14、python3 文本处理

作者: Brinnatt 分类: python 术发布时间: 2023-03-30 10:18

14.1、案例1：单词统计

对 sample 文件进行不区分大小写的单词统计？

要求用户可以排除一些单词的统计，例如 a、the、of 等不应该出现在具有实际意义的统计中，应当忽
略。
要求，全部代码使用函数封装、调用完成。

def mkkey(line: str, chars=set("""!'"#./\()[],*- \r\n""")):
    start = 0

    for i, c in enumerate(line):
        if c in chars:
            if start == i:
                start += 1
                continue
            yield line[start:i]
            start = i + 1
    else:
        if start < len(line):
            yield line[start:]

def wdcount(filename, encoding='utf8', ignore=set()):
    d = {}

    with open(filename, encoding=encoding) as f:
        for line in f:
            for word in map(str.lower, mkkey(line)):
                if word not in ignore:
                    d[word] = d.get(word, 0) + 1

    return d

def top(d: dict, n=10):
    for i, (k, v) in enumerate(sorted(d.items(), key=lambda item: (item[1], item[0]), reverse=True)):
        if i > n:
            break
        print(k, v)

top(wdcount('sample.txt', ignore={'the', 'a'}))

14.2、案例2：INI 转 JSON

有一个配置文件 test.ini 内容如下，将其转换成 json 格式文件。

[DEFAULT]
a = test

[mysql]
default-character-set=utf8
a = 1000

[mysqld]
datadir =/dbserver/data
port = 33060
character-set-server=utf8
sql_mode=NO_ENGINE_SUBSTITUTION,STRICT_TRANS_TABLES

遍历 ini 文件的字典即可：

from configparser import ConfigParser
import json

filename = 'test.ini'
jsonname = 'test.json'

cfg = ConfigParser()
cfg.read(filename)

dest = {}

for sec in cfg.sections():
    print(sec, cfg.items(sec))
    dest[sec] = dict(cfg.items(sec))

json.dump(dest, open(jsonname, 'w'))

终端查看：
(venv) PS D:\JetBrains\pythonProject> python -m json.tool test.json
{
    "mysql": {
        "a": "1000",
        "default-character-set": "utf8"
    },
    "mysqld": {
        "a": "test",
        "datadir": "/dbserver/data",
        "port": "33060",
        "character-set-server": "utf8",
        "sql_mode": "NO_ENGINE_SUBSTITUTION,STRICT_TRANS_TABLES"
    }
}

14.3、案例3：ls 功能问题

实现 ls 命令功能：

实现 -l、-a 和 --all、-h 选项
实现显示路径下的文件列表
-a 和 -all 显示包含 . 开头的文件
-l 详细列表显示
-h 和 -l 配合，人性化显示文件大小，例如 1K、1G、1T 等，可以认为 1G=1000M
```
-rw-rw-r--        1       python  python  5       Oct 25  00:07   test4
mode          硬链接   属主     属组     字节     时间            文件名
```
- c 字符；d 目录；- 普通文件；l 软链接；b 块设备；s socket文件；p pipe 文件，即 FIFO
按照文件名排序输出，可以和 ls 的顺序不一样，但要求文件名排序
要求详细列表显示时，时间可以按照 年-月-日时:分:秒 格式显示

提示：使用 argparse 模块实现。

14.4、argparse 模块

一个可执行文件或者脚本都可以接受参数。

$ ls -l /etc

/etc 是位置参数
-l 是短选项

如何把这些参数传递给程序呢？

从 3.2 开始 Python 提供了参数分析的模块 argparse。

14.4.1、参数分类

参数分为：

位置参数：参数放在那里，就要对应一个对数位置。例如 /etc 就是对应一个参数位置。

选项参数：必须通过前面是 - 的短选项或者 -- 的长选项，然后后面的才算它的参数，当然短选项后面也可以没有参数。

上例中，/etc 对应的是位置参数，-l 是选项参数。

ls -alh src

14.4.2、基本解析

先来一段最简单的程序。

import argparse

parser = argparse.ArgumentParser()  # 获得一个参数解析器
args = parser.parse_args()  # 分析参数
parser.print_help()  # 打印帮助

运行结果：
usage: main.py [-h]

options:
  -h, --help  show this help message and exit

argparse 不仅仅做了参数的定义和解析，还自动生成了帮助信息。尤其是 usage，可以看到现在定义的参数是否是自己想要的。

14.4.3、解析器的参数

parser = argparse.ArgumentParser(prog='ls', add_help=True, description='list directory contents')

参数名称	说明
prog	程序的名字，缺省使用 sys.argv[0]
add_help	自动为解析器增加 -h 和 --help 选项，默认为 True
description	为程序功能添加描述

14.4.4、位置参数解析

ls 基本功能应该解决目录内容的打印。打印的时候应该指定目录路径，需要位置参数。

import argparse

# 获得一个参数解析器
parser = argparse.ArgumentParser(prog='ls', add_help=True, description='list directory contents')
parser.add_argument('path')
args = parser.parse_args()  # 分析参数
parser.print_help()  # 打印帮助

运行结果：
usage: ls [-h] path
ls: error: the following arguments are required: path

-h 为帮助，可有可无，path 为位置参数，必须提供。

14.4.5、解析参数

parse_args(args=None, namespace=None)

args 参数列表，一个可迭代对象。内部会把可迭代对象转换成 list。如果为 None 则使用命令行传入参数，非 None 则使用 args 参数的可迭代对象。

import argparse

# 获得一个参数解析器
parser = argparse.ArgumentParser(prog='ls', add_help=True, description='list directory contents')
parser.add_argument('path')  # 位置参数

args = parser.parse_args(('/etc', ))  # 分析参数，同时传入可迭代的参数
print(args) # 打印名词空间中收集的参数

parser.print_help()  # 打印帮助

运行结果：
Namespace(path='/etc')
usage: ls [-h] path

list directory contents

positional arguments:
  path

options:
  -h, --help  show this help message and exit

Namespace(path='/etc') 里面的 path 参数存储在了一个 Namespace 对象内的属性上，可以通过 Namespace 对象属性来访问，例如 args.path。

14.4.6、非必须位置参数

上面的代码必须输入位置参数，否则会报错。

usage: ls[-h] path
ls: error: the following arguments are required: path

但有时候，ls 命令不输入任何路径的话就表示列出当前目录的文件列表。

import argparse

# 获得一个参数解析器
parser = argparse.ArgumentParser(prog='ls', add_help=True, description='list directory contents')
parser.add_argument('path', nargs='?', default='.', help="path help")  # 位置参数可有可无，缺省值，帮助
args = parser.parse_args()  # 分析参数，同时传入可选代的参数
print(args)  # 打印名词空间中收集的参数
parser.print_help()  # 打印帮助

运行结果：
Namespace(path='.')
usage: ls [-h] [path]

list directory contents

positional arguments:
  path        path help

options:
  -h, --help  show this help message and exit

可以看出path也变成可选的位置参数，没有提供就使用默认值 .点号 表示当前路径。

help 表示帮助文档中这个参数的描述
nargs 表示这个参数接收结果参数，? 表示可有可无，+ 表示至少一个，* 可以任意个，数字表示必须是指定数目个。
default 表示如果不提供该参数，就使用这个值。一般和 ?、* 配合，因为它们都可以不提供位置参数，不提供就是用缺省值。

14.4.7、选项参数

14.4.7.1、实现 `-l` 选项

import argparse

# 获得一个参数解析器
parser = argparse.ArgumentParser(prog='ls', add_help=True, description='list directory contents')

# 增加一个位置参数
parser.add_argument('path')

# 增加一个选项参数
parser.add_argument('-l')

parser.print_help()  # 打印帮助

输出结果：

usage: ls [-h] [-l L] path

list directory contents

positional arguments:
  path

options:
  -h, --help  show this help message and exit
  -l L

当前的结果不是我们想要的，我们期望是 [-l]，如何解决？

使用 nargs 能解决吗？

parser.add_argument('-l', nargs='?')

输出结果：
usage: ls [-h] [-l [L]] path

看起来还是有点问题。那么，直接把 nargs=0，意思是让这个选项接收 0 个参数，如下：

parser.add_argument('-l', nargs=0)

输出结果：
raise ValueError('nargs for store actions must be != 0; if you '
ValueError: nargs for store actions must be != 0; if you have nothing to store, actions such as store true or store const may be more appropriate

结果直接报错，为了解决这个问题，使用 action 参数。

parser.add_argument('-l', action='store_true')

输出结果：
usage: ls [-h] [-l] [path]

符合我们期望，最后测试一下：

main.py 代码：

import argparse

# 获得一个参数解析器
parser = argparse.ArgumentParser(prog='ls', add_help=True, description='list directory contents')
# 增加一个位置参数
parser.add_argument('path', nargs='?', default='.', help="path help")
# 增加一个选项参数
parser.add_argument('-l', action='store_true')
args = parser.parse_args()
print(args)
parser.print_help()  # 打印帮助

终端测试：

(venv) PS D:\JetBrains\pythonProject> python .\main.py -l
Namespace(path='.', l=True) # 加了-l
usage: ls [-h] [-l] [path]

list directory contents

positional arguments:
  path        path help

options:
  -h, --help  show this help message and exit
  -l

Namespace(path='.', l=False) # 没加-l
usage: ls [-h] [-l] [path]

list directory contents

positional arguments:
  path        path help

options:
  -h, --help  show this help message and exit
  -l

14.4.7.2、实现 `-a` 选项

parser.add_argument('-a', '--all', action='store_true')

长短选项可以同时给出。

14.4.8、代码

import argparse

# 获得一个参数解析器
parser = argparse.ArgumentParser(prog='ls', add_help=True, description='list directory contents')

# 增加一个位置参数
parser.add_argument('path', nargs='?', default='.', help="directory")

# 增加多个选项参数
parser.add_argument('-l', action='store_true', help='use a long listing format')
parser.add_argument('-a', '--all', action='store_true', help='show all files, do not ignore entries starting with .')

# 分析参数，同时传入可迭代的参数
args = parser.parse_args()

# 打印名称空间中收集的参数
print(args)

# 打印帮助
parser.print_help()

运行结果：
Namespace(path='.', l=False, a=False)
usage: ls [-h] [-l] [-a] [path]

list directory contents

positional arguments:
  path        directory

options:
  -h, --help  show this help message and exit
  -l          use a long listing format
  -a, -all    show all files, do not ignore entries starting with .

# 分析参数，同时传入可迭代的参数
args = parser.parse_args('-l -a /tmp'.split())

# 打印名称空间中收集的参数
print(args)

运行结果：
Namespace(path='/tmp', l=True, a=True)

14.5、ls 功能实现

到目前为止，已经解决了参数的定义和传参的问题，下面就要解决业务问题：

列出所有指定路径的文件，默认是不递归的
-a 显示所有文件，包括隐藏文件
-l 详细列表模式显示

14.5.1、代码实现

import argparse
from pathlib import Path
from datetime import datetime

# 获得一个参数解析器
parser = argparse.ArgumentParser(prog='ls', add_help=True, description='list directory contents')

# 增加一个位置参数
parser.add_argument('path', nargs='?', default='.', help="directory")

# 增加多个选项参数
parser.add_argument('-l', action='store_true', help='use a long listing format')
parser.add_argument('-a', '--all', action='store_true', help='show all files, do not ignore entries starting with .')

# 分析参数，同时传入可迭代的参数
args = parser.parse_args()

# 打印名称空间中收集的参数
print(1, '-->', args)

# 打印帮助
# parser.print_help()

def flist(path, all=False):
    """列出本目录文件"""
    p = Path(path)
    for i in p.iterdir():
        if not all and i.name.startswith('.'):
            continue
        yield i.name

print(2, '-->', list(flist(args.path, all=True)))

def ftype(f: Path):
    if f.is_dir():
        return 'd'
    elif f.is_block_device():
        return 'b'
    elif f.is_char_device():
        return 'c'
    elif f.is_symlink():
        return 'l'
    elif f.is_socket():
        return 's'
    else:
        return '-'

def fdetail(path, all=False):
    p = Path(path)
    for i in p.iterdir():
        if not all and i.name.startswith('.'):
            continue
        stat = i.stat()
        tp = ftype(i)
        mode = oct(stat.st_mode)[-3:]
        atime = datetime.fromtimestamp(stat.st_atime).strftime('%Y %m %d %H:%M:%S')
        yield tp, mode, stat.st_uid, stat.st_gid, stat.st_size, atime, i.name

print(3, '-->', list(fdetail(args.path)))

mode 是整数，八进制描述的权限，最终显示为 rwx 的格式。

方法1：

mlist = ['r', 'w', 'x', 'r', 'w', 'x', 'r', 'w', 'x']
def mstr(mode: int):
    mode = mode & 0o777
    mconcat = ""
    for i, v in enumerate(bin(mode)[-9:]):
        if v == '1':
            mconcat += mlist[i]
        else:
            mconcat += '-'
    return mconcat
print(mstr(0o640))

方法2：

mlist = dict(zip(range(9), ['r', 'w', 'x', 'r', 'w', 'x', 'r', 'w', 'x']))
def mstr(mode: int):
    mode = mode & 0o777
    mconcat = ""

    for i in range(8, -1, -1):
        if mode >> i & 1:
            mconcat += mlist[8 - i]
        else:
            mconcat += '-'
    return mconcat
print(mstr(0o644))

14.5.2、合并列出文件函数

fdetail 和 flist 几乎一样，重复太多，合并：

import argparse
from pathlib import Path
from datetime import datetime

mlist = dict(zip(range(9), ['r', 'w', 'x', 'r', 'w', 'x', 'r', 'w', 'x']))

# 获得一个参数解析器
parser = argparse.ArgumentParser(prog='ls', add_help=True, description='list directory contents')

# 增加一个位置参数
parser.add_argument('path', nargs='?', default='.', help="directory")

# 增加多个选项参数
parser.add_argument('-l', action='store_true', help='use a long listing format')
parser.add_argument('-a', '--all', action='store_true', help='show all files, do not ignore entries starting with .')

# 分析参数，同时传入可迭代的参数
args = parser.parse_args()

# 打印名称空间中收集的参数
print(1, '-->', args)

# 打印帮助
# parser.print_help()

def ftype(f: Path):
    if f.is_dir():
        return 'd'
    elif f.is_block_device():
        return 'b'
    elif f.is_char_device():
        return 'c'
    elif f.is_symlink():
        return 'l'
    elif f.is_socket():
        return 's'
    else:
        return '-'

def mstr(mode: int):
    mode = mode & 0o777
    mconcat = ""

    for i in range(8, -1, -1):
        if mode >> i & 1:
            mconcat += mlist[8 - i]
        else:
            mconcat += '-'
    return mconcat

def flist(path, all=False, detail=False):
    p = Path(path)
    for i in p.iterdir():
        if not all and i.name.startswith('.'):
            continue

        if not detail:
            yield (i.name,)
        else:
            stat = i.stat()
            tp = ftype(i)
            # mode = oct(stat.st_mode)[-3:]
            mode = ftype(i) + mstr(stat.st_mode)
            atime = datetime.fromtimestamp(stat.st_atime).strftime('%Y %m %d %H:%M:%S')
            yield mode, stat.st_nlink, stat.st_uid, stat.st_gid, stat.st_size, atime, i.name

for x in flist(args.path, detail=True):
    print(x)

14.5.3、排序

ls 的显示是把文件名按照升序排序输出。

# 排序
print(sorted(flist(args.path, detail=True), key=lambda x: x[len(x) - 1]))

14.5.4、完整代码

再次重构代码：

import argparse
from pathlib import Path
from datetime import datetime

mlist = dict(zip(range(9), ['r', 'w', 'x', 'r', 'w', 'x', 'r', 'w', 'x']))

# 获得一个参数解析器
parser = argparse.ArgumentParser(prog='ls', add_help=True, description='list directory contents')

# 增加一个位置参数
parser.add_argument('path', nargs='?', default='.', help="directory")

# 增加多个选项参数
parser.add_argument('-l', action='store_true', help='use a long listing format')
parser.add_argument('-a', '--all', action='store_true', help='show all files, do not ignore entries starting with .')

def flist(path, all=False, detail=False):
    def _ftype(f: Path):
        if f.is_dir():
            return 'd'
        elif f.is_block_device():
            return 'b'
        elif f.is_char_device():
            return 'c'
        elif f.is_symlink():
            return 'l'
        elif f.is_socket():
            return 's'
        else:
            return '-'

    def _mstr(mode: int):
        mode = mode & 0o777
        mconcat = ""

        for i in range(8, -1, -1):
            if mode >> i & 1:
                mconcat += mlist[8 - i]
            else:
                mconcat += '-'
        return mconcat

    def _flist(path, all=False, detail=False):
        p = Path(path)
        for i in p.iterdir():
            if not all and i.name.startswith('.'):
                continue

            if not detail:
                yield i.name,
            else:
                stat = i.stat()
                tp = _ftype(i)
                # mode = oct(stat.st_mode)[-3:]
                mode = _ftype(i) + _mstr(stat.st_mode)
                atime = datetime.fromtimestamp(stat.st_atime).strftime('%Y %m %d %H:%M:%S')
                yield mode, stat.st_nlink, stat.st_uid, stat.st_gid, stat.st_size, atime, i.name

    yield from sorted(_flist(path, all, detail), key=lambda x: x[len(x) - 1])

if __name__ == '__main__':
    args = parser.parse_args()
    print(args)
    parser.print_help()
    files = flist(args.path, args.all, args.l)
    print(list(files))

14.5.5、`-h` 实现

-h, --human-readable，如果 -l 存在，-h 有效。

增加选项参数

parser = argparse.ArgumentParser(prog='ls', add_help=False, description='list directory contents')
parser.add_argument('-h', '--human-readable', action='store_true', help='with -l, show size in form of KMGTP')

增加一个函数，能够解决单位转换

def _fsize(size: int):
   units = ' KMGTP'
   depth = 0
   while size >= 1000:
       size //= 1000
       depth += 1
   return '{}{}'.format(size, units[depth])

在 -l 逻辑部分增加处理

size = stat.st_size if not human else _fsize(stat.st_size)

14.5.6、其他的完善

uid、gid 的转换。

pwd 模块，The password database，提供访问 Linux、Unix 的 password 文件的方式。windows 没有。

pwd.getpwuid(Path().stat().st_uid).pw_name。

grp 模块，Linux、Unix 获取组信息的模块。windows 没有。

grp.getgrgid(Path().stat().st_gid).gr_name。

pathlib 模块，Path().group() 或者 Path().owner() 也可以，本质上它们就是调用 pwd 模块和 grp 模块。

由于 windows 不支持，这次可以不加这个 uid、gid 的转换。

14.5.7、代码改进

import argparse
from pathlib import Path
from datetime import datetime

mlist = dict(zip(range(9), ['r', 'w', 'x', 'r', 'w', 'x', 'r', 'w', 'x']))

# 获得一个参数解析器
parser = argparse.ArgumentParser(prog='ls', add_help=False, description='list directory contents')

# 增加一个位置参数
parser.add_argument('path', nargs='?', default='.', help="directory")

# 增加多个选项参数
parser.add_argument('-l', action='store_true', help='use a long listing format')
parser.add_argument('-a', '--all', action='store_true', help='show all files, do not ignore entries starting with .')
parser.add_argument('-h', '--human-readable', action='store_true', help='with -l, show size in form of KMGTP')

def flist(path, all=False, detail=False, human=False):
    def _ftype(f: Path):
        if f.is_dir():
            return 'd'
        elif f.is_block_device():
            return 'b'
        elif f.is_char_device():
            return 'c'
        elif f.is_symlink():
            return 'l'
        elif f.is_socket():
            return 's'
        else:
            return '-'

    def _mstr(mode: int):
        mode = mode & 0o777
        mconcat = ""

        for i in range(8, -1, -1):
            if mode >> i & 1:
                mconcat += mlist[8 - i]
            else:
                mconcat += '-'
        return mconcat

    def _fsize(size: int):
        units = ' KMGTP'
        depth = 0
        while size >= 1000:
            size //= 1000
            depth += 1
        return '{}{}'.format(size, units[depth])

    def _flist(path, all=False, detail=False, human=False):
        p = Path(path)
        for i in p.iterdir():
            if not all and i.name.startswith('.'):
                continue

            if not detail:
                yield i.name,
            else:
                stat = i.stat()
                tp = _ftype(i)
                # mode = oct(stat.st_mode)[-3:]
                mode = _ftype(i) + _mstr(stat.st_mode)
                atime = datetime.fromtimestamp(stat.st_atime).strftime('%Y %m %d %H:%M:%S')
                size = str(stat.st_size) if not human else _fsize(stat.st_size)
                yield mode, stat.st_nlink, stat.st_uid, stat.st_gid, size, atime, i.name

    yield from sorted(_flist(path, all, detail, human), key=lambda x: x[len(x) - 1])

if __name__ == '__main__':
    args = parser.parse_args()
    print(args)
    parser.print_help()
    files = flist(args.path, args.all, args.l, args.human_readable)
    print(list(files))

14.5.8、改进 mode

使用 stat 模块

from pathlib import Path
import stat

stat.filemode(Path().stat().st_mode)

14.5.9、最终代码

import argparse
import stat
from pathlib import Path
from datetime import datetime

mlist = dict(zip(range(9), ['r', 'w', 'x', 'r', 'w', 'x', 'r', 'w', 'x']))

# 获得一个参数解析器
parser = argparse.ArgumentParser(prog='ls', add_help=False, description='list directory contents')

# 增加一个位置参数
parser.add_argument('path', nargs='?', default='.', help="directory")

# 增加多个选项参数
parser.add_argument('-l', action='store_true', help='use a long listing format')
parser.add_argument('-a', '--all', action='store_true', help='show all files, do not ignore entries starting with .')
parser.add_argument('-h', '--human-readable', action='store_true', help='with -l, show size in form of KMGTP')

def flist(path, all=False, detail=False, human=False):
    def _fsize(size: int):
        units = ' KMGTP'
        depth = 0
        while size >= 1000:
            size //= 1000
            depth += 1
        return '{}{}'.format(size, units[depth])

    def _flist(path, all=False, detail=False, human=False):
        p = Path(path)
        for i in p.iterdir():
            if not all and i.name.startswith('.'):
                continue

            if not detail:
                yield i.name,
            else:
                st = i.stat()
                mode = stat.filemode(st.st_mode)
                atime = datetime.fromtimestamp(st.st_atime).strftime('%Y %m %d %H:%M:%S')
                size = str(st.st_size) if not human else _fsize(st.st_size)
                yield mode, st.st_nlink, st.st_uid, st.st_gid, size, atime, i.name

    yield from sorted(_flist(path, all, detail, human), key=lambda x: x[len(x) - 1])

if __name__ == '__main__':
    args = parser.parse_args()
    # print(args)
    # parser.print_help()
    files = flist(args.path, args.all, args.l, args.human_readable)
    for file in files:
        print(file)

14.5.10、测试

(venv) PS D:\JetBrains\Projects> python.exe .\backeyes.py -a -l -h
('drwxrwxrwx', 1, 0, 0, '4K', '2023 03 09 15:35:44', '.idea')
('-rw-rw-rw-', 1, 0, 0, '1K', '2023 03 09 15:39:40', 'backeyes.py')
('-rw-rw-rw-', 1, 0, 0, '751 ', '2023 03 09 15:08:13', 't1.py')
('drwxrwxrwx', 1, 0, 0, '0 ', '2023 03 09 15:39:40', 'venv')