16.转换文本为固定大小的列,文本的排版
textwrap
>> s = "Look into my eyes, look into my eyes, the eyes, the eyes, \... the eyes, not around the eyes, don't look around the eyes, \... look into my eyes, you're under.">>> import textwrap>>> print textwrap.fill(s,70)Look into my eyes, look into my eyes, the eyes, the eyes, the eyes,not around the eyes, don't look around the eyes, look into my eyes,you're under.textwrap.fill()可以设定每行最大字符个数,但不会对单词进行分割,initial_indent,subsequent_indent,标志位可以设定起始和终止字符>>> print textwrap.fill(s,40,initial_indent=' ') Look into my eyes, look into myeyes, the eyes, the eyes, the eyes, notaround the eyes, don't look around theeyes, look into my eyes, you're under.
17.处理文本中的HTML 和XML字符(仅适用python3)
Python 2有两种字符串类型:Unicode字符串和非Unicode字符串。Python 3只有一种类型:Unicode字符串(Unicode strings)
如果你想要用HTML和XML的通信文本取代他们的字符如&entity;或者 &#code;,你需要生成文本并跳过某些字符
用html.escape可以取代某些特殊字符如'<','>'>>> s = 'Elements are written as "text ".'>>> import html>>> print(s)Elements are written as "text ".>>> print(html.escape(s))Elements are written as "<tag>text</tag>".>>> # 关闭escape的quote>>> print(html.escape(s, quote=False))Elements are written as "<tag>text</tag>".如果你要生成ASCII字符,可以用 errors='xmlcharrefreplace'以便不同的IO功能处理>>> s = 'Spicy Jalapeño'>>> s.encode('ascii', errors='xmlcharrefreplace')b'Spicy Jalapeño'如果由于某些原因,你收到一些包含一些字符的原始文本,想要手动替换,你可以用不同的html或xml相关的语法处理>>> s = 'Spicy "Jalapeño".'>>> from html.parser import HTMLParser>>> p = HTMLParser()>>> p.unescape(s)'Spicy "Jalapeño".'>>>>>> t = 'The prompt is >>>'>>> from xml.sax.saxutils import unescape>>> unescape(t)'The prompt is >>>'
18.切分文本
如果你有一个字符文本text = 'foo = 23 + 42 * 10'为了切分文本,你不仅需要匹配文本,还要能识别要替换的文本tokens = [('NAME', 'foo'), ('EQ','='), ('NUM', '23'), ('PLUS','+'),('NUM', '42'), ('TIMES', '*'), ('NUM', '10')]用于捕捉的的正则表达式如下import reNAME = r'(?P[a-zA-Z_][a-zA-Z_0-9]*)'NUM = r'(?P \d+)'PLUS = r'(?P \+)'TIMES = r'(?P \*)'EQ= r'(?P =)'WS= r'(?P \s+)'master_pat = re.compile('|'.join([NAME, NUM, PLUS, TIMES, EQ, WS]))
?P<TOKENNAME>语法用于给正则表达式命名
scanner()可以生成一个扫瞄器对象,在一次扫描中对提供的文本多次调用match()方法,如果中间有未匹配到的字符会返回None
正则表达式的顺序也很重要,你需要确保长的匹配表达在前,
>>> scanner = master_pat.scanner('foo = 42')>>> scanner.match()<_sre.SRE_Match object at 0x100677738>>>> _.lastgroup, _.group()#'_'表示上一次执行的返回值,这里指scanner.match()('NAME', 'foo')>>> scanner.match()<_sre.SRE_Match object at 0x100677738>>>> _.lastgroup, _.group()('WS', ' ')>>> scanner.match()<_sre.SRE_Match object at 0x100677759>>>> _.lastgroup, _.group()('EQ', '=')>>> scanner.match()<_sre.SRE_Match object at 0x100677768>>>> _.lastgroup, _.group()('WS', ' ')>>> scanner.match()<_sre.SRE_Match object at 0x1006777390>>>> _.lastgroup, _.group()('NUM', '42')>>> scanner.match()
20.byte字符的文本处理(仅python3支持)
byte字符通常支持大多数文本操作,大多数操作对byte字符同样有效,但也有例外
>>> b = b'Hello World'>>> bb'Hello World'>>> b[0]72>>> b[1]101byte字符通常也无法进行字符格式化操作>>> b'%10s %10d %10.2f' % (b'ACME', 100, 490.1)Traceback (most recent call last): File "", line 1, in TypeError: unsupported operand type(s) for %: 'bytes' and 'tuple'但能以文本字符的方式进行格式操作>>> '%10s %10d %10.2f' % (b'ACME', 100, 490.1)" b'ACME' 100 490.10"