第20天的python作业--正则表达式

今天主要是正则表达式的相关内容，一些题目和思维导图。

正则表达式

python提高.png

写一个正则表达式，使其能同时识别下面所有的字符串：’bat’,’bit’, ‘but’, ‘hat’, ‘hit’, ‘hut

import re
words = ['bat','bit', 'but', 'hat', 'hit', 'hut']
for word in words:
    res = re.match('[bh][aiu]t$', word)
    # res = re.match('\w{3}', word)
    print(res.group())
    
结果：
bat
bit
but
hat
hit
hut

匹配由单个空格分隔的任意单词对，也就是姓和名

import re
def match_pairword(list):
    for word in list:
        res = re.match('[a-z]+\s[a-z]+', word)
        if res:
            print('%s是一个单词对。' % res.group())
        else:
            print('%s不是一个单词对。' % word)


if __name__ == '__main__':
    list_name = ['a b', 'wang dao', 'cs kaoyan', 'wangdaoluntan']
    match_pairword(list_name)
 
结果：
a b是一个单词对。
wang dao是一个单词对。
cs kaoyan是一个单词对。
wangdaoluntan不是一个单词对。

匹配由单个逗号和单个空白符分隔的任何单词和单个字母,如姓氏的首字母

import re
def match_wordsplit(sentence):
    # 英文单词的逗号后面通常会跟一个空格（好像是
    res = re.findall('[,\s|\s][a-zA-Z]+',sentence)
    if res:
        print(res)
    else:
        print('Nothing')

if __name__ == '__main__':
    sentence = 'hello Python, how are you?You are wenbo Song'
    match_wordsplit(sentence)

结果：
[' Python', ' how', ' are', ' you', ' are', ' wenbo', ' Song']

匹配以“www”起始且以“.com”结尾的简单Web域名:例如,http://www.yahoo.com ，也支持其他域名，如.edu .net等

import re
def match_list_domain(list):
    for domain in list:
        # 这里可能出现多级域名，但是要去到最后的，所以在中间加上较多的限制条件
        res = re.match('^www\.[\w]+[\w\.]+[\w]+\.(com|edu|net|cn|top)$', domain)
        if res:
            print('%s 符合要求。' % res.group())
        else:
            print('%s 不符合要求。' % domain)

def match_sentence_domain(sentence):
    res = re.findall('(www\.[\w\.]+\.(com|edu|net|cn|top))', sentence)
    if res:
        print(res)
    else:
        print('Nothing')

if __name__ == '__main__':
    domain_list = [
        'www.baidu.com',
        'www.sdu.edu.cn',
        'www.va1id.top',
        'blog.va1id.top'
    ]
    sentence = '''
    you can click this websit: www.va1lid.top or www.baidu.com to get what you want！
    '''

    match_list_domain(domain_list)
    match_sentence_domain(sentence)
    
结果：
www.baidu.com 符合要求。
www.sdu.edu.cn 符合要求。
www.va1id.top 符合要求。
blog.va1id.top 不符合要求。
[('www.va1lid.top', 'top'), ('www.baidu.com', 'com')]

匹配一行文字中的开头的字母内容

 import re
string = 'Do not litter Rubbish'
res = re.match('[a-zA-Z]+', string)
print(res.group())

结果：
Do

匹配一行文字中的开头的数字内容

import re
string = '128 is my math grade.'
res = re.match('[0-9]+', string)
print(res.group())

结果：
128

只匹配包含字母和数字的行（只要那一行有字母和数字就匹配）

import re
# 将文字存在文件中，如果匹配则将其append进match_list里，最后读出
# 如果不放进文件中，则将长字符串以\n进行分割存在列表中再进行处理
file = open('matchfile', mode='r', encoding='utf-8')
match_list = []
for line in file:
    if re.search('[a-zA-Z0-9]', line):
        match_list.append(line)
    else:
        pass
file.close()
print(match_list)

结果：
['matchfile1\n', 'matchfile2\n', 'matchfile3\n', 'matchfile4\n', 'matchfile5\n', 'matchfile6\n', 'matchfile7\n', '现在是2019-07-21 12:00']

提取每行中完整的年月日和时间字段

import re
file = open('matchfile', mode='r', encoding='utf-8')
match_list = []
for line in file:
    res = re.findall('(\d{4}-\d{1,2}-\d{1,2}\s\d{2}:\d{2}$)', line)
    if res:
        match_list.append(res)
    else:
        pass
print(match_list)

结果：
[['2019-07-21 12:00']]

将每行中的电子邮件地址替换为你自己的电子邮件地址

import re
def replace_email(string):
    # sub如果没有发生替换则返回原字符串
    res = re.sub(r'[a-zA-Z_]+@(163|126|qq)\.com', 'va1id@va1id.top', string)
    return res


if __name__ == '__main__':
    string = '''
    sorry, i forgot your email? dsjavnal@qq.com
    issnvjajshek snafhwe; sajf;w vnbaes@163.com
    sbvi vsdhv  jhehfkwj jhsfhewjh shf sfiuwe@126.com
    '''
    print(string)
    a = replace_email(string)
    print(a)
    
结果：
    sorry, i forgot your email? dsjavnal@qq.com
    issnvjajshek snafhwe; sajf;w vnbaes@163.com
    sbvi vsdhv  jhehfkwj jhsfhewjh shf sfiuwe@126.com
    

    sorry, i forgot your email? va1id@va1id.top
    issnvjajshek snafhwe; sajf;w va1id@va1id.top
    sbvi vsdhv  jhehfkwj jhsfhewjh shf va1id@va1id.top

匹配\home关键字：

import re
string = '''
\home\sjbs,fjaosejlfje\home
'''
res = re.findall(r'\\home', string)
print(res)

结果：['\\home', '\\home']

去除以下html文件中的标签，只显示文本信息。

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17		岗位职责：完成推荐算法、数据统计、接口、后台等服务器端相关工作必备要求：良好的自我驱动力和职业素养，工作积极主动、结果导向技术要求： 1、一年以上 Python 开发经验，掌握面向对象分析和设计，了解设计模式 2、掌握HTTP协议，熟悉MVC、MVVM等概念以及相关WEB开发框架 3、掌握关系数据库开发设计，掌握 SQL，熟练使用 MySQL/PostgreSQL 中的一种 4、掌握NoSQL、MQ，熟练使用对应技术解决方案 5、熟悉 Javascript/CSS/HTML5，JQuery、React、Vue.js 加分项：大数据，数理统计，机器学习，sklearn，高性能，大并发。

import re
html = '''
<div>
<p>岗位职责：</p>
<p>完成推荐算法、数据统计、接口、后台等服务器端相关工作</p>
<p><br></p>
<p>必备要求：</p>
<p>良好的自我驱动力和职业素养，工作积极主动、结果导向</p>
<p> <br></p>
<p>技术要求：</p>
<p>1、一年以上 Python 开发经验，掌握面向对象分析和设计，了解设计模式</p>
<p>2、掌握HTTP协议，熟悉MVC、MVVM等概念以及相关WEB开发框架</p>
<p>3、掌握关系数据库开发设计，掌握 SQL，熟练使用 MySQL/PostgreSQL 中的一种<br></p>
<p>4、掌握NoSQL、MQ，熟练使用对应技术解决方案</p>
<p>5、熟悉 Javascript/CSS/HTML5，JQuery、React、Vue.js</p>
<p> <br></p>
<p>加分项：</p>
<p>大数据，数理统计，机器学习，sklearn，高性能，大并发。</p>
</div> 
'''
# 括号内到>所有的内容全部删除
res = re.sub(r'<[^>]*>', '', html)
print(res)

结果：
岗位职责：
完成推荐算法、数据统计、接口、后台等服务器端相关工作

必备要求：
良好的自我驱动力和职业素养，工作积极主动、结果导向
 
技术要求：
1、一年以上 Python 开发经验，掌握面向对象分析和设计，了解设计模式
2、掌握HTTP协议，熟悉MVC、MVVM等概念以及相关WEB开发框架
3、掌握关系数据库开发设计，掌握 SQL，熟练使用 MySQL/PostgreSQL 中的一种
4、掌握NoSQL、MQ，熟练使用对应技术解决方案
5、熟悉 Javascript/CSS/HTML5，JQuery、React、Vue.js
 
加分项：
大数据，数理统计，机器学习，sklearn，高性能，大并发。
 


Process finished with exit code 0

将以下网址提取出域名：

http://www.interoem.com/messageinfo.asp?id=35`
http://3995503.com/class/class09/news_show.asp?id=14
http://lib.wzmc.edu.cn/news/onews.asp?id=769
http://www.zy-ls.com/alfx.asp?newsid=377&id=6
http://www.fincm.com/newslist.asp?id=415

import re
domain_str = '''
http://www.interoem.com/messageinfo.asp?id=35`
http://3995503.com/class/class09/news_show.asp?id=14
http://lib.wzmc.edu.cn/news/onews.asp?id=769
http://www.zy-ls.com/alfx.asp?newsid=377&id=6
http://www.fincm.com/newslist.asp?id=415
'''

res = re.findall('(http://[a-zA-z0-9]*\.{0,1}[\w-]+[\w\.]*\.(com|cn))', domain_str)
print(res)
i = 0
while True:
    try:
        print(res[i][0])
        i += 1
    except:
        break
结果：
[('http://www.interoem.com', 'com'), ('http://3995503.com', 'com'), ('http://lib.wzmc.edu.cn', 'cn'), ('http://www.zy-ls.com', 'com'), ('http://www.fincm.com', 'com')]
http://www.interoem.com
http://3995503.com
http://lib.wzmc.edu.cn
http://www.zy-ls.com
http://www.fincm.com

提取出如下字符串中的单词：

1	hello world ha ha

# 这道题猜测让练split？
import re
string = 'hello world ha ha'
res = re.split(' ', string)
print(res)

结果：
['hello', 'world', 'ha', 'ha']

14、练习深copy和浅copy

import copy
a = [1, 2]
b = [3, 4]
a1 = copy.copy(a)
print('a的id是：%d, a1的id是：%d' %(id(a), id(a1)))

c = [a, b]
print('a的id是：%d, c的id是：%d, b的id是：%d' %(id(a), id(c), id(b)))
d = copy.copy(c)
print('a的id是：%d, b的id是：%d' %(id(a), id(b)))
print('c的id是：%d, d的id是：%d' %(id(c), id(d)))

a[0] = 3
print('a的id是：%d, b的id是：%d' %(id(a), id(b)))
print('c的id是：%d, d的id是：%d' %(id(c), id(d)))
print(c)
print(d)
print('-'*100)


a = [1, 2]
b = [3, 4]
a1 = copy.deepcopy(a)
print('a的id是：%d, a1的id是：%d' %(id(a), id(a1)))

c = [a, b]
print('a的id是：%d, c的id是：%d, b的id是：%d' %(id(a), id(c), id(b)))
d = copy.deepcopy(c)
print('a的id是：%d, b的id是：%d' %(id(a), id(b)))
print('c的id是：%d, d的id是：%d' %(id(c), id(d)))

a[0] = 3
print('a的id是：%d, b的id是：%d' %(id(a), id(b)))
print('c的id是：%d, d的id是：%d' %(id(c), id(d)))
print(c)
print(d)

15、理解import中的坑