正则表达式详解

正则表达式（Regular Expression，简称 regex）是处理文本的强大工具。在爬虫开发中，正则表达式常用于数据提取、文本清洗、URL 匹配等场景。掌握正则表达式是编写高效爬虫的必备技能。

官方文档

什么是正则表达式？

正则表达式是一种用于描述字符串模式的表达式。它定义了一个搜索模式，可以用来检查一个字符串是否匹配某个模式，或者从一个字符串中提取符合模式的子串。

在爬虫中的应用

应用场景	说明	示例
数据提取	从 HTML 或文本中提取特定格式的数据	提取价格、日期、邮箱
数据清洗	去除或替换不需要的字符	去除 HTML 标签、空白字符
URL 处理	匹配和解析 URL	提取域名、路径、参数
验证数据	检查数据格式是否正确	验证邮箱、手机号格式
内容过滤	筛选符合条件的内容	过滤特定关键词

基本语法

元字符

元字符是正则表达式中具有特殊含义的字符：

元字符	说明	示例
`.`	匹配任意单个字符（除换行符）	`a.c` 匹配 "abc"、"a1c"
`^`	匹配字符串开头	`^hello` 匹配以 hello 开头
`$`	匹配字符串结尾	`world$` 匹配以 world 结尾
`*`	匹配前一个字符 0 次或多次	`ab*` 匹配 "a"、"ab"、"abb"
`+`	匹配前一个字符 1 次或多次	`ab+` 匹配 "ab"、"abb"
`?`	匹配前一个字符 0 次或 1 次	`ab?` 匹配 "a"、"ab"
`\`	转义字符	`\.` 匹配字面意义的点
`	`	或运算符
`[]`	字符集	`[abc]` 匹配 "a"、"b"、"c"
`()`	分组	`(ab)+` 匹配 "ab"、"abab"

量词

量词用于指定匹配的次数：

量词	说明	示例
`{n}`	匹配恰好 n 次	`a{3}` 匹配 "aaa"
`{n,}`	匹配至少 n 次	`a{2,}` 匹配 "aa"、"aaa"...
`{n,m}`	匹配 n 到 m 次	`a{2,4}` 匹配 "aa"、"aaa"、"aaaa"
`*`	等价于 `{0,}`
`+`	等价于 `{1,}`
`?`	等价于 `{0,1}`

贪婪与非贪婪

默认情况下，量词是贪婪的，会匹配尽可能多的字符。在量词后加 ? 可以使其变为非贪婪模式：

import re

text = '<div>content1</div><div>content2</div>'

# 贪婪匹配：匹配整个字符串
greedy = re.search(r'<div>.*</div>', text)
print(greedy.group())  # <div>content1</div><div>content2</div>

# 非贪婪匹配：匹配最短的
non_greedy = re.search(r'<div>.*?</div>', text)
print(non_greedy.group())  # <div>content1</div>

字符类

字符类用于匹配一组字符中的任意一个：

import re

# 普通字符类
re.search(r'[abc]', 'bed')  # 匹配 'b'
re.search(r'[0-9]', 'abc123')  # 匹配 '1'
re.search(r'[a-zA-Z]', '123abc')  # 匹配 'a'

# 否定字符类
re.search(r'[^0-9]', '123abc')  # 匹配 'a'（非数字）

# 常用字符类
re.search(r'[a-z]', 'ABCabc')  # 匹配 'a'
re.search(r'[A-Z]', 'ABCabc')  # 匹配 'A'
re.search(r'[0-9]', 'abc123')  # 匹配 '1'

预定义字符类

Python 提供了一些常用的预定义字符类：

字符类	说明	等价形式
`\d`	数字	`[0-9]`
`\D`	非数字	`[^0-9]`
`\w`	单词字符（字母、数字、下划线）	`[a-zA-Z0-9_]`
`\W`	非单词字符	`[^a-zA-Z0-9_]`
`\s`	空白字符（空格、制表符、换行等）	`[ \t\n\r\f\v]`
`\S`	非空白字符	`[^ \t\n\r\f\v]`
`\b`	单词边界
`\B`	非单词边界

import re

# 数字匹配
re.findall(r'\d+', '订单号：12345，金额：99.5')  # ['12345', '99', '5']

# 单词匹配
re.findall(r'\w+', 'Hello World! 你好')  # ['Hello', 'World', '你好']

# 空白字符
re.split(r'\s+', 'a  b\tc\nd')  # ['a', 'b', 'c', 'd']

# 单词边界
re.findall(r'\bword\b', 'a word words')  # ['word']

特殊序列

序列	说明
`\A`	匹配字符串开头（与 `^` 类似，但不受 MULTILINE 影响）
`\Z`	匹配字符串结尾（与 `$` 类似，但不受 MULTILINE 影响）
`\b`	单词边界
`\B`	非单词边界

分组与引用

基本分组

使用圆括号 () 创建分组，可以提取匹配的子串：

import re

# 提取日期的年月日
text = '日期：2024-01-15'
match = re.search(r'(\d{4})-(\d{2})-(\d{2})', text)

if match:
    print(match.group(0))  # 2024-01-15（完整匹配）
    print(match.group(1))  # 2024（第一个分组）
    print(match.group(2))  # 01（第二个分组）
    print(match.group(3))  # 15（第三个分组）
    print(match.groups())  # ('2024', '01', '15')

命名分组

使用 (?P<name>...) 为分组命名，使代码更易读：

import re

text = '邮箱：[email protected]'

# 命名分组
pattern = r'(?P<username>\w+)@(?P<domain>\w+\.\w+)'
match = re.search(pattern, text)

if match:
    print(match.group('username'))  # user
    print(match.group('domain'))    # example.com
    print(match.groupdict())        # {'username': 'user', 'domain': 'example.com'}

非捕获分组

使用 (?:...) 创建非捕获分组，只匹配不捕获：

import re

text = 'price: 100USD'

# 普通分组（会捕获）
m1 = re.search(r'(\d+)(USD|EUR)', text)
print(m1.groups())  # ('100', 'USD')

# 非捕获分组（不捕获）
m2 = re.search(r'(\d+)(?:USD|EUR)', text)
print(m2.groups())  # ('100',)

后向引用

在正则表达式中引用之前的分组：

import re

# 匹配重复的单词
text = 'the the quick brown fox fox'
matches = re.findall(r'(\b\w+\b)\s+\1', text)
print(matches)  # ['the', 'fox']

# 命名分组的后向引用
text = '<div>content</div>'
match = re.search(r'<(?P<tag>\w+)>.*?</(?P=tag)>', text)
print(match.group('tag'))  # div

零宽断言

零宽断言用于检查某个位置前后的内容，但不消耗字符：

import re

text = 'price: $100, $200, $300'

# 正向前瞻：匹配后面是数字的 $
matches = re.findall(r'\$(?=\d+)', text)
print(matches)  # ['$', '$', '$']

# 负向前瞻：匹配后面不是数字的 $
text2 = 'price: $100, $abc'
matches = re.findall(r'\$(?!\d+)', text2)
print(matches)  # ['$']（匹配 $abc 中的 $）

# 正向后顾：匹配前面是 price: 的数字
text = 'price: 100, count: 200'
matches = re.findall(r'(?<=price: )\d+', text)
print(matches)  # ['100']

# 负向后顾：匹配前面不是 price: 的数字
matches = re.findall(r'(?<!price: )\d+', text)
print(matches)  # ['200']

零宽断言总结：

语法	名称	说明
`(?=...)`	正向前瞻	匹配后面是 ... 的位置
`(?!...)`	负向前瞻	匹配后面不是 ... 的位置
`(?<=...)`	正向后顾	匹配前面是 ... 的位置
`(?<!...)`	负向后顾	匹配前面不是 ... 的位置

re 模块函数

Python 的 re 模块提供了丰富的函数来处理正则表达式。

re.match()

从字符串开头匹配，只匹配一次：

import re

text = 'Hello World'

# 从开头匹配
match = re.match(r'Hello', text)
if match:
    print(match.group())  # Hello

# 开头不匹配则返回 None
match = re.match(r'World', text)
print(match)  # None

re.search()

搜索字符串中第一个匹配的位置：

import re

text = 'Hello World'

# 搜索第一个匹配
match = re.search(r'World', text)
if match:
    print(match.group())   # World
    print(match.start())   # 6（匹配开始位置）
    print(match.end())     # 11（匹配结束位置）
    print(match.span())    # (6, 11)

re.findall()

查找所有匹配，返回列表：

import re

text = '订单号：A001, A002, A003'

# 查找所有订单号
orders = re.findall(r'A\d{3}', text)
print(orders)  # ['A001', 'A002', 'A003']

# 带分组的 findall
text = 'name: Tom, age: 18; name: Jerry, age: 20'
results = re.findall(r'name: (\w+), age: (\d+)', text)
print(results)  # [('Tom', '18'), ('Jerry', '20')]

re.finditer()

查找所有匹配，返回迭代器（适合处理大量匹配）：

import re

text = '价格：100元，200元，300元'

# 使用迭代器处理匹配
for match in re.finditer(r'(\d+)元', text):
    print(f'价格: {match.group(1)}, 位置: {match.span()}')
# 输出：
# 价格: 100, 位置: (3, 7)
# 价格: 200, 位置: (8, 12)
# 价格: 300, 位置: (13, 17)

re.sub()

替换匹配的字符串：

import re

text = '价格：100元，200元，300元'

# 简单替换
result = re.sub(r'元', 'CNY', text)
print(result)  # 价格：100CNY，200CNY，300CNY

# 使用函数进行替换
def double_price(match):
    price = int(match.group(1))
    return f'{price * 2}元'

result = re.sub(r'(\d+)元', double_price, text)
print(result)  # 价格：200元，400元，600元

# 使用计数限制替换次数
result = re.sub(r'元', 'CNY', text, count=1)
print(result)  # 价格：100CNY，200元，300元

re.subn()

替换并返回替换次数：

import re

text = 'aaa bbb aaa ccc'

result, count = re.subn(r'aaa', 'xxx', text)
print(result)  # xxx bbb xxx ccc
print(count)   # 2（替换了 2 次）

re.split()

根据正则表达式分割字符串：

import re

text = 'a1b2c3d4e'

# 按数字分割
parts = re.split(r'\d+', text)
print(parts)  # ['a', 'b', 'c', 'd', 'e']

# 保留分隔符（使用分组）
parts = re.split(r'(\d+)', text)
print(parts)  # ['a', '1', 'b', '2', 'c', '3', 'd', '4', 'e']

# 限制分割次数
parts = re.split(r'\d+', text, maxsplit=2)
print(parts)  # ['a', 'b', 'c3d4e']

re.compile()

编译正则表达式，提高重复使用的效率：

import re

# 编译正则表达式
pattern = re.compile(r'\d+')

# 使用编译后的模式
text = '订单号：12345，金额：99.5'
matches = pattern.findall(text)
print(matches)  # ['12345', '99', '5']

# 编译时指定标志
pattern = re.compile(r'hello', re.IGNORECASE)
match = pattern.search('Hello World')
print(match.group())  # Hello

匹配标志

正则表达式支持多种标志来改变匹配行为：

标志	简写	说明
`re.IGNORECASE`	`re.I`	忽略大小写
`re.MULTILINE`	`re.M`	多行模式，`^` 和 `$` 匹配每行的开头和结尾
`re.DOTALL`	`re.S`	`.` 匹配包括换行符在内的所有字符
`re.VERBOSE`	`re.X`	详细模式，可以添加注释和空白
`re.ASCII`	`re.A`	使 `\w`、`\d` 等只匹配 ASCII 字符

import re

text = '''第一行
第二行
第三行'''

# IGNORECASE：忽略大小写
print(re.search(r'FIRST', 'First Line', re.I))  # 匹配

# MULTILINE：多行模式
print(re.findall(r'^第\w+行', text, re.M))  # ['第一行', '第二行', '第三行']

# DOTALL：点匹配换行
print(re.search(r'第一行.第三行', text, re.S))  # 匹配

# 组合多个标志
print(re.search(r'FIRST. THIRD', text, re.I | re.S))  # 匹配

# VERBOSE：详细模式
pattern = re.compile(r'''
    \d{4}      # 年份
    -          # 分隔符
    \d{2}      # 月份
    -          # 分隔符
    \d{2}      # 日期
''', re.VERBOSE)

print(pattern.search('日期：2024-01-15').group())  # 2024-01-15

Match 对象

匹配成功后返回 Match 对象，包含匹配信息和操作方法：

import re

text = '订单号：ORD12345，金额：99.5元'
match = re.search(r'订单号：(\w+)，金额：([\d.]+)元', text)

if match:
    # 获取匹配内容
    print(match.group())      # 完整匹配：订单号：ORD12345，金额：99.5元
    print(match.group(1))     # 第一个分组：ORD12345
    print(match.group(2))     # 第二个分组：99.5
    print(match.groups())     # 所有分组：('ORD12345', '99.5')
    
    # 获取匹配位置
    print(match.start())      # 匹配开始位置
    print(match.end())        # 匹配结束位置
    print(match.span())       # (start, end)
    
    # 获取匹配前后内容
    print(match.string)       # 原始字符串
    
    # 命名分组
    text = 'name: Tom, age: 18'
    match = re.search(r'name: (?P<name>\w+), age: (?P<age>\d+)', text)
    print(match.group('name'))      # Tom
    print(match.group('age'))       # 18
    print(match.groupdict())        # {'name': 'Tom', 'age': '18'}

爬虫常用正则模式

提取 URL

import re

text = '访问 https://example.com/path?query=1 了解更多'

# 基本URL匹配
url_pattern = r'https?://[^\s<>"{}|\\^`\[\]]+'
urls = re.findall(url_pattern, text)
print(urls)  # ['https://example.com/path?query=1']

# 更完整的URL匹配
url_pattern = r'https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+[/\w .-]*/?(?:\?[^#\s]*)?(?:#\S*)?'
urls = re.findall(url_pattern, text)

# 提取URL的各个部分
url = 'https://user:[email protected]:8080/path/to/page?query=value#section'
pattern = r'(?P<scheme>https?)://(?:(?P<user>\w+):(?P<pass>\w+)@)?(?P<host>[\w.-]+)(?::(?P<port>\d+))?(?P<path>/[^?#]*)?(?:\?(?P<query>[^#]*))?(?:#(?P<fragment>.*))?'
match = re.match(pattern, url)
if match:
    print(match.groupdict())
    # {'scheme': 'https', 'user': 'user', 'pass': 'pass', 'host': 'example.com', 
    #  'port': '8080', 'path': '/path/to/page', 'query': 'query=value', 'fragment': 'section'}

提取邮箱

import re

text = '联系我们：[email protected] 或 [email protected]'

# 邮箱匹配
email_pattern = r'[\w.+-]+@[\w.-]+\.[a-zA-Z]{2,}'
emails = re.findall(email_pattern, text)
print(emails)  # ['[email protected]', '[email protected]']

提取手机号

import re

text = '联系电话：13812345678，186-1234-5678'

# 中国手机号
phone_pattern = r'1[3-9]\d{9}'
phones = re.findall(phone_pattern, text)
print(phones)  # ['13812345678']

# 带分隔符的手机号
phone_pattern = r'1[3-9]\d-?\d{4}-?\d{4}'
phones = re.findall(phone_pattern, text)
print(phones)  # ['13812345678', '186-1234-5678']

提取价格

import re

text = '原价：￥199.00，现价：¥99.5，折扣价：$49.99'

# 提取各种货币价格
price_pattern = r'[¥$￥]\s*(\d+(?:\.\d{1,2})?)'
prices = re.findall(price_pattern, text)
print(prices)  # ['199.00', '99.5', '49.99']

# 提取带货币符号的价格
price_pattern = r'([¥$￥])\s*(\d+(?:\.\d{1,2})?)'
matches = re.findall(price_pattern, text)
for currency, amount in matches:
    print(f'货币: {currency}, 金额: {amount}')

提取日期时间

import re

text = '创建时间：2024-01-15 10:30:45，更新于 2024/02/20'

# 匹配 YYYY-MM-DD 格式
date_pattern = r'\d{4}-\d{2}-\d{2}'
dates = re.findall(date_pattern, text)
print(dates)  # ['2024-01-15']

# 匹配多种日期格式
date_pattern = r'\d{4}[-/]\d{2}[-/]\d{2}'
dates = re.findall(date_pattern, text)
print(dates)  # ['2024-01-15', '2024/02/20']

# 匹配日期时间
datetime_pattern = r'(\d{4}-\d{2}-\d{2})\s+(\d{2}:\d{2}:\d{2})'
match = re.search(datetime_pattern, text)
if match:
    print(f'日期: {match.group(1)}, 时间: {match.group(2)}')
    # 日期: 2024-01-15, 时间: 10:30:45

提取 HTML 标签内容

import re

html = '''
<div class="content">
    <h1>标题</h1>
    <p>段落内容</p>
    <a href="https://example.com">链接</a>
</div>
'''

# 提取标签内容
content = re.search(r'<h1>(.*?)</h1>', html, re.S)
print(content.group(1))  # 标题

# 提取链接
links = re.findall(r'<a[^>]*href="([^"]*)"[^>]*>([^<]*)</a>', html)
print(links)  # [('https://example.com', '链接')]

# 提取所有标签内容
tags = re.findall(r'<(\w+)[^>]*>([^<]*)</\1>', html)
print(tags)  # [('h1', '标题'), ('p', '段落内容'), ('a', '链接')]

清洗文本

import re

text = '  Hello   World!  \n\n  This is  a test.  '

# 去除多余空白
cleaned = re.sub(r'\s+', ' ', text).strip()
print(cleaned)  # Hello World! This is a test.

# 去除 HTML 标签
html = '<p>Hello <b>World</b>!</p>'
cleaned = re.sub(r'<[^>]+>', '', html)
print(cleaned)  # Hello World!

# 去除特殊字符
text = 'Hello@World#123!'
cleaned = re.sub(r'[^\w\s]', '', text)
print(cleaned)  # HelloWorld123

# 只保留中文
text = 'Hello世界123'
chinese = re.sub(r'[^\u4e00-\u9fa5]', '', text)
print(chinese)  # 世界

性能优化

预编译正则表达式

对于重复使用的正则表达式，预编译可以显著提高性能：

import re
import time

# 不编译（每次都要解析正则表达式）
def without_compile(texts):
    results = []
    for text in texts:
        results.extend(re.findall(r'\d+', text))
    return results

# 预编译
def with_compile(texts):
    pattern = re.compile(r'\d+')
    results = []
    for text in texts:
        results.extend(pattern.findall(text))
    return results

# 测试
texts = ['abc123def456'] * 10000

start = time.time()
without_compile(texts)
print(f'不编译: {time.time() - start:.3f}s')

start = time.time()
with_compile(texts)
print(f'预编译: {time.time() - start:.3f}s')

避免回溯灾难

某些正则表达式可能导致大量回溯，造成性能问题：

import re
import time

# 危险：可能导致回溯灾难
# 当输入类似 'aaaaaaaaaaaaaaaaaaaaaaa!' 时
bad_pattern = r'(a+)+b'

# 安全：使用原子组或占有量词
safe_pattern = r'(?>a+)+b'  # 原子组
# 或
safe_pattern = r'(a++)b'    # 占有量词（Python 3.11+）

# 另一个危险示例
text = 'a' * 30 + '!'

start = time.time()
try:
    re.match(r'(a+)+$', text)
except:
    pass
print(f'耗时: {time.time() - start:.3f}s')  # 可能需要很长时间

使用原始字符串

始终使用原始字符串（r'...'）编写正则表达式，避免转义字符问题：

import re

# 不推荐：需要双重转义
pattern = '\\d+\\.\\d+'  # 匹配小数

# 推荐：使用原始字符串
pattern = r'\d+\.\d+'    # 匹配小数

# 示例
text = '价格：99.5元'
match = re.search(r'\d+\.\d+', text)
print(match.group())  # 99.5

常见错误与陷阱

1. 忘记使用原始字符串

import re

# 错误：\b 在普通字符串中是退格符
pattern = '\bword\b'  # 这不是单词边界

# 正确：使用原始字符串
pattern = r'\bword\b'

2. 混淆 match 和 search

import re

text = 'Hello World'

# match 只从开头匹配
result = re.match(r'World', text)
print(result)  # None

# search 搜索整个字符串
result = re.search(r'World', text)
print(result.group())  # World

3. 忽略换行符

import re

text = '第一行\n第二行'

# 默认 . 不匹配换行
match = re.search(r'第一行.第二行', text)
print(match)  # None

# 使用 re.DOTALL
match = re.search(r'第一行.第二行', text, re.DOTALL)
print(match.group())  # 第一行\n第二行

4. 误用量词

import re

# 错误：\d+ 会匹配尽可能多的数字
text = '123abc456'
match = re.search(r'(\d+)(\w+)', text)
print(match.groups())  # ('123', 'abc456')

# 如果想分开数字和字母
match = re.search(r'(\d+)([a-z]+)(\d+)', text)
print(match.groups())  # ('123', 'abc', '456')

小结

本章我们学习了：

基本语法 - 元字符、量词、字符类
分组与引用 - 捕获分组、命名分组、非捕获分组、零宽断言
re 模块函数 - match、search、findall、finditer、sub、split、compile
匹配标志 - IGNORECASE、MULTILINE、DOTALL、VERBOSE
爬虫常用模式 - URL、邮箱、手机号、价格、日期、HTML 标签
性能优化 - 预编译、避免回溯、使用原始字符串
常见错误 - 原始字符串、match vs search、换行符处理

正则表达式是爬虫开发的利器，掌握它可以大大提高数据提取和清洗的效率。

练习

编写正则表达式提取字符串中的所有 URL
编写正则表达式验证邮箱格式
使用正则表达式清洗 HTML 标签，只保留纯文本
编写正则表达式提取日志中的时间戳和日志级别

什么是正则表达式？​

在爬虫中的应用​

基本语法​

元字符​

量词​

贪婪与非贪婪​

字符类​

预定义字符类​

特殊序列​

分组与引用​

基本分组​

命名分组​

非捕获分组​

后向引用​

零宽断言​

re 模块函数​

re.match()​

re.search()​

re.findall()​

re.finditer()​

re.sub()​

re.subn()​

re.split()​

re.compile()​

匹配标志​

Match 对象​

爬虫常用正则模式​

提取 URL​

提取邮箱​

提取手机号​

提取价格​

提取日期时间​

提取 HTML 标签内容​

清洗文本​

性能优化​

预编译正则表达式​

避免回溯灾难​

使用原始字符串​

常见错误与陷阱​

1. 忘记使用原始字符串​

2. 混淆 match 和 search​

3. 忽略换行符​

4. 误用量词​

小结​

练习​