正则表达式实战案例

正则表达式的真正价值在于解决实际问题。本章通过大量实用案例，展示正则表达式在数据验证、提取、转换、日志分析等场景中的应用。每个案例都包含完整的代码示例和详细的实现思路。

数据验证

数据验证是正则表达式最常见的应用场景。在用户提交表单、API 接收参数时，需要对输入数据进行格式校验。

邮箱地址验证

邮箱验证是表单验证中的经典场景。需要注意的是，完全符合 RFC 5322 标准的邮箱正则表达式极其复杂，在实际应用中，我们通常使用简化版本。

简化版（适合大多数场景）：

// 匹配大多数常见邮箱格式
const emailRegex = /^[\w.-]+@[\w.-]+\.[a-zA-Z]{2,}$/;

console.log(emailRegex.test("[email protected]"));      // true
console.log(emailRegex.test("[email protected]")); // true
console.log(emailRegex.test("[email protected]"));  // true
console.log(emailRegex.test("invalid"));               // false
console.log(emailRegex.test("@example.com"));          // false

实用版（更严格，适合生产环境）：

const emailRegex = /^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$/;

// 解析：
// [a-zA-Z0-9._%+-]+   用户名部分，允许字母、数字、点、下划线、百分号、加号、减号
// @                   必须的 @ 符号
// [a-zA-Z0-9.-]+      域名部分，允许字母、数字、点、减号
// \.                  必须的点
// [a-zA-Z]{2,}        顶级域名，至少 2 个字母

Python 实现：

import re

def validate_email(email):
    """验证邮箱格式"""
    pattern = re.compile(r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$')
    return bool(pattern.match(email))

# 测试
print(validate_email("[email protected]"))  # True
print(validate_email("invalid-email"))     # False

关于邮箱验证

不要试图用正则表达式实现完美的邮箱验证。RFC 5322 定义的邮箱格式非常复杂，完整实现需要数千字符的正则表达式。在实际项目中：

使用简化版正则进行初步验证
发送验证邮件确认邮箱真实存在
结合 DNS 查询验证域名是否有效

手机号码验证

不同国家的手机号格式不同，这里以中国大陆手机号为例。

中国大陆手机号：

// 中国大陆手机号：1 开头，第二位 3-9，共 11 位
const phoneRegex = /^1[3-9]\d{9}$/;

console.log(phoneRegex.test("13812345678"));  // true
console.log(phoneRegex.test("12812345678"));  // false（第二位必须是 3-9）
console.log(phoneRegex.test("1381234567"));   // false（只有 10 位）
console.log(phoneRegex.test("138123456789")); // false（超过 11 位）

带国际区号的手机号：

// 支持国际区号格式：+86 13812345678 或 0086 13812345678
const internationalPhone = /^(?:\+?(\d{1,3})|0{2}(\d{1,3}))?\s*1[3-9]\d{9}$/;

console.log(internationalPhone.test("+86 13812345678"));   // true
console.log(internationalPhone.test("0086 13812345678"));  // true
console.log(internationalPhone.test("13812345678"));       // true

身份证号码验证

中国大陆居民身份证号码为 18 位，包含地区码、出生日期、顺序码和校验码。

// 18 位身份证号验证
const idCardRegex = /^[1-9]\d{5}(18|19|20)\d{2}(0[1-9]|1[0-2])(0[1-9]|[12]\d|3[01])\d{3}[\dXx]$/;

/*
解析：
[1-9]\d{5}                    地区码，6 位数字
(18|19|20)                    年份前两位
\d{2}                         年份后两位
(0[1-9]|1[0-2])               月份，01-12
(0[1-9]|[12]\d|3[01])         日期，01-31
\d{3}                         顺序码
[\dXx]                        校验码，可能是数字或 X
*/

console.log(idCardRegex.test("11010519900307233X"));  // true
console.log(idCardRegex.test("11010519900307233"));   // false（只有 17 位）

完整的身份证验证（包含校验码计算）：

function validateIdCard(idCard) {
    // 先验证格式
    const regex = /^[1-9]\d{5}(18|19|20)\d{2}(0[1-9]|1[0-2])(0[1-9]|[12]\d|3[01])\d{3}[\dXx]$/;
    if (!regex.test(idCard)) {
        return false;
    }
    
    // 验证校验码
    const weights = [7, 9, 10, 5, 8, 4, 2, 1, 6, 3, 7, 9, 10, 5, 8, 4, 2];
    const checkCodes = ['1', '0', 'X', '9', '8', '7', '6', '5', '4', '3', '2'];
    
    let sum = 0;
    for (let i = 0; i < 17; i++) {
        sum += parseInt(idCard[i]) * weights[i];
    }
    
    const checkCode = checkCodes[sum % 11];
    return idCard[17].toUpperCase() === checkCode;
}

console.log(validateIdCard("11010519900307233X"));  // true（需要真实身份证号才能通过校验）

URL 验证

// URL 验证（支持 http、https、ftp）
const urlRegex = /^(https?|ftp):\/\/[^\s/$.?#].[^\s]*$/i;

// 更严格的 URL 验证
const strictUrlRegex = /^https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)$/;

console.log(urlRegex.test("https://www.example.com"));       // true
console.log(urlRegex.test("http://example.com/path?q=1"));   // true
console.log(urlRegex.test("ftp://ftp.example.com"));         // true
console.log(urlRegex.test("example.com"));                   // false（缺少协议）

Python 版本：

import re
from urllib.parse import urlparse

def validate_url(url):
    """使用正则验证 URL 格式，再用 urlparse 验证结构"""
    regex = re.compile(
        r'^https?://'  # http:// 或 https://
        r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+[A-Z]{2,6}\.?|'  # 域名
        r'localhost|'  # localhost
        r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})'  # 或 IP 地址
        r'(?::\d+)?'  # 可选端口
        r'(?:/?|[/?]\S+)$', re.IGNORECASE)
    
    if not regex.match(url):
        return False
    
    # 进一步验证
    try:
        result = urlparse(url)
        return all([result.scheme, result.netloc])
    except:
        return False

IP 地址验证

IPv4 地址：

// IPv4 地址验证
const ipv4Regex = /^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$/;

/*
解析：
25[0-5]         匹配 250-255
2[0-4][0-9]     匹配 200-249
[01]?[0-9][0-9]? 匹配 0-199
\.{3}           三次重复（三个点分隔的数字组）
*/

console.log(ipv4Regex.test("192.168.1.1"));    // true
console.log(ipv4Regex.test("255.255.255.255")); // true
console.log(ipv4Regex.test("256.1.1.1"));      // false（超过 255）
console.log(ipv4Regex.test("192.168.1"));      // false（只有 3 段）

IPv6 地址（简化版）：

// IPv6 地址验证（简化版）
const ipv6Regex = /^([\da-fA-F]{1,4}:){7}[\da-fA-F]{1,4}$/;

console.log(ipv6Regex.test("2001:0db8:85a3:0000:0000:8a2e:0370:7334"));  // true
console.log(ipv6Regex.test("::1"));  // false（简化版不支持 :: 缩写）

密码强度验证

使用零宽断言验证密码强度是正则表达式的经典应用。

至少 8 位，包含大小写字母和数字：

const passwordRegex = /^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)[a-zA-Z\d]{8,}$/;

/*
解析：
(?=.*[a-z])     正向先行断言：后面必须有小写字母
(?=.*[A-Z])     正向先行断言：后面必须有大写字母
(?=.*\d)        正向先行断言：后面必须有数字
[a-zA-Z\d]{8,}  实际匹配：至少 8 位字母或数字
*/

console.log(passwordRegex.test("Password123"));  // true
console.log(passwordRegex.test("password"));     // false（缺大写）
console.log(passwordRegex.test("PASSWORD123"));  // false（缺小写）
console.log(passwordRegex.test("Pass1"));        // false（不够 8 位）

更强的密码要求：包含特殊字符：

const strongPasswordRegex = /^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[@$!%*?&])[A-Za-z\d@$!%*?&]{8,}$/;

/*
新增要求：
(?=.*[@$!%*?&]) 必须包含特殊字符
字符集也增加了特殊字符
*/

console.log(strongPasswordRegex.test("Password123!"));  // true
console.log(strongPasswordRegex.test("Password123"));   // false（缺特殊字符）

分层级的密码强度检测：

import re

def check_password_strength(password):
    """返回密码强度等级：weak, medium, strong"""
    checks = {
        'length': len(password) >= 8,
        'lower': bool(re.search(r'[a-z]', password)),
        'upper': bool(re.search(r'[A-Z]', password)),
        'digit': bool(re.search(r'\d', password)),
        'special': bool(re.search(r'[@$!%*?&]', password))
    }
    
    score = sum(checks.values())
    
    if score >= 5:
        return 'strong'
    elif score >= 3:
        return 'medium'
    else:
        return 'weak'

print(check_password_strength("Password123!"))  # strong
print(check_password_strength("password123"))   # medium
print(check_password_strength("pass"))          # weak

数据提取

正则表达式的另一个重要应用是从文本中提取特定格式的数据。

提取邮箱地址

// 从文本中提取所有邮箱
const text = "联系我们: [email protected], [email protected] 或致电 400-123-4567";
const emails = text.match(/[\w.-]+@[\w.-]+\.\w{2,}/g);

console.log(emails);  // ["[email protected]", "[email protected]"]

Python 版本：

import re

text = "联系我们: [email protected], [email protected]"
emails = re.findall(r'[\w.-]+@[\w.-]+\.\w{2,}', text)
print(emails)  # ['[email protected]', '[email protected]']

提取 URL 和链接

// 从文本中提取所有 URL
const text = "访问 https://example.com 或 http://test.org/page 查看详情";
const urls = text.match(/https?:\/\/[^\s]+/g);

console.log(urls);  // ["https://example.com", "http://test.org/page"]

提取 Markdown 链接：

const markdown = "查看 [官方文档](https://example.com/docs) 和 [教程](https://tutorial.com)";
const linkRegex = /\[([^\]]+)\]\(([^)]+)\)/g;

let match;
while ((match = linkRegex.exec(markdown)) !== null) {
    console.log(`文本: ${match[1]}, 链接: ${match[2]}`);
}
// 输出: 文本: 官方文档, 链接: https://example.com/docs
//       文本: 教程, 链接: https://tutorial.com

提取 HTML 内容

注意

正则表达式不适合解析复杂的 HTML 结构。对于复杂的 HTML 解析任务，请使用专门的解析器（如 BeautifulSoup、jsoup、DOMParser）。

提取简单标签内容：

// 提取 div 标签内的内容（仅适用于简单情况）
const html = "<div>内容1</div><span>其他</span><div>内容2</div>";
const contents = html.match(/<div>([^<]*)<\/div>/g);

console.log(contents);  // ["<div>内容1</div>", "<div>内容2</div>"]

// 只提取内容（不含标签）
const textContents = [...html.matchAll(/<div>([^<]*)<\/div>/g)].map(m => m[1]);
console.log(textContents);  // ["内容1", "内容2"]

提取图片 src 属性：

const html = '<img src="image1.jpg" alt="图片1"><img src="image2.png">';
const srcs = [...html.matchAll(/<img[^>]+src=["']([^"']+)["'][^>]*>/gi)].map(m => m[1]);

console.log(srcs);  // ["image1.jpg", "image2.png"]

提取数字和金额

// 提取文本中的所有数字
const text = "价格: 100元, 数量: 5个, 总计: 500元";
const numbers = text.match(/\d+/g);
console.log(numbers);  // ["100", "5", "500"]

// 提取带小数的数字
const prices = "商品价格: 99.9元, 优惠后: 89.5元";
const decimals = prices.match(/\d+\.?\d*/g);
console.log(decimals);  // ["99.9", "89.5"]

// 提取货币金额（支持多种格式）
const amounts = "订单金额: ¥1,234.56, $99.99, €100";
const moneyRegex = /[$€¥]\s*[\d,]+\.?\d*/g;
const matches = amounts.match(moneyRegex);
console.log(matches);  // ["¥1,234.56", "$99.99", "€100"]

提取日期和时间

// 提取 YYYY-MM-DD 格式的日期
const text = "项目开始于 2024-01-15，结束于 2024-12-31";
const dates = text.match(/\d{4}-\d{2}-\d{2}/g);
console.log(dates);  // ["2024-01-15", "2024-12-31"]

// 提取多种日期格式
const mixedDates = "日期: 2024/01/15, 2024-01-16, 01/17/2024";
const datePatterns = mixedDates.match(/\d{4}[\/-]\d{2}[\/-]\d{2}|\d{2}[\/-]\d{2}[\/-]\d{4}/g);
console.log(datePatterns);  // ["2024/01/15", "2024-01-16", "01/17/2024"]

// 提取时间 HH:MM:SS
const times = "上班时间: 09:00:00, 下班时间: 18:30:00";
const timeList = times.match(/([01]?\d|2[0-3]):[0-5]\d(:[0-5]\d)?/g);
console.log(timeList);  // ["09:00:00", "18:30:00"]

提取中文字符

// 提取所有中文字符
const text = "Hello 世界，Welcome to 中国！";
const chinese = text.match(/[\u4e00-\u9fa5]+/g);
console.log(chinese);  // ["世界", "中国"]

// 提取中文姓名（2-4个汉字）
const names = "张三、李四五和王五六七参加会议";
const nameList = names.match(/[\u4e00-\u9fa5]{2,4}/g);
console.log(nameList);  // ["张三", "李四五", "王五六七"]

数据替换与转换

正则表达式在数据格式转换中发挥重要作用。

敏感信息脱敏

// 手机号脱敏：138****5678
function maskPhone(phone) {
    return phone.replace(/(\d{3})\d{4}(\d{4})/, "$1****$2");
}

console.log(maskPhone("13812345678"));  // "138****5678"

// 身份证号脱敏：110105********1234
function maskIdCard(idCard) {
    return idCard.replace(/(\d{6})\d{8}(\d{4})/, "$1********$2");
}

console.log(maskIdCard("110105199001011234"));  // "110105********1234"

// 邮箱脱敏：t***@example.com
function maskEmail(email) {
    return email.replace(/(.{1}).*(@.*)/, "$1***$2");
}

console.log(maskEmail("[email protected]"));  // "t***@example.com"

Python 版本：

import re

def mask_phone(phone):
    """手机号脱敏"""
    return re.sub(r'(\d{3})\d{4}(\d{4})', r'\1****\2', phone)

def mask_id_card(id_card):
    """身份证号脱敏"""
    return re.sub(r'(\d{6})\d{8}(\d{4})', r'\1********\2', id_card)

def mask_bank_card(card):
    """银行卡号脱敏"""
    return re.sub(r'(\d{4})\d+(\d{4})', r'\1 **** **** \2', card)

print(mask_phone("13812345678"))       # 138****5678
print(mask_id_card("110105199001011234"))  # 110105********1234
print(mask_bank_card("6222021234567890123"))  # 6222 **** **** 0123

日期格式转换

// YYYY-MM-DD -> DD/MM/YYYY
function convertDateFormat(date) {
    return date.replace(/(\d{4})-(\d{2})-(\d{2})/, "$3/$2/$1");
}

console.log(convertDateFormat("2024-03-15"));  // "15/03/2024"

// YYYY-MM-DD -> 中文格式
function toChineseDate(date) {
    return date.replace(/(\d{4})-(\d{2})-(\d{2})/, "$1年$2月$3日");
}

console.log(toChineseDate("2024-03-15"));  // "2024年03月15日"

命名风格转换

驼峰转下划线：

function camelToSnake(str) {
    return str.replace(/([a-z])([A-Z])/g, "$1_$2").toLowerCase();
}

console.log(camelToSnake("userName"));        // "user_name"
console.log(camelToSnake("getUserById"));     // "get_user_by_id"
console.log(camelToSnake("XMLHttpRequest"));  // "x_m_l_http_request"

下划线转驼峰：

function snakeToCamel(str) {
    return str.replace(/_([a-z])/g, (_, char) => char.toUpperCase());
}

console.log(snakeToCamel("user_name"));    // "userName"
console.log(snakeToCamel("get_user_by_id"));  // "getUserById"

Python 版本：

import re

def camel_to_snake(name):
    """驼峰转下划线"""
    s1 = re.sub(r'(.)([A-Z][a-z]+)', r'\1_\2', name)
    return re.sub(r'([a-z0-9])([A-Z])', r'\1_\2', s1).lower()

def snake_to_camel(name):
    """下划线转驼峰"""
    return re.sub(r'_([a-z])', lambda m: m.group(1).upper(), name)

print(camel_to_snake("getUserById"))  # get_user_by_id
print(snake_to_camel("user_name"))    # userName

文本清理

// 去除 HTML 标签
const html = "<p>这是一段<b>HTML</b>文本</p>";
const text = html.replace(/<[^>]+>/g, "");
console.log(text);  // "这是一段HTML文本"

// 去除多余空白
const messy = "这是   一段   有多余空白的  文本";
const clean = messy.replace(/\s+/g, " ").trim();
console.log(clean);  // "这是 一段 有多余空白的 文本"

// 去除 HTML 实体
const encoded = "Hello&nbsp;World&lt;Test&gt;";
const decoded = encoded
    .replace(/&nbsp;/g, " ")
    .replace(/&lt;/g, "<")
    .replace(/&gt;/g, ">")
    .replace(/&amp;/g, "&");
console.log(decoded);  // "Hello World<Test>"

添加千位分隔符

function formatNumber(num) {
    return num.toString().replace(/\B(?=(\d{3})+(?!\d))/g, ",");
}

/*
解析：
\B             非单词边界（不在开头）
(?=(\d{3})+)   正向先行断言：后面是 3 的倍数个数字
(?!\d)         负向先行断言：后面不是数字（确保在正确的位置）
*/

console.log(formatNumber(1234567890));  // "1,234,567,890"
console.log(formatNumber(12345));       // "12,345"
console.log(formatNumber(123));         // "123"

支持小数的版本：

function formatPrice(price) {
    const [integer, decimal] = price.toString().split('.');
    const formatted = integer.replace(/\B(?=(\d{3})+(?!\d))/g, ",");
    return decimal ? `${formatted}.${decimal}` : formatted;
}

console.log(formatPrice(1234567.89));  // "1,234,567.89"

日志分析

正则表达式在日志解析中应用广泛。

解析 Apache/Nginx 访问日志

标准的 Apache/Nginx 访问日志格式：

192.168.1.1 - - [10/Oct/2023:13:55:36 +0800] "GET /index.html HTTP/1.1" 200 1234

JavaScript 解析：

const logRegex = /^(\S+) (\S+) (\S+) \[([^\]]+)\] "(\S+) (\S+) (\S+)" (\d+) (\d+)$/;

function parseAccessLog(log) {
    const match = log.match(logRegex);
    if (!match) return null;
    
    return {
        ip: match[1],
        ident: match[2],
        user: match[3],
        time: match[4],
        method: match[5],
        path: match[6],
        protocol: match[7],
        status: parseInt(match[8]),
        size: parseInt(match[9])
    };
}

const log = '192.168.1.1 - - [10/Oct/2023:13:55:36 +0800] "GET /index.html HTTP/1.1" 200 1234';
console.log(parseAccessLog(log));
// { ip: '192.168.1.1', method: 'GET', path: '/index.html', status: 200, ... }

Python 解析：

import re
from collections import Counter

log_pattern = re.compile(
    r'(?P<ip>\S+)\s+'           # IP 地址
    r'(?P<ident>\S+)\s+'        # 身份标识
    r'(?P<user>\S+)\s+'         # 用户名
    r'\[(?P<time>[^\]]+)\]\s+'  # 时间
    r'"(?P<request>[^"]+)"\s+'  # 请求行
    r'(?P<status>\d+)\s+'       # 状态码
    r'(?P<size>\d+)'            # 响应大小
)

def parse_log_file(filepath):
    """解析日志文件"""
    with open(filepath, 'r') as f:
        for line in f:
            match = log_pattern.match(line.strip())
            if match:
                yield match.groupdict()

def analyze_ips(logs):
    """统计访问量最高的 IP"""
    ip_counter = Counter(log['ip'] for log in logs)
    return ip_counter.most_common(10)

def analyze_status_codes(logs):
    """统计状态码分布"""
    status_counter = Counter(log['status'] for log in logs)
    return dict(status_counter)

解析应用日志

假设应用日志格式：[2024-03-15 10:30:45] [ERROR] [main] Something went wrong

import re
from datetime import datetime

app_log_pattern = re.compile(
    r'\[(?P<timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})\]\s+'
    r'\[(?P<level>\w+)\]\s+'
    r'\[(?P<logger>\w+)\]\s+'
    r'(?P<message>.+)'
)

def parse_app_log(line):
    """解析应用日志"""
    match = app_log_pattern.match(line)
    if match:
        data = match.groupdict()
        data['timestamp'] = datetime.strptime(data['timestamp'], '%Y-%m-%d %H:%M:%S')
        return data
    return None

# 统计错误日志
def count_errors(log_file, start_time=None, end_time=None):
    errors = []
    with open(log_file, 'r') as f:
        for line in f:
            log = parse_app_log(line)
            if log and log['level'] == 'ERROR':
                if start_time and log['timestamp'] < start_time:
                    continue
                if end_time and log['timestamp'] > end_time:
                    continue
                errors.append(log)
    return errors

提取日志中的异常堆栈

import re

# 匹配 Java 异常堆栈
java_exception_pattern = re.compile(
    r'(?P<exception>\w+(?:Exception|Error)):\s*(?P<message>[^\n]+)\n'
    r'(?P<stacktrace>(?:\s+at .+\n)+)'
)

def extract_exceptions(log_content):
    """提取日志中的所有异常"""
    exceptions = []
    for match in java_exception_pattern.finditer(log_content):
        exceptions.append({
            'type': match.group('exception'),
            'message': match.group('message'),
            'stacktrace': match.group('stacktrace')
        })
    return exceptions

# 示例
log = """
2024-03-15 10:30:45 ERROR - NullPointerException: Cannot invoke method on null object
    at com.example.Service.process(Service.java:100)
    at com.example.Controller.handle(Controller.java:50)
"""

exceptions = extract_exceptions(log)

网络数据处理

URL 参数解析

// 解析 URL 查询参数
function parseQueryString(url) {
    const params = {};
    const regex = /[?&]([^=]+)=([^&]*)/g;
    let match;
    
    while ((match = regex.exec(url)) !== null) {
        const key = decodeURIComponent(match[1]);
        const value = decodeURIComponent(match[2]);
        params[key] = value;
    }
    
    return params;
}

const url = "https://example.com/search?q=正则表达式&page=1&size=10";
console.log(parseQueryString(url));
// { q: '正则表达式', page: '1', size: '10' }

构建查询字符串：

function buildQueryString(params) {
    return Object.entries(params)
        .map(([key, value]) => `${encodeURIComponent(key)}=${encodeURIComponent(value)}`)
        .join('&');
}

console.log(buildQueryString({ q: '正则', page: 1 }));
// "q=%E6%AD%A3%E5%88%99&page=1"

// 解析 Cookie 字符串
function parseCookies(cookieString) {
    const cookies = {};
    const regex = /([^=;\s]+)=([^;]*)/g;
    let match;
    
    while ((match = regex.exec(cookieString)) !== null) {
        cookies[match[1].trim()] = match[2].trim();
    }
    
    return cookies;
}

const cookieStr = "sessionId=abc123; userId=user001; token=xyz789";
console.log(parseCookies(cookieStr));
// { sessionId: 'abc123', userId: 'user001', token: 'xyz789' }

解析 HTTP 请求头

import re

def parse_headers(header_text):
    """解析 HTTP 请求头"""
    headers = {}
    pattern = re.compile(r'^(?P<name>[^:]+):\s*(?P<value>.+)$', re.MULTILINE)
    
    for match in pattern.finditer(header_text):
        headers[match.group('name')] = match.group('value')
    
    return headers

headers_text = """
Content-Type: application/json
Authorization: Bearer token123
User-Agent: Mozilla/5.0
"""

print(parse_headers(headers_text))
# {'Content-Type': 'application/json', 'Authorization': 'Bearer token123', 'User-Agent': 'Mozilla/5.0'}

代码处理

解析代码注释

移除 JavaScript 单行注释：

function removeSingleLineComments(code) {
    // 注意：这个简化版本可能在字符串内的 // 产生误判
    return code.replace(/\/\/.*$/gm, '');
}

移除多行注释：

function removeMultiLineComments(code) {
    return code.replace(/\/\*[\s\S]*?\*\//g, '');
}

更安全的注释移除（考虑字符串）：

import re

def remove_js_comments(code):
    """移除 JavaScript 注释，保留字符串中的内容"""
    def replacer(match):
        s = match.group(0)
        if s.startswith('/'):  # 注释
            return ''
        return s  # 字符串
    
    pattern = re.compile(
        r'//[^\n]*|'           # 单行注释
        r'/\*[\s\S]*?\*/|'     # 多行注释
        r'"(?:\\.|[^"\\])*"|'  # 双引号字符串
        r"'(?:\\.|[^'\\])*'"   # 单引号字符串
    )
    
    return pattern.sub(replacer, code)

提取函数定义

提取 Python 函数定义：

import re

def extract_python_functions(code):
    """提取 Python 文件中的函数定义"""
    pattern = re.compile(
        r'^def\s+(?P<name>\w+)\s*\((?P<params>[^)]*)\)(?:\s*->\s*(?P<return>[^:]+))?:',
        re.MULTILINE
    )
    
    functions = []
    for match in pattern.finditer(code):
        functions.append({
            'name': match.group('name'),
            'params': match.group('params'),
            'return_type': match.group('return')
        })
    
    return functions

code = '''
def hello(name):
    pass

def add(a: int, b: int) -> int:
    return a + b
'''

print(extract_python_functions(code))
# [{'name': 'hello', 'params': 'name', 'return_type': None}, 
#  {'name': 'add', 'params': 'a: int, b: int', 'return_type': ' int'}]

解析 import 语句

import re

def parse_imports(code):
    """解析 Python import 语句"""
    imports = []
    
    # 匹配 import xxx
    simple_pattern = re.compile(r'^import\s+(?P<modules>[\w.]+(?:\s*,\s*[\w.]+)*)', re.MULTILINE)
    
    # 匹配 from xxx import yyy
    from_pattern = re.compile(
        r'^from\s+(?P<module>[\w.]+)\s+import\s+(?P<names>[\w.*]+(?:\s*,\s*[\w.*]+)*)',
        re.MULTILINE
    )
    
    for match in simple_pattern.finditer(code):
        modules = [m.strip() for m in match.group('modules').split(',')]
        for module in modules:
            imports.append({'type': 'import', 'module': module})
    
    for match in from_pattern.finditer(code):
        names = [n.strip() for n in match.group('names').split(',')]
        imports.append({
            'type': 'from',
            'module': match.group('module'),
            'names': names
        })
    
    return imports

搜索与高亮

关键词高亮

function highlightKeywords(text, keywords) {
    // 转义关键词中的特殊字符
    const escaped = keywords.map(k => k.replace(/[.*+?^${}()|[\]\\]/g, '\\$&'));
    const pattern = new RegExp(`(${escaped.join('|')})`, 'gi');
    
    return text.replace(pattern, '<mark>$1</mark>');
}

const text = "JavaScript 和 Python 都是流行的编程语言";
console.log(highlightKeywords(text, ['JavaScript', 'Python']));
// "<mark>JavaScript</mark> 和 <mark>Python</mark> 都是流行的编程语言"

搜索结果高亮（保留大小写）

import re

def highlight_search(text, query, tag='mark'):
    """搜索高亮，保留原文大小写"""
    def replace_match(match):
        return f'<{tag}>{match.group(0)}</{tag}>'
    
    pattern = re.compile(re.escape(query), re.IGNORECASE)
    return pattern.sub(replace_match, text)

text = "JavaScript is great. I love javascript!"
print(highlight_search(text, "javascript"))
# "<mark>JavaScript</mark> is great. I love <mark>javascript</mark>!"

代码语法高亮（简化版）

// 简单的代码语法高亮
function simpleSyntaxHighlight(code, language) {
    const patterns = {
        // 字符串
        string: /(["'`])(?:(?!\1)[^\\]|\\.)*\1/g,
        // 数字
        number: /\b\d+\.?\d*\b/g,
        // 关键字
        keyword: /\b(const|let|var|function|return|if|else|for|while|class|import|export)\b/g,
        // 注释
        comment: /(\/\/.*$|\/\*[\s\S]*?\*\/)/gm,
        // 函数名
        function: /\b([a-zA-Z_]\w*)\s*(?=\()/g
    };
    
    let result = code;
    
    // 按顺序替换，避免重叠
    result = result.replace(patterns.comment, '<span class="comment">$1</span>');
    result = result.replace(patterns.string, '<span class="string">$&</span>');
    result = result.replace(patterns.keyword, '<span class="keyword">$1</span>');
    result = result.replace(patterns.number, '<span class="number">$&</span>');
    result = result.replace(patterns.function, '<span class="function">$1</span>');
    
    return result;
}

批量重命名

使用正则表达式批量处理文件名。

import re
import os

def batch_rename(directory, pattern, replacement, dry_run=True):
    """
    批量重命名文件
    
    Args:
        directory: 目录路径
        pattern: 正则表达式模式
        replacement: 替换字符串
        dry_run: 是否只显示预览而不实际执行
    """
    regex = re.compile(pattern)
    renamed = []
    
    for filename in os.listdir(directory):
        new_name = regex.sub(replacement, filename)
        if new_name != filename:
            renamed.append((filename, new_name))
    
    if dry_run:
        print("预览重命名结果：")
        for old, new in renamed:
            print(f"  {old} -> {new}")
    else:
        for old, new in renamed:
            os.rename(
                os.path.join(directory, old),
                os.path.join(directory, new)
            )
        print(f"已重命名 {len(renamed)} 个文件")

# 示例：将 photo_001.jpg 重命名为 image_001.jpg
# batch_rename('/path/to/files', r'^photo_(\d+)', r'image_\1')

综合案例：表单验证器

/**
 * 表单验证工具类
 */
class FormValidator {
    constructor() {
        this.rules = {
            email: {
                pattern: /^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$/,
                message: '请输入有效的邮箱地址'
            },
            phone: {
                pattern: /^1[3-9]\d{9}$/,
                message: '请输入有效的手机号码'
            },
            idCard: {
                pattern: /^[1-9]\d{5}(18|19|20)\d{2}(0[1-9]|1[0-2])(0[1-9]|[12]\d|3[01])\d{3}[\dXx]$/,
                message: '请输入有效的身份证号码'
            },
            url: {
                pattern: /^https?:\/\/[^\s/$.?#].[^\s]*$/i,
                message: '请输入有效的URL'
            },
            password: {
                pattern: /^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)[a-zA-Z\d]{8,}$/,
                message: '密码至少8位，需包含大小写字母和数字'
            }
        };
    }
    
    /**
     * 验证单个字段
     */
    validate(value, ruleName) {
        const rule = this.rules[ruleName];
        if (!rule) {
            throw new Error(`未知的验证规则: ${ruleName}`);
        }
        
        const isValid = rule.pattern.test(value);
        return {
            valid: isValid,
            message: isValid ? '' : rule.message
        };
    }
    
    /**
     * 批量验证
     */
    validateAll(data, rules) {
        const errors = {};
        let isValid = true;
        
        for (const [field, ruleName] of Object.entries(rules)) {
            const result = this.validate(data[field], ruleName);
            if (!result.valid) {
                errors[field] = result.message;
                isValid = false;
            }
        }
        
        return { valid: isValid, errors };
    }
    
    /**
     * 添加自定义规则
     */
    addRule(name, pattern, message) {
        this.rules[name] = { pattern, message };
    }
}

// 使用示例
const validator = new FormValidator();

// 添加自定义规则
validator.addRule('username', /^[a-zA-Z][a-zA-Z0-9_]{3,15}$/, '用户名需以字母开头，4-16位字母数字下划线');

// 验证单个字段
console.log(validator.validate('[email protected]', 'email'));
// { valid: true, message: '' }

// 批量验证
const formData = {
    email: 'invalid-email',
    phone: '12345',
    username: 'validuser'
};

const rules = {
    email: 'email',
    phone: 'phone',
    username: 'username'
};

console.log(validator.validateAll(formData, rules));
// { valid: false, errors: { email: '请输入有效的邮箱地址', phone: '请输入有效的手机号码' } }

小结

本章通过大量实战案例展示了正则表达式的广泛应用：

数据验证：邮箱、手机号、身份证、URL、密码强度等
数据提取：从文本中提取邮箱、URL、数字、日期等
数据转换：敏感信息脱敏、格式转换、命名风格转换
日志分析：解析访问日志、应用日志、异常堆栈
网络数据处理：URL 参数、Cookie、HTTP 头解析
代码处理：注释移除、函数提取、import 解析
搜索高亮：关键词高亮、语法高亮

正则表达式是文本处理的瑞士军刀，掌握它能极大提高开发效率。同时需要注意：

不要过度使用：能用简单字符串方法解决的问题，不用正则
注意性能：避免危险模式，考虑输入规模
安全第一：防范 ReDoS 攻击，验证用户输入
保持可读：复杂正则要加注释，使用命名分组

数据验证​

邮箱地址验证​

手机号码验证​

身份证号码验证​

URL 验证​

IP 地址验证​

密码强度验证​

数据提取​

提取邮箱地址​

提取 URL 和链接​

提取 HTML 内容​

提取数字和金额​

提取日期和时间​

提取中文字符​

数据替换与转换​

敏感信息脱敏​

日期格式转换​

命名风格转换​

文本清理​

添加千位分隔符​

日志分析​

解析 Apache/Nginx 访问日志​

解析应用日志​

提取日志中的异常堆栈​

网络数据处理​

URL 参数解析​

解析 Cookie​

解析 HTTP 请求头​

代码处理​

解析代码注释​

提取函数定义​

解析 import 语句​

搜索与高亮​

关键词高亮​

搜索结果高亮（保留大小写）​

代码语法高亮（简化版）​

批量重命名​

综合案例：表单验证器​

小结​

参考资源​

数据验证

邮箱地址验证

手机号码验证

身份证号码验证

URL 验证

IP 地址验证

密码强度验证

数据提取

提取邮箱地址

提取 URL 和链接

提取 HTML 内容

提取数字和金额

提取日期和时间

提取中文字符

数据替换与转换

敏感信息脱敏

日期格式转换

命名风格转换

文本清理

添加千位分隔符

日志分析

解析 Apache/Nginx 访问日志

解析应用日志

提取日志中的异常堆栈

网络数据处理

URL 参数解析

解析 Cookie

解析 HTTP 请求头

代码处理

解析代码注释

提取函数定义

解析 import 语句

搜索与高亮

关键词高亮

搜索结果高亮（保留大小写）

代码语法高亮（简化版）

批量重命名

综合案例：表单验证器

小结

参考资源