各语言正则表达式应用

正则表达式是跨语言的工具，但不同编程语言的 API 和特性支持有所不同。本章介绍在主流编程语言中使用正则表达式的方法。

跨语言正则表达式概览

虽然不同语言的正则表达式 API 各不相同，但核心概念是相通的。理解这些共同点，可以帮助你快速在不同语言之间切换：

共同的核心概念

概念	说明	所有语言一致
字符类	`[abc]`、`\d`、`\w` 等	✓
量词	`*`、`+`、`?`、`{n,m}`	✓
锚点	`^`、`$`、`\b`	✓
分组	`(...)`、`(?:...)`	✓
选择	`\|`	✓
转义	`\` 用于转义特殊字符	✓

主要差异点

差异	说明
创建方式	JavaScript 有字面量 `/pattern/`，其他语言通常用字符串
API 风格	方法命名和参数顺序不同
标志指定	JavaScript 用后缀 `gi`，Python 用常量 `re.I
特性支持	命名分组、零宽断言等高级特性的支持程度不同

学习建议

先掌握核心语法：字符类、量词、分组是所有语言通用的
了解目标语言 API：熟悉编译、匹配、替换、分割等常用方法
注意特性差异：在使用高级特性前，确认目标语言是否支持

下面分别介绍各主流语言的正则表达式用法。

JavaScript

JavaScript 的正则表达式支持通过 RegExp 对象和字符串方法使用。

创建正则表达式

// 字面量语法（推荐）
const regex1 = /abc/gi;

// 构造函数语法
const regex2 = new RegExp("abc", "gi");

// 动态构建
const keyword = "hello";
const regex3 = new RegExp(keyword, "i");

常用方法

const str = "The quick brown fox jumps over the lazy dog";
const regex = /\b\w{5}\b/g;  // 匹配 5 个字母的单词

// test() - 测试是否匹配
regex.test(str);  // true

// exec() - 执行匹配，返回详细信息
let match;
while ((match = regex.exec(str)) !== null) {
    console.log(`找到: ${match[0]}, 位置: ${match.index}`);
}
// 输出: 找到: quick, 位置: 4
//       找到: brown, 位置: 10
//       找到: jumps, 位置: 20

// match() - 返回所有匹配
str.match(/\b\w{5}\b/g);  // ["quick", "brown", "jumps"]

// matchAll() - 返回迭代器（ES2020）
for (const match of str.matchAll(/\b\w{5}\b/g)) {
    console.log(match[0]);
}

// search() - 返回第一个匹配的位置
str.search(/fox/);  // 16

// replace() - 替换
str.replace(/fox/, "cat");  // "The quick brown cat jumps..."
str.replace(/\b\w{5}\b/g, "*****");  // 替换所有 5 字母单词

// replace() 使用函数
str.replace(/\b\w{5}\b/g, (word) => word.toUpperCase());
// "The QUICK BROWN FOX JUMPS over the LAZY dog"

// split() - 分割
"a,b,c".split(/,/);  // ["a", "b", "c"]
"a, b, c".split(/,\s*/);  // ["a", "b", "c"]

命名分组

const date = "2024-03-15";
const regex = /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/;
const match = date.match(regex);

console.log(match.groups.year);   // "2024"
console.log(match.groups.month);  // "03"
console.log(match.groups.day);    // "15"

// 在替换中使用命名分组
"2024-03-15".replace(regex, "$<day>/$<month>/$<year>");
// "15/03/2024"

实际案例

// 1. 表单验证
function validateForm(data) {
    const rules = {
        email: /^[\w.-]+@[\w.-]+\.\w{2,}$/,
        phone: /^1[3-9]\d{9}$/,
        password: /^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)[a-zA-Z\d]{8,}$/
    };
    
    const errors = {};
    for (const [field, regex] of Object.entries(rules)) {
        if (!regex.test(data[field])) {
            errors[field] = `${field} 格式不正确`;
        }
    }
    return errors;
}

// 2. 解析 URL 参数
function parseUrlParams(url) {
    const params = {};
    const regex = /[?&](\w+)=([^&]*)/g;
    let match;
    
    while ((match = regex.exec(url)) !== null) {
        params[decodeURIComponent(match[1])] = decodeURIComponent(match[2]);
    }
    
    return params;
}

parseUrlParams("https://example.com?name=John&age=30");
// { name: "John", age: "30" }

// 3. 高亮关键词
function highlightKeywords(text, keywords) {
    const pattern = new RegExp(`(${keywords.join("|")})`, "gi");
    return text.replace(pattern, "<mark>$1</mark>");
}

highlightKeywords("JavaScript is great", ["javascript", "great"]);
// "<mark>JavaScript</mark> is <mark>great</mark>"

Python

Python 的 re 模块提供了完整的正则表达式支持。

基本用法

import re

# 编译模式（推荐）
pattern = re.compile(r'\d{3}-\d{4}-\d{4}')

# 或直接使用模块函数
re.search(r'\d+', 'abc123def')

常用函数

text = "The quick brown fox jumps over the lazy dog"
pattern = re.compile(r'\b\w{5}\b')

# match() - 从开始位置匹配
pattern.match(text)  # None（不是以 5 字母单词开始）

# search() - 搜索第一个匹配
match = pattern.search(text)
print(match.group())   # "quick"
print(match.start())   # 4
print(match.end())     # 9
print(match.span())    # (4, 9)

# findall() - 返回所有匹配的列表
pattern.findall(text)  # ['quick', 'brown', 'jumps']

# finditer() - 返回迭代器
for match in pattern.finditer(text):
    print(f"找到: {match.group()}, 位置: {match.span()}")

# fullmatch() - 完整字符串匹配
re.fullmatch(r'\d{4}', '2024')  # 匹配
re.fullmatch(r'\d{4}', '2024-03')  # 不匹配

Python 特有转义

import re

# \A 和 \Z - 不受多行模式影响
text = """first line
second line
third line"""

# ^ 和 $ 在多行模式下匹配每行的开头和结尾
re.findall(r'^\w+', text, re.MULTILINE)
# ['first', 'second', 'third']

# \A 和 \Z 始终匹配整个字符串的开头和结尾
re.search(r'\A\w+', text)  # 只匹配 'first'
re.search(r'\w+\Z', text)  # 只匹配 'line'（最后一行的结尾）

# 对比 $ 和 \Z
re.search(r'line$', 'line\n')   # 匹配成功（$ 可以匹配换行符之前）
re.search(r'line\Z', 'line\n')  # 匹配失败（\Z 只匹配真正的结尾）

# Python 3.14+ 新增 \z（字符串绝对结尾）
# \z 在结尾换行符之后也必须匹配
# re.search(r'line\z', 'line\n')  # 匹配失败

分组和命名分组

# 索引分组
date_pattern = re.compile(r'(\d{4})-(\d{2})-(\d{2})')
match = date_pattern.match('2024-03-15')

print(match.group(0))  # "2024-03-15" - 完整匹配
print(match.group(1))  # "2024" - 第一个分组
print(match.group(2))  # "03"
print(match.group(3))  # "15"
print(match.groups())  # ('2024', '03', '15')

# 命名分组
date_pattern = re.compile(r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})')
match = date_pattern.match('2024-03-15')

print(match.group('year'))   # "2024"
print(match.group('month'))  # "03"
print(match.group('day'))    # "15"
print(match.groupdict())     # {'year': '2024', 'month': '03', 'day': '15'}

替换和分割

# sub() - 替换
text = "The quick brown fox"
result = re.sub(r'\b\w{5}\b', '*****', text)
print(result)  # "The ***** ***** fox"

# 使用函数进行替换
def to_uppercase(match):
    return match.group().upper()

result = re.sub(r'\b\w{5}\b', to_uppercase, text)
print(result)  # "The QUICK BROWN fox"

# subn() - 替换并返回替换次数
result, count = re.subn(r'\b\w{5}\b', '*****', text)
print(result)  # "The ***** ***** fox"
print(count)   # 2

# split() - 分割
re.split(r',\s*', 'a, b, c')     # ['a', 'b', 'c']
re.split(r'\W+', 'Words, words, words.')  # ['Words', 'words', 'words', '']

标志

# 忽略大小写
re.search(r'hello', 'HELLO', re.IGNORECASE)  # 匹配

# 多行模式
text = "line1\nline2"
re.search(r'^line2$', text, re.MULTILINE)  # 匹配

# dotAll 模式（. 匹配换行）
re.search(r'line1.line2', "line1\nline2", re.DOTALL)  # 匹配

# 详细模式（忽略空白和注释）
pattern = re.compile(r'''
    \d{4}      # 年
    -          # 分隔符
    \d{2}      # 月
    -          # 分隔符
    \d{2}      # 日
''', re.VERBOSE)

Python 3.11+ 新特性

Python 3.11 引入了占有量词和原子组，用于优化性能和防止回溯：

import re

# 占有量词（在量词后加 +）
# 普通贪婪匹配
re.match(r'a+a', 'aaaaa')   # 匹配成功（回溯后匹配）

# 占有量词（Python 3.11+）
re.match(r'a++a', 'aaaaa')  # None（不回溯）

# 原子组（Python 3.11+）
re.match(r'(?>a+)a', 'aaaaa')  # None（原子组阻止回溯）

# 实际应用：优化可能产生回溯灾难的模式
# 危险模式
# dangerous = re.compile(r'(a+)+b')  # 可能很慢

# 安全模式（使用占有量词）
# safe = re.compile(r'(a++)b')  # 快速失败

条件匹配

Python 支持条件匹配语法 (?(id/name)yes|no)：

import re

# 根据分组是否匹配选择不同模式
# 如果有开头的 <，则必须有结尾的 >
pattern = re.compile(r'(<)?(\w+@\w+(?:\.\w+)+)(?(1)>|)')

pattern.match('[email protected]')      # 匹配成功
pattern.match('<[email protected]>')    # 匹配成功
pattern.match('<[email protected]')     # None（引号不配对）

实际案例

import re

# 1. 日志解析
log_line = '192.168.1.1 - - [10/Oct/2023:13:55:36 +0800] "GET /index.html HTTP/1.1" 200 1234'

log_pattern = re.compile(r'''
    (?P<ip>\S+)\s+                    # IP 地址
    -\s+-\s+                          # 身份标识（忽略）
    \[(?P<time>[^\]]+)\]\s+           # 时间
    "(?P<method>\S+)\s+               # 请求方法
    (?P<path>\S+)\s+                  # 请求路径
    (?P<protocol>[^"]+)"\s+           # 协议
    (?P<status>\d+)\s+                # 状态码
    (?P<size>\d+)                     # 响应大小
''', re.VERBOSE)

match = log_pattern.match(log_line)
if match:
    print(match.groupdict())

# 2. 数据提取
def extract_emails(text):
    pattern = re.compile(r'[\w.-]+@[\w.-]+\.\w{2,}')
    return pattern.findall(text)

# 3. 文本清理
def clean_text(text):
    # 去除 HTML 标签
    text = re.sub(r'<[^>]+>', '', text)
    # 合并多个空白
    text = re.sub(r'\s+', ' ', text)
    # 去除首尾空白
    return text.strip()

# 4. 批量重命名
def batch_rename(files, pattern, replacement):
    """
    批量重命名文件
    例如: 将 "photo_001.jpg" 重命名为 "image_001.jpg"
    """
    renamed = []
    regex = re.compile(pattern)
    for filename in files:
        new_name = regex.sub(replacement, filename)
        renamed.append((filename, new_name))
    return renamed

# 使用示例
files = ['photo_001.jpg', 'photo_002.jpg', 'photo_003.jpg']
result = batch_rename(files, r'^photo', 'image')
# [('photo_001.jpg', 'image_001.jpg'), ...]

Java

Java 的正则表达式通过 java.util.regex 包提供。

基本用法

import java.util.regex.*;

public class RegexExample {
    public static void main(String[] args) {
        String text = "The quick brown fox jumps over the lazy dog";
        
        // 创建 Pattern
        Pattern pattern = Pattern.compile("\\b\\w{5}\\b");
        
        // 创建 Matcher
        Matcher matcher = pattern.matcher(text);
        
        // 查找所有匹配
        while (matcher.find()) {
            System.out.println("找到: " + matcher.group() + 
                             " 位置: " + matcher.start() + "-" + matcher.end());
        }
    }
}

常用方法

String text = "The quick brown fox jumps over the lazy dog";
Pattern pattern = Pattern.compile("\\b\\w{5}\\b");
Matcher matcher = pattern.matcher(text);

// matches() - 完整匹配
boolean isMatch = Pattern.matches("\\d{4}", "2024");  // true

// find() - 查找下一个匹配
while (matcher.find()) {
    System.out.println(matcher.group());
}

// lookingAt() - 从开始匹配
Pattern.compile("The").matcher(text).lookingAt();  // true

// 获取分组
Pattern datePattern = Pattern.compile("(\\d{4})-(\\d{2})-(\\d{2})");
Matcher dateMatcher = datePattern.matcher("2024-03-15");

if (dateMatcher.find()) {
    System.out.println(dateMatcher.group(0));  // "2024-03-15"
    System.out.println(dateMatcher.group(1));  // "2024"
    System.out.println(dateMatcher.group(2));  // "03"
    System.out.println(dateMatcher.group(3));  // "15"
}

替换

String text = "The quick brown fox";

// 替换所有
String result = text.replaceAll("\\b\\w{5}\\b", "*****");
System.out.println(result);  // "The ***** ***** fox"

// 只替换第一个
String result2 = text.replaceFirst("\\b\\w{5}\\b", "*****");
System.out.println(result2);  // "The ***** brown fox"

// 使用 Matcher 进行替换
Pattern pattern = Pattern.compile("\\b(\\w{5})\\b");
Matcher matcher = pattern.matcher(text);
StringBuffer sb = new StringBuffer();

while (matcher.find()) {
    matcher.appendReplacement(sb, matcher.group(1).toUpperCase());
}
matcher.appendTail(sb);
System.out.println(sb.toString());  // "The QUICK BROWN fox"

实际案例

import java.util.regex.*;
import java.util.*;

public class RegexUtils {
    
    // 1. 验证邮箱
    public static boolean isValidEmail(String email) {
        String regex = "^[\\w.-]+@[\\w.-]+\\.\\w{2,}$";
        return Pattern.matches(regex, email);
    }
    
    // 2. 验证手机号
    public static boolean isValidPhone(String phone) {
        String regex = "^1[3-9]\\d{9}$";
        return Pattern.matches(regex, phone);
    }
    
    // 3. 提取所有链接
    public static List<String> extractUrls(String text) {
        List<String> urls = new ArrayList<>();
        String regex = "https?://[^\\s]+";
        Pattern pattern = Pattern.compile(regex);
        Matcher matcher = pattern.matcher(text);
        
        while (matcher.find()) {
            urls.add(matcher.group());
        }
        return urls;
    }
    
    // 4. 解析 CSV 行
    public static List<String> parseCsvLine(String line) {
        List<String> fields = new ArrayList<>();
        String regex = "\"([^\"]*)\"|([^,]+)";
        Pattern pattern = Pattern.compile(regex);
        Matcher matcher = pattern.matcher(line);
        
        while (matcher.find()) {
            String field = matcher.group(1) != null ? matcher.group(1) : matcher.group(2);
            fields.add(field.trim());
        }
        return fields;
    }
    
    // 5. 敏感信息脱敏
    public static String maskSensitiveInfo(String text) {
        // 手机号脱敏
        text = text.replaceAll("(\\d{3})\\d{4}(\\d{4})", "$1****$2");
        // 身份证号脱敏
        text = text.replaceAll("(\\d{6})\\d{8}(\\d{4})", "$1********$2");
        return text;
    }
    
    public static void main(String[] args) {
        // 测试
        System.out.println(isValidEmail("[email protected]"));  // true
        System.out.println(extractUrls("Visit https://google.com or http://example.com"));
        System.out.println(maskSensitiveInfo("联系电话: 13812345678"));
    }
}

Go

Go 语言的 regexp 包提供正则表达式支持。

基本用法

package main

import (
    "fmt"
    "regexp"
)

func main() {
    // 创建正则（必须编译成功，否则 panic）
    re := regexp.MustCompile(`\b\w{5}\b`)
    
    text := "The quick brown fox jumps over the lazy dog"
    
    // 查找所有匹配
    matches := re.FindAllString(text, -1)
    fmt.Println(matches)  // [quick brown jumps]
}

常用方法

text := "The quick brown fox jumps over the lazy dog"
re := regexp.MustCompile(`\b\w{5}\b`)

// MatchString - 测试是否匹配
matched := re.MatchString(text)  // true

// FindString - 返回第一个匹配
first := re.FindString(text)  // "quick"

// FindAllString - 返回所有匹配
all := re.FindAllString(text, -1)  // ["quick", "brown", "jumps"]
all = re.FindAllString(text, 2)    // ["quick", "brown"]（最多 2 个）

// FindStringIndex - 返回第一个匹配的位置
loc := re.FindStringIndex(text)  // [4, 9]

// FindAllStringIndex - 返回所有匹配的位置
locs := re.FindAllStringIndex(text, -1)  // [[4, 9], [10, 15], [20, 25]]

分组

// 使用分组
dateRe := regexp.MustCompile(`(\d{4})-(\d{2})-(\d{2})`)
match := dateRe.FindStringSubmatch("2024-03-15")
// match[0] = "2024-03-15"
// match[1] = "2024"
// match[2] = "03"
// match[3] = "15"

// 查找所有带分组的匹配
allMatches := dateRe.FindAllStringSubmatch("2024-03-15 and 2024-04-20", -1)
// [  ["2024-03-15", "2024", "03", "15"],  ["2024-04-20", "2024", "04", "20"]  ]

替换和分割

text := "The quick brown fox"
re := regexp.MustCompile(`\b\w{5}\b`)

// ReplaceAllString - 替换所有
result := re.ReplaceAllString(text, "*****")
// "The ***** ***** fox"

// ReplaceAllStringFunc - 使用函数替换
result = re.ReplaceAllStringFunc(text, func(s string) string {
    return strings.ToUpper(s)
})
// "The QUICK BROWN fox"

// Split - 分割
re2 := regexp.MustCompile(`,\s*`)
parts := re2.Split("a, b, c", -1)
// ["a", "b", "c"]

实际案例

package main

import (
    "fmt"
    "regexp"
    "strings"
)

// 验证邮箱
func isValidEmail(email string) bool {
    re := regexp.MustCompile(`^[\w.-]+@[\w.-]+\.\w{2,}$`)
    return re.MatchString(email)
}

// 验证手机号
func isValidPhone(phone string) bool {
    re := regexp.MustCompile(`^1[3-9]\d{9}$`)
    return re.MatchString(phone)
}

// 提取所有邮箱
func extractEmails(text string) []string {
    re := regexp.MustCompile(`[\w.-]+@[\w.-]+\.\w{2,}`)
    return re.FindAllString(text, -1)
}

// 解析 URL 参数
func parseURLParams(url string) map[string]string {
    params := make(map[string]string)
    re := regexp.MustCompile(`[?&](\w+)=([^&]*)`)
    matches := re.FindAllStringSubmatch(url, -1)
    
    for _, match := range matches {
        if len(match) == 3 {
            params[match[1]] = match[2]
        }
    }
    return params
}

// 驿峰命名转下划线
func toSnakeCase(s string) string {
    re := regexp.MustCompile(`([a-z])([A-Z])`)
    s = re.ReplaceAllString(s, "${1}_${2}")
    return strings.ToLower(s)
}

// 下划线转驿峰命名
func toCamelCase(s string) string {
    re := regexp.MustCompile(`_([a-z])`)
    return re.ReplaceAllStringFunc(s, func(m string) string {
        return strings.ToUpper(m[1:])
    })
}

func main() {
    // 测试
    fmt.Println(isValidEmail("[email protected]"))  // true
    fmt.Println(extractEmails("联系我们: [email protected] 或 [email protected]"))
    fmt.Println(parseURLParams("https://example.com?name=John&age=30"))
    fmt.Println(toSnakeCase("helloWorld"))  // hello_world
    fmt.Println(toCamelCase("hello_world"))  // helloWorld
}

Rust

Rust 的 regex crate 是 Rust 生态中最流行的正则表达式库，具有高性能、内存安全和完整 Unicode 支持的特点。

添加依赖

在 Cargo.toml 中添加：

[dependencies]
regex = "1"

基本用法

use regex::Regex;

fn main() {
    // 编译正则表达式
    let re = Regex::new(r"\d{4}").unwrap();
    
    // is_match - 检查是否匹配
    assert!(re.is_match("2024"));
    
    // find - 查找第一个匹配
    let text = "年份是 2024 年";
    if let Some(mat) = re.find(text) {
        println!("找到: {}", &text[mat.start()..mat.end()]);
        // 输出: 找到: 2024
    }
    
    // find_iter - 迭代所有匹配
    let text = "2023 年到 2024 年";
    for mat in re.find_iter(text) {
        println!("匹配: {}", &text[mat.start()..mat.end()]);
    }
}

捕获分组

use regex::Regex;

fn main() {
    let re = Regex::new(r"(\d{4})-(\d{2})-(\d{2})").unwrap();
    let text = "日期: 2024-03-15";
    
    // captures - 获取捕获分组
    if let Some(caps) = re.captures(text) {
        println!("完整匹配: {}", &caps[0]);   // 2024-03-15
        println!("年: {}", &caps[1]);        // 2024
        println!("月: {}", &caps[2]);        // 03
        println!("日: {}", &caps[3]);        // 15
    }
    
    // captures_iter - 迭代所有捕获
    let text = "2024-03-15 和 2024-04-20";
    for caps in re.captures_iter(text) {
        println!("日期: {}-{}-{}", &caps[1], &caps[2], &caps[3]);
    }
}

命名分组

use regex::Regex;

fn main() {
    // 使用 (?P<name>...) 语法定义命名分组
    let re = Regex::new(r"(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})").unwrap();
    let text = "2024-03-15";
    
    if let Some(caps) = re.captures(text) {
        // 通过名称访问分组
        println!("年: {}", &caps["year"]);
        println!("月: {}", &caps["month"]);
        println!("日: {}", &caps["day"]);
        
        // 也可以通过 name 方法获取
        if let Some(year) = caps.name("year") {
            println!("年份: {}", year.as_str());
        }
    }
}

替换

use regex::Regex;

fn main() {
    let re = Regex::new(r"\d{4}").unwrap();
    let text = "年份 2023 变成 2024";
    
    // replace - 替换第一个匹配
    let result = re.replace(text, "YEAR");
    println!("{}", result);  // 年份 YEAR 变成 2024
    
    // replace_all - 替换所有匹配
    let result = re.replace_all(text, "YEAR");
    println!("{}", result);  // 年份 YEAR 变成 YEAR
    
    // 使用捕获分组进行替换
    let re = Regex::new(r"(\w+)@(\w+\.\w+)").unwrap();
    let text = "邮箱: [email protected]";
    // $1 和 $2 引用分组
    let result = re.replace(text, "用户: $1, 域名: $2");
    println!("{}", result);  // 邮箱: 用户: test, 域名: example.com
    
    // 使用命名分组替换
    let re = Regex::new(r"(?P<user>\w+)@(?P<domain>\w+\.\w+)").unwrap();
    let result = re.replace(text, "用户: $user, 域名: $domain");
    
    // 使用闭包进行替换
    let re = Regex::new(r"\d+").unwrap();
    let result = re.replace_all("1 + 2 = 3", |caps: &regex::Captures| {
        let num: i32 = caps[0].parse().unwrap();
        (num * 2).to_string()
    });
    println!("{}", result);  // 2 + 4 = 6
}

分割

use regex::Regex;

fn main() {
    let re = Regex::new(r"\s+").unwrap();
    let text = "hello   world\t\nrust";
    
    // split - 分割字符串
    let parts: Vec<&str> = re.split(text).collect();
    println!("{:?}", parts);  // ["hello", "world", "rust"]
    
    // splitn - 最多分割 n 次
    let parts: Vec<&str> = re.splitn(text, 2).collect();
    println!("{:?}", parts);  // ["hello", "world\trust"]
}

编译选项

Rust regex 通过 RegexBuilder 提供丰富的编译选项：

use regex::{Regex, RegexBuilder};

fn main() {
    // 使用 RegexBuilder 配置选项
    let re = RegexBuilder::new(r"hello")
        .case_insensitive(true)      // 忽略大小写
        .multi_line(true)            // 多行模式
        .dot_matches_new_line(true)  // . 匹配换行符
        .ignore_whitespace(true)     // 忽略空白（支持注释）
        .size_limit(10 * (1 << 20))  // 设置大小限制
        .build()
        .unwrap();
    
    assert!(re.is_match("HELLO"));
    
    // 详细模式示例（支持注释）
    let re = RegexBuilder::new(r"""
        \d{4}    # 年份
        -        # 分隔符
        \d{2}    # 月份
        -        # 分隔符
        \d{2}    # 日期
    """)
    .ignore_whitespace(true)
    .build()
    .unwrap();
    
    assert!(re.is_match("2024-03-15"));
}

Unicode 支持

Rust regex 默认完全支持 Unicode：

use regex::Regex;

fn main() {
    // 匹配中文字符
    let re = Regex::new(r"[\p{Han}]+").unwrap();
    assert!(re.is_match("你好世界"));
    
    // 匹配 Unicode 字母
    let re = Regex::new(r"\p{L}+").unwrap();
    assert!(re.is_match("Hello世界"));
    
    // 匹配 Unicode 数字（包括阿拉伯数字等）
    let re = Regex::new(r"\p{N}+").unwrap();
    assert!(re.is_match("123٠١٢"));  // 包含阿拉伯数字
    
    // 匹配 Emoji
    let re = Regex::new(r"\p{Emoji}+").unwrap();
    assert!(re.is_match("😀🎉"));
}

实际案例

use regex::Regex;

// 验证邮箱
fn is_valid_email(email: &str) -> bool {
    let re = Regex::new(r"^[\w.-]+@[\w.-]+\.\w{2,}$").unwrap();
    re.is_match(email)
}

// 提取所有邮箱
fn extract_emails(text: &str) -> Vec<String> {
    let re = Regex::new(r"[\w.-]+@[\w.-]+\.\w{2,}").unwrap();
    re.find_iter(text).map(|m| m.as_str().to_string()).collect()
}

// 驼峰转下划线
fn to_snake_case(s: &str) -> String {
    let re = Regex::new(r"([a-z])([A-Z])").unwrap();
    re.replace_all(s, "${1}_${2}").to_lowercase()
}

// 下划线转驼峰
fn to_camel_case(s: &str) -> String {
    let re = Regex::new(r"_([a-z])").unwrap();
    re.replace_all(s, |caps: &regex::Captures| {
        caps[1].to_uppercase()
    }).to_string()
}

// 解析 URL 参数
fn parse_url_params(url: &str) -> std::collections::HashMap<String, String> {
    let mut params = std::collections::HashMap::new();
    let re = Regex::new(r"[?&](\w+)=([^&]*)").unwrap();
    
    for caps in re.captures_iter(url) {
        params.insert(caps[1].to_string(), caps[2].to_string());
    }
    params
}

fn main() {
    println!("{}", is_valid_email("[email protected]"));  // true
    println!("{:?}", extract_emails("联系: [email protected] 和 [email protected]"));
    println!("{}", to_snake_case("helloWorld"));  // hello_world
    println!("{}", to_camel_case("hello_world"));  // helloWorld
    println!("{:?}", parse_url_params("https://example.com?name=John&age=30"));
}

性能特点

Rust regex crate 的核心设计理念是保证线性时间匹配，这意味着无论输入字符串多复杂，匹配时间都与输入长度成线性关系。它使用有限自动机实现，不会出现回溯灾难问题。

use regex::Regex;

fn main() {
    // 这种模式在其他语言可能导致回溯灾难
    // 但在 Rust regex 中是安全的，保证线性时间
    let re = Regex::new(r"(a+)+b").unwrap();
    
    // 即使输入很长也不会卡住
    let text = "a".repeat(100) + "c";
    assert!(!re.is_match(&text));  // 快速返回 false
}

C# (.NET)

C# 的 System.Text.RegularExpressions 命名空间提供了功能强大的正则表达式支持，具有编译正则、超时控制、命名分组等丰富特性。

基本用法

using System;
using System.Text.RegularExpressions;

class Program
{
    static void Main()
    {
        // 创建正则表达式
        Regex regex = new Regex(@"\d{4}");
        
        // IsMatch - 检查是否匹配
        bool isMatch = regex.IsMatch("年份是 2024");
        Console.WriteLine(isMatch);  // True
        
        // Match - 查找第一个匹配
        Match match = regex.Match("年份是 2024 年");
        if (match.Success)
        {
            Console.WriteLine($"找到: {match.Value}");  // 找到: 2024
            Console.WriteLine($"位置: {match.Index}");  // 位置: 4
        }
        
        // Matches - 查找所有匹配
        MatchCollection matches = regex.Matches("2023 年到 2024 年");
        foreach (Match m in matches)
        {
            Console.WriteLine($"匹配: {m.Value}");
        }
    }
}

捕获分组

using System;
using System.Text.RegularExpressions;

class Program
{
    static void Main()
    {
        Regex regex = new Regex(@"(\d{4})-(\d{2})-(\d{2})");
        string text = "日期: 2024-03-15";
        
        Match match = regex.Match(text);
        if (match.Success)
        {
            Console.WriteLine($"完整匹配: {match.Groups[0].Value}");  // 2024-03-15
            Console.WriteLine($"年: {match.Groups[1].Value}");        // 2024
            Console.WriteLine($"月: {match.Groups[2].Value}");        // 03
            Console.WriteLine($"日: {match.Groups[3].Value}");        // 15
            
            // 遍历所有分组
            for (int i = 0; i < match.Groups.Count; i++)
            {
                Console.WriteLine($"分组 {i}: {match.Groups[i].Value}");
            }
        }
    }
}

命名分组

using System;
using System.Text.RegularExpressions;

class Program
{
    static void Main()
    {
        // 使用 (?<name>...) 语法定义命名分组
        Regex regex = new Regex(@"(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})");
        string text = "2024-03-15";
        
        Match match = regex.Match(text);
        if (match.Success)
        {
            // 通过名称访问分组
            Console.WriteLine($"年: {match.Groups["year"].Value}");
            Console.WriteLine($"月: {match.Groups["month"].Value}");
            Console.WriteLine($"日: {match.Groups["day"].Value}");
        }
        
        // 获取所有分组名称
        string[] groupNames = regex.GetGroupNames();
        Console.WriteLine($"分组名称: {string.Join(", ", groupNames)}");
    }
}

替换

using System;
using System.Text.RegularExpressions;

class Program
{
    static void Main()
    {
        Regex regex = new Regex(@"\d{4}");
        string text = "年份 2023 变成 2024";
        
        // Replace - 替换所有匹配
        string result = regex.Replace(text, "YEAR");
        Console.WriteLine(result);  // 年份 YEAR 变成 YEAR
        
        // 只替换前 n 个匹配
        result = regex.Replace(text, "YEAR", 1);
        Console.WriteLine(result);  // 年份 YEAR 变成 2024
        
        // 使用分组引用
        Regex emailRegex = new Regex(@"(\w+)@(\w+\.\w+)");
        result = emailRegex.Replace("邮箱: [email protected]", "用户: $1, 域名: $2");
        Console.WriteLine(result);  // 邮箱: 用户: test, 域名: example.com
        
        // 使用命名分组引用
        Regex namedRegex = new Regex(@"(?<user>\w+)@(?<domain>\w+\.\w+)");
        result = namedRegex.Replace("邮箱: [email protected]", "用户: ${user}, 域名: ${domain}");
        
        // 使用 MatchEvaluator 委托
        Regex numRegex = new Regex(@"\d+");
        result = numRegex.Replace("1 + 2 = 3", match => 
        {
            int num = int.Parse(match.Value);
            return (num * 2).ToString();
        });
        Console.WriteLine(result);  // 2 + 4 = 6
    }
}

分割

using System;
using System.Text.RegularExpressions;

class Program
{
    static void Main()
    {
        Regex regex = new Regex(@"\s+");
        string text = "hello   world	rust";
        
        // Split - 分割字符串
        string[] parts = regex.Split(text);
        Console.WriteLine(string.Join(", ", parts));  // hello, world, rust
        
        // Split 最多分割 n 次
        parts = regex.Split(text, 2);
        Console.WriteLine(string.Join(", ", parts));  // hello, world	rust
        
        // 静态方法
        parts = Regex.Split("a,b,c", @",\s*");
        Console.WriteLine(string.Join(", ", parts));  // a, b, c
    }
}

RegexOptions 选项

using System;
using System.Text.RegularExpressions;

class Program
{
    static void Main()
    {
        // 常用选项组合
        Regex regex = new Regex(
            @"hello",
            RegexOptions.IgnoreCase |      // 忽略大小写
            RegexOptions.Multiline |        // 多行模式
            RegexOptions.Singleline |       // 单行模式（. 匹配换行）
            RegexOptions.IgnorePatternWhitespace |  // 忽略空白和注释
            RegexOptions.Compiled           // 编译为 IL 代码（提高性能）
        );
        
        // 忽略大小写
        regex = new Regex(@"hello", RegexOptions.IgnoreCase);
        Console.WriteLine(regex.IsMatch("HELLO"));  // True
        
        // 多行模式
        regex = new Regex(@"^line", RegexOptions.Multiline);
        Console.WriteLine(regex.IsMatch("first\nline"));  // True
        
        // 单行模式
        regex = new Regex(@"a.b", RegexOptions.Singleline);
        Console.WriteLine(regex.IsMatch("a\nb"));  // True
        
        // 详细模式（支持注释）
        regex = new Regex(@"
            \d{4}    # 年份
            -        # 分隔符
            \d{2}    # 月份
            -        # 分隔符
            \d{2}    # 日期
        ", RegexOptions.IgnorePatternWhitespace);
        Console.WriteLine(regex.IsMatch("2024-03-15"));  // True
    }
}

超时控制

.NET 正则表达式支持设置超时时间，防止恶意输入导致的拒绝服务攻击：

using System;
using System.Text.RegularExpressions;

class Program
{
    static void Main()
    {
        try
        {
            // 设置 1 秒超时
            Regex regex = new Regex(
                @"(a+)+b",
                RegexOptions.None,
                TimeSpan.FromSeconds(1)
            );
            
            // 如果匹配时间超过 1 秒，抛出 RegexMatchTimeoutException
            string input = new string('a', 100) + "c";
            bool result = regex.IsMatch(input);
        }
        catch (RegexMatchTimeoutException ex)
        {
            Console.WriteLine($"正则匹配超时: {ex.Message}");
        }
    }
}

编译正则表达式

对于频繁使用的正则表达式，可以编译为程序集以提高性能：

using System;
using System.Reflection;
using System.Text.RegularExpressions;

class Program
{
    static void Main()
    {
        // 使用 Compiled 选项编译为 IL 代码
        // 首次使用会有编译开销，但后续匹配更快
        Regex regex = new Regex(@"\d{4}", RegexOptions.Compiled);
        
        // 编译到程序集（高级用法）
        // Regex.CompileToAssembly 方法可以将正则表达式编译为独立的 DLL
        // 这适用于大量复杂正则表达式的场景
    }
}

实际案例

using System;
using System.Collections.Generic;
using System.Text.RegularExpressions;

public class RegexUtils
{
    // 验证邮箱
    public static bool IsValidEmail(string email)
    {
        Regex regex = new Regex(@"^[\w.-]+@[\w.-]+\.\w{2,}$");
        return regex.IsMatch(email);
    }
    
    // 验证手机号
    public static bool IsValidPhone(string phone)
    {
        Regex regex = new Regex(@"^1[3-9]\d{9}$");
        return regex.IsMatch(phone);
    }
    
    // 提取所有邮箱
    public static List<string> ExtractEmails(string text)
    {
        Regex regex = new Regex(@"[\w.-]+@[\w.-]+\.\w{2,}");
        List<string> emails = new List<string>();
        
        foreach (Match match in regex.Matches(text))
        {
            emails.Add(match.Value);
        }
        return emails;
    }
    
    // 驼峰转下划线
    public static string ToSnakeCase(string s)
    {
        Regex regex = new Regex(@"([a-z])([A-Z])");
        return regex.Replace(s, "$1_$2").ToLower();
    }
    
    // 下划线转驼峰
    public static string ToCamelCase(string s)
    {
        Regex regex = new Regex(@"_([a-z])");
        return regex.Replace(s, m => m.Groups[1].Value.ToUpper());
    }
    
    // 解析 URL 参数
    public static Dictionary<string, string> ParseUrlParams(string url)
    {
        Dictionary<string, string> params = new Dictionary<string, string>();
        Regex regex = new Regex(@"[?&](\w+)=([^&]*)");
        
        foreach (Match match in regex.Matches(url))
        {
            params[match.Groups[1].Value] = match.Groups[2].Value;
        }
        return params;
    }
    
    // 敏感信息脱敏
    public static string MaskPhone(string phone)
    {
        Regex regex = new Regex(@"(\d{3})\d{4}(\d{4})");
        return regex.Replace(phone, "$1****$2");
    }
}

class Program
{
    static void Main()
    {
        Console.WriteLine(RegexUtils.IsValidEmail("[email protected]"));  // True
        Console.WriteLine(string.Join(", ", RegexUtils.ExtractEmails("联系: [email protected] 和 [email protected]")));
        Console.WriteLine(RegexUtils.ToSnakeCase("helloWorld"));  // hello_world
        Console.WriteLine(RegexUtils.ToCamelCase("hello_world"));  // helloWorld
        Console.WriteLine(RegexUtils.MaskPhone("13812345678"));  // 138****5678
    }
}

.NET 7+ 新特性

.NET 7 引入了新的高性能 API：

using System.Text.RegularExpressions;

// 使用 source generator 生成编译时正则表达式
[GeneratedRegex(@"\d{4}")]
private static partial Regex YearRegex();

// 使用
bool isMatch = YearRegex().IsMatch("2024");

// 枚举匹配（更高效）
string text = "2023 年到 2024 年";
foreach (var match in YearRegex().EnumerateMatches(text.AsSpan()))
{
    Console.WriteLine(text.Substring(match.Index, match.Length));
}

// 计算匹配数
int count = YearRegex().Count(text);

各语言特性对比

特性	JavaScript	Python	Java	Go	Rust	C#
字面量语法	`/pattern/flags`	不支持	不支持	`原始字符串`	不支持	不支持
命名分组	`✓` (ES2018)	`✓`	`✓` (Java 7+)	`✗`	`✓`	`✓`
零宽断言	`✓` (ES2018)	`✓`	`✓` (Java 9+)	`✗`	`✓`	`✓`
后行断言	`✓` (ES2018)	`✓`	`✓`	`✗`	`✓`	`✓`
递归模式	`✗`	`✓` (regex 模块)	`✗`	`✗`	`✗`	`✓` (平衡组)
Unicode 属性	`✓` (ES2018)	`✓` (regex 模块)	`✓`	`✗`	`✓`	`✓`
原子组	`✗`	`✓` (3.11+)	`✓`	`✗`	`✗`	`✗`
占有量词	`✗`	`✓` (3.11+)	`✓`	`✗`	`✗`	`✗`
条件匹配	`✗`	`✓`	`✗`	`✗`	`✗`	`✗`
Modifier 语法	`✓` (ES2024)	`✗`	`✗`	`✗`	`✗`	`✗`
全局匹配	`g` 标志	`findall`	`find` 循环	`FindAll`	`find_iter`	`Matches`
忽略大小写	`i` 标志	`re.I`	`CASE_INSENSITIVE`	`(?i)`	`case_insensitive`	`IgnoreCase`
匹配索引	`d` 标志 (ES2022)	`Match.span()`	`start()`, `end()`	`FindStringIndex`	`start()`, `end()`	`Index`, `Length`
Unicode Sets	`v` 标志 (ES2024)	`✗`	`✗`	`✗`	`✗`	`✗`
字符串属性	`v` 标志 (ES2024)	`✗`	`✗`	`✗`	`✓`	`✓`
集合运算	`v` 标志 (ES2024)	`✗`	`✗`	`✗`	`✗`	`✗`
字符串字面量 `\q{}`	`v` 标志 (ES2024)	`✗`	`✗`	`✗`	`✗`	`✗`
escape 方法	`✓` (ES2025)	`re.escape()` (3.7+)	`Pattern.quote()`	`regexp.QuoteMeta`	`regex::escape()`	`Regex.Escape()`
超时控制	`✗`	`✗`	`✗`	`✗`	`✗`	`✓`
编译优化	`✗`	`✓`	`✓`	`✓`	`✓`	`✓` (Compiled)
线性时间保证	`✗`	`✗`	`✗`	`✗`	`✓`	`✗`

选择建议

JavaScript: 适合前端表单验证、实时文本处理
Python: 适合数据处理、脚本编写、日志分析
Java: 适合企业级应用、高并发场景
Go: 适合高性能服务、简洁语法场景
Rust: 适合系统编程、需要内存安全和性能保证的场景
C#: 适合 .NET 企业级应用、需要编译优化和超时控制的场景

常见问题

1. 转义问题

// JavaScript: 字符串中需要双层转义
new RegExp("\\d{4}");  // 匹配数字

// Python: 使用原始字符串避免转义
re.compile(r"\d{4}")

// Java: 双层转义
Pattern.compile("\\d{4}")

// Go: 原始字符串
regexp.MustCompile(`\d{4}`)

// Rust: 原始字符串
Regex::new(r"\d{4}").unwrap()

// C#: 使用逐字字符串
new Regex(@"\d{4}")

2. Unicode 处理

// JavaScript: 使用 u 标志
/\p{Han}+/u.test("你好");  // true

// Python: 默认支持 Unicode
re.findall(r"[\u4e00-\u9fff]+", "你好")

// Java: 使用 UNICODE_CHARACTER_CLASS
Pattern.compile("\\p{IsHan}+")

// Go: 使用 Unicode 范围
regexp.MustCompile(`[\p{Han}]+`)

// Rust: 内置 Unicode 支持
Regex::new(r"\p{Han}+").unwrap()

// C#: 使用 Unicode 类别
new Regex(@"\p{IsHan}+")

3. 性能考虑

对于频繁使用的正则，预先编译模式
避免在循环中重复编译
对于大文本，考虑使用流式处理
Rust 特有优势: 保证线性时间匹配，不会出现回溯灾难
C# 特有特性: 设置超时防止 ReDoS 攻击

小结

不同编程语言的正则表达式实现在语法和功能上有所差异，但核心概念是相通的：

编译: 将正则表达式字符串转换为内部表示
匹配: 在目标字符串中查找模式
提取: 获取匹配的内容和分组
替换: 将匹配的内容替换为新内容

各语言的独特优势：

JavaScript: 浏览器原生支持，ES2024 引入强大的 v 标志
Python: 灵活的 API，支持条件匹配和原子组
Java: 企业级成熟方案，完整特性支持
Go: 简洁语法，原生支持良好
Rust: 保证线性时间匹配，内存安全，适合系统编程
C#: 编译优化、超时控制，.NET 7+ 支持 Source Generator

选择适合你项目的语言，熟悉其 API 和特性，就能高效地使用正则表达式解决文本处理问题。

跨语言正则表达式概览​

共同的核心概念​

主要差异点​

学习建议​

JavaScript​

创建正则表达式​

常用方法​

命名分组​

实际案例​

Python​

基本用法​

常用函数​

Python 特有转义​

分组和命名分组​

替换和分割​

标志​

Python 3.11+ 新特性​

条件匹配​

实际案例​

Java​

基本用法​

常用方法​

替换​

实际案例​

Go​

基本用法​

常用方法​

分组​

替换和分割​

实际案例​

Rust​

添加依赖​

基本用法​

捕获分组​

命名分组​

替换​

分割​

编译选项​

Unicode 支持​

实际案例​

性能特点​

C# (.NET)​

基本用法​

捕获分组​

命名分组​

替换​

分割​

RegexOptions 选项​

超时控制​

编译正则表达式​

实际案例​

.NET 7+ 新特性​

各语言特性对比​

选择建议​

常见问题​

1. 转义问题​

2. Unicode 处理​

3. 性能考虑​

小结​

跨语言正则表达式概览

共同的核心概念

主要差异点

学习建议

JavaScript

创建正则表达式

常用方法

命名分组

实际案例

Python

基本用法

常用函数

Python 特有转义

分组和命名分组

替换和分割

标志

Python 3.11+ 新特性

条件匹配

实际案例

Java

基本用法

常用方法

替换

实际案例

Go

基本用法

常用方法

分组

替换和分割

实际案例

Rust

添加依赖

基本用法

捕获分组

命名分组

替换

分割

编译选项

Unicode 支持

实际案例

性能特点

C# (.NET)

基本用法

捕获分组

命名分组

替换

分割

RegexOptions 选项

超时控制

编译正则表达式

实际案例

.NET 7+ 新特性

各语言特性对比

选择建议

常见问题

1. 转义问题

2. Unicode 处理

3. 性能考虑

小结