正则表达式进阶用法与真实调优案例

Aaron Hu

2019-10-17

正则表达式并不陌生，但当面对复杂的正则表达式：长度长，特殊符号多，多分组的情况下，解析速度会怎样？

正则表达式基础

为节省版面直击要害，基础内容本文不作赘述，请点击以下链接查看：

表达式全集

史上最全常用正则表达式大全

真实案例：超级慢的正则表达式

常用的正则表达式一般几秒内解析完毕，那么见过3分钟都没解析出来的正则表达式吗？请往下看：

解析目标

过滤请求中不合法的 uri（不包含参数），如包含特殊符号- = + % ?、多个斜杠//、中文等。

代码实现

正则表达式："^(/?[A-Za-z0-9\\-]+/?[A-Za-z0-9\\-]+)+/?"

编写 java 代码：

// 所有方法设置成静态的，方便 main 调用
private static final Pattern URI_PATTERN = Pattern.compile("^(/?[A-Za-z0-9\\-]+/?[A-Za-z0-9\\-]+)+/?)";

public static boolean validateUri(String uri) {
  Matcher m = URI_PATTERN.matcher(uri);
  return m.matches();
}

public static void validate() {
  System.out.println("开始。。。。。。");
	System.out.println(validateUri("/hotel/getUseraaaaaaaasdfasdfasdfasdfasdf+"));
  System.out.println("结束。。。。。。");
}

执行 validate 方法。

测试结果

6C32G mac pro 跑了3分钟没跑完。。没耐心继续等待，直接中断。

最简单的解决方案

简化正则表达式为：^/?([-A-Za-z0-9]/?)+$

但是我们需要知道为什么？如果就想在原来的基础上做修改，该怎么办？

原因探索：回溯匹配

猜测：考虑字符太长，以及特殊符号的问题

测试1：把 uri 中间的 aaaaaasdfsdf 等去掉后，只保留简单的 /hotel/getUser+，很明显速度提上来了。

测试2：把 uri 中的 + 号提前，变成 /hotel/getUser+aaaaaaaasdfasdfasdfasdfasdf，速度也很快。

初步结论：非匹配字符的位置会影响正则表达式的执行效率

为此需要知道正则表达式的执行规则：回溯。

比如要匹配的字符串是 helloworld,hihaojava

正则表达式是 h(ello|ihao)java

匹配过程：

从字符串第一个字符 h 开始匹配，可以命中。
接下来的匹配正则有2个分支 ello 和 ihao 。在 e 处打标记，先从左边的 ello 开始匹配，可以匹配字符串，但是到了 world 的 w 时，与正则中的 j 不一致，该分支匹配结束。接着回溯到刚刚的标记处，开始第二个分支 ihao，无法匹配，接下来正则会从 h 开始匹配字符串。
从字符串的第二个字符 e 继续匹配，直到正则的第一个字符 h 匹配到字符串的 , 之后匹配成功。
接着正则的2个分支执行匹配，最终只有 ihao匹配成功，最终匹配到的字符串是 hihaojava。

看完了正则的匹配过程，就知道为什么前面代码中的正则匹配效率会那么低下了。每个字母都要匹配到最后的+才发现匹配失败，回溯后继续查找，如果忽略其他，只考虑字符串 getUseraaaaaaaasdfasdfasdfasdfasdf+ 和正则 [-A-Za-z0-9] 的影响，时间复杂度就已经为 n^2，类似于以下模拟代码：

// 模拟代码，真实情况远比这个复杂，这里仅为了方便理解
public static boolean execute(String s) {
  String reg = "-ABCDEFGHIJKMLNOPQRSTUVWSYZabcdefghijkmlnopqrstuvwsyz0123456789";
  int i = 0, count = 0;
  int length = s.length();
  for (; i < length; i++) {
    for (int j = 0; j < length; j++) {
      count ++;
      String idx = s.substring(j, j + 1);
      if (!reg.contains(idx)) {
        System.out.println("匹配失败！特殊字符：" + idx);
        break;
      }
      System.out.println("匹配字符：" + idx);
    }
  }
  System.out.println("一共匹配了 " + count + "次！字符串长度为 " + length);
  return i != length;
}

解决方法：采用进阶匹配模式

正则模式：贪婪、勉强、侵占

假定字符串为：aahelloworldhello

贪婪模式（.*he）：将正则分为两个模式 p1 .* 以及 p2 he 。
1. 第一轮匹配：p1读入所有字符串，那么p2就没什么都没匹配到。
2. 第二轮匹配：字符串被分割为 aahelloworldhell 和 o，p1匹配子串1成功，p2匹配子串2失败。
3. 直到字符串分割为 aahelloworld 和 hello时，两个正则模式都匹配成功。匹配到的子串为aahelloworldhe，停止匹配，返回结果。
勉强模式（.*?he）：最小匹配方式。此时的正则模式为 .*? 和 he。
1. 第一次匹配：p1由于是0或任意次，被忽略，用字符串整体去匹配 p2，当然失败。
2. 第二次匹配：p1读入第一个字符 a，匹配成功，剩余的 ahelloworldhello 由 p2匹配，失败。
3. 直到字符串分割为 aa 和 helloworldhello，两个正则模式都匹配成功。匹配子串 aahe，返回结果。
4. 继续匹配，直到字符串分隔为 lloworld 和 hello，匹配子串 lloworldhe，返回结果。
侵占模式（.*+he）：也叫占用模式。匹配开始时读入所有字符串和 p1匹配成功，但没有剩余字符串去和 p2匹配，因此返回匹配失败。

说明：贪婪模式和占有模式相比，贪婪模式会在只有部分匹配成功的条件下，依次从多到少减少匹配成功部分模式的匹配数量，将字符留给模式其他部分去匹配。而占用模式则是占有所有能匹配成功部分，绝不留给其他部分使用。

代码演示

public static final Pattern PATTERN_GREEDY = Pattern.compile(".*he");
public static final Pattern PATTERN_FORCED = Pattern.compile(".*?he");
public static final Pattern PATTERN_OCCUPIED = Pattern.compile(".*+he");

public static void execute() {
  String s = "aahelloworldhello";
  Matcher greedy = PATTERN_GREEDY.matcher(s);
  while (greedy.find()) {
    System.out.println("贪婪模式：匹配到子串 " + greedy.group());
    System.out.println("贪婪模式：查找匹配到的子串在原始串中的索引位置 " + greedy.start());
    System.out.println("贪婪模式：查找匹配到的子串最后一个字符串在原串中的位置 " + greedy.end());
  }

  Matcher forced = PATTERN_FORCED.matcher(s);
  while (forced.find()) {
    System.out.println("勉强模式：匹配到子串 " + forced.group());
    System.out.println("勉强模式：查找匹配到的子串在原始串中的索引位置 " + forced.start());
    System.out.println("勉强模式：查找匹配到的子串最后一个字符串在原串中的位置 " + forced.end());
  }

  Matcher occupied = PATTERN_OCCUPIED.matcher(s);
  System.out.println("侵占模式：匹配原串任意位置返回的结果： " + occupied.find());
}

返回的结果：

贪婪模式：匹配到子串 aahelloworldhe
贪婪模式：查找匹配到的子串在原始串中的索引位置 0
贪婪模式：查找匹配到的子串最后一个字符串在原串中的位置 14
  
勉强模式：匹配到子串 aahe
勉强模式：查找匹配到的子串在原始串中的索引位置 0
勉强模式：查找匹配到的子串最后一个字符串在原串中的位置 4
勉强模式：匹配到子串 lloworldhe
勉强模式：查找匹配到的子串在原始串中的索引位置 4
勉强模式：查找匹配到的子串最后一个字符串在原串中的位置 14
  
侵占模式：匹配原串任意位置返回的结果： false

在这个基础上，之前的难题迎刃而解。对于判断 uri 是否合法的问题，不需要正则做回溯操作，整体不合法则返回匹配失败即可，因此选用侵占模式，需要将之前的正则表达式改进一下。

原始正则表达式："^(/?[A-Za-z0-9\\-]+/?[A-Za-z0-9\\-]+)+/?"

改进后的正则表达式："^(/?[A-Za-z0-9\\-]++/?[A-Za-z0-9\\-]++)+/?"

测试后发现速度变为了 ms 级，只增加 + 号，效果显而易见。

正则高级用法补充

除了贪婪勉强侵占模式以外，补充一些其他的高级用法。

获取匹配 Capturing

系统在幕后将所有的子模式匹配结果保存起来，供我们查找或替换。

后向引用：使用 \数字 代表前面某个子模式的匹配内容，使用 $数字 代表变量。

例如：匹配合法的 html 标记。

正则：<h([1-6])>.*?</h\1>

文本：<h1> text1</h1> <h2>text23</h3>

其中 <h1>text1</h1> 被成功匹配。\1代表前面的子模式([1-6])的匹配结果1。

常见应用：匹配重复单词 (\w+) \1，匹配合法的 html 标记。

非获取匹配 Non-Capturing

在子模式内部前面添加 ?:。表示这个子模式的匹配内容不会被保存，不能用于后向引用中。

例如：Windows 95 and Windows 98 are the successor. Then Windows 2000 and Windows Xp appeared. Windows Vista is the Latest version of the family.

正则：Windows (?:[\w]+\b)

匹配：Windows 95 and Windows 98 are the successor. Then Windows 2000 and Windows Xp appeared. Windows Vista is the Latest version of the family.

结果：只匹配内容，但并未保存子匹配的结果

正向肯定预查：在子模式内部前面加 ?=，子模式仅仅作为条件限制，并不作为匹配结果输出，匹配子模式前面的内容。

正则：Windows (?=[\d]+\b)

匹配：Windows 95 and Windows 98 are the successor. Then Windows 2000 and Windows Xp appeared. Windows Vista is the Latest version of the family.
正向否定预查：在子模式内部前面加 ?!。

正则：Windows (?![\d]+\b)

匹配：Windows 95 and Windows 98 are the successor. Then Windows 2000 and Windows Xp appeared. Windows Vista is the Latest version of the family.
反向肯定预查：在子模式内部前面加 ?<=，匹配子模式后面的结果作为匹配结果。

例如：CNY:100.2 USD:222.1 USD:301.3 HKD:122.1 CNY:114.4

正则：(?<=CNY:)\d+\.\d

匹配：CNY:100.2 USD:222.1 USD:301.3 HKD:122.1 CNY:114.4
反向否定预查：在子模式内部前面加 ?<!

正则：(?<!CNY:)\b\d+\.\d

匹配：CNY:100.2 USD:222.1 USD:301.3 HKD:122.1 CNY:114.4

代码演示

获取匹配

1 2	String s = "abc def aaa bbb".replaceAll("(\\w+)\\s(\\w+)", "$2 $1"); // 结果是 def abc bbb aaa

非获取匹配

public static final Pattern PATTERN = Pattern.compile("Windows (?:[\\w]+\\b)");
public static final Pattern PATTERN_POSITIVE_YES = Pattern.compile("Windows (?=[\\d]+\\b)");
public static final Pattern PATTERN_POSITIVE_NO = Pattern.compile("Windows (?![\\d]+\\b)");
public static final Pattern PATTERN_NEGATIVE_YES = Pattern.compile("(?<=CNY:)\\d+\\.\\d+");
public static final Pattern PATTERN_NEGATIVE_NO = Pattern.compile("(?<!CNY:)\\b\\d+\\.\\d+");

public static final String STR1 = "Windows 95 and Windows 98 are the successor. Then Windows 2000 and Windows Xp appeared. Windows Vista is the Latest version of the family.";
public static final String STR2 = "CNY:100.25 USD:222.16 USD:301.3 HKD:122.1 CNY:114.4";

public static void execute() {
  Matcher noCapturing = PATTERN.matcher(STR1);
  System.out.println("\n非获取匹配结果：---------");
  while (noCapturing.find()) {
    System.out.print(noCapturing.group() + "\t");
  }

  Matcher positiveYes = PATTERN_POSITIVE_YES.matcher(STR1);
  System.out.println("\n正向肯定预查结果：---------");
  while (positiveYes.find()) {
    System.out.print(positiveYes.group() + "\t");
  }

  Matcher positiveNo = PATTERN_POSITIVE_NO.matcher(STR1);
  System.out.println("\n正向否定预查结果：---------");
  while (positiveNo.find()) {
    System.out.print(positiveNo.group() + "\t");
  }

  Matcher negativeYes = PATTERN_NEGATIVE_YES.matcher(STR2);
  System.out.println("\n负向肯定预查结果：---------");
  while (negativeYes.find()) {
    System.out.print(negativeYes.group() + "\t");
  }

  Matcher negativeNo = PATTERN_NEGATIVE_NO.matcher(STR2);
  System.out.println("\n负向否定预查结果：---------");
  while (negativeNo.find()) {
    System.out.print(negativeNo.group() + "\t");
  }
}

运行结果：

非获取匹配结果：---------
Windows 95	Windows 98	Windows 2000	Windows Xp	Windows Vista
正向肯定预查结果：---------
Windows 	Windows 	Windows
正向否定预查结果：---------
Windows 	Windows
负向肯定预查结果：---------
100.25	114.4
负向否定预查结果：---------
222.16	301.3	122.1

正则表达式进阶用法与真实调优案例

正则表达式基础

真实案例：超级慢的正则表达式

解析目标

代码实现

测试结果

最简单的解决方案

原因探索：回溯匹配

猜测：考虑字符太长，以及特殊符号的问题

初步结论：非匹配字符的位置会影响正则表达式的执行效率

解决方法：采用进阶匹配模式

正则模式：贪婪、勉强、侵占

代码演示

正则高级用法补充

获取匹配 Capturing

非获取匹配 Non-Capturing

代码演示

Previous

大型网站技术架构

book

Next

mac单机部署mysql多实例

database

Comments