re2 is a PHP extension which provides an interface to Google's RE2 regular-expression library.
Backtracking engines are typically full of features and convenient syntactic sugar but can be forced into taking exponential amounts of time on even small inputs. RE2 uses automata theory to guarantee that regular expression searches run in time linear in the size of the input. RE2 implements memory limits, so that searches can be constrained to a fixed amount of memory. RE2 is engineered to use a small fixed C++ stack footprint no matter what inputs or regular expressions it must process; thus RE2 is useful in multithreaded environments where thread stacks cannot grow arbitrarily large.
On large inputs, RE2 is often much faster than backtracking engines; its use of automata theory lets it apply optimizations that the others cannot.
Unlike most automata-based engines, RE2 implements almost all the common Perl and PCRE features and syntactic sugars. It also finds the leftmost-first match, the same match that Perl would, and can return submatch information. The one significant exception is that RE2 drops support for backreferences and generalized zero-width assertions, because they cannot be implemented efficiently. The syntax page gives full details.
<?php
$subject = 'Hello regex world';
re2_match_all('\w+', $subject, $matches);
print_r($matches);
/*
Array
(
[0] => Array
(
[0] => Hello
[1] => regex
[2] => world
)
)
*/
re2_match_all('\w(\w+)', $subject, $matches, RE2_SET_ORDER);
print_r($matches);
/*
Array
(
[0] => Array
(
[0] => Hello
[1] => ello
)
[1] => Array
(
[0] => regex
[1] => egex
)
[2] => Array
(
[0] => world
[1] => orld
)
)
*/
echo re2_replace('\w+', 'foo', $subject), "\n";
/*
foo foo foo
*/
echo re2_replace('\w+', 'foo', $subject, 1), "\n";
/*
foo regex world
*/
echo re2_replace_callback('\w+', function($m) { return strtoupper($m[0]); }, $subject, 2), "\n";
/*
HELLO REGEX world
*/
?>
The interface is intended to follow ext/pcre (preg_match()
et al) as closely as possible.
The main differences are:
- Pattern delimiters (the "/" in "/foo/") are not required.
- The functions which accept a pattern will take either a string or an RE2 object for the pattern.
int re2_match(mixed $pattern, string $subject [, array &$matches [, int $flags = RE2_ANCHOR_NONE [, int $offset = 0]]])
Returns whether the pattern matches the subject.
int re2_match_all(mixed $pattern, string $subject, array &$matches [, int $flags = RE2_PATTERN_ORDER [, int $offset = 0]])
Returns how many times the pattern matched the subject.
mixed re2_replace(mixed $pattern, mixed $replacement, mixed $subject [, int $limit = -1 [, int &$count]])
Replaces all matches of the pattern with the replacement.
mixed re2_replace_callback(mixed $pattern, mixed $callback, mixed $subject [, int $limit = -1 [, int &$count]])
Replaces all matches of the pattern with the value returned by the replacement callback.
mixed re2_filter(mixed $pattern, mixed $replacement, mixed $subject [, int $limit = -1 [, int &$count]])
Replaces all matches of the pattern with the replacement. Returns only the subjects where there was a match.
Return array entries which match the pattern (or which don't, with RE2_GREP_INVERT.)
Escapes all potentially meaningful regexp characters in the subject.
Represents a compiled regex pattern.
Construct a new Re2 object.
Returns the pattern.
Returns the options used for this pattern.
Options to be used for a particular pattern.
Construct a new Re2Options object.
Default "utf8". The encoding to use for the pattern and subject strings, "utf8" or "latin1".
Default 8388608 (65KB).
The max_mem option controls how much memory can be used to hold the compiled form of the regexp (the Prog) and its cached DFA graphs. Code Search placed limits on the number of Prog instructions and DFA states: 10,000 for both. In RE2, those limits would translate to about 240 KB per Prog and perhaps 2.5 MB per DFA (DFA state sizes vary by regexp; RE2 does a better job of keeping them small than Code Search did). Each RE2 has two Progs (one forward, one reverse), and each Prog can have two DFAs (one first match, one longest match).
The RE2 memory budget is statically divided between the two Progs and then the DFAs: two thirds to the forward Prog and one third to the reverse Prog. The forward Prog gives half of what it has left over to each of its DFAs. The reverse Prog gives it all to its longest-match DFA.
Once a DFA fills its budget, it flushes its cache and starts over. If this happens too often, RE2 falls back on the NFA implementation.
Default false
.
Restrict patterns to POSIX egrep syntax.
Default false
.
Search for the longest match instead of the first match.
Default true
.
Write syntax and execution errors to stderr.
Default false
.
Interpret pattern as literal, not regex.
Default false
.
Never match \n
, even in regex.
Default true
.
Match is case-sensitive (regexp can override with (?i) unless in posix_syntax mode)
Default false
.
Allow Perl's \d \s \w \D \S \W
when in posix_syntax mode.
Default false
.
Allow \b \B
(word boundary and not) when in posix_syntax mode.
Default false
.
^
and $
only match beginning and end of text when in posix_syntax mode.