Be aware that PHP 5.2.0 up to 5.2.3 PCRE is buggy when using a simple regex like:
<?php
if(preg_match("/((a+)?)+/", "a")){
echo "Matched";
}
?>
Running a nested regex like this will result in a Segmentation Fault on those versions.
The bug is reported to be fixed (http://bugs.php.net/bug.php?id=41796), but think twice before writing a regex like that one if your code needs to be compatible with these versions.
PCRE Functions
Table of Contents
- preg_filter — Perform a regular expression search and replace
- preg_grep — Return array entries that match the pattern
- preg_last_error — Returns the error code of the last PCRE regex execution
- preg_match_all — Perform a global regular expression match
- preg_match — Perform a regular expression match
- preg_quote — Quote regular expression characters
- preg_replace_callback — Perform a regular expression search and replace using a callback
- preg_replace — Perform a regular expression search and replace
- preg_split — Split string by a regular expression
PCRE Functions
bermi
09-Jul-2007 07:20
09-Jul-2007 07:20
misc at e2007 dot cynergi dot com
05-May-2007 01:16
05-May-2007 01:16
PCRE faster than POSIX RE? Not always.
In a recent search-engine project here at Cynergi, I had a simple loop with a few cute ereg_replace() functions that took 3min to process data. I changed that 10-line loop into a 100-line hand-written code for replacement and the loop now took 10s to process the same data! This opened my eye to what can *IN SOME CASES* be very slow regular expressions.
Lately I decided to look into Perl-compatible regular expressions (PCRE). Most pages claim PCRE are faster than POSIX, but a few claim otherwise. I decided on bechmarks of my own.
My first few tests confirmed PCRE to be faster, but... the results were slightly different than others were getting, so I decided to benchmark every case of RE usage I had on a 8000-line secure (and fast) Webmail project here at Cynergi to check it out.
The results? Inconclusive! Sometimes PCRE *are* faster (sometimes by a factor greater than 100x faster!), but some other times POSIX RE are faster (by a factor of 2x).
I still have to find a rule on when are one or the other faster. It's not only about search data size, amount of data matched, or "RE compilation time" which would show when you repeated the function often: one would *always* be faster than the other. But I didn't find a pattern here. But truth be said, I also didn't take the time to look into the source code and analyse the problem.
I can give you some examples, though. The POSIX RE
([0-9]{4})/([0-9]{2})/([0-9]{2})[^0-9]+
([0-9]{2}):([0-9]{2}):([0-9]{2})
is 30% faster in POSIX than when converted to PCRE (even if you use \d and \D and non-greedy matching). On the other hand, a similarly PCRE complex pattern
/[0-9]{1,2}[ \t]+[a-zA-Z]{3}[ \t]+[0-9]{4}[ \t]+[0-9]{1,2}:[0-9]{1,2}(:[0-9]{1,2})?[ \t]+[+-][0-9]{4}/
is 2.5x faster in PCRE than in POSIX RE. Simple replacement patterns like
ereg_replace( "[^a-zA-Z0-9-]+", "", $m );
are 2x faster in POSIX RE than PCRE. And then we get confused again because a POSIX RE pattern like
(^|\n|\r)begin-base64[ \t]+[0-7]{3,4}[ \t]+......
is 2x faster as POSIX RE, but the case-insensitive PCRE
/^Received[ \t]*:[ \t]*by[ \t]+([^ \t]+)[ \t]/i
is 30x faster than its POSIX RE version!
When it comes to case sensitivity, PCRE has so far seemed to be the best option. But I found some really strange behaviour from ereg/eregi. On a very simple POSIX RE
(^|\r|\n)mime-version[ \t]*:
I found eregi() taking 3.60s (just a number in a test benchmark), while the corresponding PCRE took 0.16s! But if I used ereg() (case-sensitive) the POSIX RE time went down to 0.08s! So I investigated further. I tried to make the POSIX RE case-insensitive itself. I got as far as this:
(^|\r|\n)[mM][iI][mM][eE]-vers[iI][oO][nN][ \t]*:
This version also took 0.08s. But if I try to apply the same rule to any of the 'v', 'e', 'r' or 's' letters that are not changed, the time is back to the 3.60s mark, and not gradually, but immediatelly so! The test data didn't have any "vers" in it, other "mime" words in it or any "ion" that might be confusing the POSIX parser, so I'm at a loss.
Bottom line: always benchmark your PCRE / POSIX RE to find the fastest!
Tests were performed with PHP 5.1.2 under Windows, from the command line.
Pedro Freire
cynergi.com
nickspring at mail dot ru
14-Oct-2006 04:47
14-Oct-2006 04:47
Regular Expressions Tutorial on russian language is accessible on http://www.pcre.ru
lgandras at hotmail dot com
19-Feb-2006 11:19
19-Feb-2006 11:19
I read this part, but i couldn't undertand a single word beacause before i must know Basic regular expression. Somebody put a link for PERL that is almost like PHP but here is one totally dedicated to PHP:
http://weblogtoolscollection.com/regex/regex.php
Gokul
06-Feb-2006 09:59
06-Feb-2006 09:59
I came accross this nice tutorial for regural expression in perl
http://perldoc.perl.org/perlretut.html
richardh at phpguru dot org
22-Sep-2005 08:50
22-Sep-2005 08:50
There's a printable PDF PCRE cheat sheet available here:
http://www.phpguru.org/article.php?ne_id=67
Has the common metacharacters, quantifiers, pattern modifiers, character classes and assertions with short explanations.
hfuecks at nospam dot org
04-Jul-2005 11:21
04-Jul-2005 11:21
Good PCRE tutorial at http://www.tote-taste.de/X-Project/regex/ - well explained but still in depth
Ned Baldessin
24-Oct-2004 03:08
24-Oct-2004 03:08
If you want to perform regular expressions on Unicode strings, the PCRE functions will NOT be of any help. You need to use the Multibyte extension : mb_ereg(), mb_eregi(), pb_ereg_replace() and so on. When doing so, be carefull to set the default text encoding to the same encoding used by the text you are searching and replacing in. You can do that with the mb_regex_encoding() function. You will probably also want to set the default encoding for the other mb_* string functions with mb_internal_encoding().
So when dealing with, say, french text, I start with these :
<?php
mb_internal_encoding('UTF-8');
mb_regex_encoding('UTF-8');
setlocale(LC_ALL, 'fr-fr');
?>
steve at stevedix dot de
20-Jul-2004 02:17
20-Jul-2004 02:17
Something to bear in mind is that regex is actually a declarative programming language like prolog : your regex is a set of rules which the regex interpreter tries to match against a string. During this matching, the interpreter will assume certain things, and continue assuming them until it comes up against a failure to match, which then causes it to backtrack. Regex assumes "greedy matching" unless explicitly told not to, which can cause a lot of backtracking. A general rule of thumb is that the more backtracking, the slower the matching process.
It is therefore vital, if you are trying to optimise your program to run quickly (and if you can't do without regex), to optimise your regexes to match quickly.
I recommend the use of a tool such as "The Regex Coach" to debug your regex strings.
http://weitz.de/files/regex-coach.exe (Windows installer) http://weitz.de/files/regex-coach.tgz (Linux tar archive)
Biju
21-Sep-2003 06:00
21-Sep-2003 06:00
Regular Expressions Tutorial from non PHP sites
http://www.amk.ca/python/howto/regex/
http://sitescooper.org/tao_regexps.html
http://www.english.uga.edu/humcomp/perl/regex2a.html
http://www.english.uga.edu/humcomp/perl/regexps.html
http://www.english.uga.edu/humcomp/perl/regular_expressions.HTML
http://www.english.uga.edu/humcomp/perl/
http://java.sun.com/docs/books/tutorial/extra/regex/
http://gnosis.cx/publish/programming/regular_expressions.html
http://www.zvon.org/other/PerlTutorial/Books/Book1/
http://it.metr.ou.edu/regex/
http://www.regular-expressions.info/
hrz at geodata dot soton dot ac dot uk
06-Mar-2002 08:33
06-Mar-2002 08:33
If you're venturing into new regular expression territory with a lack of useful examples then it would pay to get familiar with this page:
http://www.pcre.org/man.txt