Regular expressions (or regex) are a powerful way to traverse large strings in order to find information. They rely on underlying patterns in a string’s structure to work their magic. Unfortunately, simple regular expressions are unable to cope with complex patterns and symbols. To deal with this dilemma, you can use advanced regular expressions.
Below, we present an introduction to advanced regular expressions, with eight commonly used concepts and examples. Each example outlines a simple way to match patterns in complex strings. If you do not yet have experience with basic regular expressions, have a look at this article to get started. The syntax used here matches PHP's Perl-compatible regular expressions.
1. Greediness/Laziness
All regex repetition operators are greedy. They try to match as much as possible in a string. Unfortunately, this might not always be a desired effect. Thus, lazy operators are used to solve this problem. They only match the smallest possible pattern and are used by adding a '?' after the respective greedy operator. Alternatively, the 'U' modifier may be used to make all repetiton operators lazy. Differentiating between greediness and laziness is key to fully understanding advanced regular expressions.
Greedy Operators
The * operator matches the previous expression 0 or more times. It is a greedy operator. Consider the following expression:
preg_match( '/<h1>.*</h1>/', '<h1>This is a heading.</h1>
<h1>This is another one.</h1>', $matches );
Recall that a . means any character except a new line. The above regular expression is looking for an h1 tag and all of its contents. It uses the . and * operators to constantly match anything inside the tag. This pattern will match:
<h1>This is a heading.</h1><h1>This is another one.</h1>
It returns the whole string. The * operator will continuously match everything -- even the middle closing h1 tag -- because it is greedy. Matching the whole string is the best it can do.
Lazy Operators
Let's change the above operator by adding a '?' after it. This will make it lazy:
/<h1>.*?</h1>/
The regex now fulfills its duty and matches only the first h1 tag. Another greedy operator that uses this same property is {n,}. This matches the previous expression n or more times. If it is used without a question mark, it looks for the most repetitions possible. Otherwise, it starts from n repetitions:
# Set up a String
$str = 'hihi';
# Match it using the greedy {n,} operator
preg_match( '/(hi){1,}/', $str, $matches ); # matches[0] will be 'hihi'
# Match it with the lazy {n,}? operator
preg_match( '/(hi){1,}?/', $str, $matches ); # matches[0] will be 'hi'
2. Back Referencing
What it does
Back referencing is a way to refer to previously matched patterns inside a regular expression. For example, take a look at this simple regex that matches an expression in quotes:
# Set up an array of matches
$matches = array();
# Create a String
$str = ""This is a 'string'"";
# Traverse it with regular expressions
preg_match( "/("|').*?("|')/", $str, $matches );
# Print the whole match
echo $matches[0];
Unfortunately, this will not correctly match the string. Instead, it will print:
"This is a '
This regular expression matches the opening double quote but finds a different type of quote to close it. This is because it was given the option of picking a double or single quote at the end. In order to fix this, you can use back referencing. The expressions 1, 2, ...., 9 hold references to previously captured subpatterns. The first matched quote, in this case, will be held by the variable 1.
How to Use It
In order to apply this concept to the aforementioned example, use 1 in place of the last quote:
preg_match( '/("|').*?1/', $str, $matches );
This will now correctly return:
"This is a 'string'"
Remember that back referencing may also be used by preg_replace
. Note that instead of 1 ... 9, you should use $1 ... $9 ... $n (any number of these will work). For example, if you want to replace all paragraph tags with text that represents them, use:
$text = preg_replace( '/<p>(.*?)</p>/',
"<p>$1</p>", $html );
The $1 back reference holds the text inside the paragraph and is being used in the replace pattern itself. This completely valid expression shows an easy way to access matched patterns even while replacing.
3. Named Groups
When using multiple back references, a regular expression can quickly become confusing and hard to understand. An alternative way to back reference is by using named groups. A named group is specified by using (?P<name>pattern)
, where name is the name of the group and pattern is the regular expression in the group itself. The group can then be referred to by (?P=name). For example, consider the following:
/(?P<quote>"|').*?(?P=quote)/
The above expression will create the same effect as the previous back reference example, but by instead using named groups. It is also significantly easier to read.
Named groups are also useful when sifting through the array of matches. The name given to a specific pattern is also the key of the corresponding matches array.
preg_match( '/(?P<quote>"|')/', "'String'", $matches );
# This will print "'"
echo $matches[1];
# This will also print "'", as it is a named group
echo $matches['quote'];
Thus, named groups not only make code easier to read but also organize it.
4. Word Boundaries
Word boundaries are places in a string that come between a word character and a non-word character. The specialty of these boundaries is the fact that they don't actually match a character. Their length is zero. The b
regular expression matches any word boundary.
Unfortunately, boundaries are so often skimmed over that many do not recognize their real significance. For example, let's say you want to match the word "import":
/import/
Watch out! Regular expressions can be tricky. The above expression will also match:
important
You may think it is as simple as adding a space before and after import to prevent these bogus matches:
/ import /
But what about this case?
The trader voted for the import
When import is at the beginning or the end of a string, the modified regex will fail. Thus, splitting this up into cases is required:
/(^import | import | import$)/i
Looking back at our regular expression, it does not take periods or other punctuation into account. Just to match this single word, a regular expressions may look like this:
/(^import(:|;|,)? | import(:|;|,)? | import(.|?|!)?$)/i
That's a lot of code to match just a single word. This is why word boundaries are so significant. To accomplish the above statement and many other variations with word boundaries, all that is necessary is:
/bimportb/
This will match every case above and more. b
's flexibility comes from the fact that it matches a zero-length string. All it matches is an imaginary space between two characters. It checks if one of the characters is a non-word character and the other is a word character. If so, it matches it. If the beginning or end of a string is encountered, b
treats it as a non-word character. Because the i
in import is still considered a word character, it will match import.
Note that the opposite of b is B. This operator will match the space in-between two word or two non-word characters. Thus, if you would like to match 'hi' inside another word, you could use:
BhiB
5. Atomic Groups
Atomic groups are special regex groups that are non-capturing. They are usually used to increase the efficiency of a regular expression, but may also be applied to eliminate certain matches. An atomic group is specified by using (?>pattern):
/(?>his|this)/
When the regex engine matches an atomic group, it will discard backtracting positions that came with all tokens inside it. Consider the word 'smashing'. Using the above regular expression, the regex engine will first try to match the pattern 'his' in 'smashing'. It will not find a match. At this point, the atomic group will kick in. The engine will discard all backtracking positions. This means that it will not search for 'this' inside 'smashing'. Why? If 'his' did not return a match, then obviously 'this' (which includes 'his') will not return positive either.
The above example did not have many practical uses. We might as well have used /t?his?/
instead. Look at the following:
/b(engineer|engrave|end)b/
If the regex engine is given the word 'engineering', it will correctly match 'engineer'. The next word boundary, b, will not match. Thus, it will move on to the next match: engrave. It realizes that the 'eng' matches, but the rest do not. Finally, 'end' is attempted and also failed. If you look carefully, you will realize that once the engine matches 'engineer' and fails the last word boundary, it can not possibly match 'engrave' or 'end'. These two matches are smaller words than 'engineer', and thus the regex engine should not continue with the other trials.
/b(?>engineer|engrave|end)b/
The above is a much better alternative that will save the regex engine time and improve the code's efficiency.
6. Recursion
Recursion in regular expressions can be used to match nested constructs, such as parentheses, (this (that)), and HTML tags, <div></div>
. They require the use of (?R)
, an operator that matches recursive sub-patterns. Consider the regular expression that matches nested parentheses:
/(((?>[^()]+)|(?R))*)/
The outermost parentheses in this regular expression match the beginning of the nested constructs. Then comes an optional operator, which can either match non-parenthetical characters (?>[^()]+
) or the whole expression again in a sub-pattern, (?R). Notice that this operator is repeated as many times as possible to match all nested parentheses.
Another example of recursion at work is the following:
/<([w]+).*?>((?>[^<>]+)|((?R)))*1>/
The above expression combines character groups, greedy operators, back-tracking, and atomic groups to match nested tags. The first parenthesized group ([w]+) matches the tag name for use later in the regular expression. It then proceeds to match the rest of the tag. The next parenthesized sub-expression is very similar to the one above. It either matches non-tag (?>[^<>]+) characters or recurses over another tag (?R). Finally, the last part of the expression matches the close tag.
7. Callbacks
Certain matches in a pattern may require special modifications. In order to apply multiple or complex changes, callbacks can be used. A callback is used for dynamic substitution Strings in the preg_replace_callback
function. They take in a function as a parameter to use when a match is found. This function receives the match array as a parameter and returns a modified string that is used as a replacement.
As an example, consider a regular expression that changes all words to uppercase in a given string. Unfortunately, PHP does not have a regex operator that changes a character to a different case. To accomplish this task, a callback may be used. First, the expression must match all letters that need to be capitalized:
/bw/
The above uses both word boundaries and character classes to work. Now that we have this expression, we can write a callback function:
function upper_case( $matches ) {
return strtoupper( $matches[0] );
}
upper_case
takes in an array of matches and returns the whole matched pattern in uppercase. $matches[0]
, in this case, represents the letter that needs to be capitalized. All of this can now be put together using the preg_replace_callback
function:
preg_replace_callback( '/bw/', "upper_case", $str );
That is the power of a simple callback.
8. Commenting
Commenting is not a way to actually match strings, but it is one of the most important parts of regular expressions. As you dive deep into larger, more complex expressions, it becomes hard to decipher what is actually being matched. Using comments in the middle of regular expressions is the perfect way to minimize such confusion.
To place a comment inside a regular expression, use the (?#comment)
format. Replace "comment" with the word(s) of your choice:
/(?#digit)d/
It is especially important to comment regular expressions that you release to the public. Users of your regex will be able to easily understand and modify the pattern to meet their needs. It can even go so far as to help you decode it when revisiting a program.
Consider using the "x" or (?x) modifier for free-spacing mode with comments. This causes a regular expression to ignore white space between tokens. All spaces can still be represented with [ ]
or
(a backslash and a space):
/
d #digit
[ ] #space
w+ #word
/x
The above is the same as:
/d(?#digit)[ ](?#space)w+(?#word)/
Always create well-documented code.
Further Resources
- Regular-Expressions.info
Comprehensive website on regular expressions - Cheat Sheet
Informative regular expressions cheat sheet - Regex Generator
JavaScript regular expressions generator