Regular Expressions in PHP

Perl-compatible Regular Expressions

Perl Compatible Regular Expressions (normally abbreviated as “PCRE”) offer a very powerful string-matching and replacement mechanism that far surpasses anything we have examined so far.
Regular expressions are often thought of as very complex—and they can be at times. However, properly used they are relatively simple to understand and fairly easy to use. Given their complexity, of course, they are also much more computationally intensive than the simple search-and-replace functions we examined earlier in this chapter. Therefore, you should use them only when appropriate—that is, when using the simpler functions is either impossible or so complicated that it’s not worth the effort.

A regular expression is a string that describes a set of matching rules. The simplest possible regular expression is one that matches only one string; for example, Davey matches only the string “Davey”. In fact, such a simple regular expression would be pointless, as you could just as easily perform the match using strpos(), which is a much faster alternative.
The real power of regular expressions comes into play when you don’t know the exact string that you want to match. In this case, you can specify one or more metacharacters and quantifiers, which do not have a literal meaning, but instead stand to be interpreted in a special way.
In this chapter, we will discuss the basics of regular expressions that are required by the exam. More thorough coverage is provided by the PHP manual, or by one of the many regular expression books available (most notably, Mastering Regular Expressions, by Jeffrey Friedl, published by O’ReillyMedia).
Delimiters
A regular expression is always delimited by a starting and ending character. Any character can be used for this purpose (as long as the beginning and ending delimiter match); since any occurrence of this character inside the expression itself must be escaped, it’s usually a good idea to pick a delimiter that isn’t likely to appear inside the expression. By convention, the forward slash is used for this purpose—although, for example, another character like the octothorpe is sometimes used when dealing with pathnames or URLs.
Metacharacters
The term “metacharacter” is a bit of a misnomer—as a metacharacter can actually be composed of more than one character. However, every metacharacter represents a single character in the matched expression. Here are the most common ones:
. Match any character
ˆ Match the start of the string
$ Match the end of the string
\s Match any whitespace character
\d Match any digit
\w Match any “word” character
Metacharacters can also be expressed using grouping expressions. For example, a series of valid alternatives for a character can be provided by using square brackets:
/ab[cd]e/
The expression above will match both abce and abde. You can also use other metacharacters, and provide ranges of valid characters inside a grouping expression:
/ab/
This will match abc, abd, abe and any combination of ab followed by a digit.
Quantifiers
A quantifier allows you to specify the number of times a particular character or metacharacter can appear in a matched string. There are four types of quantifiers:
* The character can appear zero ormore times
+ The character can appear one or more times
? The character can appear zero or one times
{n,m} The character can appear at least n times, and no more than m.
Either parameter can be omitted to indicated a minimum limit with nomaximum, or a maximum limit without aminimum, but not both.
Thus, for example, the expression ab?c matches both ac and abc, while ab{1,3}c
matches abc, abbc and abbbc.
Sub-Expressions
A sub-expression is a regular expression contained within the main regular expression (or another sub-expression); you define one by encapsulating it in parentheses:
/a(bc.)e/
This expression will match the letter a, followed by the letters b and c, followed by any character and, finally the letter e. As you can see, sub-expressions by themselves do not have any influence on the way a regular expression is executed; however, you can use them in conjunction with quantifiers to allow for complex expressions to happen more than once. For example:
/a(bc.)+e/
This expression will match the letter a, followed by the expression bc. repeated one or more times, followed by the letter e.
Sub-expressions can also be used as capturing patterns, which we will examine in the next section.
Matching and Extracting Strings
The preg_match() function can be used to match a regular expression against a given string. The function returns true if the match is successful, and can return all the captured subpatterns in an array if an optional third parameter is passed by reference. Here’s an example:
<?php$name = "Davey Shafik";
// Simple match
$regex = "/[a-zA-Z\s]/";
if(preg_match($regex, $name)){
   // Valid Name
}
// Match with subpatterns and capture
$regex = '/^(\w+)\s(\w+)/';
$matches = array();
if(preg_match ($regex, $name, $matches)){
   var_dump ($matches);
}
?>
If you run the second example, you will notice that the $matches array is populated, on return with the following values:
array(3) {
   [0]=>
   string(12) "Davey Shafik"
   [1]=>
   string(5) "Davey"
   [2]=>
   string(6) "Shafik"
}
As you can see, the first element of the array contains the entire matched string, while the second element (index 1) contains the first captured subpattern, and the third element contains the second matched subpattern.
Performing Multiple Matches
The preg_match_all() function allows you to perform multiple matches on a given string based on a single regular expression. For example:
<?php
$string = "a1bb b2cc c2dd";
$regex = "#([abc])\d#";
$matches = array();
if(preg_match_all($regex, $string, $matches)){
   var_dump($matches);
}
?>
This script outputs the following:
array(2){
[0]=>
   array(3){
   [0]=>
   string(2) "a1"
   [1]=>
   string(2) "b2"
   [2]=>
   string(2) "c2"
   }
[1]=>
   array(3){
   [0]=>
   string(1) "a"
   [1]=>
   string(1) "b"
   [2]=>
   string(1) "c"
   }
}
As you can see, all the whole-pattern matches are stored in the first sub-array of the result, while the first captured subpattern of every match is stored in the corresponding slot of the second sub-array.
Using PCRE to Replace Strings
Whilst str_replace() is quite flexible, it still only works on “whole” strings, that is, where you know the exact text to search for. Using preg_replace(), however, you can replace text that matches a pattern we specify. It is even possible to reuse captured subpatterns directly in the substitution string by prefixing their index with a dollar sign. In the example below, we use this technique to replace the entire matched pattern with a string that is composed using the first captured subpattern ($1).
<?php
$body = "[b]Make Me Bold![/b]";
$regex = "@\[b\](.*?)\[/b\]@i";
$replacement = '<b>$1</b>';
$body = preg_replace($regex, $replacement, $body);
?>
Just like with str_replace(), we can pass arrays of search and replacement arguments; however, unlike str_replace(), we can also pass in an array of subjects on which to perform the search-and-replace operation. This can speed things up considerably, since the regular expression (or expressions) are compiled once and reused multiple times. Here’s an example:
<?php
$subjects['body'] = "[b]Make Me Bold![/b]";

$subjects['subject'] = "[i]Make Me Italics![/i]";

$regex[] = "@\[b\](.*?)\[/b\]@i";

$regex[] = "@\[i\](.*?)\[/i\]@i";

$replacements[] = "<b>$1</b>";

$replacements[] = "<i>$1</i>";

$results = preg_replace($regex, $replacements, $subjects);
?>
When you execute the code shown above, you will end up with an array that looks like this:
array(2){
   ["body"]=>
   string(20) "<b>Make Me Bold!</b>"
   ["subject"]=>
   string(23) "<i>Make Me Italic!</i>"
}

 

No comments:

Post a Comment

Related Posts Plugin for WordPress, Blogger...