Mastering Regular Expressions (2)
Friday, March 30th, 2007First chapter complete! And so I shall share with thee my knowledge of this divine subject…
They have a summary of the chapter, so I’ll basically just hit those points. Here are the symbols and what they mean.
- . or dot means any character – number, digit or symbol.
- [] encloses a character class, which has it’s own set of rules. It will match any one character listed. For example, [0-9a-zA-Z] is looking for any 1 character that is a digit, a lowercase letter or an uppercase letter. The hyphen in a character class shows a range, like 0-9 means 0 through 9. The only exception is if it’s at the beginning, like [-0-9] would mean to either get a hyphen or a digit.
- \char can change the value or escape a character. For instance, \. would mean literally a ., while \< and \> can mean the beginning and ending of a word, respectively.
- ? applies to the preceding 1 character or expression, meaning there may be 0 or 1 of the expression. So you could say colou?r to allow an optional u in the word color. If you said [0-9a-zA-Z]? it would mean an optional letter or digit.
- * applies to the preceding 1 character or expression, meaning there may be 0 or more. So if you said [0-9]* it would mean that you were allowing an infinite amount of digits, or none.
- + applies to the preceding 1 character or expression, meaning there may be 1 or more of the expression. So if you put [0-9]+ that would mean that there needs to be atleast 1 number, but can contain an infinite amount.
- ^ (also called a caret) means to match from the beginning of the line or string, so if you said ^[0-9], it would not match the string “I’m 12″, but it would match the string “12 I am” (it would match the 1, since the character class only matches one character). It also has a special meaning inside of a character class. If you put a caret inside a character class, like [^0-9], that means anycharacter that is not a ______. In this case, any character that is not a digit.
- $ means it matches at the end of a line. So if we used our previous example, but warped a little bit, ^[0-9]$ would mean the string or line would need to have 1 digit on it, and nothing else. Both “12 I am” and “I’m 12″ would not match. “12″ would also not match. It would match “1″ or “2″. On the other hand, if you put in ^[0-9]+$ that would mean that it would match “1″, as well as “12″, or “314159265″ etc.
- \< As I briefly mentioned earlier means the start of a word. So if you said “\<at” (I’m starting to use double quotes to make it more clear) that would match “at” or event “attached”, but it would not match “categories”.
- \> means the end of a word, so if you put “ate\>” it would match “ate” and “fate“, but not “categories”. “\<ate\>” would only match the word “ate”. Note on the last two, they are not supported by all regex utilities, so you should test it before you rely on it.
- | means alternate when inside of parentheses. For example, if you were matching an extension of a file, like an image, your regex might look something like “(jpg|jpeg|gif|bmp|png|tiff)$”. That would mean any of those options would match.
- () (parentheses) can be used for alternation, or grouping so that the symbols, ?, * and + will work on entire expression. For instance, if you said ([0-9]\.)+ that would mean you would have the whole expression (0-9]\.) 1 or more times to match. It is also used for captures (coming up next).
- \1, \2, etc… refers to a back reference. This means that \1 will refer to the text matched in a set of parentheses. The example used in the book was used for editting. If you wanted to find everytime a word was repeated, you could use a regex like “\<([a-zA-Z]+) +\1\>” and that would match any word that doubled itself. If that’s a bit confusion, in english it says, start at the beginning of a word, followed by a word with 1 or more letters, with 1 or more spaces between it and the next word, while the next word is the same as the first word. If that’s a bit confusing, I’m sorry, but here’s an example it would match. “He ran ran to the store.” It would also match “He ran    ran to the store,” or a number of other combinations. Back references may not be supported by all regex checkers, so make sure you test it.
I hope that helped, and if you have any questions, feel free to contact me or get the book. It explains it far better than I did (that was basically a 30 page chapter). I’m just giving you the highlights if you want to get into the nitty gritty fast.
