For whom is this tutorial This tutorial is for anyone who is curious about regular expressions and has little or no experience with them. The tutorial presumes that the reader has some UNIX command tool skills and understanding. POSIX has standarised two different sets of regular expressions, Basic Regular Expression (BRE) and Extended Regular Expressions (ERE). This tutorial covers the basics of using BRE. Once they are learned, it is easy to learn how to use ERE. BRE can be used in most programs, like sed, and grep. egrep and awk use ERE, but the syntax in most of the cases is identical, so there is a fair chance that a simple BRE will have the same syntax as the ERE. At the end of this page there is a small chapter on the main differences between BRE and ERE. Comments welcome! If you wish to give me your comments, suggest new ideas or examples
for this tutorial, please write me an e-mail to: diego at diegoalonso dot net How to use this guide You are free to use this guide whichever way you want. Anyway, to get the most out of it, I would recommend you to run the examples in your machine, and experiment doing some changes to them, and see what happens. For making the examples more clear, I created a simple C program, with a very reduced 'grep' functionality, and called it 'mygrep'. It reads lines from the standard input, and tries to match a regular expression. If a line matches the regular expression, it is printed out to the output, with the matched string highlighted. I have compiled the program in both Linux and Solaris. If you want to get a copy of the program, let me know. The basics Regular expressions are expressions that are used to match a certain number of characters in a bigger string. A simple regular expression could be 'dog'. A string like 'This is a dog' will match the regular expression 'dog'. A string like 'This is not a cat' will not mach it. If you have used grep, you have been using regular expressions all along, whether you know it or not. The basic syntax for grep is 'grep regular_expression filename'.
Any character The metacharacter dot (.) matches any single character, except the new line.
The regular expression '.' matches any character. The regular expression '..l' matches any two characters followed by an l. Note that the regular expression is written between single quotes. This is to avoid the shell from interpreting the expression. Zero or more instances The metacharacter star (*) matches zero or more instances of the preceding regular expression.
Observe how the expression 'o*t' means zero or more instances of a letter 'o' followed by a t. In the first line, the 't' in total matches that criteria. Square brackets Inside the square bracket metacharacter you can define the set of characters you want the regular expression to match.
The regular expresion 'ot[ah]' matches the character 'o' followed by 't', and followed by either 'a' or 'h'. You can use the dash character (-) to specifiy a range of characters. That way, [a-z] will mach any lowercase letter, [A-Z] will match uppercase letters, [a-zA-Z] will match upper and lower case letters, and [0-9] will match any single digit number. Look at the following example.
Inside square brackets, use (^) character to indicate a set of characters that should not be matched. For instance [^0-9] matches any character that is not a single numeric digit.
The previous regular expression means: match any set of characters not between 2 and 8. Beginning of line The metacharacter (^) matches the beginning of the line.
The regular expression '^t' matches any line starting with the character 't'. Note the difference when (^) appears inside square brackets (used to negate a regular expression) and when it appers outside square brackets (used to specify beginning of line) End of line The metacharacter ($) matches the end of the line.
The regular expression 'p$' matches any line ending with the character 'p'. Number of ocurrences The curly bracket metacharacters \{\} allows you to specify the number of times a regular expression has to occur.
The regular expression 're\{2,3\}gexp' matches the character 'r' followed by two or three instances of the character 'e' followed by 'gexp'. The regular expression 're\{2\}gexp' matches the character 'r' followed by two instances of the character 'e' followed by 'gexp'. The regular expression 're\{3,\}gexp' matches the character 'r' followed by three or more instances of the character 'e' followed by 'gexp'. Scape character We use the metacharacter '\' to scape special characters. Observe the difference
Backreference The backreference metacharacter \(\) can be used to reference a part of a regular expression previously matched. This is better explained with an example.
The expression [0-9] will match any number. Since it is between parenthesis, its value will be stored in memory, and avaliable under the expression \1. [0-9]* will match zero or more instances of any number. So \([0-9]\)[0-9]*\1 will match any mumber, followed by zero or more numbers, followed by the first number that was matched. That should explain the output of the command. Main differences between BRE and ERE ERE supports all metacharacters that BRE supports, changing the syntax if some of them, and adding a few valuable metacharacters. ERE does not need the backslash for backreferences and number of ocurrences. What in BRE is '\([0-9]\)' in ERE is '([0-9])'. What in BRE is 're\{2,3\}gexp' in ERE is 're{2,3}gexp'. This provides a simpler syntax. There are also several additions in ERE. The metacharacter (+) in ERE is similar to the metacharacer (*), only that (+) matches one or more instances of a character (instead of zero or more) The metacharacter (?) matches zero or one instance of a given character. To know where to use ERE and where to use BRE, look at the chapter about which regular expressions. |