For whom is this tutorial

This tutorial is for anyone who is curious about regular expressions and has little or no experience with them. The tutorial presumes that the reader has some UNIX command tool skills and understanding.

Which regular expressions?

POSIX has standarised two different sets of regular expressions, Basic Regular Expression (BRE) and Extended Regular Expressions (ERE). This tutorial covers the basics of using BRE. Once they are learned, it is easy to learn how to use ERE. BRE can be used in most programs, like sed, and grep. egrep and awk use ERE, but the syntax in most of the cases is identical, so there is a fair chance that a simple BRE will have the same syntax as the ERE.

At the end of this page there is a small chapter on the main differences between BRE and ERE.

Comments welcome!

If you wish to give me your comments, suggest new ideas or examples for this tutorial, please write me an e-mail to: diego at diegoalonso dot net

How to use this guide

You are free to use this guide whichever way you want. Anyway, to get the most out of it, I would recommend you to run the examples in your machine, and experiment doing some changes to them, and see what happens.

For making the examples more clear, I created a simple C program, with a very reduced 'grep' functionality, and called it 'mygrep'. It reads lines from the standard input, and tries to match a regular expression. If a line matches the regular expression, it is printed out to the output, with the matched string highlighted. I have compiled the program in both Linux and Solaris. If you want to get a copy of the program, let me know.

The basics

Regular expressions are expressions that are used to match a certain number of characters in a bigger string. A simple regular expression could be 'dog'. A string like 'This is a dog' will match the regular expression 'dog'. A string like 'This is not a cat' will not mach it. If you have used grep, you have been using regular expressions all along, whether you know it or not. The basic syntax for grep is 'grep regular_expression filename'.

# ls -l | mygrep root
-rwxr-xr-x 1
root other 11496 Nov 29 13:13 a.out
-rw-r--r-- 1
root other 73901 Nov 30 11:34 hello.c
-rwxr-xr-x 1
root other 12200 Nov 30 11:35 mygrep

Any character

The metacharacter dot (.) matches any single character, except the new line.

# ls -l | mygrep '.'
total 50
-rwxr-xr-x 1 root other 11496 Nov 29 13:13 a.out
-rw-r--r-- 1 root other 73901 Nov 30 11:34 hello.c
-rwxr-xr-x 1 root other 12200 Nov 30 11:35 mygrep
# ls -l | mygrep '..l'
to
tal 50
-rw-r--r-- 1 root other 73901 Nov 30 11:34
hello.c

The regular expression '.' matches any character. The regular expression '..l' matches any two characters followed by an l. Note that the regular expression is written between single quotes. This is to avoid the shell from interpreting the expression.

Zero or more instances

The metacharacter star (*) matches zero or more instances of the preceding regular expression.

# ls -l | mygrep 'o*t'
total 50
-rwxr-xr-x 1 r
oot other 11496 Nov 29 13:13 a.out
-rw-r--r-- 1 r
oot other 73901 Nov 30 11:34 hello.c
-rwxr-xr-x 1 r
oot other 12200 Nov 30 11:35 mygrep

Observe how the expression 'o*t' means zero or more instances of a letter 'o' followed by a t. In the first line, the 't' in total matches that criteria.

Square brackets

Inside the square bracket metacharacter you can define the set of characters you want the regular expression to match.

# ls -l | mygrep 'ot[ah]'
total 50
-rwxr-xr-x 1 root
other 11496 Nov 29 13:13 a.out
-rw-r--r-- 1 root
other 739 Nov 30 11:34 hello.c
-rwxr-xr-x 1 root
other 12200 Nov 30 11:35 mygrep

The regular expresion 'ot[ah]' matches the character 'o' followed by 't', and followed by either 'a' or 'h'.

You can use the dash character (-) to specifiy a range of characters. That way, [a-z] will mach any lowercase letter, [A-Z] will match uppercase letters, [a-zA-Z] will match upper and lower case letters, and [0-9] will match any single digit number. Look at the following example.

# ls -l | mygrep "[2-8][2-8]*"
total
70
-rwxr-xr-x 1 root other 9
7361 Nov 30 13:46 a.out
-rw-r--r-- 1 root other 11340 Dec 12 16:11 hello.c
-rwxr-xr-x 1 root other 1011
2 Dec 12 16:11 mygrep

Inside square brackets, use (^) character to indicate a set of characters that should not be matched. For instance [^0-9] matches any character that is not a single numeric digit.

# ls -l | mygrep "[^2-8]*"
total 70
-rwxr-xr-x 1 root other 97361 Nov 30 13:46 a.out
-rw-r--r-- 1 root other 11340 Dec 12 16:11 hello.c
-rwxr-xr-x 1 root other 10112 Dec 12 16:11 mygrep

The previous regular expression means: match any set of characters not between 2 and 8.

Beginning of line

The metacharacter (^) matches the beginning of the line.

# # ls -l | mygrep '^t'
total 50

The regular expression '^t' matches any line starting with the character 't'. Note the difference when (^) appears inside square brackets (used to negate a regular expression) and when it appers outside square brackets (used to specify beginning of line)

End of line

The metacharacter ($) matches the end of the line.

# ls -l | mygrep 'p$'
-rwxr-xr-x 1 root other 12200 Nov 30 11:35 mygrep

The regular expression 'p$' matches any line ending with the character 'p'.

Number of ocurrences

The curly bracket metacharacters \{\} allows you to specify the number of times a regular expression has to occur.

# ls -l
total 0
-rw-r--r-- 1 root other 0 Nov 30 15:38 reeeegexp
-rw-r--r-- 1 root other 0 Nov 30 15:38 reeegexp
-rw-r--r-- 1 root other 0 Nov 30 15:38 reegexp
-rw-r--r-- 1 root other 0 Nov 30 15:38 regexp
# ls -l | mygrep 're\{2,3\}gexp'
-rw-r--r-- 1 root other 0 Nov 30 15:38
reeegexp
-rw-r--r-- 1 root other 0 Nov 30 15:38
reegexp
# ls -l | mygrep 're\{2\}gexp'
-rw-r--r-- 1 root other 0 Nov 30 15:38
reegexp
# ls -l | mygrep 're\{3,\}gexp'
-rw-r--r-- 1 root other 0 Nov 30 15:38
reeeegexp
-rw-r--r-- 1 root other 0 Nov 30 15:38
reeegexp

The regular expression 're\{2,3\}gexp' matches the character 'r' followed by two or three instances of the character 'e' followed by 'gexp'.

The regular expression 're\{2\}gexp' matches the character 'r' followed by two instances of the character 'e' followed by 'gexp'.

The regular expression 're\{3,\}gexp' matches the character 'r' followed by three or more instances of the character 'e' followed by 'gexp'.

Scape character

We use the metacharacter '\' to scape special characters. Observe the difference

# ls -l | mygrep '.'
total 70
-rwxr-xr-x 1 root other 9736 Nov 30 13:46 a.out
-rwxrwxrwx 1 root other 48 Dec 7 11:31 compile.sh
-rw-r--r-- 1 root other 1134 Dec 12 16:11 hello.c
-rwxr-xr-x 1 root other 10112 Dec 12 16:11 mygrep
-rwxr-xr-x 1 root other 10452 Nov 30 14:10 regexp
-rw-r--r-- 1 root other 939 Nov 30 14:12 regexp.c
# ls -l | mygrep '\.'
-rwxr-xr-x 1 root other 9736 Nov 30 13:46 a
.out
-rwxrwxrwx 1 root other 48 Dec 7 11:31 compile
.sh
-rw-r--r-- 1 root other 1134 Dec 12 16:11 hello
.c
-rw-r--r-- 1 root other 939 Nov 30 14:12 regexp
.c

Backreference

The backreference metacharacter \(\) can be used to reference a part of a regular expression previously matched. This is better explained with an example.

# ls -l | mygrep '\([0-9]\)[0-9]*\1'
-rwxrwxrwx 1 root other 48 Dec 7
11:31 compile.sh
-rw-r--r-- 1 root other
1134 Dec 12 16:11 hello.c
-rwxr-xr-x 1 root other
10112 Dec 12 16:11 mygrep
-rw-r--r-- 1 root other
939 Nov 30 14:12 regexp.c

The expression [0-9] will match any number. Since it is between parenthesis, its value will be stored in memory, and avaliable under the expression \1. [0-9]* will match zero or more instances of any number. So \([0-9]\)[0-9]*\1 will match any mumber, followed by zero or more numbers, followed by the first number that was matched. That should explain the output of the command.

Main differences between BRE and ERE

ERE supports all metacharacters that BRE supports, changing the syntax if some of them, and adding a few valuable metacharacters.

ERE does not need the backslash for backreferences and number of ocurrences. What in BRE is '\([0-9]\)' in ERE is '([0-9])'. What in BRE is 're\{2,3\}gexp' in ERE is 're{2,3}gexp'. This provides a simpler syntax.

There are also several additions in ERE. The metacharacter (+) in ERE is similar to the metacharacer (*), only that (+) matches one or more instances of a character (instead of zero or more)

The metacharacter (?) matches zero or one instance of a given character.

To know where to use ERE and where to use BRE, look at the chapter about which regular expressions.