For who is this tutorial

This tutorial is for anyone who is curious about awk and has little or no experience with it. The tutorial presumes that the reader has some basic UNIX command tool skills, and basic programming knowledge (if you have seen an 'if' and 'for' statement before, you should be all right.)

Which awk?

This tutorial is based on POSIX awk run under Sun Solaris. It should be possible to run it in any UNIX system with POSIX awk, and with a bit of work from your side, also on non-unix operating systems. Solaris has its own version of awk, usually located under /usr/bin/, which is not POSIX compliant. A POSIX compliant version of awk is delivered under /usr/xpg4/bin/awk. You may decide to add /usr/xpg4/bin/ at the beginning of your PATH environment variable.

Comments welcome!

If you wish to give me your comments, suggest new ideas or examples for this tutorial, please write me an e-mail to: diego at diegoalonso dot net. If you have done something with awk that you think is really cool and you want to share it, send it to me, and I will include it in this page, if I think it's appropiate.

How to use this guide

You are free to use this guide whichever way you want. Anyway, to get the most out of it, I would recommend you to run the examples in your machine, and experiment doing some changes to them, and see what happens.

The basics

Basically awk reads records from a stream and splits them into tokens, and allows the execution of a program manipulating those tokens. By default awk considers every line as a record and uses space to separate tokens.

The basic syntax is awk 'program' filename, so here we go with the first example.

# cat /etc/hosts
127.0.0.1 localhost
222.222.222.222 fake_ip
111.111.111.111 no_real_ip does_not_exist
# awk '{print $1}' /etc/hosts
127.0.0.1
222.222.222.222
111.111.111.111

As mentioned before, awk reads records from a stream. In this case, the file /etc/hosts will have three records, since each line is consider like a record by default. The default token separator is the space character, so record one will have two tokens (127.0.0.1 and localhost), record two will have two tokens, and record three will have three tokens. The program '{print $1}' asks awk to print the first token from every record it reads. That should explain the output from the command.

Of course, you can use pipes with awk.

# ls -l
total 4
drwxrwxrwx 2 root other 512 Sep 30 14:10 in
drwxrwxrwx 2 root other 512 Sep 28 13:24 out
# ls -l | awk '{print $5, $9}'

512 in
512 out

In this example, tokens 5 and 9 (which corresponds to file size and file name) are printed. Note that the first line in the output from ls does not have tokens 5 or 9, and because of that, the first line in the output from awk is empty.

It's possible to change token separator from space to whatever is desired with the -F option

# cat /etc/passwd
root:x:0:1:Super-User:/:/sbin/sh
daemon:x:1:1::/:
bin:x:2:2::/usr/bin:
sys:x:3:3::/:
adm:x:4:4:Admin:/var/adm:
# awk -F: '{print $1, $6}' /etc/passwd
root /
daemon /
bin /usr/bin
sys /
adm /var/adm

The previous awk command will print the user name and its home directory for each user defined in the /etc/passwd file. Note how the tokens are now separated by the colon character, instead of the default space.

If the token manipulation program is going to be long and complex, it is a good practice to save the awk program in a file. Then you can invoke that program file from awk with the -f option.

# cat /etc/passwd
root:x:0:1:Super-User:/:/sbin/sh
daemon:x:1:1::/:
bin:x:2:2::/usr/bin:
sys:x:3:3::/:
adm:x:4:4:Admin:/var/adm:
# cat example.awk
{print $1, $6}
# awk -F: -f example.awk /etc/passwd
root /
daemon /
bin /usr/bin
sys /
adm /var/adm

Using variables

awk has mainly two types of scalar variables: numeric and strings (scalar variables are the ones containing only one value, as opposed to arrays). In awk all new variables are initialized to empty string, and a conversion to numeric is done when necessary. To illustrate how awk works with variables let's look at this trivial example

# ls -l
total 6
drwxr-xr-x 2 alon other 512 Feb 28 2005 cdrfiles
-rwxrwxrwx 1 alon other 158 Feb 28 2005 moveFiles
-rwxrwxrwx 1 alon other 254 Feb 26 2005 perfMeasure

# ls -l | awk '{totalSize = totalSize + $5; print totalSize}'
0
512
670
924

In this program we used a variable called totalSize. The first line that is processes (total 6) only has two tokens, so $5 will be empty. When awk executes the code totalSize = totalSize + $5 it realizes that it is doing a numeric operation, and translates both empty strings to 0, so totalSize gets the numeric value 0. In the second line (drwxr-xr-x 2 root other 512 Feb 28 2005 cdrfiles) the fifth token has the value 512, which will be added to the previous value of totaSize. The third time $5 will get the value 158, which is added again to the variable. And so it goes, you get the idea.

awk does not have an operator to concatenate strings. Instead, you just put the strings following each other, and awk will concatenate them for you. Look at the following example.

# ls -l
total 4
drwxrwxrwx 2 alon other 512 Sep 28 13:24 in
drwxrwxrwx 2 alon other 512 Sep 28 13:24 out
#
# ls -l | awk '{name = name "<>" $9; print name}'
<>
<><>in
<><>in<>out

The instruction name = name "<>" $9 will concatenate the content of name (empty string in the first execution) with the string "<>" and the token number 9, the name of the file. You also see in this example that strings are declared between quotes.

Conversion between numbers and strings

The conversion between numbers and strings is simple to do. To convert a number into a string, you do the following:

number = 7; string = "" number;

You are basically concatenating an empty string with the number, with the result of a string containing the number you had previously.

The conversion from string to number is equally simple

string = "12"; number = 0 + string;

When awk has to apply the + operator, it converts its elements into numbers. As a curiosity, the POSIX version of awk converts "12A" to number 12, and the Solaris version of it converts "12A" to 0.

Pre execution and post execution

awk has a mechanism to execute code before the input stream is parsed, and after the whole stream has been parsed. For doing this you need to use the BEGIN and END pattern. Look at this example

# ls -l
total 4
drwxrwxrwx 2 alon other 512 Sep 28 13:24 in
drwxrwxrwx 2 alon other 512 Sep 28 13:24 out
# ls -l | awk 'BEGIN {print "Before parsing file"} {print $0} \

? END {print "After  parsing file"}'
Before parsing file
total 4
drwxrwxrwx 2 alon other 512 Sep 28 13:24 in
drwxrwxrwx 2 alon other 512 Sep 28 13:24 out
After parsing file

The code BEGIN {print "Before parsing file"} will be executed before the file is parsed. The code {print $0} will be executed once for every record (or line) and the code END {print "After parsing file"} will be executed after the whole file has been parsed.

If you look at the example, you should also realize that $0 is a variable that contains the the whole record.

If statement

# ls -lrt
total 10
drwxrwxrwx 2 alon other 512 Sep 28 13:24 out
drwxrwxrwx 2 alon other 512 Sep 30 14:10 in
-rw-r--r-- 1 alon other 600 Oct 3 15:44 bigger.txt
-rw-r--r-- 1 alon other 600 Oct 3 15:45 bigger.copy
-rw-r--r-- 1 alon other 198 Oct 3 15:49 listBigger.awk
#
# cat listBigger.awk
BEGIN {maxSize=0}
{
if ($5 > maxSize) # Bigger file found
{
maxFile=$9;
maxSize=$5;
}
else if ($5 == maxSize) # identical size
maxFile=maxFile ", " $9; # Concatenate current filename
}
END {print "The biggest file(s) is " maxFile}
#
# ls -lrt | awk -f listBigger.awk
The biggest file(s) is bigger.txt, bigger.copy

The listBigger.awk program begins initializing the variable maxSize to 0 (actually, this was not necessary, because the first time a variable is invoked it is initialized with the empty string value, which converts to 0 when translated into number.) Then awk goes through every line of the stream, comparing the value of token 5 (file size except for the first line) with maxSize. If the file size is bigger, we record the file name in maxFile, and set the maxSize to the current size. Otherwise, if token 5 has the same value as maxSize, we add the current filename to the previous filename, thus reporting all filenames of maximum size. Also note how # can be used to put comments into your code

You can also see that, like in C, if you want to put more than one instruction in an if or else statement you need to put it between curly brackets ({})

For statement

The for statement is very similar to the for statement in C. The for statement has three members, separated by semicolons. The first member is executed before the first execution of the loop. The second member is a conditional expression, which is checked every time before executing the loop, and if it evaluates to TRUE, the code in the loop is run. The third member is executed every time the loop finishes. Take a look at the following examples.

# ls -l
total 4
drwxrwxrwx 2 alon other 512 Sep 30 14:10 in
drwxrwxrwx 2 alon other 512 Sep 28 13:24 out
# ls -l | awk '{for (token=1;token<=NF;token++) \

? printf "[" token ":" $token "]"; print ""}'
[1:total][2:4]
[1:drwxrwxrwx][2:2][3:alon][4:other][5:512][6:Sep][7:30][8:14:10][9:in]
[1:drwxrwxrwx][2:2][3:alon][4:other][5:512][6:Sep][7:28][8:13:24][9:out]

A few things to note in this example. First note how the variable NF holds the number of tokens in the current field. Also, note that when the variable token is 1, $token is interpreted as if it was $1. And finally, you can see that printf does not print a new line at the end, as opposed to print. We will look into printf with more detail in a while.

There is a second version of the loop command to work with arrays, and it will be explained in the arrays chapter.

While statement

The while statement could be seen as a simplified for, since while(expression) is identical to for(;expression;). Lets try to rewrite the previous example using while instead of for.

# ls -l
total 4
drwxrwxrwx 2 alon other 512 Sep 30 14:10 in
drwxrwxrwx 2 alon other 512 Sep 28 13:24 out
# ls -l | awk '{token=0; while(++token<=NF) \

? printf "["token":"$token"]"; print ""}'
[1:total][2:4]
[1:drwxrwxrwx][2:2][3:alon][4:other][5:512][6:Sep][7:30][8:14:10][9:in]
[1:drwxrwxrwx][2:2][3:alon][4:other][5:512][6:Sep][7:28][8:13:24][9:out]]

Here we use the pre-increment operator (++), which will increment the value of token, and return its new value (for an explanation of the difference between pre-increment and post-increment look into the numeric functions section). The rest of the awk program resembles very much the for counterpart.

Do-while statement

Similar to the while statement, we have the do-while statement. Let's try to rewrite the previous code with a do-while.

# ls -l
total 4
drwxrwxrwx 2 alon other 512 Sep 30 14:10 in
drwxrwxrwx 2 alon other 512 Sep 28 13:24 out
# ls -l | awk '{token=1; do {printf "["token":"$token"]"} \

? while (++token<=NF); print ""}'
[1:total][2:4]
[1:drwxrwxrwx][2:2][3:alon][4:other][5:512][6:Sep][7:30][8:14:10][9:in]
[1:drwxrwxrwx][2:2][3:alon][4:other][5:512][6:Sep][7:28][8:13:24][9:out]]

The main difference between while and do-while is that with the do-while the code in the loop is executed at least once, and with the while statement, if the condition evaluates to FALSE the first time, the code in the loop will never be executed.

Pattern matching

Until now we have been missing an important part of awk programming model. All the scripts that we have seen had followed the model '{action}', with the variants 'BEGIN {action}' and 'END {action}'. Actually, BEGIN and END are two special types of patterns, and the full syntax for an awk script is 'pattern {action}'. The general rule is that awk tries to match each record with the specified pattern. Only in case the record matches the pattern will the action be executed.

The usual way to define the regular expression is between slash characters.

# ls -l
total 6
drwxrwxrwx 2 root other 512 Sep 30 14:10 in
drwxrwxrwx 2 root other 512 Sep 28 13:24 out
drwxr-xr-x 2 root other 512 Oct 31 16:11 output
# ls -l | awk '/out/ {print $0}'
drwxrwxrwx 2 root other 512 Sep 28 13:24 out
drwxr-xr-x 2 root other 512 Oct 31 16:11 output

As you can see in the example, only the lines that match the regular expression 'out' are printed out. Regular expressions can be used, as shown in the following example:

# ls -l
total 8
-rw-r--r-- 1 alon other 25 Sep 30 16:24 casa.txt
lrwxrwxrwx 1 alon other 8 Sep 30 16:25 dd -> casa.txt
drwxrwxrwx 2 alon other 512 Sep 30 14:10 in
drwxrwxrwx 2 alon other 512 Sep 28 13:24 out
#
# ls -l | awk '/^d/ {print "Record " FNR;}
Record 4
Record 5

Where only lines that match '^d' (lines that start with the character 'd') are printed out. As you can see in this example, FNR is a variable that contains the number of the record being processed.

It is also possible to use expressions to filter out records. In the following example only files that match NF>=9 criteria (number of tokens greater or equal to 9) will be printed out.

# ls -l
total 4
drwxrwxrwx 2 alon other 512 Sep 30 14:10 in
drwxrwxrwx 2 alon other 512 Sep 28 13:24 out
# ls -l | awk 'NF>=9 {printf "[Record" FNR"]"; \
? for (token=1;token<=NF;token++) printf "["token">"$token"]"; print ""}'
[Record2][1>drwxrwxrwx][2>2][3>alon][4>other][5>512][6>Sep][7>30][8>14:10][9>in]
[Record3][1>drwxrwxrwx][2>2][3>alon][4>other][5>512][6>Sep][7>28][8>13:24][9>out]

As a last example we have an awk program with two 'pattern {action}' scripts, one that prints records from 6 to 8, and another script that prints records from 7 to 10.

# ls -l | awk '((FNR > 5) && (FNR < 9)) {print "[First print " FNR"]"} \
? ((FNR > 6) && (FNR <11 )) {print "[Second print " FNR"]"}'
[First print 6]
[First print 7]
[Second print 7]
[First print 8]
[Second print 8]
[Second print 9]
[Second print 10]

Arrays

Arrays are variables that can contain several values. Arrays in awk follow a similar syntax to C, but internally they work in a different way. Let's take a look at some simple examples:

table[1] = "one";
table["cat"] = "Lile";
table["dog"] = "Puff";

As you can see, you can define elements of an array 'on the fly'. The array in the previous example will have three element: 1, "cat", and "dog", and their values will be "one", "Lile", and "Puff".

There is an easy way to iterate through arrays using a special form of the for command:

# awk 'BEGIN {table[1] = "one"; \
? table["cat"] = "Lile";\
? table["dog"] = "Puff"; \
? for (element in table) print element, table[element]}'
dog Puff
cat Lile
1 one

The instruction for (element in table) will iterate through all the members of the array, assigning to the variable 'element' each of the keys used in the array.

To test if an array contains a certain key, you use the command if (element in array) or if (!(element in array))

The following example on shows how to use the array functions. We have an xml file, and we want to create an array that will contain how many cups and plates (or any other piece of cutlery) we have in it.

# cat xml.txt
<input>
<param>cups</param>
<value>10</value>
</input>
<input>
<param>plates</param>
<value>8</value>
</input>
<input>
<param>plates</param>
<value>3</value>
</input>
<input>
<param>cups</param>
<value>1</value>
</input>
<input>
<param>cups</param>
<value>5</value>
</input>
# cat xml.txt | \
? awk -F "<|>" '/param/ {param=$3; if (!(param in table)) \

? print "Found new param:", param} \
? /value/ {table[param]+=$3} \
? END {for (object in table) print object, table[object]}'
Found new param: cups
Found new param: plates
plates 11
cups 16

First let's take a look at the token delimiter: "<|>". This means that either '<' or '>' act as record delimiters. In the xml.txt example, that means that for a record like ' <param>cups</param>' $1 will have the value '', $2 will be 'param', and $3 will be 'cups'. awk will start iterating through the file, and when it finds a record that matches the regular expression 'param', it will store its value (plates, or cups) in the variable param. When awk finds a line that matches the regular expression 'value', it adds to the appropriate array element (table["cups"] or table["plates"]) the value in the xml structure. Finally, once all records are processed, it will loop through all the elements in the array, and print out the respective values. This example assumes that the line <param> always comes before the line <value>

String functions

The functions tolower and toupper return a string in lowercase and uppercase.

# awk 'BEGIN {print tolower("caSa4Mv") toupper("caSa4Mv")}'
casa4mv CASA4MV

The function length returns the length of its argument. If no argument is passed, it returns the length of the whole record

# ls -l
total 0
-rw-r--r-- 1 alon other 0 Nov 25 11:12 one
-rw-r--r-- 1 alon other 0 Nov 25 11:12 three
# ls -l | awk '{print length($9), length}'
0 7
3 57
5 59

The foundationmatch returns the position of regular expression on a string. If the string does not match the regular expression, it returns 0. After running match, the variable RLENGTH contains the length of the string matched.

# ls -1
ama.txt
filterAma.sh
lk
testamma.txt
tcsh:# ls -1 | awk '{print match($0, "[aA]m*a"), RLENGTH}'
1 3
7 3
0 -1
5 4

The regular expression [aA]m*a matches, an 'a' or an 'A', followed by 0 or more 'm's followed by 'a'. Observe how the length for 'ama.txt' is 3, and the length for 'testamma.txt' is 4.

The function index operates in a similar way, but does not accept regular expressions:

# ls -1
ama.txt
filterAma.sh
lk
# ls -1 | awk '{print index($0, "Ama")}'
0
7
0

The functions sub and gsub substitute a string by a regular expression in a given string. This is easier explained by an example

# ls -1
amanotAma.txt
# ls -1 | awk '{sub("[Aa]m*a","-",$1); print $1}'
-notAma.txt
# ls -1 | awk '{gsub("[Aa]m*a","-",$1); print $1}'
-not-.txt

As you can see, sub only substitutes the first instance of the regular expression, while gsub substitutes all.

The functionsplit splits a string in different tokens, using a regular expression as separator, and leaves the tokens in an array.

# awk 'BEGIN {length=split("An awk tutorial",array," "); \
? while (++item<=length) print array[item];}'
An
awk
tutorial
# awk 'BEGIN {length=split("An*&awk&&*tutorial",array,"[*&]*"); \
?
while (++item<=length) print array[item];}'
An
awk
tutorial

In the first example, " " was used as a token separator, while in the second example "[*&]*" is used, which means, any combination of '*'s and '&'s.

The functions substr returns a substring of a given string, starting at a given position, and returning a given number of characters. If the number of characters is not defined, returns the rest of the string

# awk 'BEGIN {print substr("An awk tutorial",4,3); \

? print substr("An awk tutorial",4);}'
awk
awk tutorial

Numeric functions

The following numeric functions are provided: atan2, cos, sin, exp, log, sqrt, int, rand, and srand.

awk provides two expression to increment by one the value of a variable. They are the pre-increment and the post-increment variables. The pre-increment expression ++variable, increments the value of a variable and returns its new value. The post-increment expression variable++, increments the value of a variable, and returns the value the variable had before the increment. Look at the example:

# awk 'BEGIN {var=5; print ++var; print var}'
6
6
# awk 'BEGIN {var=5; print var++; print var}'
5
6

In the first example, ++var increments the variable from 5 to 6, and returns 6. In the second example, var++ increments the variable from 5 to 6, and returns 5.

Special variables

There are several variables that are available in the awk programming language. Only the most useful ones will be explained in this tutorial.

The variable FILENAME contains the filename being processed at the time. Useful when processing several files.

# grep -c param *
other.txt:0
xml.txt:5
# awk '/param/ {file[FILENAME]++} \
? END {for (elem in file) print elem " contains " file[elem] \

? " line(s) with param"}' *
xml.txt contains 5 line(s) with param

The variable FNR contains the current record number in the current file. The variable NR contains the number of records processed from the beginning of the awk program.

# cat xml.txt
<input>
</input>
# cat xml2.txt
<param>cups</param>
# awk '{print FILENAME ":" NR}' xml.txt xml2.txt
xml.txt:1
xml.txt:2
xml2.txt:3
# awk '{print FILENAME ":" FNR}' xml.txt xml2.txt
xml.txt:1
xml.txt:2
xml2.txt:1

The variable NF contains the number of field in the current record. You may recall the previous example:

# ls -l
total 4
drwxrwxrwx 2 alon other 512 Sep 30 14:10 in
drwxrwxrwx 2 alon other 512 Sep 28 13:24 out
# ls -l | awk '{token=0; while(++token<=NF) \

? printf "["token":"$token"]"; print ""}'
[1:total][2:4]
[1:drwxrwxrwx][2:2][3:alon][4:other][5:512][6:Sep][7:30][8:14:10][9:in]
[1:drwxrwxrwx][2:2][3:alon][4:other][5:512][6:Sep][7:28][8:13:24][9:out]]

Other resources

This tutorial does not cover all that awk has to offer, only the features that I use most often. So here we go with the last example, and probably most useful one:

# man nawk
User Commands nawk(1)

NAME
nawk - pattern scanning and processing language

SYNOPSIS
/usr/bin/nawk [-F ERE] [-v assignment] 'program' |
-f progfile... [argument...]
/usr/xpg4/bin/awk [-F ERE] [-v assignment...] 'program' |
-f progfile... [argument...]

DESCRIPTION
The /usr/bin/nawk and /usr/xpg4/bin/awk utilities execute
programs written in the nawk programming language, which is
...
...

There you will find all the information missing in this tutorial. Observe that the man page is nawk, and not awk. If you do man awk you will get the man pages for Solaris verion of awk, which is not POSIX compliant.

If you want to become an awk guru, I would recommend the following book: sed & awk, by Dale Dougherty & Arnold Robbins (O'Reilly)