Do it better with awk 2

January 12th, 2008

Note: The file format in the file below is the same as in my earlier article Do it better with awk 1.

Today I was able to meet Bryan of Guru Labs. During our conversation he posed the following question. “Find the 3rd field in a file consisting of space separated fields, the first being an ip address, in the range 192.168.1-2.1-255. There maybe lines in the file containing invalid ip addresses.”

I used grep to find the lines and then used awk to find the field. For example:

$ egrep '^192\.168\.[1-2]\.([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|2[0-5]{2})' access_log | \
awk '{print $(NF-1)}'
200
304
304
...

He pointed out that, while this works, there is no reason to invoke grep. He is certainly correct. Indeed, awk is all powerful! The default usage of awk is:

awk 'pattern { command }'

In its most common and simple usage, to print a field deliminated by spaces:

awk '{print $3}'

You are specifying no pattern, which matches every line. When solving the problem posed by Bryan, simply specify the pattern and eliminate grep from the pipe line. Here is the equivalent awk command:

$ awk '/^192\.168\.[1-2]\.([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|2[0-5]{2})/ {print $(NF-1)}' \
 access_log
200
304
304
...

Awk has some extremely powerful selecting operators. Here I am using the ~ operator to match the third field from the right (resource), to ^/man, and printing the matched field:

$ awk '$(NF-3) ~ /^\/man/ {print $(NF-3)}' access_log
/man/cmd/info
/man/cmd/Mail
/man/s/Z
/man/cmd/mv
...

This invocation uses the !~ operator, to match lines where the resource does not match the pattern ^/man:

$ awk '$(NF-3) !~ /^\/man/ {print $(NF-3)}' access_log
/feed/
/feed
/robots.txt
/10-linux-commands-youve-never-used.html
...

Here I am selecting lines where the response code $(NF-1) is greater than or equal to 200, but less than 400 and printing the resource and response code. I use awk’s boolean “and” operator && to perform this operation:

 $ awk '$(NF-1) >= 200 && $(NF-1) <= 399 {print $(NF-3), $(NF-1)}' access_log
/man/cmd/info 200
/feed/ 304
/feed 304
...

The following example uses the boolean “or” operator || to print lines where there resource matches ^/feed or ^/sitemap:

$ awk '$(NF-3) ~ /^\/feed/ || $(NF-3) ~ /^\/sitemap/ {print $0}' access_log
192.168.1.2 - - [01/Jan/2008:00:00:31 -0600] "GET /feed/ HTTP/1.1" 304 -
192.168.1.3 - - [01/Jan/2008:00:01:09 -0600] "GET /feed HTTP/1.1" 304 -
...

Do it better with awk 1

January 10th, 2008

If you are system administrator or developer, you need to process log files to have a better grasp of situation. Many people use Perl or Python to help with this task. However, many times using one of the P languages is overkill. Furthermore, every single day I am on a machine that I cannot make changes to and thus cannot use my helper script. However, awk has the tools available to solve most on-the-fly log processing problems, directly from the command line. In addition, awk can provide a more concise and faster solution the the pipeline of cut, grep, sort, and other commands you are currently using.

In this article, this is the format of the file I am working with:

$ tail -n 1 access_log-2008-01
1.1.1.1 - - [10/Jan/2008:17:26:51 -0600] "GET / HTTP/1.1" 200 38856

Basically what we have here is: ip address, date, request, response code, response size. (Ignoring the dashes after the ip address.)

How would you find the largest response sent by your HTTP server? My typical solution has always been:

$ awk '{print $NF}' access_log-2008-01 | egrep -v '\-'  | sort -n | tail -n 1
10678272

However, there is clearly a better solution.

By default, awk splits input lines by spaces, and assigns the entire line to $0, each field to $n, and the number of fields to NF. See this example:

$ echo a b c d e f | awk '{print $0}'
a b c d e f
$ echo a b c d e f | awk '{print $1}'
a
$ echo a b c d e f | awk '{print $2}'
b
$ echo a b c d e f | awk '{print NF}'
6

Note that you can print the last field by saying print the (NF)’s variable:

$ echo a b c d e f | awk '{print $NF}'
f

Or print the second variable from the end:

$ echo a b c d e f | awk '{print $(NF-1)}'
e

Look at my example again:

$ awk '{print $NF}' access_log-2008-01 | egrep -v '\-'  | sort -n | tail -n 1
10678272

That solution starts three processes and filters the data three times. That is exceedingly inefficient! How about this:

$ awk '{if ($NF > max) { max = $NF;}} END {print max}' access_log-2008-01
10678272

This starts one process and filters the data only one time. That command in English says: For each line, if the last field is greater than the max, set it to the variable “max”. Once we have processed all the lines, print the variable max.

Which command do you suppose is faster?

 $ time awk '{print $NF}' access_log-2008-01 | egrep -v '\-'  | sort -n | tail -n 1
10678272
real    0m1.107s
user    0m1.070s
sys     0m0.037s

$ time awk '{if ($NF > max) { max = $NF;}} END {print max}' access_log-2008-01
10678272
real    0m0.207s
user    0m0.194s
sys     0m0.012s

Experts state that “1.0 second is about the limit for the user’s flow of thought to stay uninterrupted, even though the user will notice the delay.” That log file is only 12MB in size and there is a different in speed which you can notice at the terminal. Imagine if the log is 300MB?

Awk also has extremely accessible associative arrays. Here I use an array to count HTTP response codes:

$ awk '{counts[$(NF-1)]+=1}; END {for(code in counts) print code, counts[code]}' \
access_log-2008-01
206 177
301 1212
302 302
304 5051
403 5
200 82539
404 906
405 1
500 183

The previous command in English says: for each line, using the second to last field as our index, increment our array. Once we have proccessed all lines, loop through the array assigning “code” to the array index.

Lets count the number of requests for each URL:

$ awk '{counts[$(NF-3)]+=1}; END {for(url in counts) print counts[url], url}' \
access_log-2008-01 | sort -n
...output removed...
796 /media/centos5.0_install/common/AA-bios.jpg
846 /robots.txt
1063 /media/misc/why-bad-interpreter-premature-end-of-script-headers.png
1425 /media/10-linux-commands-youve-never-used/mkfifo-write-to-pipe.png
1443 /media/10-linux-commands-youve-never-used/read-from-pipe.png
1629 /
2066 /feed/
3073 /10-linux-commands-youve-never-used.html
3909 /wp2.3/wp-content/themes/minn-01/style.css
6989 /favicon.ico

Now lets sum the responses sizes each URL and display it in MB:

$ awk '{sizes[$(NF-3)]+=$NF}; END {for(url in sizes) print (sizes[url]/1024/1024) "MB", url}' \
access_log-2008-01  | sort -n
...output removed...
68.6784MB /media/centos5.0_install/gui_common/AQ-install-in-progress-3.png
72.0453MB /media/centos5.0_install/gui_common/AP-install-in-progress-2.png
74.0067MB /media/centos5.0_install/gui_common/AT-setup-agent-welcome.png
74.6089MB /media/centos5.0_install/gui_common/AV-setup-agent-firewall-r-u-sure.png
78.2652MB /media/centos5.0_install/gui_common/BA-setup-agent-sound-card.png
80.3148MB /media/centos5.0_install/gui_common/AG-bootloader-configuration.png
85.8359MB /media/centos5.0_install/gui_common/AI-set-timezone.png
101.836MB /media/centos_4.4_boot.iso
137.622MB /
263.253MB /media/centos_5.0_boot.iso

Lets do the same for IP addresses:

 $ awk '{counts[$1]+=1}; END {for(ip in counts) print counts[ip], ip}' \
access_log-2008-01 | sort -n
...output removed...
378 67.202.20.7
402 65.214.45.100
476 195.225.177.39
493 87.207.147.201
702 66.150.96.121
704 213.239.195.172
968 82.150.18.3
1335 65.28.61.246
2330 66.249.73.75
2883 71.63.249.40

$ awk '{sizes[$1]+=$NF}; END {for(ip in sizes) print (sizes[ip]/1024/1024) "MB", ip}' \
access_log-2008-01 | sort -n
...output removed...
20.9338MB 61.64.209.144
21.8517MB 116.71.182.210
23.4265MB 85.102.126.48
31.5194MB 213.239.195.172
32.732MB 67.176.123.158
37.9046MB 66.249.73.75
56.1901MB 71.63.249.40
57.9892MB 67.202.20.7
78.6117MB 65.28.61.246

Sum the size of all responses by ip address if the response code is 200:

$ awk '$(NF-1) == 200 {sizes[$1]+=$NF}; END {for(ip in sizes) print (sizes[ip]/1024/1024) "MB", ip}' \
access_log-2008-01 | sort -n
...output removed...
16.5405MB 220.181.38.245
16.7031MB 207.67.117.178
16.7661MB 128.227.0.66
16.9171MB 67.176.123.158
18.2246MB 71.72.54.173
31.5194MB 213.239.195.172
37.3774MB 66.249.73.75
53.6944MB 71.63.249.40
57.9885MB 67.202.20.7
76.9965MB 65.28.61.246

The command in English: for each line, if the response code is 200 ($(NF-1)), then increment our array at index ip address ($1), by response size ($NF).

Any questions, comments, or suggestions? I will be writing a second article on some other features of awk in the near future.