Do it better with awk 2
January 12th, 2008
Note: The file format in the file below is the same as in my earlier article Do it better with awk 1.
Today I was able to meet Bryan of Guru Labs. During our conversation he posed the following question. “Find the 3rd field in a file consisting of space separated fields, the first being an ip address, in the range 192.168.1-2.1-255. There maybe lines in the file containing invalid ip addresses.”
I used grep to find the lines and then used awk to find the field. For example:
$ egrep '^192\.168\.[1-2]\.([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|2[0-5]{2})' access_log | \
awk '{print $(NF-1)}'
200
304
304
...
He pointed out that, while this works, there is no reason to invoke grep. He is certainly correct. Indeed, awk is all powerful! The default usage of awk is:
awk 'pattern { command }'
In its most common and simple usage, to print a field deliminated by spaces:
awk '{print $3}'
You are specifying no pattern, which matches every line. When solving the problem posed by Bryan, simply specify the pattern and eliminate grep from the pipe line. Here is the equivalent awk command:
$ awk '/^192\.168\.[1-2]\.([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|2[0-5]{2})/ {print $(NF-1)}' \
access_log
200
304
304
...
Awk has some extremely powerful selecting operators. Here I am using the ~ operator to match the third field from the right (resource), to ^/man, and printing the matched field:
$ awk '$(NF-3) ~ /^\/man/ {print $(NF-3)}' access_log
/man/cmd/info
/man/cmd/Mail
/man/s/Z
/man/cmd/mv
...
This invocation uses the !~ operator, to match lines where the resource does not match the pattern ^/man:
$ awk '$(NF-3) !~ /^\/man/ {print $(NF-3)}' access_log
/feed/
/feed
/robots.txt
/10-linux-commands-youve-never-used.html
...
Here I am selecting lines where the response code $(NF-1) is greater than or equal to 200, but less than 400 and printing the resource and response code. I use awk’s boolean “and” operator && to perform this operation:
$ awk '$(NF-1) >= 200 && $(NF-1) <= 399 {print $(NF-3), $(NF-1)}' access_log
/man/cmd/info 200
/feed/ 304
/feed 304
...
The following example uses the boolean “or” operator || to print lines where there resource matches ^/feed or ^/sitemap:
$ awk '$(NF-3) ~ /^\/feed/ || $(NF-3) ~ /^\/sitemap/ {print $0}' access_log
192.168.1.2 - - [01/Jan/2008:00:00:31 -0600] "GET /feed/ HTTP/1.1" 304 -
192.168.1.3 - - [01/Jan/2008:00:01:09 -0600] "GET /feed HTTP/1.1" 304 -
...


January 12th, 2008 at 12:04 pm
[…] questions, comments, or suggestions? I will be writing a second article on some other features of awk in the near future. Category: Beginners, Shell, […]
February 23rd, 2008 at 1:08 am
[…] the first examples when finding last modification date to use awk instead of grep and awk. Hard to change my ways! Category: Beginners, Good Practice, […]
June 9th, 2008 at 4:34 am
[…] […]
August 10th, 2008 at 5:03 pm
I strongly disagree your conclusion that you should use awk alone instead of grep|awk.
The problem is that awk is very slow, so if you have huge files or will be doing this on a regular basis while grep would sort out a lot of lines, it is much faster using grep to limit the amount of desired lines.
br,
pasp