Splitting Strings Natively with the Shell: Native vs Native
December 9th, 2009
Splitting Strings Natively with the Shell: Native vs Native
In my previous post on why to split strings with bash itself, I used set to split the string.
This was much faster than using a sub-shell and awk or cut. However, we can do better! The read command accepts a list of variables to split the input. Combined with setting a per command variable, we can write an even more elegant solution.
The magic is here:
while IFS=: read username x uid gid gecos home shell
We set IFS=: only for the execution of read, so there is no need to reset it once done splitting the string. Second we read each field (separated by : via IFS) into a variable directly.
Below is the script we will use to compare the two methods. You will notice I had to up the iterations to 100 in order to see a difference in execution speed:
[root@sandbox ~]# cat ifs-test2.sh
#!/bin/bash
split_words_native() {
# execute 100 times
for i in {0..100}
do
while read line
do
oldIFS=$IFS
IFS=:
set -- $line
IFS=$oldIFS
# at this point $1 is the username, $3
# is the uid, and $7 is the shell
if [[ $3 -gt 10 ]] && [[ '/sbin/nologin' == "$7" ]]
then
echo $1
fi
done < /etc/passwd
done
}
split_words_native_read() {
# execute 100 times
for i in {0..100}
do
while IFS=: read username x uid gid gecos home shell
do
if [[ $uid -gt 10 ]] && [[ '/sbin/nologin' == "$shell" ]]
then
echo $username
fi
done < /etc/passwd
done
}
echo "---Native---"
time split_words_native >/dev/null
echo -e "\n---Read---"
time split_words_native_read >/dev/null
Using read is more elegant and a little faster:
[root@sandbox ~]# ./ifs-test2.sh ---Native--- real 0m0.179s user 0m0.168s sys 0m0.010s ---Read--- real 0m0.147s user 0m0.135s sys 0m0.012s
Splitting Strings Natively with the Shell: Why
December 9th, 2009
Today I want to discuss splitting strings into tokens or “words”. I previously discussed how to do this with the IFS variable and promised a more in depth discussion. Today, I will make the case on WHY to use IFS to split strings as opposed to using a subshell combined with awk or cut.
I wrote this script which reads the /etc/password file line-by-line and prints the username of any user which has a UID greater than 10 and has the shell of /sbin/nologin. Each test function performs this task 10 times to increase the length of the test:
[root@sandbox ~]# cat ifs-test.sh
#!/bin/bash
split_words_cut() {
# execute 10 times
for i in {0..9}
do
while read line
do
# get uid
id=$(echo $line | cut -d: -f3)
if [[ $id -gt 10 ]]
then
# get shell
shell=$(echo $line | echo $line | cut -d: -f7)
if [[ '/sbin/nologin' == "$shell" ]]
then
# print username
echo $line | cut -d: -f1
fi
fi
done < /etc/passwd
done
}
split_words_awk() {
# execute 10 times
for i in {0..9}
do
while read line
do
# get uid
id=$(echo $line | awk -F: '{print $3}')
if [[ $id -gt 10 ]]
then
# get shell
shell=$(echo $line | awk -F: '{print $NF}')
if [[ '/sbin/nologin' == "$shell" ]]
then
# print username
echo $line | awk -F: '{print $1}'
fi
fi
done < /etc/passwd
done
}
split_words_native() {
# execute 10 times
for i in {0..9}
do
while read line
do
oldIFS=$IFS
IFS=:
set -- $line
IFS=$oldIFS
# at this point $1 is the username, $3
# is the uid, and $7 is the shell
if [[ $3 -gt 10 ]] && [[ '/sbin/nologin' == "$7" ]]
then
echo $1
fi
done < /etc/passwd
done
}
echo -e "---Cut---"
time split_words_cut >/dev/null
echo -e "\n---Awk---"
time split_words_awk >/dev/null
echo -e "\n---Native---"
time split_words_native >/dev/null
As you can see, using the shell itself is about two orders of magnitude faster than using the subshell awk/cut method:
[root@sandbox ~]# ./ifs-test.sh ---Cut--- real 0m1.184s user 0m0.118s sys 0m0.676s ---Awk--- real 0m1.279s user 0m0.151s sys 0m0.750s ---Native--- real 0m0.018s user 0m0.014s sys 0m0.003s
This is why you should using IFS when splitting strings….
Reading a file, line by line
November 24th, 2009
nixcraft has a link on how to read a file line by line. The method is a great way to read a file, but there some trouble spots I thought I would point out.
In the script, the special variable IFS is set:
# set the Internal Field Separator to a pipe symbol IFS='|'
The tells the read command to split “cyberciti.biz|74.86.48.99″ into “cyberciti.biz” and “74.86.48.99″ and thus fill both the domain and ip variables here:
while read domain ip
Using BASH to split strings is much faster than doing something line this:
while read line
do
domain=$(echo $line | awk -F'|' '{print $1}'
ip=$(echo $line | awk -F'|' '{print $2}'
As new script writers typically do. However, setting IFS and forgetting to reset the special variable can cause some odd problems in longer scripts. For example, lets say you needed to read a second file, later on in the script. This one delimited by spaces. For simplicity, I will take the same file and just replace the pipe characters with spaces.
/tmp/domains-using-space.txt
root@b92 [~]# cat /tmp/domains-using-space.txt cyberciti.biz 74.86.48.99 nixcraft.com 75.126.168.152 theos.in 75.126.168.153 cricketnow.in 75.126.168.154 vivekgite.com 75.126.168.155
Now, here is my new script:
#!/bin/ksh
# set the Internal Field Separator to a pipe symbol
IFS='|'
# file name
file=/tmp/domains.txt
# use while loop to read domain and ip
while read domain ip
do
print "$domain has address $ip"
done <"$file"
echo ------------------------
file=/tmp/domains-using-space.txt
# use while loop to read domain and ip
while read domain ip
do
print "$domain has address $ip"
done <"$file"
As you can see, the output is incorrect:
root@b92 [~]# ./test.sh cyberciti.biz has address 74.86.48.99 nixcraft.com has address 75.126.168.152 theos.in has address 75.126.168.153 cricketnow.in has address 75.126.168.154 vivekgite.com has address 75.126.168.155 ------------------------ cyberciti.biz 74.86.48.99 has address nixcraft.com 75.126.168.152 has address theos.in 75.126.168.153 has address cricketnow.in 75.126.168.154 has address vivekgite.com 75.126.168.155 has address
By saving and resetting the special variable IFS, we can eliminate this problem:
#!/bin/ksh
# file name
file=/tmp/domains.txt
# set the Internal Field Separator to a pipe symbol
oldIFS="$IFS"
IFS='|'
# use while loop to read domain and ip
while read domain ip
do
print "$domain has address $ip"
done <"$file"
IFS="$oldIFS"
echo ------------------------
file=/tmp/domains-using-space.txt
# use while loop to read domain and ip
while read domain ip
do
print "$domain has address $ip"
done <"$file"
The output from the new script, which saves and resets IFS:
cyberciti.biz has address 74.86.48.99 nixcraft.com has address 75.126.168.152 theos.in has address 75.126.168.153 cricketnow.in has address 75.126.168.154 vivekgite.com has address 75.126.168.155 ------------------------ cyberciti.biz has address 74.86.48.99 nixcraft.com has address 75.126.168.152 theos.in has address 75.126.168.153 cricketnow.in has address 75.126.168.154 vivekgite.com has address 75.126.168.155
In short, IFS is a great way to split strings. My next article will be a more in depth discussion of this topic. In the mean time, one item to remember when using IFS, is to always save and reset this variable.
The best in command line xml: XMLStarlet
June 23rd, 2008
Quite some time ago I wrote about using xsltproc to process xml on the command line. Thank fully someone pointed out XMLStarlet. I now use XMLStarlet almost every day. I work with a variety of REST based API’s gather information. XMLStartlet along with a simple for loop or xargs gives you an exceedingly powerful set of tools.
Here is a quick introduction into the power of XMLStarlet. This is just a teaser as I cannot share the data I work with. However, you should be able to see the power of this tool.
All the links from my RSS feed:
$ curl -s 'http://bashcurescancer.com/rss/' | xml sel -t -m '//link' -v '.' -n http://bashcurescancer.com
http://bashcurescancer.com/processing-xml-on-the-command-line.html http://bashcurescancer.com/do-not-close-stderr.html
http://bashcurescancer.com/prepend-to-a-file-with-sponge-from-moreutils.html
http://bashcurescancer.com/bug-in-curl-is-fixed.html
http://bashcurescancer.com/using-kill-to-see-if-a-process-is-alive.html
http://bashcurescancer.com/performance-testing-with-curl.html
http://bashcurescancer.com/new-command-prepend.html
http://bashcurescancer.com/shell-function-which-webserver-does-that-site-run.html
http://bashcurescancer.com/exposing-command-line-programs-as-web-services.html http://bashcurescancer.com/wrapping-dynamic-languages-in-shell-without-an-extra-script.html
Or how about “Title: link”
$ curl -s 'http://bashcurescancer.com/rss/' | xml sel -t -m '//item' -v 'title' -o ': ' -v 'link' -n
Processing XML on the Command Line: http://bashcurescancer.com/processing-xml-on-the-command-line.html
Do not close stderr: http://bashcurescancer.com/do-not-close-stderr.html
prepend to a file with sponge from moreutils: http://bashcurescancer.com/prepend-to-a-file-with-sponge-from-moreutils.html
Bug in Curl is fixed: http://bashcurescancer.com/bug-in-curl-is-fixed.html
using kill to see if a process is alive: http://bashcurescancer.com/using-kill-to-see-if-a-process-is-alive.html
Performance testing - with curl: http://bashcurescancer.com/performance-testing-with-curl.html
New command: prepend: http://bashcurescancer.com/new-command-prepend.html
Shell Function - Which Webserver Does That Site Run?: http://bashcurescancer.com/shell-function-which-webserver-does-that-site-run.html
Exposing command line programs as web services: http://bashcurescancer.com/exposing-command-line-programs-as-web-services.html
Wrapping dynamic languages in shell without an extra script: http://bashcurescancer.com/wrapping-dynamic-languages-in-shell-without-an-extra-script.html
You may need to do some reading on xpaths and xsl stylesheets to use the full power of the tool.
Do not close stderr
April 22nd, 2008
A few years ago, I wrote a post commenting on how ugly this was:
$ someprog 2>/dev/null
I was nearly imploring the reader to close stderr:
$ someprog 2>&-
Some very knowledgeable anonymous commenter explained why that was a bad idea. At the time, I didn’t understand exactly what they were saying. As such, I deleted the post. Yesterday, for no particular reason, the implications of closing stderr popped into my head. In the shower no less.
I wrote a simple little C program named do-not-close-stderr.c. It takes two parameters, a string you want written to a file and the file you want said string written to. After opening the file, it prints “some kind of warning message” to stderr. Here we are:
$ gcc -Wall do-not-close-stderr.c -o do-not-close-stderr $ ./do-not-close-stderr "Brock was here." output Some kind of warning message. $ cat output Brock was here.
Now lets close standard error when executing:
$ ./do-not-close-stderr "Brock was here." output 2>&- $ cat output Some kind of warning message. Brock was here.
Thanks to whoever that commenter was.
prepend to a file with sponge from moreutils
April 17th, 2008
A few weeks I wrote about a tool, which helps you easily prepend to a file. I submitted prepend to moreutils and Joey was kind enough to point out this could be done with `sponge’. sponge reads standard input and when done, writes it to a file:
Probably the most general purpose tool in moreutils so far is
sponge(1), which lets you do things like this:
% sed "s/root/toor/" /etc/passwd | grep -v joey | sponge /etc/passwd
Two days ago Joey released version 0.29 of moreutils including a patch by yours truly (with much help from Joey).
sponge: Handle large data sizes by using a temp file rather than by consuming arbitrary amounts of memory. Patch by Brock Noland. version 0.29 changelog
Also, on a non-command line note, I found a video on Joey’s site which I thought was pretty cool, Joey Learns to Fly.
using kill to see if a process is alive
April 9th, 2008
I am making some changes to the moreutils sponge command. Sponge provides a method of prepending which is less specialized than my prepend util. However, it has trouble with large amounts of input.
Regardless, while testing my changes, I want to watch it operate. Normally, you would just do so from a second terminal. That is a pain. kill -0 can be very useful for this. After backgrounding the command, I assign the pid (via the variable $!) to $pid using eval. eval is needed to stop BASH from expanding $! until after the background operation.
After that, I enter a while loop on kill -0 $pid, which will not kill $pid, but will return successfully until $pid has died:
# cat large-file-GB | ./sponge large-file-GB-copy & eval 'pid=$!'; while kill -0 $pid; do sleep 10; ls -lh large-file* /tmp/sponge.*; echo;done [1] 7937 -rw-r--r-- 1 root root 977M 2008-04-09 16:18 large-file-GB -rw------- 1 root root 128M 2008-04-09 17:23 /tmp/sponge.JMsBWG -rw-r--r-- 1 root root 977M 2008-04-09 16:18 large-file-GB -rw------- 1 root root 384M 2008-04-09 17:23 /tmp/sponge.JMsBWG -rw-r--r-- 1 root root 977M 2008-04-09 16:18 large-file-GB -rw------- 1 root root 877M 2008-04-09 17:24 /tmp/sponge.JMsBWG -rw-r--r-- 1 root root 977M 2008-04-09 16:18 large-file-GB -rw-r--r-- 1 root root 20M 2008-04-09 17:24 large-file-GB-copy -rw------- 1 root root 896M 2008-04-09 17:24 /tmp/sponge.JMsBWG -rw-r--r-- 1 root root 977M 2008-04-09 16:18 large-file-GB -rw-r--r-- 1 root root 413M 2008-04-09 17:25 large-file-GB-copy -rw------- 1 root root 896M 2008-04-09 17:24 /tmp/sponge.JMsBWG -rw-r--r-- 1 root root 977M 2008-04-09 16:18 large-file-GB -rw-r--r-- 1 root root 836M 2008-04-09 17:25 large-file-GB-copy -rw------- 1 root root 896M 2008-04-09 17:24 /tmp/sponge.JMsBWG -rw-r--r-- 1 root root 977M 2008-04-09 16:18 large-file-GB -rw-r--r-- 1 root root 920M 2008-04-09 17:25 large-file-GB-copy [1]+ Done cat large-file-GB | ./sponge large-file-GB-copy ls: cannot access /tmp/sponge.*: No such file or directory -rw-r--r-- 1 root root 977M 2008-04-09 16:18 large-file-GB -rw-r--r-- 1 root root 977M 2008-04-09 17:25 large-file-GB-copy -bash: kill: (7937) - No such process # md5sum large-file-GB* b5c667a723a10a3485a33263c4c2b978 large-file-GB b5c667a723a10a3485a33263c4c2b978 large-file-GB-copy
Performance testing – with curl
April 8th, 2008
Often I need or want to do some type of performance testing. Given my ideas on software development, I can usually do this by making simple HTTP requests. I use curl for this. While you may be tempted to do this in a for loop (or worse, actually write something!):
$ time for i in {1..1000}; do curl -s "http://bashcurescancer.com/blank.html";done
real 0m23.436s user 0m6.416s sys 0m7.351s
Curl provides the same functionality:
$ time curl -s "http://bashcurescancer.com/blank.html?[1-1000]"
real 0m6.561s user 0m0.294s sys 0m0.494s
Here are the details from the curl manual:
The URL syntax is protocol dependent. You’ll find a detailed description in RFC 3986.
You can specify multiple URLs or parts of URLs by writing part sets within braces as in:
http://site.{one,two,three}.com
or you can get sequences of alphanumeric series by using [ ] as in:
ftp://ftp.numericals.com/file[1-100].txt
ftp://ftp.numericals.com/file[001-100].txt (with leading zeros)
ftp://ftp.letters.com/file[a-z].txtNo nesting of the sequences is supported at the moment, but you can use several ones next to each other:
http://any.org/archive[1996-1999]/vol[1-4]/part{a,b,c}.html
You can specify any amount of URLs on the command line. They will be fetched in a sequential manner in the specified order.
Since curl 7.15.1 you can also specify step counter for the ranges, so that you can get every Nth number or letter:
http://www.numericals.com/file[1-100:10].txt
http://www.letters.com/file[a-z:2].txt
If you specify URL without protocol:// prefix, curl will attempt to guess what protocol you might want. It will then default to HTTP but try other protocols based on often-used host name prefixes. For example, for host names starting with “ftp.” curl will assume you want to speak FTP.
Curl will attempt to re-use connections for multiple file transfers, so that getting many files from the same server will not do multiple connects / handshakes. This improves speed. Of course this is only done on files specified on a single command line and cannot be used
between separate curl invokes.
This is important as it helps measure the actual change being tested. A for loop, by creating a new process every loop, will fill up your test with “local” time. Using a single curl process eliminates this – which should allow you to see the results of your test in a more transparent manner.
For example, lets say you have a change that reduces page production time. Your not sure how long, so you decide to run 1000 tests. Eliminating a second from a 23 second tests is not 5 percent. While removing a second from a 6 second test, is almost 20%.
New command: prepend
April 6th, 2008
I am utilizing Google’s project hosting to host software which I create and feel is useful or want to keep track of. I called the project Brock’s Tools. The code that led me to create this project was a command I am calling prepend 1.1. (UPDATE: See this post on sponge as its a better general case tool.)
prepend, prepend’s files or standard input to a file. For example, you have three files:
$ echo BROCK > a $ echo DAVID > b $ echo NOLAND > c
And you want to combine them into one file:
$ echo "My name is:" | prepend - a b c $ cat c My name is: BROCK DAVID NOLAND
Or lets say you just want to append a file to itself:
$ cat a BROCK $ cat a >> a cat: a: input file is output file
prepend does this:
$ prepend a $ cat a BROCK BROCK
I come across the a situation where this would be useful quite often. Of course prepend’ing can be done in the shell:
$ { echo "My name is:"; cat a b c; } > tmp && mv -f tmp c
$ cat c
My name is:
BROCK
DAVID
NOLAND
However, that is unsafe and I have lost data that way. I perform this operation most often when dealing with XML. In this example, its trivial to open the file in an editor, but with a large file, its quite nasty to do so:
$ cat something.xml <entry><blah/><more>stuff 1</more></entry> <entry><blah/><more>stuff 2</more></entry> <entry><blah/><more>stuff 3 </more></entry> <entry><blah/><more>stuff 4</more></entry> $ echo "</entries>" >> something.xml $ cat something.xml <entry><blah/><more>stuff 1</more></entry> <entry><blah/><more>stuff 2</more></entry> <entry><blah/><more>stuff 3 </more></entry> <entry><blah/><more>stuff 4</more></entry> </entries> $ echo "<entries>" | prepend - something.xml $ cat something.xml <entries> <entry><blah/><more>stuff 1</more></entry> <entry><blah/><more>stuff 2</more></entry> <entry><blah/><more>stuff 3 </more></entry> <entry><blah/><more>stuff 4</more></entry> </entries>
Shell Function – Which Webserver Does That Site Run?
April 4th, 2008
I just read the following post Python – Script – Which Webserver Does That Site Run? by blogger Corey Goldberg.
I prefer the shell version:
$ what-http-server() { curl -s -I "http://$1" | awk -F': ' '/^Server:/ {print $2}'; }
$ what-http-server www.pylot.org
Apache/2.0.52
$ what-http-server() { curl -s -I "$@" | awk -F': ' '/^Server:/ {print $2}'; }
$ what-http-server www.pylot.org google.com bashcurescancer.com
Apache/2.0.52
gws
Apache/2.2.6 (Unix)
That works but this version is more correct:
what-http-server() { curl -s -I $(for h in "$@"; do printf "http://%s " "$h"; done) | awk -F': ' '/^Server:/ {print $2}'; }
In the version which works for multiple hosts, we are letting curl assume the protocol is HTTP. This works fine most of the time. However, there are exceptions:
If you specify URL without protocol:// prefix, curl will attempt to guess what protocol you might want. It will then default to HTTP but try other protocols based on often-used host name prefixes. For example, for host names starting with “ftp.” curl will assume you want to speak FTP. – man curl

