Splitting Strings Natively with the Shell: Why
December 9th, 2009
Today I want to discuss splitting strings into tokens or “words”. I previously discussed how to do this with the IFS variable and promised a more in depth discussion. Today, I will make the case on WHY to use IFS to split strings as opposed to using a subshell combined with awk or cut.
I wrote this script which reads the /etc/password file line-by-line and prints the username of any user which has a UID greater than 10 and has the shell of /sbin/nologin. Each test function performs this task 10 times to increase the length of the test:
[root@sandbox ~]# cat ifs-test.sh
#!/bin/bash
split_words_cut() {
# execute 10 times
for i in {0..9}
do
while read line
do
# get uid
id=$(echo $line | cut -d: -f3)
if [[ $id -gt 10 ]]
then
# get shell
shell=$(echo $line | echo $line | cut -d: -f7)
if [[ '/sbin/nologin' == "$shell" ]]
then
# print username
echo $line | cut -d: -f1
fi
fi
done < /etc/passwd
done
}
split_words_awk() {
# execute 10 times
for i in {0..9}
do
while read line
do
# get uid
id=$(echo $line | awk -F: '{print $3}')
if [[ $id -gt 10 ]]
then
# get shell
shell=$(echo $line | awk -F: '{print $NF}')
if [[ '/sbin/nologin' == "$shell" ]]
then
# print username
echo $line | awk -F: '{print $1}'
fi
fi
done < /etc/passwd
done
}
split_words_native() {
# execute 10 times
for i in {0..9}
do
while read line
do
oldIFS=$IFS
IFS=:
set -- $line
IFS=$oldIFS
# at this point $1 is the username, $3
# is the uid, and $7 is the shell
if [[ $3 -gt 10 ]] && [[ '/sbin/nologin' == "$7" ]]
then
echo $1
fi
done < /etc/passwd
done
}
echo -e "---Cut---"
time split_words_cut >/dev/null
echo -e "\n---Awk---"
time split_words_awk >/dev/null
echo -e "\n---Native---"
time split_words_native >/dev/null
As you can see, using the shell itself is about two orders of magnitude faster than using the subshell awk/cut method:
[root@sandbox ~]# ./ifs-test.sh ---Cut--- real 0m1.184s user 0m0.118s sys 0m0.676s ---Awk--- real 0m1.279s user 0m0.151s sys 0m0.750s ---Native--- real 0m0.018s user 0m0.014s sys 0m0.003s
This is why you should using IFS when splitting strings….


December 9th, 2009 at 11:18 pm
The overhead is most likely due to the system call overhead of spawning and starting (fork()ing and exec()ing) the awk and cut processes. It would be interesting to see how “already started” cut and awk compare with bash/IFS speedwise.
December 9th, 2009 at 11:36 pm
Yes, you are correct!
However I do not see how that would work. One thought would be to set them up to listen to a fifo, ensure they are line buffered, and have them ignore EOF, you could send them a single line and get a single response.
Like this:
# make fifo
mkfifo awk-input-username
mkfifo awk-output-username
awk -F: ‘{print $1} < awk-input-username > awk-output-username
….
# get username
echo $line > awk-input-username &
read username < awk-output-username
Of course they do NOT ignore EOF so the awk process will exit after reading a single line.
December 9th, 2009 at 11:57 pm
[...] my previous post on why to split strings with bash itself, I used set to split the [...]
December 10th, 2009 at 1:26 am
Doh. I should have thought of using fifos…
Dumb suggestion: if you were totally desperate to use cut or awk in this way, could you convince tr to filter out EOF (that is, ^D = 0×04) to something less destructive? I wonder if tr is coded to allow this behavior. Just send a signal later to cut to get it to die.
December 10th, 2009 at 1:43 am
I do not believe there is a way to “filter” out the EOF.
December 20th, 2009 at 12:24 am
tr -d ‘4′
December 20th, 2009 at 12:25 am
That should be:
January 15th, 2010 at 8:04 pm
…
Хм.. …
January 27th, 2010 at 2:39 pm
…
Хм …