Processing XML on the Command Line
April 24th, 2008
The other day on the cURL email list, someone asked:
Could someone please tell me (preferably with an example) of how I could parse and xml like the following:
<?xml version=”1.0″ encoding=”ISO-8859-1″ ?>
<FileRetriever>
<FileList>
<File name=”AMERI08.D4860.ZIP” />
<File name=”DTCCRSF.D4861.ZIP” />
<File name=”DTGSS01.D4862.ZIP” />
<File name=”DTGSS02.D4863.ZIP” />
<File name=”DTGSS03.D4864.ZIP” /
</FileList>
</FileRetriever>
This is not appropriate for the cURL list, but I thought a fair question. You could do this:
$ grep '<File ' config.xml | awk -F'"' '{print $2}' | xargs -l -I {} echo curl -I "http://bashcurescancer.com/{}"
curl -I http://bashcurescancer.com/AMERI08.D4860.ZIP
curl -I http://bashcurescancer.com/DTCCRSF.D4861.ZIP
curl -I http://bashcurescancer.com/DTGSS01.D4862.ZIP
curl -I http://bashcurescancer.com/DTGSS02.D4863.ZIP
curl -I http://bashcurescancer.com/DTGSS03.D4864.ZIP
Or, you could use the xsltproc command with an associated style sheet. This is really the correct method and much more effective when your processing complex XML or XML that is not easily grep’able:
$ xsltproc --nonet config.xsl config.xml | xargs -l -I {} echo curl -I "http://bashcurescancer.com/{}"
curl -I http://bashcurescancer.com/AMERI08.D4860.ZIP
curl -I http://bashcurescancer.com/DTCCRSF.D4861.ZIP
curl -I http://bashcurescancer.com/DTGSS01.D4862.ZIP
curl -I http://bashcurescancer.com/DTGSS02.D4863.ZIP
curl -I http://bashcurescancer.com/DTGSS03.D4864.ZIP
Links to config.xml and config.xsl.
using kill to see if a process is alive
April 9th, 2008
I am making some changes to the moreutils sponge command. Sponge provides a method of prepending which is less specialized than my prepend util. However, it has trouble with large amounts of input.
Regardless, while testing my changes, I want to watch it operate. Normally, you would just do so from a second terminal. That is a pain. kill -0 can be very useful for this. After backgrounding the command, I assign the pid (via the variable $!) to $pid using eval. eval is needed to stop BASH from expanding $! until after the background operation.
After that, I enter a while loop on kill -0 $pid, which will not kill $pid, but will return successfully until $pid has died:
# cat large-file-GB | ./sponge large-file-GB-copy & eval 'pid=$!'; while kill -0 $pid; do sleep 10; ls -lh large-file* /tmp/sponge.*; echo;done [1] 7937 -rw-r--r-- 1 root root 977M 2008-04-09 16:18 large-file-GB -rw------- 1 root root 128M 2008-04-09 17:23 /tmp/sponge.JMsBWG -rw-r--r-- 1 root root 977M 2008-04-09 16:18 large-file-GB -rw------- 1 root root 384M 2008-04-09 17:23 /tmp/sponge.JMsBWG -rw-r--r-- 1 root root 977M 2008-04-09 16:18 large-file-GB -rw------- 1 root root 877M 2008-04-09 17:24 /tmp/sponge.JMsBWG -rw-r--r-- 1 root root 977M 2008-04-09 16:18 large-file-GB -rw-r--r-- 1 root root 20M 2008-04-09 17:24 large-file-GB-copy -rw------- 1 root root 896M 2008-04-09 17:24 /tmp/sponge.JMsBWG -rw-r--r-- 1 root root 977M 2008-04-09 16:18 large-file-GB -rw-r--r-- 1 root root 413M 2008-04-09 17:25 large-file-GB-copy -rw------- 1 root root 896M 2008-04-09 17:24 /tmp/sponge.JMsBWG -rw-r--r-- 1 root root 977M 2008-04-09 16:18 large-file-GB -rw-r--r-- 1 root root 836M 2008-04-09 17:25 large-file-GB-copy -rw------- 1 root root 896M 2008-04-09 17:24 /tmp/sponge.JMsBWG -rw-r--r-- 1 root root 977M 2008-04-09 16:18 large-file-GB -rw-r--r-- 1 root root 920M 2008-04-09 17:25 large-file-GB-copy [1]+ Done cat large-file-GB | ./sponge large-file-GB-copy ls: cannot access /tmp/sponge.*: No such file or directory -rw-r--r-- 1 root root 977M 2008-04-09 16:18 large-file-GB -rw-r--r-- 1 root root 977M 2008-04-09 17:25 large-file-GB-copy -bash: kill: (7937) - No such process # md5sum large-file-GB* b5c667a723a10a3485a33263c4c2b978 large-file-GB b5c667a723a10a3485a33263c4c2b978 large-file-GB-copy
New command: prepend
April 6th, 2008
I am utilizing Google’s project hosting to host software which I create and feel is useful or want to keep track of. I called the project Brock’s Tools. The code that led me to create this project was a command I am calling prepend 1.1. (UPDATE: See this post on sponge as its a better general case tool.)
prepend, prepend’s files or standard input to a file. For example, you have three files:
$ echo BROCK > a $ echo DAVID > b $ echo NOLAND > c
And you want to combine them into one file:
$ echo "My name is:" | prepend - a b c $ cat c My name is: BROCK DAVID NOLAND
Or lets say you just want to append a file to itself:
$ cat a BROCK $ cat a >> a cat: a: input file is output file
prepend does this:
$ prepend a $ cat a BROCK BROCK
I come across the a situation where this would be useful quite often. Of course prepend’ing can be done in the shell:
$ { echo "My name is:"; cat a b c; } > tmp && mv -f tmp c
$ cat c
My name is:
BROCK
DAVID
NOLAND
However, that is unsafe and I have lost data that way. I perform this operation most often when dealing with XML. In this example, its trivial to open the file in an editor, but with a large file, its quite nasty to do so:
$ cat something.xml <entry><blah/><more>stuff 1</more></entry> <entry><blah/><more>stuff 2</more></entry> <entry><blah/><more>stuff 3 </more></entry> <entry><blah/><more>stuff 4</more></entry> $ echo "</entries>" >> something.xml $ cat something.xml <entry><blah/><more>stuff 1</more></entry> <entry><blah/><more>stuff 2</more></entry> <entry><blah/><more>stuff 3 </more></entry> <entry><blah/><more>stuff 4</more></entry> </entries> $ echo "<entries>" | prepend - something.xml $ cat something.xml <entries> <entry><blah/><more>stuff 1</more></entry> <entry><blah/><more>stuff 2</more></entry> <entry><blah/><more>stuff 3 </more></entry> <entry><blah/><more>stuff 4</more></entry> </entries>
Exposing command line programs as web services
March 27th, 2008
The web services paradigm of development is based on the Unix philosophy of “small is good”. Web services should do one job, and do it well, allowing users to develop complex solutions by combining small, reliable and proven services.
Why not then, expose the power of familiar Unix commands like sort, grep, gzip… to the web?
Here is a proof of concept python script (Python 2.3 version) to demonstrate.
Start services:
$ ./to_web.py -p8008 sort & Thu Mar 27 13:45:54 2008 sort server started - 8008 $ ./to_web.py -p8009 gzip & Thu Mar 27 13:46:29 2008 gzip server started - 8009
Use the services:
$ for i in {1..10}; do echo ${RANDOM:0:2}; done | \
> curl –data-binary @- “http://swat:8008/sort+-nr” | \
> curl –data-binary @- “http://swat:8009/gzip” | \
> gunzip
97
37
23
23
21
18
11
11
10
10
In my position, we have a database with host information - which has a command line interface. This tool has dependencies which are a painful to resolve. With to_web.py, we can turn the command line tool into a web service and access the data without having to satisfy those additional dependencies.
This is guest post by my esteemed colleague Adam Fokken. He can be reached here: Sadly, he does not have a blog.
Wrapping dynamic languages in shell without an extra script
March 25th, 2008
There are situations where, if you want a Python, PERL, PHP, etc script to be portable among a few different servers, it makes sense to wrap the script in shell. A few years ago I was trying to use the Python cx_Oracle module. This module is a wrapper for the native Oracle database driver. However, it requires the driver library directory be in the LD_LIBRARY_PATH environment variable.
No problem I thought. I’ll use the os.environ dict to set the variable. Example script:
$ cat python-only.sh
#!/usr/bin/python
import sys, os
sys.path.append("/usr/local/lib/python2.4/site-packages/")
if not os.environ.has_key('LD_LIBRARY_PATH'):
os.environ['LD_LIBRARY_PATH'] = "/home/noland/oracle-lib"
else:
os.environ['LD_LIBRARY_PATH'] = "/home/noland/oracle-lib:" + os.environ['LD_LIBRARY_PATH']
print "LD_LIBRARY_PATH looks OK in Python: LD_LIBRARY_PATH = ", os.environ['LD_LIBRARY_PATH']
os.system('echo LD_LIBRARY_PATH looks OK via os.system: LD_LIBRARY_PATH = $LD_LIBRARY_PATH')
try:
import cx_Oracle
print "Imported cx_Oracle! LD_LIBRARY_PATH was set correctly."
except ImportError, e:
print "Woops, LD_LIBRARY_PATH was not set correctly: ", e
This method does not work:
$ ./python-only.sh LD_LIBRARY_PATH looks OK in Python: LD_LIBRARY_PATH = /home/noland/oracle-lib LD_LIBRARY_PATH looks OK via os.system: LD_LIBRARY_PATH = /home/noland/oracle-lib Woops, LD_LIBRARY_PATH was not set correctly: libclntsh.so.10.1: cannot open shared object file: No such file or directory
This seems to be a common problem. However, when I was dealing with this a few years ago, I could not find a good resource on Google. I bite the bullet and wrote a separate shell script wrapper - hating invocation of the shell script. However, there is absolutely no reason I needed a separate shell script. I could have embedded the Python within a shell script. Example:
$ cat python-and-bash.sh
#/bin/bash
export LD_LIBRARY_PATH=/home/noland/oracle-lib:$LD_LIBRARY_PATH
/usr/bin/python<<END_OF_PYTHON
import sys
sys.path.append("/usr/local/lib/python2.4/site-packages/")
try:
import cx_Oracle
print "Imported cx_Oracle! LD_LIBRARY_PATH was set correctly."
except ImportError, e:
print "Woops, LD_LIBRARY_PATH was not set correctly: ", e
END_OF_PYTHON
Ahh, much better:
$ ./python-and-bash.sh Imported cx_Oracle! LD_LIBRARY_PATH was set correctly.
Of course I could have just set this variable in my profile. However, this creates an additional external dependency - which is what I was trying to avoid.
Process Substitution
March 23rd, 2008
Quite some time ago, someone wrote me to ask about a possible article on process substitution. Sadly, I could not find the email so I cannot credit them. As you likely have guessed, I am finally writing a post on process substitution.
Many times I have used pipelines and temporary files when process substitution would be a much cleaner solution.
First, I am going to create two test files:
$ dd if=/dev/urandom of=file-small count=750001 $ dd if=/dev/urandom of=file-large count=1000000 $ ls -l file-* -rw-r–r– 1 noland noland 512000000 Mar 23 08:53 file-large -rw-r–r– 1 noland noland 384000512 Mar 23 08:49 file-small
I thought of writing this article while writing a script to test ftp servers and file locking. As such I will upload the small file to a file named append-example:
$ curl -T file-small --user noland ftp://localhost/append-example Enter host password for user 'noland': $ ls -l append-example -rw-r--r-- 1 noland noland 384000512 Mar 23 11:52 append-example
Now I will append the large file:
$ curl -s -a -T file-large --user noland ftp://localhost/append-example Enter host password for user 'noland': $ ls -l append-example -rw-r--r-- 1 noland noland 896000512 Mar 23 11:54 append-example
I am going to use dd and process substituion to caculate the MD5 hash of the first upload:
$ md5sum file-small <(dd if=append-example count=750001 status=noxfer) dfabff7441bd814145a804e03d333864 file-small 1000000+0 records in 1000000+0 records out dfabff7441bd814145a804e03d333864 /dev/fd/63
Now the portion that was appended:
$ md5sum file-large <(dd if=append-example skip=750001 status=noxfer) 1b8daed9e435fc90b4a49d74b55f96f4 file-large 1000000+0 records in 1000000+0 records out 1b8daed9e435fc90b4a49d74b55f96f4 /dev/fd/63
When you place a command inside <( ) the shell sets standard output of the command to pipe inside /dev/fd/ and replaces the command with that pipe. Here is the classic example:
$ echo <(echo) <(echo) <(echo) <(echo) /dev/fd/63 /dev/fd/62 /dev/fd/61 /dev/fd/60
In my script I use process substitution as below (effectively) which feels exeedingly clean:
$ read hash name < <(md5sum <(dd if=append-example skip=750001 status=noxfer)) 1000000+0 records in 1000000+0 records out $ printf “hash=%s name=%s\n” $hash $name hash=1b8daed9e435fc90b4a49d74b55f96f4 name=/dev/fd/63
Keeping your SSH sessions alive with NOOP
March 12th, 2008
In the past, my SSH sessions died due to inactivity. In order to solve this, I used to:
while true; do uptime; sleep 5;done
Obviously, this eventually clears your terminal history. BASH to rescue! My noop script solves this problem. (Please see comments, there maybe a better solution, thanks David!) noop, standing for no operation, is a processor instruction and is common in protocols. You may find it interesting, that exploit code is filled with NOP’s. The operation increases your chances of exploiting buffer overflows
The source:
$ cat /usr/bin/noop
#!/bin/bash
backspace() {
echo -e "\b\c"
}
cleanup() {
backspace
exit
}
trap "cleanup" 2
while :
do
num=${RANDOM:0:1}
printf $num
sleep ".$num"
backspace
done
For the hell of it, I made a video of noop in action.
If your wondering how the script works, here is a quick explanation. The script defines two functions. backspace and cleanup. Backspace prints the special characters \b and \c. Backslash b is a backspace, and backslash c, stops echo from printing a trailing newline:
backspace() {
echo -e "\b\c"
}
The cleanup function prints a backspace and then exits. The cleanup function is run by trap when it receives a SIGINT (2):
cleanup() {
backspace
exit
}
trap "cleanup" 2
The main body of the script, is an infinite loop which generates, a random number using the special variable $RANDOM. This random is assigned to the variable num, utilizing only the first digit. After printing that number, the script sleeps num tenths of seconds, and the backspace function is called:
while :
do
num=${RANDOM:0:1}
printf $num
sleep ".$num"
backspace
done
rpm2tgz - web interface and web service
February 23rd, 2008
My favorite site to convert rpm’s to tar gzip files appears to have shut down. As such, I wrote my own tool. It has a web interface: Convert a RPM to a tgz and (keeping inline with my thoughts on software) can be used from the command line.
Five usage examples:
$ wget -q "http://bashcurescancer.com/rpm2tgz.ws?url=http://bashcurescancer.com/media/rpm2tgz/telnet-0.17-39.el5.i386.rpm" $ ls -l telnet-0.17-39.el5.i386.tgz -rw-r--r-- 1 noland noland 49804 Feb 23 17:09 telnet-0.17-39.el5.i386.tgz
$ curl -s -F "rpm=@telnet-0.17-39.el5.i386.rpm" \ "http://bashcurescancer.com/rpm2tgz.ws" >telnet-0.17-39.el5.i386.tgz.1
$ curl -s -F "url=http://bashcurescancer.com/media/rpm2tgz/telnet-0.17-39.el5.i386.rpm" \ http://bashcurescancer.com/rpm2tgz.ws > telnet-0.17-39.el5.i386.tgz.2
$ curl -s "http://bashcurescancer.com/rpm2tgz.ws?url=ttp://bashcurescancer.com/media/rpm2tgz/telnet-0.17-39.el5.i386.rpm" \ > telnet-0.17-39.el5.i386.tgz.3
$ wget -q -O telnet-0.17-39.el5.i386.tgz.4 \ "http://bashcurescancer.com/rpm2tgz.ws?url=http://bashcurescancer.com/media/rpm2tgz/telnet-0.17-39.el5.i386.rpm"
Needless to say, if you abuse this, I will block your ip address from accessing the service. If there is an error the script will either return 404 File Not Found or 500 Internal Server Error and an empty body. As such, you should be able to the -s expression of test, [, and [[ to check the validity of the file.
Which comparator, test, bracket, or double bracket, is fastest?
January 24th, 2008
The other day, I began wondering which comparator, test, [, or [[, was fastest? Here are the results:
$ time for i in {1..100000}; do [[ -d . ]];done
real 0m1.256s user 0m1.018s sys 0m0.238s
$ time for i in {1..100000}; do [ -d . ];done
real 0m3.407s user 0m2.704s sys 0m0.703s
$ time for i in {1..100000}; do test -d .;done
real 0m3.223s
user 0m2.607s
sys 0m0.616s
The double bracket is a “compound command” where as test and the single bracket are shell built-ins (and in actuality are the same command). Thus, the single bracket and double bracket execute different code.
The test and single bracket are the most portable as they exist as separate and external commands. However, if your using any remotely modern version of BASH, the double bracket is supported.
Here is the performance numbers on the external version of test and single bracket:
$ time for i in {1..100000}; do /usr/bin/test -d .;done
real 5m49.324s
user 0m51.771s
sys 4m48.013s
$ time for i in {1..100000}; do /usr/bin/[ -d . ];done
real 5m45.728s
user 0m52.536s
sys 4m46.259s
Wow! This shows the high cost of process creation!
dssh - executing an arbitrary command in parallel on an arbitrary number of hosts
January 21st, 2008
I asked “What do you want” and you said scripting. Which is good, because I have felt like scripting lately!
I help a website hosting company, Idologic, on the weekends. (Side note: I highly recommend Idologic. I have worked with and been a customer of many other hosting companies. I really doubt you will find better customer service elsewhere.) Like many businesses these days, Idologic has quite a few Linux servers. When presented with many servers, I typically want to parallelize my work.
As such, I have written a script called dssh (previous version), which allows you to execute commands on n hosts, in parallel. This can be used to find information on the hosts, such as load average, number of processes by user, number of processes by process name, etc.
There are other options such as pssh and p-run, however I wanted to create a shell solution which could be easily and simply “installed”. Dssh reads standard input. It expects one host per line. Host specific ssh options are supported. Here is my sample hosts file:
$ cat hosts mojito -l noland kodiak mojito kodiak -C mojito -i /home/noland/.ssh/id_rsa kodiak
There is nothing restricting you from generating this output from some type of meta data (I.E. database). Here are some examples of output:
$ ./dssh.sh "uptime" < hosts First time huh? Think your cmd over and then try again. $ ./dssh.sh "uptime" < hosts mojito:O:0:19:16:45 up 3 days, 14 min, 5 users, load average: 0.22, 0.22, 0.20 kodiak:O:0:13:24:00 up 20:00, 1 user, load average: 0.42, 0.16, 0.05 mojito:O:0:19:16:45 up 3 days, 14 min, 5 users, load average: 0.22, 0.22, 0.20 kodiak:O:0:13:24:00 up 20:00, 1 user, load average: 0.42, 0.16, 0.05 mojito:O:0:19:16:45 up 3 days, 14 min, 5 users, load average: 0.22, 0.22, 0.20 kodiak:O:0:13:24:00 up 20:00, 1 user, load average: 0.42, 0.16, 0.0
$ ./dssh.sh "pgrep -u noland | wc -w" < hosts mojito:O:0:60 kodiak:O:0:5 mojito:O:0:60 kodiak:O:0:5 mojito:O:0:60 kodiak:O:0:5
$ ./dssh.sh "ls not_a_file" < hosts mojito:E:2:ls: not_a_file: No such file or directory kodiak:E:2:ls: not_a_file: No such file or directory mojito:E:2:ls: not_a_file: No such file or directory kodiak:E:2:ls: not_a_file: No such file or directory mojito:E:2:ls: not_a_file: No such file or directory kodiak:E:2:ls: not_a_file: No such file or directory
Notes:
- With great power comes even greater responsibility. Running rm -rf / as root with this script would do exactly that.
- I don’t reccomend doing anything with this script that “changes state”.
- I make no warranties or promises.
- You need ssh keys to use this. I recommend using ssh-agent.
- By default dssh will execute 10 children in parallel. If you have a large host, increase this.
- When looping through the hosts, if the maximum number of children are still processing, the script will sleep 500ms. If your version of sleep does not support fractional seconds, you will need to change this.
Here is an outline of the script:
- Read from standard input a list of hosts
- Configure trap to remove temporary files on exit
- For each host
- Sleep while we have more children than the maximum number of children
- Generate three temporary files, one for each of
- Standard Output
- Standard Error
- Exit value
- Create a child process saving stdin, stderr, and the exit value in their respective files.
- Wait for all children to exit
- For each host
- If the standard output or error files are of size greater than zero, print the content, prefacing each line with the hostname, standard error/output indicator, and exit status.
- Else print something to indicate we executed a process and have an exit value.
Once again, here is the script I am calling dssh.

