Processing XML on the Command Line
April 24th, 2008
The other day on the cURL email list, someone asked:
Could someone please tell me (preferably with an example) of how I could parse and xml like the following:
<?xml version=”1.0″ encoding=”ISO-8859-1″ ?>
<FileRetriever>
<FileList>
<File name=”AMERI08.D4860.ZIP” />
<File name=”DTCCRSF.D4861.ZIP” />
<File name=”DTGSS01.D4862.ZIP” />
<File name=”DTGSS02.D4863.ZIP” />
<File name=”DTGSS03.D4864.ZIP” /
</FileList>
</FileRetriever>
This is not appropriate for the cURL list, but I thought a fair question. You could do this:
$ grep '<File ' config.xml | awk -F'"' '{print $2}' | xargs -l -I {} echo curl -I "http://bashcurescancer.com/{}"
curl -I http://bashcurescancer.com/AMERI08.D4860.ZIP
curl -I http://bashcurescancer.com/DTCCRSF.D4861.ZIP
curl -I http://bashcurescancer.com/DTGSS01.D4862.ZIP
curl -I http://bashcurescancer.com/DTGSS02.D4863.ZIP
curl -I http://bashcurescancer.com/DTGSS03.D4864.ZIP
Or, you could use the xsltproc command with an associated style sheet. This is really the correct method and much more effective when your processing complex XML or XML that is not easily grep’able:
$ xsltproc --nonet config.xsl config.xml | xargs -l -I {} echo curl -I "http://bashcurescancer.com/{}"
curl -I http://bashcurescancer.com/AMERI08.D4860.ZIP
curl -I http://bashcurescancer.com/DTCCRSF.D4861.ZIP
curl -I http://bashcurescancer.com/DTGSS01.D4862.ZIP
curl -I http://bashcurescancer.com/DTGSS02.D4863.ZIP
curl -I http://bashcurescancer.com/DTGSS03.D4864.ZIP
Links to config.xml and config.xsl.
Do not close stderr
April 22nd, 2008
A few years ago, I wrote a post commenting on how ugly this was:
$ someprog 2>/dev/null
I was nearly imploring the reader to close stderr:
$ someprog 2>&-
Some very knowledgeable anonymous commenter explained why that was a bad idea. At the time, I didn’t understand exactly what they were saying. As such, I deleted the post. Yesterday, for no particular reason, the implications of closing stderr popped into my head. In the shower no less.
I wrote a simple little C program named do-not-close-stderr.c. It takes two parameters, a string you want written to a file and the file you want said string written to. After opening the file, it prints “some kind of warning message” to stderr. Here we are:
$ gcc -Wall do-not-close-stderr.c -o do-not-close-stderr $ ./do-not-close-stderr "Brock was here." output Some kind of warning message. $ cat output Brock was here.
Now lets close standard error when executing:
$ ./do-not-close-stderr "Brock was here." output 2>&- $ cat output Some kind of warning message. Brock was here.
Thanks to whoever that commenter was.
prepend to a file with sponge from moreutils
April 17th, 2008
A few weeks I wrote about a tool, which helps you easily prepend to a file. I submitted prepend to moreutils and Joey was kind enough to point out this could be done with `sponge’. sponge reads standard input and when done, writes it to a file:
Probably the most general purpose tool in moreutils so far is
sponge(1), which lets you do things like this:
% sed "s/root/toor/" /etc/passwd | grep -v joey | sponge /etc/passwd
Two days ago Joey released version 0.29 of moreutils including a patch by yours truly (with much help from Joey).
sponge: Handle large data sizes by using a temp file rather than by consuming arbitrary amounts of memory. Patch by Brock Noland. version 0.29 changelog
Also, on a non-command line note, I found a video on Joey’s site which I thought was pretty cool, Joey Learns to Fly.
Bug in Curl is fixed
April 14th, 2008
I love curl. I use it quite often to perform HTTP HEAD requests:
$ curl -I http://bashcurescancer.com HTTP/1.1 200 OK Date: Mon, 14 Apr 2008 03:11:35 GMT Server: Apache/2.2.6 (Unix) X-Pingback: http://bashcurescancer.com/wordpress/xmlrpc.php Last-Modified: Mon, 14 Apr 2008 02:38:11 GMT Connection: close Content-Type: text/html; charset=UTF-8
However, I sometimes forget if a HEAD request is -I or -i, as such I usually specify them both. Lowercase i is “include headers in output” and uppercase I tells curl to use HEAD instead of GET. When you use -I, -i is implied.
Given all this, there should be no problems specifying both options. However, if you place -I before -i, curl doesn’t actually display the response. Here is the output from my bug report to curl-users:
$ curl -I -i http://bashcurescancer.com $ curl -i -I http://bashcurescancer.com HTTP/1.1 200 OK Date: Mon, 14 Apr 2008 03:11:35 GMT Server: Apache/2.2.6 (Unix) X-Pingback: http://bashcurescancer.com/wordpress/xmlrpc.php Last-Modified: Mon, 14 Apr 2008 02:38:11 GMT Connection: close Content-Type: text/html; charset=UTF-8
Curl uses a long integer for configuration flags via bit masking. The problem arises in that the -I option sets two bits bit and the -i option XOR’s one of those same bits:
src/main.c
case 'i':
config->conf ^= CONF_HEADER; /* include the HTTP header as well */
break;
…
case ‘I’:
/*
* This is a bit tricky. We either SET both bits, or we clear both
* bits. Let’s not make any other outcomes from this.
*/
if((CONF_HEADER|CONF_NOBODY) !=
(config->conf&(CONF_HEADER|CONF_NOBODY)) ) {
/* one of them weren’t set, set both */
config->conf |= (CONF_HEADER|CONF_NOBODY);
if(SetHTTPrequest(config, HTTPREQ_HEAD, &config->httpreq))
return PARAM_BAD_USE;
}
else {
/* both were set, clear both */
config->conf &= ~(CONF_HEADER|CONF_NOBODY);
if(SetHTTPrequest(config, HTTPREQ_GET, &config->httpreq))
return PARAM_BAD_USE;
}
Thanks to Daniel Stenberg, the fix “is now committed!“
using kill to see if a process is alive
April 9th, 2008
I am making some changes to the moreutils sponge command. Sponge provides a method of prepending which is less specialized than my prepend util. However, it has trouble with large amounts of input.
Regardless, while testing my changes, I want to watch it operate. Normally, you would just do so from a second terminal. That is a pain. kill -0 can be very useful for this. After backgrounding the command, I assign the pid (via the variable $!) to $pid using eval. eval is needed to stop BASH from expanding $! until after the background operation.
After that, I enter a while loop on kill -0 $pid, which will not kill $pid, but will return successfully until $pid has died:
# cat large-file-GB | ./sponge large-file-GB-copy & eval 'pid=$!'; while kill -0 $pid; do sleep 10; ls -lh large-file* /tmp/sponge.*; echo;done [1] 7937 -rw-r--r-- 1 root root 977M 2008-04-09 16:18 large-file-GB -rw------- 1 root root 128M 2008-04-09 17:23 /tmp/sponge.JMsBWG -rw-r--r-- 1 root root 977M 2008-04-09 16:18 large-file-GB -rw------- 1 root root 384M 2008-04-09 17:23 /tmp/sponge.JMsBWG -rw-r--r-- 1 root root 977M 2008-04-09 16:18 large-file-GB -rw------- 1 root root 877M 2008-04-09 17:24 /tmp/sponge.JMsBWG -rw-r--r-- 1 root root 977M 2008-04-09 16:18 large-file-GB -rw-r--r-- 1 root root 20M 2008-04-09 17:24 large-file-GB-copy -rw------- 1 root root 896M 2008-04-09 17:24 /tmp/sponge.JMsBWG -rw-r--r-- 1 root root 977M 2008-04-09 16:18 large-file-GB -rw-r--r-- 1 root root 413M 2008-04-09 17:25 large-file-GB-copy -rw------- 1 root root 896M 2008-04-09 17:24 /tmp/sponge.JMsBWG -rw-r--r-- 1 root root 977M 2008-04-09 16:18 large-file-GB -rw-r--r-- 1 root root 836M 2008-04-09 17:25 large-file-GB-copy -rw------- 1 root root 896M 2008-04-09 17:24 /tmp/sponge.JMsBWG -rw-r--r-- 1 root root 977M 2008-04-09 16:18 large-file-GB -rw-r--r-- 1 root root 920M 2008-04-09 17:25 large-file-GB-copy [1]+ Done cat large-file-GB | ./sponge large-file-GB-copy ls: cannot access /tmp/sponge.*: No such file or directory -rw-r--r-- 1 root root 977M 2008-04-09 16:18 large-file-GB -rw-r--r-- 1 root root 977M 2008-04-09 17:25 large-file-GB-copy -bash: kill: (7937) - No such process # md5sum large-file-GB* b5c667a723a10a3485a33263c4c2b978 large-file-GB b5c667a723a10a3485a33263c4c2b978 large-file-GB-copy
Performance testing - with curl
April 8th, 2008
Often I need or want to do some type of performance testing. Given my ideas on software development, I can usually do this by making simple HTTP requests. I use curl for this. While you may be tempted to do this in a for loop (or worse, actually write something!):
$ time for i in {1..1000}; do curl -s "http://bashcurescancer.com/blank.html";done
real 0m23.436s user 0m6.416s sys 0m7.351s
Curl provides the same functionality:
$ time curl -s "http://bashcurescancer.com/blank.html?[1-1000]"
real 0m6.561s user 0m0.294s sys 0m0.494s
Here are the details from the curl manual:
The URL syntax is protocol dependent. You’ll find a detailed description in RFC 3986.
You can specify multiple URLs or parts of URLs by writing part sets within braces as in:
http://site.{one,two,three}.com
or you can get sequences of alphanumeric series by using [ ] as in:
ftp://ftp.numericals.com/file[1-100].txt
ftp://ftp.numericals.com/file[001-100].txt (with leading zeros)
ftp://ftp.letters.com/file[a-z].txtNo nesting of the sequences is supported at the moment, but you can use several ones next to each other:
http://any.org/archive[1996-1999]/vol[1-4]/part{a,b,c}.html
You can specify any amount of URLs on the command line. They will be fetched in a sequential manner in the specified order.
Since curl 7.15.1 you can also specify step counter for the ranges, so that you can get every Nth number or letter:
http://www.numericals.com/file[1-100:10].txt
http://www.letters.com/file[a-z:2].txtIf you specify URL without protocol:// prefix, curl will attempt to guess what protocol you might want. It will then default to HTTP but try other protocols based on often-used host name prefixes. For example, for host names starting with “ftp.” curl will assume you want to speak FTP.
Curl will attempt to re-use connections for multiple file transfers, so that getting many files from the same server will not do multiple connects / handshakes. This improves speed. Of course this is only done on files specified on a single command line and cannot be used
between separate curl invokes.
This is important as it helps measure the actual change being tested. A for loop, by creating a new process every loop, will fill up your test with “local” time. Using a single curl process eliminates this - which should allow you to see the results of your test in a more transparent manner.
For example, lets say you have a change that reduces page production time. Your not sure how long, so you decide to run 1000 tests. Eliminating a second from a 23 second tests is not 5 percent. While removing a second from a 6 second test, is almost 20%.
New command: prepend
April 6th, 2008
I am utilizing Google’s project hosting to host software which I create and feel is useful or want to keep track of. I called the project Brock’s Tools. The code that led me to create this project was a command I am calling prepend 1.1. (UPDATE: See this post on sponge as its a better general case tool.)
prepend, prepend’s files or standard input to a file. For example, you have three files:
$ echo BROCK > a $ echo DAVID > b $ echo NOLAND > c
And you want to combine them into one file:
$ echo "My name is:" | prepend - a b c $ cat c My name is: BROCK DAVID NOLAND
Or lets say you just want to append a file to itself:
$ cat a BROCK $ cat a >> a cat: a: input file is output file
prepend does this:
$ prepend a $ cat a BROCK BROCK
I come across the a situation where this would be useful quite often. Of course prepend’ing can be done in the shell:
$ { echo "My name is:"; cat a b c; } > tmp && mv -f tmp c
$ cat c
My name is:
BROCK
DAVID
NOLAND
However, that is unsafe and I have lost data that way. I perform this operation most often when dealing with XML. In this example, its trivial to open the file in an editor, but with a large file, its quite nasty to do so:
$ cat something.xml <entry><blah/><more>stuff 1</more></entry> <entry><blah/><more>stuff 2</more></entry> <entry><blah/><more>stuff 3 </more></entry> <entry><blah/><more>stuff 4</more></entry> $ echo "</entries>" >> something.xml $ cat something.xml <entry><blah/><more>stuff 1</more></entry> <entry><blah/><more>stuff 2</more></entry> <entry><blah/><more>stuff 3 </more></entry> <entry><blah/><more>stuff 4</more></entry> </entries> $ echo "<entries>" | prepend - something.xml $ cat something.xml <entries> <entry><blah/><more>stuff 1</more></entry> <entry><blah/><more>stuff 2</more></entry> <entry><blah/><more>stuff 3 </more></entry> <entry><blah/><more>stuff 4</more></entry> </entries>
Shell Function - Which Webserver Does That Site Run?
April 4th, 2008
I just read the following post Python - Script - Which Webserver Does That Site Run? by blogger Corey Goldberg.
I prefer the shell version:
$ what-http-server() { curl -s -I "http://$1" | awk -F': ' '/^Server:/ {print $2}'; }
$ what-http-server www.pylot.org
Apache/2.0.52
$ what-http-server() { curl -s -I "$@" | awk -F': ' '/^Server:/ {print $2}'; }
$ what-http-server www.pylot.org google.com bashcurescancer.com
Apache/2.0.52
gws
Apache/2.2.6 (Unix)
That works but this version is more correct:
what-http-server() { curl -s -I $(for h in "$@"; do printf "http://%s " "$h"; done) | awk -F': ' '/^Server:/ {print $2}'; }
In the version which works for multiple hosts, we are letting curl assume the protocol is HTTP. This works fine most of the time. However, there are exceptions:
If you specify URL without protocol:// prefix, curl will attempt to guess what protocol you might want. It will then default to HTTP but try other protocols based on often-used host name prefixes. For example, for host names starting with “ftp.” curl will assume you want to speak FTP. - man curl
Exposing command line programs as web services
March 27th, 2008
The web services paradigm of development is based on the Unix philosophy of “small is good”. Web services should do one job, and do it well, allowing users to develop complex solutions by combining small, reliable and proven services.
Why not then, expose the power of familiar Unix commands like sort, grep, gzip… to the web?
Here is a proof of concept python script (Python 2.3 version) to demonstrate.
Start services:
$ ./to_web.py -p8008 sort & Thu Mar 27 13:45:54 2008 sort server started - 8008 $ ./to_web.py -p8009 gzip & Thu Mar 27 13:46:29 2008 gzip server started - 8009
Use the services:
$ for i in {1..10}; do echo ${RANDOM:0:2}; done | \
> curl –data-binary @- “http://swat:8008/sort+-nr” | \
> curl –data-binary @- “http://swat:8009/gzip” | \
> gunzip
97
37
23
23
21
18
11
11
10
10
In my position, we have a database with host information - which has a command line interface. This tool has dependencies which are a painful to resolve. With to_web.py, we can turn the command line tool into a web service and access the data without having to satisfy those additional dependencies.
This is guest post by my esteemed colleague Adam Fokken. He can be reached here: Sadly, he does not have a blog.
Wrapping dynamic languages in shell without an extra script
March 25th, 2008
There are situations where, if you want a Python, PERL, PHP, etc script to be portable among a few different servers, it makes sense to wrap the script in shell. A few years ago I was trying to use the Python cx_Oracle module. This module is a wrapper for the native Oracle database driver. However, it requires the driver library directory be in the LD_LIBRARY_PATH environment variable.
No problem I thought. I’ll use the os.environ dict to set the variable. Example script:
$ cat python-only.sh
#!/usr/bin/python
import sys, os
sys.path.append("/usr/local/lib/python2.4/site-packages/")
if not os.environ.has_key('LD_LIBRARY_PATH'):
os.environ['LD_LIBRARY_PATH'] = "/home/noland/oracle-lib"
else:
os.environ['LD_LIBRARY_PATH'] = "/home/noland/oracle-lib:" + os.environ['LD_LIBRARY_PATH']
print "LD_LIBRARY_PATH looks OK in Python: LD_LIBRARY_PATH = ", os.environ['LD_LIBRARY_PATH']
os.system('echo LD_LIBRARY_PATH looks OK via os.system: LD_LIBRARY_PATH = $LD_LIBRARY_PATH')
try:
import cx_Oracle
print "Imported cx_Oracle! LD_LIBRARY_PATH was set correctly."
except ImportError, e:
print "Woops, LD_LIBRARY_PATH was not set correctly: ", e
This method does not work:
$ ./python-only.sh LD_LIBRARY_PATH looks OK in Python: LD_LIBRARY_PATH = /home/noland/oracle-lib LD_LIBRARY_PATH looks OK via os.system: LD_LIBRARY_PATH = /home/noland/oracle-lib Woops, LD_LIBRARY_PATH was not set correctly: libclntsh.so.10.1: cannot open shared object file: No such file or directory
This seems to be a common problem. However, when I was dealing with this a few years ago, I could not find a good resource on Google. I bite the bullet and wrote a separate shell script wrapper - hating invocation of the shell script. However, there is absolutely no reason I needed a separate shell script. I could have embedded the Python within a shell script. Example:
$ cat python-and-bash.sh
#/bin/bash
export LD_LIBRARY_PATH=/home/noland/oracle-lib:$LD_LIBRARY_PATH
/usr/bin/python<<END_OF_PYTHON
import sys
sys.path.append("/usr/local/lib/python2.4/site-packages/")
try:
import cx_Oracle
print "Imported cx_Oracle! LD_LIBRARY_PATH was set correctly."
except ImportError, e:
print "Woops, LD_LIBRARY_PATH was not set correctly: ", e
END_OF_PYTHON
Ahh, much better:
$ ./python-and-bash.sh Imported cx_Oracle! LD_LIBRARY_PATH was set correctly.
Of course I could have just set this variable in my profile. However, this creates an additional external dependency - which is what I was trying to avoid.

