Sorting large files faster with a shell script
January 18th, 2008
I often find myself having to sort large files. Gigabyte files take an extremely long time to sort. Twenty to thirty minutes is common, even when utilizing an entire CPU. As such, I wrote a script to perform distributed sorting.
Setup requires editing the “hosts” array in the top of the script. Put the hosts or username@host pair, like so:
hosts[0]="localhost" hosts[1]="localhost hosts[2]="user@otherhost" hosts[3]="yahost" hosts[4]="finalhost"
Then, optionally setup SSH keys to automate authentication and start sorting!
$ ls -l unsorted_file -rw-r--r-- 1 noland users 499328352 2008-01-18 18:19 unsorted_file
$ time sort unsorted_file > sorted_file2 real 19m17.247s user 18m8.720s sys 0m10.169s
$ time ./distsort.sh unsorted_file > sorted_file real 9m46.398s user 7m19.651s sys 0m21.397s $ md5sum sorted_file sorted_file2 ceb8a3aa2947868cae6ee33c457ba8d0 sorted_file ceb8a3aa2947868cae6ee33c457ba8d0 sorted_file2
Here is the proccess:
- Split the file into equal portions, one for each entry in the hosts array.
- For each host, either background local sort process or a remote sort process using SSH. The file is transferred via a pipeline to the remote hosts.
- Wait for all processes to exit.
- Merge the sorted files.
- Remove temp files.
Some things to keep in mind:
- If you don’t have mktemp available, you will have to add a mktemp function.
- The demonstration above was for a 500MB file and utilized three sort processes, one on the local host and two remote.
- You only want one sort process per CPU, as sort will consume 100% of the CPU. If you have two CPU’s on your local machine, place two localhost entries and it will start two sort processes. The same goes for remote hosts.
- The file must be large enough so that the cost of transferring each “chunk” over the wire twice is less than the cost of sorting the entire file locally.
If you change this script or have suggestions of any kind, please do comment!

