I often find myself having to sort large files. Gigabyte files take an extremely long time to sort. Twenty to thirty minutes is common, even when utilizing an entire CPU. As such, I wrote a script to perform distributed sorting.

Setup requires editing the “hosts” array in the top of the script. Put the hosts or username@host pair, like so:

hosts[0]="localhost"
hosts[1]="localhost
hosts[2]="user@otherhost"
hosts[3]="yahost"
hosts[4]="finalhost"

Then, optionally setup SSH keys to automate authentication and start sorting!

$ ls -l unsorted_file
-rw-r--r-- 1 noland users 499328352 2008-01-18 18:19 unsorted_file
 $ time sort unsorted_file > sorted_file2

real    19m17.247s
user    18m8.720s
sys     0m10.169s
$ time ./distsort.sh unsorted_file > sorted_file

real    9m46.398s
user    7m19.651s
sys     0m21.397s

$ md5sum sorted_file sorted_file2
ceb8a3aa2947868cae6ee33c457ba8d0  sorted_file
ceb8a3aa2947868cae6ee33c457ba8d0  sorted_file2

Here is the proccess:

  1. Split the file into equal portions, one for each entry in the hosts array.
  2. For each host, either background local sort process or a remote sort process using SSH. The file is transferred via a pipeline to the remote hosts.
  3. Wait for all processes to exit.
  4. Merge the sorted files.
  5. Remove temp files.

Some things to keep in mind:

  • If you don’t have mktemp available, you will have to add a mktemp function.
  • The demonstration above was for a 500MB file and utilized three sort processes, one on the local host and two remote.
  • You only want one sort process per CPU, as sort will consume 100% of the CPU. If you have two CPU’s on your local machine, place two localhost entries and it will start two sort processes. The same goes for remote hosts.
  • The file must be large enough so that the cost of transferring each “chunk” over the wire twice is less than the cost of sorting the entire file locally.

If you change this script or have suggestions of any kind, please do comment!