I often find myself having to sort large files. Gigabyte files take an extremely long time to sort. Twenty to thirty minutes is common, even when utilizing an entire CPU. As such, I wrote a script to perform distributed sorting.

Setup requires editing the “hosts” array in the top of the script. Put the hosts or username@host pair, like so:

hosts[0]="localhost"
hosts[1]="localhost
hosts[2]="user@otherhost"
hosts[3]="yahost"
hosts[4]="finalhost"

Then, optionally setup SSH keys to automate authentication and start sorting!

$ ls -l unsorted_file
-rw-r--r-- 1 noland users 499328352 2008-01-18 18:19 unsorted_file
 $ time sort unsorted_file > sorted_file2

real    19m17.247s
user    18m8.720s
sys     0m10.169s
$ time ./distsort.sh unsorted_file > sorted_file

real    9m46.398s
user    7m19.651s
sys     0m21.397s

$ md5sum sorted_file sorted_file2
ceb8a3aa2947868cae6ee33c457ba8d0  sorted_file
ceb8a3aa2947868cae6ee33c457ba8d0  sorted_file2

Here is the proccess:

  1. Split the file into equal portions, one for each entry in the hosts array.
  2. For each host, either background local sort process or a remote sort process using SSH. The file is transferred via a pipeline to the remote hosts.
  3. Wait for all processes to exit.
  4. Merge the sorted files.
  5. Remove temp files.

Some things to keep in mind:

  • If you don’t have mktemp available, you will have to add a mktemp function.
  • The demonstration above was for a 500MB file and utilized three sort processes, one on the local host and two remote.
  • You only want one sort process per CPU, as sort will consume 100% of the CPU. If you have two CPU’s on your local machine, place two localhost entries and it will start two sort processes. The same goes for remote hosts.
  • The file must be large enough so that the cost of transferring each “chunk” over the wire twice is less than the cost of sorting the entire file locally.

If you change this script or have suggestions of any kind, please do comment!

5 Responses to “Sorting large files faster with a shell script”

  1. Pages tagged "real" Says:

    [...] Sorting large files faster with a shell script. xmiszBubblez bookmarked on 01/18/08, saved by 1 others [...]

  2. Brandon Says:

    You may get even better performance if you transmit your data compressed (use the -C option to ssh).

  3. Yves Junqueira Says:

    @Brandon,

    Good point. Even if ’sort’ is CPU-bound and SSH would be competing for processor cycles with the local ’sort’ process, that may not be a problem in the end, since the local sort probably ends much faster than the others, so there is free local cpu power to spare.

    And if you have SMP, ssh -C makes even more sense.

  4. admin Says:

    Good points guys. I think if you have multiple processors, the -C option makes sense. However, since compression is also CPU intensive, tests would need to be done to see whether it makes more sense to use -C or to just fire up an additional local sort process.

  5. Joel Says:

    Thank you! I just applied this to ~80 million lines of data (700MB in bz2), sorted in 40 minutes on 16 machines.

Leave a Reply

If Wordpress eats your comment (shell output, loops, ex..) email the text to me.