Sorting large files faster with a shell script
January 18th, 2008
I often find myself having to sort large files. Gigabyte files take an extremely long time to sort. Twenty to thirty minutes is common, even when utilizing an entire CPU. As such, I wrote a script to perform distributed sorting.
Setup requires editing the “hosts” array in the top of the script. Put the hosts or username@host pair, like so:
hosts[0]="localhost" hosts[1]="localhost hosts[2]="user@otherhost" hosts[3]="yahost" hosts[4]="finalhost"
Then, optionally setup SSH keys to automate authentication and start sorting!
$ ls -l unsorted_file -rw-r--r-- 1 noland users 499328352 2008-01-18 18:19 unsorted_file
$ time sort unsorted_file > sorted_file2 real 19m17.247s user 18m8.720s sys 0m10.169s
$ time ./distsort.sh unsorted_file > sorted_file real 9m46.398s user 7m19.651s sys 0m21.397s $ md5sum sorted_file sorted_file2 ceb8a3aa2947868cae6ee33c457ba8d0 sorted_file ceb8a3aa2947868cae6ee33c457ba8d0 sorted_file2
Here is the proccess:
- Split the file into equal portions, one for each entry in the hosts array.
- For each host, either background local sort process or a remote sort process using SSH. The file is transferred via a pipeline to the remote hosts.
- Wait for all processes to exit.
- Merge the sorted files.
- Remove temp files.
Some things to keep in mind:
- If you don’t have mktemp available, you will have to add a mktemp function.
- The demonstration above was for a 500MB file and utilized three sort processes, one on the local host and two remote.
- You only want one sort process per CPU, as sort will consume 100% of the CPU. If you have two CPU’s on your local machine, place two localhost entries and it will start two sort processes. The same goes for remote hosts.
- The file must be large enough so that the cost of transferring each “chunk” over the wire twice is less than the cost of sorting the entire file locally.
If you change this script or have suggestions of any kind, please do comment!


January 18th, 2008 at 11:16 pm
[...] Sorting large files faster with a shell script. xmiszBubblez bookmarked on 01/18/08, saved by 1 others [...]
January 19th, 2008 at 9:43 am
You may get even better performance if you transmit your data compressed (use the -C option to ssh).
January 21st, 2008 at 8:59 pm
@Brandon,
Good point. Even if ’sort’ is CPU-bound and SSH would be competing for processor cycles with the local ’sort’ process, that may not be a problem in the end, since the local sort probably ends much faster than the others, so there is free local cpu power to spare.
And if you have SMP, ssh -C makes even more sense.
January 21st, 2008 at 9:56 pm
Good points guys. I think if you have multiple processors, the -C option makes sense. However, since compression is also CPU intensive, tests would need to be done to see whether it makes more sense to use -C or to just fire up an additional local sort process.
April 27th, 2009 at 6:57 pm
Thank you! I just applied this to ~80 million lines of data (700MB in bz2), sorted in 40 minutes on 16 machines.