Radix sorting & searching

Stefan Nilsson

Efficient sorting and searching are corner-stones in algorithm design. In computer science it has become a deep-rooted habit to use comparison-based methods to solve these problems. In this thesis we argue that radix sorting and searching algorithms are superior in many respects, both practical and theoretical. They are often remarkably simple and many, if not most, sorting and searching problems may be cast in a form suitable for such algorithms.

In particular, we show that string sorting can be reduced to integer sorting in optimal asymptotic time and we show that a unit-cost RAM with a word length of w bits can sort n word-sized integers in O(n log log n) time for arbitrary w ≥ log n, a significantly improved upper bound for sorting. The algorithm is simple and runs surprisingly fast also in practice.

We introduce a new practical radix sorting algorithm, Forward radixsort, that combines the advantages of traditional LSD and MSD radixsort in a simple way. Adding a preprocessing step to Forward radixsort we obtain an algorithm with interesting theoretical properties. For example, if B / n = O(w), where B is the minimum number of bits that must be inspected to distinguish the strings, it is possible to sort n binary strings in O(n log(B / n log n)) time. The result can also be expressed in terms of the entropy H per bit of the input: n strings from a stationary ergodic process can be sorted in O(n log 1/H) time. We implement some of the new radix sorting algorithms in the C programming language and achieve sorting routines that run significantly faster than previously presented string sorting routines.

The other theme of the thesis is a new compression technique, level compression, for trie structures. A level-compressed trie, or LC-trie, is easy to implement and inherits the good properties of standard tries with respect to neighbor and range searches, while the average depth is significantly decreased. The expected average depth of an element in an LC-trie is proportional to the iterated logarithm function for uniformly distributed data and proportional to log log n for data from a Bernoulli-type distribution.

We also study the problem of string searching using suffix trees. Combining the methods of path compression, level compression, and data compression we achieve an efficient, compact, and fast implementation of a suffix tree. Based on extensive simulations, we argue that this new data structure is useful in many practical situations. For example, it can be used to reduce the number of accesses to secondary memory when searching in very large sets of data.