One of the bottlenecks of the Internet is the address lookup operations performed by the routers. Expensive tailor-made hardware solutions have typically been used to achieve necessary speed. In this article, we’ll show that it is possible to perform the lookups efficiently with a simple data structure – a level compressed trie or LC-trie. A sortware implementation can sustain several million lookups per second, enough to match a Gbit/sec link. The data structure is simple and it scales well. No modifications are needed when swithching from the 32-bit addressed of IP version 4 (IPv4) to the 128-bit addresses of IP version 6 (IPv6), and we expect the lookup operation to be almost as fast for the longer addresses.

The Internet protocol (IP) sends all data in packets. Each packet has a header that contains, among other things, a destination address. Currently, this address is a 32-bit number. In the next version of the Internet protocol, IPv6, these addresses will be 128 bits in length.

When a packet is sent over the Internet, it typically encounters several routers that are responsible for forwarding the packet in the correct direction toward its final destination. The router has a table that indicates a suitable forwarding direction for each possible address. This routing table does not contain complete addresses but only prefixes. All addresses starting with the same bits are forwarded in the same direction. This is a small routing table with 15 entries.

Entry -------- 0 0000 1 0001 2 00101 3 010 4 0110 5 0111 6 100 7 101000 8 101001 9 10101 10 10110 11 10111 12 110 13 11101000 14 11101001

The destination address of a packet is compared to the prefixes in the routing table. If there is more than one match, the longest matching prefix is chosen. If no match is found a default route is used. The core routers in the Internet backbone are required to recognize all addresses and therefore tend to be larger.

The trie is a general-purpose data structure for storing strings. The idea behind it is simple: Each string is represented by a leaf in a tree structure, and the value of the string corresponds to the path from the root of the tree to the leaf. The binary strings in the routing table correspond to the trie in Figure A.

For example, the string 010 corresponds to the path starting at the root and ending in leaf number 3: first a left-turn (0), then a right-turn (1), and finally a turn to the left (0).

This simple structure is not very efficient.
The number of nodes may be large and the average depth may be long.
The traditional technique to overcome this problem is to use path compression:
Each internal node with only one child is removed.
We store a number, the skip value, in each node
that indicates how many bits have been skipped on the path.
The path-compressed version of the trie in Figure A
is shown in Figure B.
The total number of nodes in a path-compressed binary trie
is exactly 2*n* - 1,
where *n* is the number of leaves in the trie.
For a large class of distributions the average depth of
a path-compressed trie is proportional to log *n*.

Path compression is a way to compress parts of the trie that are sparsely populated.
Level compression, described by A. Andersson and S. Nilsson in
“Improved Behavior of Tries by Adaptive Branching”
(Information Processing Letters, 46(6):295-300, 1993),
is a new technique for compressing parts of the trie that are densely populated.
The idea is to replace the *i* highest complete levels of the binary trie
with a single node with 2^{i} descendants.
This replacement is performed recursively on each subtrie.
The level-compressed version, the LC-trie,
of the trie in Figure B
is shown in Figure C.
For a large class of distributions the average depth of
an LC-trie is proportional to log log *n*.
That is, the LC-trie grows very slowly as function of the number of entries.

If we want to achieve the efficiency promised by these theoretical bounds, it is of course important to represent the trie efficiently. The standard implementation of a trie, where a set of children pointers are stored at each internal node, is not a good solution, since it has a large space overhead. A space-efficient alternative is to store the children of a node in consecutive memory locations. In this way, only a pointer to the left-most child is needed. In fact, the nodes may be stored in an array and each node can be represented by a single word containing three numbers:

- The branching factor.
- The skip value.
- A pointer to the left-most child.

The array representation of the LC-trie in Figure C is shown below.

Branch Skip Pointer --------------------- 0 3 0 1 1 1 0 9 2 0 2 2 3 0 0 3 4 1 0 11 5 0 0 6 6 2 0 13 7 0 0 12 8 1 4 17 9 0 0 0 10 0 0 1 11 0 0 4 12 0 0 5 13 1 0 19 14 0 0 9 15 0 0 10 16 0 0 11 17 0 0 13 18 0 0 14 19 0 0 7 20 0 0 8

Using this implementation, there is no apparent way to perform efficient dynamic updates, inserting and removing single entries. However, for the routing tables, you can get away with rebuilding the complete structure with large time intervals (seconds).

The search algorithm can be implemented very efficiently as demonstrated by this pseudocode:

node = T[0]; pos = node.skip; branch = node.branch; adr = node.adr; while (branch != 0) { node = T[adr + EXTRACT(pos, branch, s)]; pos = pos + branch + node.skip; branch = node.branch; adr = node.adr; } return adr;

Let `s`

be the string searched for and let
`EXTRACT(p,b,s)`

be a function that returns the number give
by the `b`

bits starting at position `p`

in the string `s`

.
We denote the array representing the trie by `T`

.
The root of the trie is stored in `T[0]`

.
Note that the address returned only indicates a possible hit;
the bits that have been skipped during the search may not match.
Therefore, we need to store the values of the strings separately and
perform one additional comparison to check whether the search actually was successful.

For example, we’ll search for the string 10110111. We start at the root, node number 0, and find that the branching value is 3 and the skip value is 0. Therefore, we extract the first three bits from the search string. These three bits have the value 5, which is added to the pointer, leading to position 6 in the array. At this node, the branching value is 2 and the skip value is 0. Therefore we extract the next two bits. They have the value 2. Adding 2 to the pointer, we arrive at position 15. At this node the branching value is 0, which implies that it is a leaf. The pointer value 5 gives the position of the string in the base vector. Note that we always check whether this constitutes a true hit. We need to compare the first five bits of the search string with the first five bits of a value stored in the base vector in the position indicated by the pointer (10) in the leaf. In fact, the routing table contains a prefix 10110, matching the string; the search was successful.

The idea of level compression turns out to be quite efficient in reducing the depth of the trie. However, it is rather sensitive: You might lose some good compression opportunities if a subtrie is almost complete, with just a few strings missing. Therefore, we use a less restrictive criterion for compression. We compress a subtrie if a certain fraction, called the “fill factor,” of the strings are present. This relaxed level compression reduces the depth of the trie, but increases the size since we introduce some superfluous empty leaves in the trie.

Another simple optimization is to use a large branching factor at the root of the trie. This should be a good idea, since every bit that we gain at the top of the trie will affect each and every entry. You may recognize this as a standard technique known as bucketing: The complete address space of the problem is divided into buckets of equal size and the entries belonging to each bucket are treated separately.

The trie structure, as we have described it, can not handle routing tables containing entries that are prefixes of other entries. There are several ways to address this problem. One technique is to store entries not only at the leaves of the trie but also in internal nodes. However, this makes the structure more complicated (we need to implement nodes of different sizes), and the search operation will be slower since we need to check for a match also at the internal nodes of the trie. Therefore, we have chosen to use a special prefix table that is accessed if no match is found at the leaf of the trie. This table is simple, and, in practice, it is also small and shallow. Conceptually, an entry that is a prefix corresponds to an exception in the address space. Each entry in the routing table defines a set of addresses that share the same routing table entry. In such an address set, a longer match corresponds to a subset of addresses that should be routed differently. This special case will probably be even less frequent with IPv6. With a much larger address space, it is possible to allocate the addresses in a strictly hierarchical fashion.

We performed measurements on a Sun Ultra Sparc II running in multiuser mode equipped with two 296-MHz processors and 512 MB of RAM. The code is available electronically.

We have performed simulations for the U.S. core routers as provided by the Internet Performance Measurement and Analysis (IPMA) project. Traffic traces for these routers are not freely available, since they contain potentially sensitive data. However, we have been able to attain both routing tables and partial traffic traces from the Finnish University and Research Network (FUNET). We had access only to the first 24 bits of each address. However, this should have little effect on the measurements since the last eight bits are local addresses that are rarely inspected by this router. The routing table contained more than 40,000 entries, which is comparable to the largest U.S. core routers. Finland is densely populated, not with people, but with IP addresses.

This table shows some measurements.

Million Average Size of Fill lookups path Branch tree factor per sec [Maximum at root (kwords) [Simulated path] traffic] -------------------------------------------- No LC 1.2 [0.8] 19.5 [25] 2 80 -------------------------------------------- 1.00 2.5 [1.4] 8.31 [14] 2 64 -------------------------------------------- 0.10 4.5 [1.6] 1.81 [4] 16 304 0.20 4.3 [1.6] 2.36 [4] 12 159 0.30 4.2 [1.7] 2.71 [5] 8 110 0.40 4.4 [1.9] 2.93 [6] 7 86 0.50 4.3 [1.8] 3.36 [7] 6 78 -------------------------------------------- 0.25 4.8 [2.2] 1.73 [5] 16 fix 171 0.50 5.0 [2.1] 1.73 [5] 16 fix 129 0.75 4.4 [1.9] 1.82 [6] 16 fix 118 1.00 4.3 [1.7] 2.10 [9] 16 fix 117 --------------------------------------------

A standard path-compressed trie with no level compression doesn’t behave well. The average depth, which is the average number of memory accesses required by the search algorithm, is almost 20. Level compression gives a big improvement. Not only does it reduce the depth, but it even makes the structure smaller. Relaxed level compression gives further improvements. A lookup in an LC-trie with fill factor 0.30 needs less than three memory accesses on the average and only five memory accesses in the worst case. Yet the size of the structure is only slightly larger than for a standard LC-trie. The number of lookups per second are sufficient to match a Gbit/sec link, since the average size of a packet is about 250 bytes.

It is interesting to compare the simulated traffic with the real traffic traces. The simulated traffic simply consists of the entries in the routing table randomly permuted. This means that the lookups for simulated traffic has very bad memory access patterns, while the real traffic has a lot of locality -- packets typically come in groups headed for the same destination. Therefore, the number of cache misses will be much smaller for real traffic. Even so, the difference in speed is only a factor of 2. This may surprise you, since a cache miss can be 100 times slower than a direct access to the cache. However, the number of cache misses is small, even for bad memory access patterns. It seems like caching is better than its reputation suggests!

Another interesting observation is that the bucketing scheme alone, with no relaxed level-compression, behaves well. This is not so surprising. In IPv4 only the first 24 bits of the addresses are typically inspected by the core routers. Hence, using bucketing on the first 16 bits, each bucket will contain only a few elements and all that remains is a small (at most 256 elements) search problem that could be solved by any standard technique. Hence, IP address lookup is easy. This simple bucketing scheme will, however, not be sufficient for IP version 6 with its 128-bit addresses.

Fortunately, our structure scales well with the length of the addresses.
In fact, no changes to the trie are needed.
We just need to make more room in the array where the strings are stored.
This array is only accessed once for each search and
accessing four consecutive words instead of one is only marginally slower.
Also, note that even without bucketing, using only relaxed level compression,
the measurements are good.
Furthermore, theory tells us that the depth of the trie
does not depend on the lengths of the strings
but only on their number and distribution.
And for a large class of distributions, the average depth is proportional
to log log *n* – very close to constant.
For a more detailed discussion,
including a comparison with other proposed software solutions,
see our paper “Fast Address Look-up for Internet Routers”
(Fourth IFIP Conference on Broadband Communications, Stuttgart, Germany, April 1-3, 1998).

Address lookup in IPv4 is easy. Since only the first 24 bits are used by the core routers, a simple bucketing scheme solves the problem. In IPv6, the addresses will be longer and we will need more sophisticated methods. We believe that the LC-trie is a suitable data structure. It’s simple and compact and an address lookup requires only a few memory accesses. The depth of the structure does not depend on the length of the strings and grows slowly as a function of the number of entries in the table.

and Gunnar Karlsson