Avalg homework 2 fall 2004

This homework is due December 7 at 15.15. It can be delivered to Stefan personally at his office or put in his mail slot at the department. Do not use the department mail box. It can also be given to Stefan in connection with one of the lectures. Solutions handed in late are not accepted and will not be graded.

Some forms of collaboration are allowed and required. The size of a group of collaboration should be two people. The group should hand in only one solution and for each problem it should be clearly marked which of the members have contributed.

All members of the group must be present when the homework is returned. If your group talked to Per for homework 1 you should talk to Isaac this time and vice versa.

1. Text searching with preprocessing (90p)
Your task is to build a simple data structure that facilitates searching in a text. It consists of a sorted array of pointers. There is one pointer for each position in the text. The pointer array is sorted according to the suffixes to which the pointers refer. For example, the text "abba0" (0 is a special end of string character) has the suffix array [3, 0, 2, 1], where the pointers refer to the suffixes "a0", abba0", "ba0", and "bba0", respectively.

  • Build this data structure for a text t. A solution with worst-case time complexity Omega(n*n), where n is the length of the text, is not good enough. For a full score you will need to find a better algorithm; a quadratic algorithm will not meet the time bounds for strings containing 1,000,000 characters. (30p)
  • Find the number of substrings of t that match a given string. You should count the number of matching strings. (10p)
  • Find the length of the longest repeated substring in t. (20p)
  • Find the number of substrings of t that match a string which may contain one or more '*'-symbols. A '*' matches any string. For example, consider the string "abbc" and the pattern "a*b*c". There are two matches, b can match either the first or the second b, even though there is only one string that matches, namely "abbc". (30p)
You should give a clear description of the algorithms and an estimate of the asymptotic worst-case performance for each of the four subtasks.
     This is a programming exercise and the same quality standards as in Homework 1 apply. You may assume that the text is stored on file and consists of ASCII characters (excluding '0' and '*'). The program should contain a simple user interface for the queries and the time to execute each command should be reported to the user. The text contains at most 1,000,000 characters. Your goal is to build the suffix array within one minute and the searches should take no more than 10 seconds. You do not know what the text files and queries will look like.

2. Fast sorting (60p)
In the lectures you've seen how to sort n word-sized integers on a unit-cost RAM in O(n log log n) time. In this exercise you will study some special cases where it's possible to find easier algorithms or better time bounds.

  • If there are many elements - e.g. if n is close to 2^w, where w is the word length - you may sort in linear time. Give an algorithm and discuss how large n needs to be. (15p)
  • If the elements are small you may also sort in linear time. Give an algorithm and discuss how small the elements need to be. (15p)
  • You may sort faster if there are many repeated elements. How many distinct elements can you handle and how fast can you sort? (15p)
  • The algorithm discussed in class uses large amounts of memory. How much? By doing the radix sorting phase in more than two steps you may reduce the memory requirements while increasing the running time. Explain how to do this and give formulas describing the time-space tradeoff. (15p)

Stefan Nilsson