Avalg homework 2 fall 2002
This homework is due December 10 at 10.15. It can be delivered to Stefan or Isaac personally at their offices or put in their mail slot at the department. It can also be given to Stefan in connection with one of the lectures. Solutions handed in late are not accepted and will not be graded.
Some forms of collaboration are allowed and required. The size of a group of collaboration should be two people. The group should hand in only one solution and for each problem it should be clearly marked which of the members have contributed.
The homework will be returned on Thurday and Friday December 12-13. All members of the group must be present when the homework is returned. If your group talked to Stefan for homework 1 you should talk to Isaac this time and vice versa.
This is a programming exercise and the same quality standards as in Homework 1 apply. You may assume that the text is stored on file and consists of ASCII characters (excluding '0' and '*'). The program should contain a simple user interface for the queries and the time to execute each command should be reported to the user. The text contains at most 1,000,000 characters. Your goal is to build the suffix array within one minute and the searches should take no more than 10 seconds. You do not know what the text files and queries will look like.
Questions and answers
Q: When counting the number of strings matching a pattern, should we count the number of MATCHES or the number of MATCHING STRINGS?
Consider the string "abbc" and the pattern "a*b*c". There are two matches, b can match either the first or the second b, but there is only one string that matches, namely "abbc".
A: You should count the number of matching strings.
Q: Look at the pattern "ab" and the string "abab". What are the number of matching substrings?
Q: Look at the pattern "a*" and the string "abc". What are the number of matching substrings?
Q: I've found a solution with worst-case time complexity Omega(n*n), where n is the length of the string. Is that good enough?
A: For a full score you will need to find a better algorithm. It's not likely that a quadratic algorithm will meet the time bounds for strings containing 1,000,000 characters.
Q: We have a question about the "longest repeated substring". Does this mean the longest substring that occurs at least two times or does it mean the longest substring that repeats itself, as in "abcabc".
A: It's the longest substring that occurs at least twice.
Q: Look at the string "AABB" and the pattern "A*B". Can the string that matches * contain the characters 'A' and 'B'? Is the answer 1 or 4?
A: The '*' character matches _any_ string. There are 4 matching substrings in your example.