I played a bit more with my C++ Suffix Tree experiment, gradually exploring how Suffix Trees are equivalent to a Suffix Array and its associated LCP Array, allowing us to do anything with Suffix Arrays (and LCP arrays – Longest Common Prefix) that we can do with Suffix Trees, but with less memory (but only by a constant factor) and, I think, more cache-friendly use of memory. This code is still far from suitable for any real use.
Converting between Suffix Trees and Suffix Arrays
I first moved my previous code for Ukkonen’s algorithm from a branch into master, so we can now construct a SuffixTree from a string, in O(n) linear time. Subsequent insert()s still use the naive O(n^2) code. Traditionally, you’d concatenate all your strings with unique separator characters, but I’m trying to avoid that. And, really, I’m just trying to explore the use of the relevant algorithms – it’s a bonus if I can wrap it up into something useful.
I then added a get_suffix_array_and_lcp_array() method to get the Suffix Array and associated LCP Array, in O(n) linear time. This works by doing a lexicographically-ordered depth first search.
Finally, I added a SuffixTree constructor that takes such a Suffix Array and LCP Array to build the suffix tree, again in O(n) linear time. This iterates over the suffixes in the Suffix Array adding a node for each, walking back up the tree to split at the point specified by the LCP array. I kept the path in a stack for that walk up. This constructor let me test that I could use a Suffix Array (and LCP Array) obtained from the SuffixTree to recreate the SuffixTree correctly.
This was based on the ideas I learned from Dan Gusfield’s Fundamentals of Stringology lectures, specifically part 3. Incidentally, his later discussion of the Burrows Wheeler Transform led me to Colt McAnlis’ wonderful Compressor Head series, which far more people should watch.
Towards a usable SuffixArray API.
I’ve already created a SuffixArray class to do much of what the SuffixTree class can do, such as finding matches for a substring. It currently does an O(log(n)) binary search over the suffix array. But this should be possible in O(m) time, where m is the length of the substring, hopefully with far less string comparisons, just as it is possible with a Suffix Array. This can apparently be done with the help of the LCP Array. Hopefully I’ll get to that.
I also want to implement Kärkkäinen & Sanders DC3 algorithm to construct the Suffix Array directly from the string in O(n) linear time. And, if I understand Dan Gusfield’s lecture properly, we can use the Burrows-Wheeler transform to construct the associated LCP Array also in linear time. Hopefully I’ll get to that too.