1 Suffix trees
To this point we have only examined linear time solutions to the exact pattern matching problem that preprocess
the pattern. However, there are also linear time solutions that preprocess the text. Although we will find the worst
case complexity of these methods to be the same (or similar depending on the level of granularity of our analysis),
whether it is best to preprocess the pattern or the text is often dictated by the problem at hand. In fact, often we
are not given a choice as in many problems only one of the pattern or the text are made available for preprocessing.
Definition 1: The substring matching problem
Given a reference text txt[1 . . . n], preprocess txt such that any given pat[1 . . . m] can be searched in time
proportional to the length of the pattern, O(m).
In the substring matching problem we are given access to the text and need to preprocess it in order to support
rapid queries. In contrast, the exact pattern matching problem gave us a pattern and asked us to find it in an
initially unknown text. The algorithms that we have considered previously enable us to efficiently search for a single
pattern within multiple large texts. However, now we are tasked with efficiently searching for multiple patterns
within a single text. This reversal of roles of the pattern and the text reduces the effectiveness of the methods that
preprocess the pattern and to remedy this we need to construct methods that preprocess the text.
There are several data structures that can be constructed from the text that support pattern matching in O(m)-
time. For instance, we examined suffix trees and the burrows wheeler transform (BWT) in FIT2004. Here we
choose to focus on suffix trees as they are tremendously versatile data structures that enable efficient solutions to
a multitude of important problems (see tutorial 3 and assignment 2 for examples). However, naive construction of
suffix trees requires O(n2)-time and yields an O(n2+km)-time solution to the substring matching problem; assuming
we are given k patterns of the same length. On the surface they do not appear to be an attractive alternative to
exact pattern matching algorithms that promise an O(m)-time preprocessing phase and an O(n)-time search phase
for each pattern, giving O(km + kn) overall. Fortunately, there exists an ingenious algorithm for constructing a
suffix tree in linear time known as Ukkonen’s algorithm, and with it suffix trees allow us to solve the substring
matching problem in O(n + km)-time. But before we examine the inner workings of Ukkonen’s, we will quickly
remind ourselves of what a suffix tree is and examine their connection to the implicit suffix trees, which are central
to the Ukkonen’s algorithm.
Definition 2: Suffix trees
A suffix tree for a string str[1 . . . n] is a tree with the following properties.
1. It has exactly n leaves numbered 1 through n.
2. Except for the root, every internal node has at least two children.
3. Each edge is labelled with a non-empty substring of str.
4. Every node has at most one edge beginning with any given character of the alphabet.
5. The string-label of the path starting at the root and ending at the leaf numbered i is the suffix str[i . . . n],for i = 1 . . . n.
Definition 3: Terminal character
The terminal character ‘$’ is defined to be a character in the alphabet such that,
1. It occurs only once in the string and is always the last character in the string.
2. It is lexicographically smaller than any other character in the alphabet.
本网站支持 Alipay WeChatPay PayPal等支付方式
E-mail: email@example.com 微信号:vipnxx