1 Suffix trees
To this point we have only examined linear time solutions to the exact pattern matching problem that preprocess the pattern. However, there are also linear time solutions that preprocess the text. Although we will find the worst case complexity of these methods to be the same (or similar depending on the level of granularity of our analysis),whether it is best to preprocess the pattern or the text is often dictated by the problem at hand. In fact, often we are not given a choice as in many problems only one of the pattern or the text are made available for preprocessing.
Definition 1: The substring matching problem
Given a reference text txt[1 . . . n], preprocess txt such that any given pat[1 . . . m] can be searched in time proportional to the length of the pattern, O(m).
In the substring matching problem we are given access to the text and need to preprocess it in order to support rapid queries. In contrast, the exact pattern matching problem gave us a pattern and asked us to find it in an initially unknown text. The algorithms that we have considered previously enable us to efficiently search for a single pattern within multiple large texts. However, now we are tasked with efficiently searching for multiple patterns within a single text. This reversal of roles of the pattern and the text reduces the effectiveness of the methods that preprocess the pattern and to remedy this we need to construct methods that preprocess the text.
There are several data structures that can be constructed from the text that support pattern matching in O(m)-time. For instance, we examined suffix trees and the burrows wheeler transform (BWT) in FIT2004. Here we choose to focus on suffix trees as they are tremendously versatile data structures that enable efficient solutions to a multitude of important problems (see tutorial 3 and assignment 2 for examples). However, naive construction of suffix trees requires O(n2)-time and yields an O(n2+km)-time solution to the substring matching problem; assuming we are given k patterns of the same length. On the surface they do not appear to be an attractive alternative to exact pattern matching algorithms that promise an O(m)-time preprocessing phase and an O(n)-time search phase for each pattern, giving O(km + kn) overall. Fortunately, there exists an ingenious algorithm for constructing a suffix tree in linear time known as Ukkonen’s algorithm, and with it suffix trees allow us to solve the substring matching problem in O(n + km)-time. But before we examine the inner workings of Ukkonen’s, we will quickly remind ourselves of what a suffix tree is and examine their connection to the implicit suffix trees, which are central to the Ukkonen’s algorithm.
Definition 2: Suffix trees
A suffix tree for a string str[1 . . . n] is a tree with the following properties.
1. It has exactly n leaves numbered 1 through n.
2. Except for the root, every internal node has at least two children.
3. Each edge is labelled with a non-empty substring of str.
4. Every node has at most one edge beginning with any given character of the alphabet.
5. The string-label of the path starting at the root and ending at the leaf numbered i is the suffix str[i . . . n],for i = 1 . . . n.
Definition 3: Terminal character
The terminal character ‘$’ is defined to be a character in the alphabet such that,
1. It occurs only once in the string and is always the last character in the string.
2. It is lexicographically smaller than any other character in the alphabet.
本网站支持 Alipay WeChatPay PayPal等支付方式
E-mail: email@example.com 微信号:vipnxx