A simple and dynamic data structure for pattern matching in texts

Woo, Sung-Whan, author; McConnell, Ross M., advisor; Bohm, A. P. Willem, committee member; Penttila, Tim, committee member; Malaiya, Yashwant K., committee member

A simple and dynamic data structure for pattern matching in texts

Files

Woo_colostate_0053A_10463.pdf (1.21 MB)

Date

2011

Authors

Woo, Sung-Whan, author

McConnell, Ross M., advisor

Bohm, A. P. Willem, committee member

Penttila, Tim, committee member

Malaiya, Yashwant K., committee member

Abstract

The demand for a pattern matching algorithm is currently on the rise from diverse areas such as string search, image matching, voice recognition and bioinformatics. In particular, string search or matching algorithms have been growing in popularity as they have been applied to areas such as text editors, search engines and bioinformatics. To satisfy these various demands, many string matching methods have been developed to search for substrings (pattern strings) within a text, and several techniques employ the use of tree data structures, deterministic finite automata, and other structures. The problem of string matching is defined by finding all location of a pattern string P within a text T, where preprocessing of T is allowed in order to facilitate the queries. There has been significant success in finding a pattern string in O(m+k) time, where m is the length of the pattern string and k is the number of occurrences, using data structures that can be constructed in O(n) time, where n is the length of T. Suffix trees and directed acyclic word graphs are such data structures. All of these data structures index the searched text in O(m+k) time. However, the difficulty of understanding and programming the construction algorithms is rarely mentioned. Also, they have significant space requirements and take Θ(n) time to update even if one character of T is changed. To solve these problems, we propose the augmented position heap. It can be built in O(n) time, and can be used to search a pattern string in O(m+k) time. Most importantly, when a block of j characters are inserted or deleted, the asymptotic updating it when a text is modified is O((h(T) + j)h(T)), where h(T) is the length of the longest substring X of T that occurs at least
X
times in T, where
X
is the length of X. For texts arising from practical applications, h(T) is typically slowly growing function of
T
; for a random text T, its expected value is O(logn). Another issue in data structures that must be addressed is space requirement. The most space efficient data structure for string search is the suffix array, which uses 2n words and supports searches in O(nlogn + m + k). A compact representation of the position heap proposed in this thesis also takes 2n words, but can be updated in O((h(T) + j)h(T)) time, but takes O(m2+k) time for a search. The best bound known bound for updating the suffix array or the directed acyclic word graph is O(n), and they both take considerably more space. A compact representation proposed in this thesis for the augmented position heap takes 4n words, can be updated just as efficiently as the position heap, and takes O(m+k) time for a search.

Subject

data structure

string searching

pattern matching

dynamic string matching

URI

http://hdl.handle.net/10217/48169

Collections

2000-2019
Theses and Dissertations

Full item page

A simple and dynamic data structure for pattern matching in texts

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Abstract

Description

Rights Access

Subject

Citation

URI

Associated Publications

Collections