Repository logo
 

A simple and dynamic data structure for pattern matching in texts

dc.contributor.authorWoo, Sung-Whan, author
dc.contributor.authorMcConnell, Ross M., advisor
dc.contributor.authorBohm, A. P. Willem, committee member
dc.contributor.authorPenttila, Tim, committee member
dc.contributor.authorMalaiya, Yashwant K., committee member
dc.date.accessioned2007-01-03T05:34:49Z
dc.date.available2007-01-03T05:34:49Z
dc.date.issued2011
dc.description.abstractThe demand for a pattern matching algorithm is currently on the rise from diverse areas such as string search, image matching, voice recognition and bioinformatics. In particular, string search or matching algorithms have been growing in popularity as they have been applied to areas such as text editors, search engines and bioinformatics. To satisfy these various demands, many string matching methods have been developed to search for substrings (pattern strings) within a text, and several techniques employ the use of tree data structures, deterministic finite automata, and other structures. The problem of string matching is defined by finding all location of a pattern string P within a text T, where preprocessing of T is allowed in order to facilitate the queries. There has been significant success in finding a pattern string in O(m+k) time, where m is the length of the pattern string and k is the number of occurrences, using data structures that can be constructed in O(n) time, where n is the length of T. Suffix trees and directed acyclic word graphs are such data structures. All of these data structures index the searched text in O(m+k) time. However, the difficulty of understanding and programming the construction algorithms is rarely mentioned. Also, they have significant space requirements and take Θ(n) time to update even if one character of T is changed. To solve these problems, we propose the augmented position heap. It can be built in O(n) time, and can be used to search a pattern string in O(m+k) time. Most importantly, when a block of j characters are inserted or deleted, the asymptotic updating it when a text is modified is O((h(T) + j)h(T)), where h(T) is the length of the longest substring X of T that occurs at least
dc.description.abstractX
dc.description.abstracttimes in T, where
dc.description.abstractX
dc.description.abstractis the length of X. For texts arising from practical applications, h(T) is typically slowly growing function of
dc.description.abstractT
dc.description.abstract; for a random text T, its expected value is O(logn). Another issue in data structures that must be addressed is space requirement. The most space efficient data structure for string search is the suffix array, which uses 2n words and supports searches in O(nlogn + m + k). A compact representation of the position heap proposed in this thesis also takes 2n words, but can be updated in O((h(T) + j)h(T)) time, but takes O(m2+k) time for a search. The best bound known bound for updating the suffix array or the directed acyclic word graph is O(n), and they both take considerably more space. A compact representation proposed in this thesis for the augmented position heap takes 4n words, can be updated just as efficiently as the position heap, and takes O(m+k) time for a search.
dc.format.mediumborn digital
dc.format.mediumdoctoral dissertations
dc.identifierWoo_colostate_0053A_10463.pdf
dc.identifier.urihttp://hdl.handle.net/10217/48169
dc.languageEnglish
dc.language.isoeng
dc.publisherColorado State University. Libraries
dc.relation.ispartof2000-2019
dc.rightsCopyright and other restrictions may apply. User is responsible for compliance with all applicable laws. For information about copyright law, please see https://libguides.colostate.edu/copyright.
dc.subjectdata structure
dc.subjectstring searching
dc.subjectpattern matching
dc.subjectdynamic string matching
dc.titleA simple and dynamic data structure for pattern matching in texts
dc.typeText
dcterms.rights.dplaThis Item is protected by copyright and/or related rights (https://rightsstatements.org/vocab/InC/1.0/). You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s).
thesis.degree.disciplineComputer Science
thesis.degree.grantorColorado State University
thesis.degree.levelDoctoral
thesis.degree.nameDoctor of Philosophy (Ph.D.)

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Woo_colostate_0053A_10463.pdf
Size:
1.21 MB
Format:
Adobe Portable Document Format
Description: