Turkish Journal of Electrical Engineering and Computer Sciences

A new word-based compression model allowing compressed pattern matching

Abstract

In this study a new semistatic data compression model that has a fast coding process and that allows compressed pattern matching is introduced. The name of the proposed model is chosen as tagged word-based compression algorithm (TWBCA) since it has a word-based coding and word-based compressed matching algorithm. The model has two phases. In the first phase a dictionary is constructed by adding a phrase, paying attention to word boundaries, and in the second phase compression is done by using codewords of phrases in this dictionary. The first byte of the codeword determines whether the word is compressed or not. By paying attention to this rule, the CPM process can be conducted as word based. In addition, the proposed method makes it possible to also search for the group of consecutively compressed words. Any of the previous pattern matching algorithms can be chosen to use in compressed pattern matching as a black box. The duration of the CPM process is always less than the duration of the same process on the texts coded by Gzip tool. While matching longer patterns, compressed pattern matching takes more time on the texts coded by compress and end-tagged dense code (ETDC). However, searching shorter patterns takes less time on texts coded by our approach than the texts compressed with compress. Besides this, the compression ratio of our algorithm has a better performance against ETDC only on a file that has been written in Turkish. The compression performance of TWBCA is stable and does not vary over 6% on different text files.

DOI

10.3906/elk-1601-92

Keywords

Compression, pattern matching, compressed pattern matching, semistatic model

First Page

3607

Last Page

3622

Recommended Citation

BULUŞ, H. N, CARUS, A, & MESUT, A (2017). A new word-based compression model allowing compressed pattern matching. Turkish Journal of Electrical Engineering and Computer Sciences 25 (5): 3607-3622. https://doi.org/10.3906/elk-1601-92

Download

Included in

Computer Engineering Commons, Computer Sciences Commons, Electrical and Computer Engineering Commons

COinS

Turkish Journal of Electrical Engineering and Computer Sciences

A new word-based compression model allowing compressed pattern matching

Abstract

DOI

Keywords

First Page

Last Page

Recommended Citation

Included in

Issues by Year

Search

Turkish Journal of Electrical Engineering and Computer Sciences

A new word-based compression model allowing compressed pattern matching

Authors

Abstract

DOI

Keywords

First Page

Last Page

Recommended Citation

Included in

Share

Issues by Year

Search