DOI

10.3906/elk-1410-124

Abstract

Various scholarly works in the literature have pointed out that placing a preprocessor in front of a standard postcompressor would help achieve higher gains while compressing natural-language text files. Ever since, there has been much research on preprocessors to improve the gain attained by concatenated systems. With the same goal in mind our paper proposes a new word-based preprocessor named METEHAN190 (M190) and contrasts its performance with four other state-of-the-art preprocessors. Throughout the experiments source files from the Wall Street Journal (WSJ) archive, and the Calgary, Canterbury, Gutenberg, and Pizza and Chili corpora were used. Postcompressors adapted were Prediction by Partial Matching compressor using method-D (PPMD) and Monstrous PPM II compressor (PPMonstr). It was observed that in all three experiments WRT and M190 would achieve the two highest compression gains. For small text and transcription files from the Calgary corpus, M190 would outperform all preprocessors including WRT. On the other hand, a look at average encoding and decoding times shows that the semistatic byte-oriented methods are much faster in comparison to the static dictionary-based methods that encode words with characters.

Keywords

Lossless text compression, preprocessing, postcompressor, dictionary, semistatic byte-oriented preprocessors, METEHAN 190

First Page

4465

Last Page

4480

Recommended Citation

ŞENERGİN, METE ERAY and İNCE, ERHAN ALİRİZA (2016) "A new dictionary-based preprocessor that uses radix-190 numbering," Turkish Journal of Electrical Engineering and Computer Sciences: Vol. 24: No. 5, Article 84. https://doi.org/10.3906/elk-1410-124
Available at: https://journals.tubitak.gov.tr/elektrik/vol24/iss5/84

Download

Included in

Computer Engineering Commons, Computer Sciences Commons, Electrical and Computer Engineering Commons

COinS

Turkish Journal of Electrical Engineering and Computer Sciences

A new dictionary-based preprocessor that uses radix-190 numbering

DOI

Abstract

Keywords

First Page

Last Page

Recommended Citation

Included in

Issues by Year

Search

Turkish Journal of Electrical Engineering and Computer Sciences

A new dictionary-based preprocessor that uses radix-190 numbering

Authors

DOI

Abstract

Keywords

First Page

Last Page

Recommended Citation

Included in

Share

Issues by Year

Search