DOI

10.3906/elk-2004-67

Abstract

Cascading style sheets (CSS) selectors are patterns used to select HTML elements. They are often preferred in web data extraction because they are easy to prepare and have short expressions. In order to be able to extract data from web pages by using these patterns, a document object model (DOM) tree is constructed by an HTML parser for a webpage. The construction process of this tree and the extraction process using this tree increase time and memory costs depending on the number of HTML elements and their hierarchies. For reducing these costs, regular expressions can be considered as a solution. However, preparing regular expression patterns is a laborious task. In this study, a heuristic approach, namely Regex Generator (REGEXN), that automatically generates these patterns through CSS selectors is introduced and th eperformance gains are analyzed on a web crawler. The analysis shows that regular expression patterns generated by this approach can significantly reduce the average extraction time results from 743.31 ms to 1.03 ms when compared with the extraction process from a DOM tree. Similarly, the average memory usage drops from 1054.01 B to 1.59 B. Moreover, REGEXN can be easily adapted to the existing frameworks and tools in this task.

Keywords

Web data extraction, computational efficiency, regular expressions, heuristic algorithms

First Page

3389

Last Page

3401

Recommended Citation

UZUN, ERDİNÇ (2020) "A regular expression generator based on CSS selectors for efficient extraction from HTML pages," Turkish Journal of Electrical Engineering and Computer Sciences: Vol. 28: No. 6, Article 20. https://doi.org/10.3906/elk-2004-67
Available at: https://journals.tubitak.gov.tr/elektrik/vol28/iss6/20

Download

Included in

Computer Engineering Commons, Computer Sciences Commons, Electrical and Computer Engineering Commons

COinS

Turkish Journal of Electrical Engineering and Computer Sciences

A regular expression generator based on CSS selectors for efficient extraction from HTML pages

DOI

Abstract

Keywords

First Page

Last Page

Recommended Citation

Included in

Issues by Year

Search

Turkish Journal of Electrical Engineering and Computer Sciences

A regular expression generator based on CSS selectors for efficient extraction from HTML pages

Authors

DOI

Abstract

Keywords

First Page

Last Page

Recommended Citation

Included in

Share

Issues by Year

Search