Skip to content

Integrate a DOM-Based Similarity Filter to Reduce Duplication in Web Crawling #1266

@Xianyu0day

Description

@Xianyu0day

Background and Objective
In our current crawler system, we often encounter issues with high duplication rates. Although some websites have different URLs, their page content may be generated from templates, resulting in a significant amount of redundant data. This not only wastes system resources but also increases the complexity of subsequent data cleaning and processing.

To address this issue, I propose adding a filter based on DOM similarity to the crawler system. This feature can automatically identify and filter out duplicate or highly similar web pages by analyzing their structure and content similarity, significantly improving the efficiency of the crawler and the quality of the collected data.

Implementation Approach
Convert Web Pages to DOM Structure
Each web page can be parsed into a tree-like DOM structure. By extracting the content and position information of each node, we can generate a unique representation for each web page.
Generate Web Page Embedding Vectors
Use a hash algorithm to encode the content of each node, and combine it with parameters such as node depth and weight to generate a fixed-length vector (Embedding). This vector can be considered a "fingerprint" of the web page content, enabling quick comparisons of page similarity.
Calculate Web Page Similarity
Use cosine similarity to compare the Embedding vectors of two web pages. Based on the similarity score, we can set a threshold to automatically filter out duplicate or highly similar web pages.

Reduce Duplication Rate
This feature can effectively identify and filter out template-generated web pages, reducing the collection of redundant data and improving the efficiency of the crawler.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type: EnhancementMost issues will probably ask for additions or changes.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions