Skip to content

关于注解模式下extractLinks的逻辑问题 #87

Open
@ccliangbo

Description

@ccliangbo

ModelPageProcessor中有如下代码:

    public void process(Page page) {
        for (PageModelExtractor pageModelExtractor : pageModelExtractorList) {
            extractLinks(page, pageModelExtractor.getHelpUrlRegionSelector(), pageModelExtractor.getHelpUrlPatterns());
            extractLinks(page, pageModelExtractor.getTargetUrlRegionSelector(), pageModelExtractor.getTargetUrlPatterns());
            Object process = pageModelExtractor.process(page);
            if (process == null || (process instanceof List && ((List) process).size() == 0)) {
                continue;
            }
            postProcessPageModel(pageModelExtractor.getClazz(), process);
            page.putField(pageModelExtractor.getClazz().getCanonicalName(), process);
        }
        if (page.getResultItems().getAll().size() == 0) {
            page.getResultItems().setSkip(true);
        }
    }

目前的逻辑是,对于每一个页面,都会把所有的pageModelExtractor里面定义的TargetUrl和HelpUrl抽取出来。这样会使得RegionSelector失去意义,不便于精确控制爬虫。举例如下:
目前有A页面,和ModelA的TargetUrl相匹配。同时ModelA中定义,在ResionA里面取出与ModelC匹配的URL。A页面中其他位置的URL不会被抽取。
有B页面,和ModelB的TargetUrl相匹配。同时ModelB中定义,在B全文中取出与ModelC匹配的URL。
这样,当A页面的page被传入这个函数的时候,虽然pageModelExtractorA仅仅把ResionA里面的URL加入队列,但是A页面也被pageModelExtractorB处理了,pageModelExtractorB把全文中的URL都加入了队列。于是指定Resion这种精确控制失效了。

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions