feat: Add AdaptivePlaywrightCrawler #872

Pijukatel · 2025-01-06T13:32:39Z

Add AdaptivePlaywrightCrawler, based on JS implementation

Issues

Purpose:

Adaptive crawler can choose to crawl page with either static crawler(like BeautifulSoupCrawler or ParselCrawler) or browser-based PlaywrightCrawler.

This is done to theoretically benefit from both approaches - use faster static crawler where it is possible and fallback to slower browser-based crawler where needed.

Implementation

Python implementation differs from JS implementation mainly in the way adaptive crawler performs the browser crawling or the static crawling.

JS implementation explicitly defines the code that is used for static/browser crawling - this makes it more simple and easier to understand, but it is some code duplication(code that already exists in other crawlers) and also it does drop some features that exist on those crawlers, like pre-navigation hooks.

Python implementation does not define explicitly code for static/browser crawling. It instead delegates this task to "sub crawlers" (instance of implementation of AbstractHttpCrawler for static crawling and instance of PlaywrightCrawler.)
This avoids code duplication and supports more features and control over the static/browser crawling, but it adds complexity of connecting all the components together.

"Sub crawlers" are directly used only in __init__ method of top crawler to create their context pipelines. These are then used by top crawler to perform single request crawls when needed. This approach re-uses existing code from other crawlers and fully preserve their context pipeline handling.

Crawling contexts
Since AdaptivePlaywrightCrawler can crawl page in two different manners the contexts can also look differently. I selected approach where there is unified context for both crawl types(static/browser) and this unified context contains all the attributes that are found both on static context and browser context, but browser based attributes will raise AdaptiveContextError(RuntimeError) when accessed during static crawling. It is impossible to know in advance which mode of crawling will be used, so this forces users to use try-catch pattern when working with these unified contexts.

For user defined pre-navigation hooks there is AdaptivePlaywrightPreNavCrawlingContext (very nice name!) that has page property that can raise AdaptiveContextError.

For user defined request handler there is AdaptivePlaywrightCrawlingContext that has page, infinite_scroll and response properties (all copied from PlaywrightCrawlingContext) that can raise AdaptiveContextError.

Missing in this PR

Documentation
Documentation, examples and template will be added as last step in follow up PRs. The reason is that this feature should be documented only when it is in state that is ready to be used by users. Updating documentation too early would lead to users to try to use incomplete feature.

Rendering type predictor
JS implementation uses statistical regression based predictor to decide what crawl type should be used. This predictor is not implemented in this PR and is replaced by dummy version - RandomRenderingTypePredictor.
Proper predictor will be implemented in follow up PR.

Helper methods on adaptive context
JS implementation adds several helper methods on adaptive context. Since adaptive crawler in JS is not generic with respect to static crawling, it is straight forward to implement it with the only option Cheerio.
If same was to be done in Python it would require more handling of generics as it would basically require to wrap some sort of generic "select" method from whatever generic parser is used and it's return value would have to be another generic parameter.
This change is standalone and it is better to do it separately.

Parity changes to JS code

Some improvements could be ported back to JS implementation. These are improvements and issues are summarized in apify/crawlee#2798

Add method to BasicCrawler to handle just one request.

Add statistics TODO: Make mypy happy about statistics. Wrap existing statistics from init in adaptive statistics. Silent subcrawler statistics and loggers in general. (Set level to error?)

Ignore and create TODO follow up issue for refactoring Statistics class after technical discussion.

Handle use state.

Pre-navigation hooks delegation to sub crawler hooks.

Statistics were marked as generics, but in reality were not. Hardcoding state_model to make it explicit and clear.

WIP KVS handling. Currently it does not go through Result handler.

This reverts commit d345259.

Fix wrong id for predictor_state persistence.

Add test for pre nav hook Add test for statistics in crawler init

…nternals of sub crawlers) Cleanup commit results.

src/crawlee/crawlers/_abstract_http/_abstract_http_crawler.py

src/crawlee/crawlers/_adaptive_playwright/__init__.py

src/crawlee/crawlers/_adaptive_playwright/_adaptive_playwright_crawler.py

Add it in adaptive crawler instead at the cost of accessing many private members.

…andled.

…_request_handler.

janbuchar

LGTM.

Pijukatel · 2025-01-27T15:22:21Z

I have to integrate properly block_requests changes from master

src/crawlee/crawlers/_basic/_basic_crawler.py

src/crawlee/crawlers/_adaptive_playwright/__init__.py

src/crawlee/crawlers/_basic/_basic_crawler.py

Co-authored-by: Jan Buchar <[email protected]>

docs/examples/code/adaptive_playwright_crawler.py

src/crawlee/crawlers/__init__.py

src/crawlee/crawlers/_adaptive_playwright/_result_comparator.py

src/crawlee/crawlers/_adaptive_playwright/_adaptive_playwright_crawler.py

src/crawlee/crawlers/_abstract_http/_abstract_http_crawler.py

src/crawlee/crawlers/_adaptive_playwright/_adaptive_playwright_crawler.py

vdusek

Good job Pepa, and thanks for your patience 👀.

Pijukatel added 17 commits December 24, 2024 09:38

WIP

d8b2438

More feasible version of composition.

c12acda

Add method to BasicCrawler to handle just one request.

Pass properly all kwargs to subcrawlers

623d341

Add run decision logic from JS version

548349e

Add statistics TODO: Make mypy happy about statistics. Wrap existing statistics from init in adaptive statistics. Silent subcrawler statistics and loggers in general. (Set level to error?)

Statistics class change lead too many ripple changes.

29d510a

Ignore and create TODO follow up issue for refactoring Statistics class after technical discussion.

Handle sub crawlers loggers

04eefd9

Handle use state.

Statistics change to be usable without ignores.

33efed3

Pre-navigation hooks delegation to sub crawler hooks.

Align use_state with JS implementation.

202aceb

Remove "fake generics" from Statistics.

d474a94

Statistics were marked as generics, but in reality were not. Hardcoding state_model to make it explicit and clear.

Align result comparator witrh JS implementation.

e95ff19

Add doc strings.

2408d85

WIP KVS handling. Currently it does not go through Result handler.

use_state through RequestHandlerRunResult

d345259

Revert "use_state through RequestHandlerRunResult"

bfa9290

This reverts commit d345259.

Add basic delegation test.

63f278a

Add context test.

e190788

Add tests for use_state and predictor.

0ecb137

Remove unintended edit.

b73c702

Pijukatel requested a review from janbuchar January 6, 2025 13:32

github-actions bot assigned Pijukatel Jan 6, 2025

github-actions bot added this to the 105th sprint - Tooling team milestone Jan 6, 2025

github-actions bot added t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics. labels Jan 6, 2025

Pijukatel added 3 commits January 7, 2025 09:34

Add tests for statistics.

5c79d0d

Fix wrong id for predictor_state persistence.

Add test for error handling adn commiting correct results.

957915a

Add test for pre nav hook Add test for statistics in crawler init

Add crawl_one_required_contexts property. (Alternative to accessing i…

f12f605

…nternals of sub crawlers) Cleanup commit results.

Pijukatel force-pushed the adaptive-PwCrawler branch from 340c53d to f12f605 Compare January 7, 2025 10:35

Lint

5256af2

janbuchar reviewed Jan 7, 2025

View reviewed changes

Pijukatel added 2 commits January 8, 2025 09:18

Remove BasicCrawler modifications.

2fd7aae

Add it in adaptive crawler instead at the cost of accessing many private members.

Make _commit_result consistent with how other result components are h…

714b5bd

…andled.

Pijukatel added 4 commits January 23, 2025 08:44

Review comments

949c4ff

Add optional argument to pre navigation hook decorator

56ad33a

Remove _push_result_to_context and add result argument/return to _run…

781d5ff

…_request_handler.

Merge remote-tracking branch 'origin/master' into adaptive-PwCrawler

a93d6a1

janbuchar self-requested a review January 27, 2025 15:00

janbuchar approved these changes Jan 27, 2025

View reviewed changes

Pijukatel added 2 commits January 28, 2025 10:11

Add block_request to adaptive pre nav context

b4ba31b

Use context result map for handling request handler results

8bce425

Pijukatel force-pushed the adaptive-PwCrawler branch from baa2052 to 8bce425 Compare January 29, 2025 09:26

janbuchar reviewed Jan 29, 2025

View reviewed changes

src/crawlee/crawlers/_basic/_basic_crawler.py Outdated Show resolved Hide resolved

src/crawlee/crawlers/_basic/_basic_crawler.py Show resolved Hide resolved

Pijukatel added 3 commits January 30, 2025 09:00

Review comments based comments

5f8c26c

Merge remote-tracking branch 'origin/master' into adaptive-PwCrawler

e1d0c7e

Integrate RenderingTypePredictor

aacb90a

janbuchar reviewed Jan 30, 2025

View reviewed changes

src/crawlee/crawlers/_adaptive_playwright/__init__.py Show resolved Hide resolved

janbuchar reviewed Jan 30, 2025

View reviewed changes

src/crawlee/crawlers/_basic/_basic_crawler.py Outdated Show resolved Hide resolved

Pijukatel and others added 2 commits January 30, 2025 13:09

Update src/crawlee/crawlers/_basic/_basic_crawler.py

ab1c40e

Co-authored-by: Jan Buchar <[email protected]>

Finalize exports and re exports

2bf43f6

janbuchar self-requested a review January 30, 2025 13:49

janbuchar approved these changes Jan 30, 2025

View reviewed changes

janbuchar requested a review from vdusek January 30, 2025 14:11

vdusek modified the milestones: 105th sprint - Tooling team, 107th sprint - Tooling team Feb 4, 2025

vdusek reviewed Feb 4, 2025

View reviewed changes

Pijukatel added 3 commits February 5, 2025 14:08

Review comments

5a6d07c

Merge remote-tracking branch 'origin/master' into adaptive-PwCrawler

e20aa49

Review comments

1f48d53

vdusek approved these changes Feb 6, 2025

View reviewed changes

Pijukatel merged commit 5ba70b6 into master Feb 7, 2025
23 checks passed

Pijukatel deleted the adaptive-PwCrawler branch February 7, 2025 07:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add AdaptivePlaywrightCrawler #872

feat: Add AdaptivePlaywrightCrawler #872

Uh oh!

Pijukatel commented Jan 6, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

janbuchar left a comment

Uh oh!

Pijukatel commented Jan 27, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vdusek left a comment

Uh oh!

Uh oh!

Uh oh!

feat: Add AdaptivePlaywrightCrawler #872

feat: Add AdaptivePlaywrightCrawler #872

Uh oh!

Conversation

Pijukatel commented Jan 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issues

Purpose:

Implementation

Missing in this PR

Parity changes to JS code

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

janbuchar left a comment

Choose a reason for hiding this comment

Uh oh!

Pijukatel commented Jan 27, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vdusek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Pijukatel commented Jan 6, 2025 •

edited

Loading