Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add AdaptivePlaywrightCrawler #872

Open
wants to merge 68 commits into
base: master
Choose a base branch
from
Open

Conversation

Pijukatel
Copy link
Contributor

@Pijukatel Pijukatel commented Jan 6, 2025

Add AdaptivePlaywrightCrawler, based on JS implementation

Issues

#249

Purpose:

Adaptive crawler can choose to crawl page with either static crawler(like BeautifulSoupCrawler or ParselCrawler) or browser-based PlaywrightCrawler.

This is done to theoretically benefit from both approaches - use faster static crawler where it is possible and fallback to slower browser-based crawler where needed.

Implementation

Python implementation differs from JS implementation mainly in the way adaptive crawler performs the browser crawling or the static crawling.

JS implementation explicitly defines the code that is used for static/browser crawling - this makes it more simple and easier to understand, but it is some code duplication(code that already exists in other crawlers) and also it does drop some features that exist on those crawlers, like pre-navigation hooks.

Python implementation does not define explicitly code for static/browser crawling. It instead delegates this task to "sub crawlers" (instance of implementation of AbstractHttpCrawler for static crawling and instance of PlaywrightCrawler.)
This avoids code duplication and supports more features and control over the static/browser crawling, but it adds complexity of connecting all the components together.

"Sub crawlers" are directly used only in __init__ method of top crawler to create their context pipelines. These are then used by top crawler to perform single request crawls when needed. This approach re-uses existing code from other crawlers and fully preserve their context pipeline handling.

Crawling contexts
Since AdaptivePlaywrightCrawler can crawl page in two different manners the contexts can also look differently. I selected approach where there is unified context for both crawl types(static/browser) and this unified context contains all the attributes that are found both on static context and browser context, but browser based attributes will raise AdaptiveContextError(RuntimeError) when accessed during static crawling. It is impossible to know in advance which mode of crawling will be used, so this forces users to use try-catch pattern when working with these unified contexts.

For user defined pre-navigation hooks there is AdaptivePlaywrightPreNavCrawlingContext (very nice name!) that has page property that can raise AdaptiveContextError.

For user defined request handler there is AdaptivePlaywrightCrawlingContext that has page, infinite_scroll and response properties (all copied from PlaywrightCrawlingContext) that can raise AdaptiveContextError.

Missing in this PR

Documentation
Documentation, examples and template will be added as last step in follow up PRs. The reason is that this feature should be documented only when it is in state that is ready to be used by users. Updating documentation too early would lead to users to try to use incomplete feature.

Rendering type predictor
JS implementation uses statistical regression based predictor to decide what crawl type should be used. This predictor is not implemented in this PR and is replaced by dummy version - RandomRenderingTypePredictor.
Proper predictor will be implemented in follow up PR.

Helper methods on adaptive context
JS implementation adds several helper methods on adaptive context. Since adaptive crawler in JS is not generic with respect to static crawling, it is straight forward to implement it with the only option Cheerio.
If same was to be done in Python it would require more handling of generics as it would basically require to wrap some sort of generic "select" method from whatever generic parser is used and it's return value would have to be another generic parameter.
This change is standalone and it is better to do it separately.

Parity changes to JS code

Some improvements could be ported back to JS implementation. These are improvements and issues are summarized in apify/crawlee#2798

Add method to BasicCrawler to handle just one request.
Add statistics

TODO:
Make mypy happy about statistics.
Wrap existing statistics from init in adaptive statistics.
Silent subcrawler statistics and loggers in general. (Set level to error?)
Ignore and create TODO follow up issue for refactoring Statistics class after technical discussion.
Handle use state.
Pre-navigation hooks delegation to sub crawler hooks.
Statistics were marked as generics, but in reality were not.
Hardcoding state_model to make it explicit and clear.
WIP KVS handling. Currently it does not go through Result handler.
@Pijukatel Pijukatel requested a review from janbuchar January 6, 2025 13:32
@github-actions github-actions bot added this to the 105th sprint - Tooling team milestone Jan 6, 2025
@github-actions github-actions bot added t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics. labels Jan 6, 2025
Fix wrong id for predictor_state persistence.
Add test for pre nav hook
Add test for statistics in crawler init
…nternals of sub crawlers)

Cleanup commit results.
Add it in adaptive crawler instead at the cost of accessing many private members.
Copy link
Collaborator

@janbuchar janbuchar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I finally managed to read this one through again 🙂 It needs some cleaning up, but everything important looks correct - nice job!

@janbuchar janbuchar self-requested a review January 27, 2025 15:00
Copy link
Collaborator

@janbuchar janbuchar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@Pijukatel
Copy link
Contributor Author

I have to integrate properly block_requests changes from master

@janbuchar janbuchar self-requested a review January 30, 2025 13:49
@janbuchar janbuchar requested a review from vdusek January 30, 2025 14:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants