Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add AdaptivePlaywrightCrawler #872

Open
wants to merge 68 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 57 commits
Commits
Show all changes
68 commits
Select commit Hold shift + click to select a range
d8b2438
WIP
Pijukatel Dec 24, 2024
c12acda
More feasible version of composition.
Pijukatel Dec 31, 2024
623d341
Pass properly all kwargs to subcrawlers
Pijukatel Dec 31, 2024
548349e
Add run decision logic from JS version
Pijukatel Jan 1, 2025
29d510a
Statistics class change lead too many ripple changes.
Pijukatel Jan 1, 2025
04eefd9
Handle sub crawlers loggers
Pijukatel Jan 1, 2025
33efed3
Statistics change to be usable without ignores.
Pijukatel Jan 2, 2025
202aceb
Align use_state with JS implementation.
Pijukatel Jan 2, 2025
d474a94
Remove "fake generics" from Statistics.
Pijukatel Jan 2, 2025
e95ff19
Align result comparator witrh JS implementation.
Pijukatel Jan 2, 2025
2408d85
Add doc strings.
Pijukatel Jan 2, 2025
d345259
use_state through RequestHandlerRunResult
Pijukatel Jan 3, 2025
bfa9290
Revert "use_state through RequestHandlerRunResult"
Pijukatel Jan 6, 2025
63f278a
Add basic delegation test.
Pijukatel Jan 6, 2025
e190788
Add context test.
Pijukatel Jan 6, 2025
0ecb137
Add tests for use_state and predictor.
Pijukatel Jan 6, 2025
b73c702
Remove unintended edit.
Pijukatel Jan 6, 2025
5c79d0d
Add tests for statistics.
Pijukatel Jan 7, 2025
957915a
Add test for error handling adn commiting correct results.
Pijukatel Jan 7, 2025
f12f605
Add crawl_one_required_contexts property. (Alternative to accessing i…
Pijukatel Jan 7, 2025
5256af2
Lint
Pijukatel Jan 7, 2025
2fd7aae
Remove BasicCrawler modifications.
Pijukatel Jan 8, 2025
714b5bd
Make _commit_result consistent with how other result components are h…
Pijukatel Jan 8, 2025
b38dda1
Remove subcrawlers and add _OrphanPipeline
Pijukatel Jan 8, 2025
ffb2a78
Use dummy statistics in subcrawlers.
Pijukatel Jan 8, 2025
3b05228
Keep predictor related functions on predictor_state
Pijukatel Jan 8, 2025
dc06490
Unify pre-nav hooks.
Pijukatel Jan 8, 2025
0766b7a
Simplify pre-nav hook common context.
Pijukatel Jan 9, 2025
4a63c2c
Make static crawling part of AdaptiveCrawler generic.
Pijukatel Jan 9, 2025
bd72a84
Update tests to remove bs references.
Pijukatel Jan 9, 2025
c964d44
Revert accidental Lint edits to website/*.py
Pijukatel Jan 10, 2025
a471395
Review comments.
Pijukatel Jan 14, 2025
dbf6310
Sub crawler timeout handling + test
Pijukatel Jan 15, 2025
8ee8f99
Simplify prenav hooks
janbuchar Jan 15, 2025
c291d3e
Simplify context manager handling
janbuchar Jan 15, 2025
2c23238
Review comments - _run_request_handler + timeouts
Pijukatel Jan 15, 2025
b2a29c1
Statistics.
Pijukatel Jan 15, 2025
0786f87
Make statistics generic again!
Pijukatel Jan 15, 2025
4f7d7f4
Merge remote-tracking branch 'origin/master' into adaptive-PwCrawler
Pijukatel Jan 15, 2025
1234ea7
Mock requests in tests.
Pijukatel Jan 16, 2025
e85ec9f
Improve error readability.
Pijukatel Jan 17, 2025
6e8635a
Mock both static and browser requests in tests.
Pijukatel Jan 17, 2025
0e9146a
Create proper example code
Pijukatel Jan 17, 2025
a4eac8f
Merge remote-tracking branch 'origin/master' into adaptive-PwCrawler
Pijukatel Jan 17, 2025
44ad898
Relax timeout in test to avoid flakiness in CI
Pijukatel Jan 20, 2025
f219453
Remove AdaptivePlaywrightCrawlerStatistics
Pijukatel Jan 21, 2025
64d9e54
WIP
Pijukatel Jan 21, 2025
08fc81f
Update options typed dicts
Pijukatel Jan 21, 2025
9bde9dc
Add docstrings to adaptive context public stuff
Pijukatel Jan 21, 2025
9a14569
Make crawl_one_with private.
Pijukatel Jan 21, 2025
fd8dd82
Update tests/unit/crawlers/_adaptive_playwright/test_adaptive_playwri…
Pijukatel Jan 22, 2025
4221219
Review comments
Pijukatel Jan 22, 2025
565d36b
Remove _run_subcrawler_pipeline
Pijukatel Jan 22, 2025
4bd8251
Remove Orphans
Pijukatel Jan 22, 2025
a3ab2e8
Merge remote-tracking branch 'origin/master' into adaptive-PwCrawler
Pijukatel Jan 22, 2025
fc95132
Move SubCrawlerRun to where it is used
Pijukatel Jan 22, 2025
4f316b5
Use custom _TestInput dataclass
Pijukatel Jan 22, 2025
949c4ff
Review comments
Pijukatel Jan 23, 2025
56ad33a
Add optional argument to pre navigation hook decorator
Pijukatel Jan 27, 2025
781d5ff
Remove _push_result_to_context and add result argument/return to _run…
Pijukatel Jan 27, 2025
a93d6a1
Merge remote-tracking branch 'origin/master' into adaptive-PwCrawler
Pijukatel Jan 27, 2025
b4ba31b
Add `block_request` to adaptive pre nav context
Pijukatel Jan 28, 2025
8bce425
Use context result map for handling request handler results
Pijukatel Jan 29, 2025
5f8c26c
Review comments based comments
Pijukatel Jan 30, 2025
e1d0c7e
Merge remote-tracking branch 'origin/master' into adaptive-PwCrawler
Pijukatel Jan 30, 2025
aacb90a
Integrate RenderingTypePredictor
Pijukatel Jan 30, 2025
ab1c40e
Update src/crawlee/crawlers/_basic/_basic_crawler.py
Pijukatel Jan 30, 2025
2bf43f6
Finalize exports and re exports
Pijukatel Jan 30, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 55 additions & 0 deletions docs/examples/code/adaptive_playwright_crawler.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
import asyncio

from playwright.async_api import Route

from crawlee.crawlers._adaptive_playwright._adaptive_playwright_crawler import AdaptivePlaywrightCrawler
from crawlee.crawlers._adaptive_playwright._adaptive_playwright_crawling_context import (
AdaptiveContextError,
AdaptivePlaywrightCrawlingContext,
AdaptivePlaywrightPreNavCrawlingContext,
)


async def main() -> None:
crawler = AdaptivePlaywrightCrawler.with_beautifulsoup_static_parser(
max_requests_per_crawl=5, playwright_crawler_specific_kwargs={'headless': False}
)

@crawler.router.default_handler
async def request_handler(context: AdaptivePlaywrightCrawlingContext) -> None:
# Code that will be executed in both crawl types
context.log.info(f'User handler processing: {context.request.url} ...')

try:
some_locator = context.page.locator('div').first
# Code that will be executed only in Playwright crawl.
# Trying to access `context.page` in static crawl will throw `AdaptiveContextError`.

await some_locator.wait_for()
# Do stuff with locator...
context.log.info(f'Playwright processing of: {context.request.url} ...')
except AdaptiveContextError:
# Code that will be executed in only in static crawl
context.log.info(f'Static processing of: {context.request.url} ...')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is technically correct, but not the way it is supposed to be used. Here's what I'd do:

  • in the default handler, just use the parsed content and enqueue_links
    • when we have the new context helpers, we'll use those here
  • add a new handler to the router that uses the page and add a comment that the crawler will take care of this
  • do not catch the AdaptiveContextError - that's an implementation detail and we shouldn't make it look like this is something you should do

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of now this example is more of a starting point for the reviewers and not really for end users given the fact that it is not mentioned in docs. I agree with the changes you suggest, but I would prefer to do them once the context helpers are in place, so that the example is matching the current state of the code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get that, but I believe that even though that example is not actually accessible from anywhere, it shouldn't contain code that goes against how the adaptive crawler should be used. I'd prefer to update that file now and then once more after we have those context helpers in place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, but unless we split pre-navigation hooks to two different functions I do not think we can avoid try-catch there.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or unless we use the new block_requests context helper there and just have it do nothing in a static crawling context. Would that work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that covers only some specific use cases. There can be any code in those hooks. So to cover general case and have 1 to 1 matching behavior with both PlaywrightCrawler and whatever variant of ParsedHttpCrawler, I still se just two options: Try-catch or separate hooks.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, the point of the adaptive crawler is that you shouldn't need to make separate branches for browser/static. Is there any operation that you can do in a prenav hook in the static context but not in the browser? If not (I can't think of any, but that's not saying anything), then we can just provide block_requests (a possibly common use case) and page for all the other usage scenarios (but it will force browser-based crawling when used).

Copy link
Contributor Author

@Pijukatel Pijukatel Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I tried some compromise solution.

@crawler.pre_navigation_hook
This decorator registers hooks for both static and playwright sub crawlers -> if anyone tries to access context.page, it will raise an exception. (The exception text points directly to the example below)

@crawler.pre_navigation_hook(playwright_only=True)
This decorator registers hooks only playwright sub crawler -> safe to access context.page

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, this is fine by me


# FInd more links and enqueue them.
await context.enqueue_links()
await context.push_data({'Top crawler Url': context.request.url})

@crawler.pre_navigation_hook
async def hook(context: AdaptivePlaywrightPreNavCrawlingContext) -> None:
async def some_routing_function(route: Route) -> None:
await route.continue_()

try:
await context.page.route('*/**', some_routing_function)
context.log.info(f'Playwright pre navigation hook for: {context.request.url} ...')
except AdaptiveContextError:
context.log.info(f'Static pre navigation hook for: {context.request.url} ...')

# Run the crawler with the initial list of URLs.
await crawler.run(['https://warehouse-theme-metal.myshopify.com/'])


if __name__ == '__main__':
asyncio.run(main())
169 changes: 88 additions & 81 deletions poetry.lock

Large diffs are not rendered by default.

55 changes: 45 additions & 10 deletions src/crawlee/crawlers/_abstract_http/_abstract_http_crawler.py
janbuchar marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
from typing import TYPE_CHECKING, Any, Callable, Generic

from pydantic import ValidationError
from typing_extensions import NotRequired, TypeVar
from typing_extensions import NotRequired, TypedDict, TypeVar

from crawlee import EnqueueStrategy
from crawlee._request import BaseRequestData
Expand All @@ -14,6 +14,7 @@
from crawlee.crawlers._basic import BasicCrawler, BasicCrawlerOptions, ContextPipeline
from crawlee.errors import SessionError
from crawlee.http_clients import HttpxHttpClient
from crawlee.statistics import StatisticsState

from ._http_crawling_context import HttpCrawlingContext, ParsedHttpCrawlingContext, TParseResult

Expand All @@ -27,24 +28,33 @@
from ._abstract_http_parser import AbstractHttpParser

TCrawlingContext = TypeVar('TCrawlingContext', bound=ParsedHttpCrawlingContext)
TStatisticsState = TypeVar('TStatisticsState', bound=StatisticsState, default=StatisticsState)


@docs_group('Data structures')
class HttpCrawlerOptions(Generic[TCrawlingContext], BasicCrawlerOptions[TCrawlingContext]):
"""Arguments for the `AbstractHttpCrawler` constructor.

It is intended for typing forwarded `__init__` arguments in the subclasses.
"""

class _HttpCrawlerAdditionalOptions(TypedDict):
additional_http_error_status_codes: NotRequired[Iterable[int]]
"""Additional HTTP status codes to treat as errors, triggering automatic retries when encountered."""

ignore_http_error_status_codes: NotRequired[Iterable[int]]
"""HTTP status codes that are typically considered errors but should be treated as successful responses."""


@docs_group('Data structures')
class HttpCrawlerOptions(
Generic[TCrawlingContext, TStatisticsState],
_HttpCrawlerAdditionalOptions,
BasicCrawlerOptions[TCrawlingContext, StatisticsState],
):
"""Arguments for the `AbstractHttpCrawler` constructor.

It is intended for typing forwarded `__init__` arguments in the subclasses.
"""


@docs_group('Abstract classes')
class AbstractHttpCrawler(Generic[TCrawlingContext, TParseResult], BasicCrawler[TCrawlingContext], ABC):
class AbstractHttpCrawler(
Generic[TCrawlingContext, TParseResult], BasicCrawler[TCrawlingContext, StatisticsState], ABC
):
"""A web crawler for performing HTTP requests.

The `AbstractHttpCrawler` builds on top of the `BasicCrawler`, inheriting all its features. Additionally,
Expand All @@ -65,7 +75,7 @@ def __init__(
parser: AbstractHttpParser[TParseResult],
additional_http_error_status_codes: Iterable[int] = (),
ignore_http_error_status_codes: Iterable[int] = (),
**kwargs: Unpack[BasicCrawlerOptions[TCrawlingContext]],
**kwargs: Unpack[BasicCrawlerOptions[TCrawlingContext, StatisticsState]],
) -> None:
self._parser = parser
self._pre_navigation_hooks: list[Callable[[BasicCrawlingContext], Awaitable[None]]] = []
Expand All @@ -87,6 +97,31 @@ def __init__(
kwargs.setdefault('_logger', logging.getLogger(__name__))
super().__init__(**kwargs)

@staticmethod
def create_parsed_http_crawler_class(
static_parser: AbstractHttpParser[TParseResult],
) -> type[AbstractHttpCrawler[ParsedHttpCrawlingContext[TParseResult], TParseResult]]:
"""Convenience class factory that creates specific version of `AbstractHttpCrawler` class.

In general typing sense two generic types of `AbstractHttpCrawler` do not have to be dependent on each other.
This is convenience constructor for specific cases when `TParseResult` is used to specify both generic
parameters in `AbstractHttpCrawler`.
"""

class _ParsedHttpCrawler(AbstractHttpCrawler[ParsedHttpCrawlingContext[TParseResult], TParseResult]):
def __init__(
self,
parser: AbstractHttpParser[TParseResult] = static_parser,
**kwargs: Unpack[HttpCrawlerOptions[ParsedHttpCrawlingContext[TParseResult]]],
) -> None:
kwargs['_context_pipeline'] = self._create_static_content_crawler_pipeline()
super().__init__(
parser=parser,
**kwargs,
)

return _ParsedHttpCrawler

def _create_static_content_crawler_pipeline(self) -> ContextPipeline[ParsedHttpCrawlingContext[TParseResult]]:
"""Create static content crawler context pipeline with expected pipeline steps."""
return (
Expand Down
7 changes: 7 additions & 0 deletions src/crawlee/crawlers/_adaptive_playwright/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
from crawlee.crawlers._adaptive_playwright._adaptive_playwright_crawler import AdaptivePlaywrightCrawler
from crawlee.crawlers._adaptive_playwright._adaptive_playwright_crawling_context import (
AdaptivePlaywrightCrawlingContext,
AdaptivePlaywrightPreNavCrawlingContext,
)

__all__ = ['AdaptivePlaywrightCrawler', 'AdaptivePlaywrightCrawlingContext', 'AdaptivePlaywrightPreNavCrawlingContext']
Loading
Loading