Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add AdaptivePlaywrightCrawler #872

Open
wants to merge 68 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 66 commits
Commits
Show all changes
68 commits
Select commit Hold shift + click to select a range
d8b2438
WIP
Pijukatel Dec 24, 2024
c12acda
More feasible version of composition.
Pijukatel Dec 31, 2024
623d341
Pass properly all kwargs to subcrawlers
Pijukatel Dec 31, 2024
548349e
Add run decision logic from JS version
Pijukatel Jan 1, 2025
29d510a
Statistics class change lead too many ripple changes.
Pijukatel Jan 1, 2025
04eefd9
Handle sub crawlers loggers
Pijukatel Jan 1, 2025
33efed3
Statistics change to be usable without ignores.
Pijukatel Jan 2, 2025
202aceb
Align use_state with JS implementation.
Pijukatel Jan 2, 2025
d474a94
Remove "fake generics" from Statistics.
Pijukatel Jan 2, 2025
e95ff19
Align result comparator witrh JS implementation.
Pijukatel Jan 2, 2025
2408d85
Add doc strings.
Pijukatel Jan 2, 2025
d345259
use_state through RequestHandlerRunResult
Pijukatel Jan 3, 2025
bfa9290
Revert "use_state through RequestHandlerRunResult"
Pijukatel Jan 6, 2025
63f278a
Add basic delegation test.
Pijukatel Jan 6, 2025
e190788
Add context test.
Pijukatel Jan 6, 2025
0ecb137
Add tests for use_state and predictor.
Pijukatel Jan 6, 2025
b73c702
Remove unintended edit.
Pijukatel Jan 6, 2025
5c79d0d
Add tests for statistics.
Pijukatel Jan 7, 2025
957915a
Add test for error handling adn commiting correct results.
Pijukatel Jan 7, 2025
f12f605
Add crawl_one_required_contexts property. (Alternative to accessing i…
Pijukatel Jan 7, 2025
5256af2
Lint
Pijukatel Jan 7, 2025
2fd7aae
Remove BasicCrawler modifications.
Pijukatel Jan 8, 2025
714b5bd
Make _commit_result consistent with how other result components are h…
Pijukatel Jan 8, 2025
b38dda1
Remove subcrawlers and add _OrphanPipeline
Pijukatel Jan 8, 2025
ffb2a78
Use dummy statistics in subcrawlers.
Pijukatel Jan 8, 2025
3b05228
Keep predictor related functions on predictor_state
Pijukatel Jan 8, 2025
dc06490
Unify pre-nav hooks.
Pijukatel Jan 8, 2025
0766b7a
Simplify pre-nav hook common context.
Pijukatel Jan 9, 2025
4a63c2c
Make static crawling part of AdaptiveCrawler generic.
Pijukatel Jan 9, 2025
bd72a84
Update tests to remove bs references.
Pijukatel Jan 9, 2025
c964d44
Revert accidental Lint edits to website/*.py
Pijukatel Jan 10, 2025
a471395
Review comments.
Pijukatel Jan 14, 2025
dbf6310
Sub crawler timeout handling + test
Pijukatel Jan 15, 2025
8ee8f99
Simplify prenav hooks
janbuchar Jan 15, 2025
c291d3e
Simplify context manager handling
janbuchar Jan 15, 2025
2c23238
Review comments - _run_request_handler + timeouts
Pijukatel Jan 15, 2025
b2a29c1
Statistics.
Pijukatel Jan 15, 2025
0786f87
Make statistics generic again!
Pijukatel Jan 15, 2025
4f7d7f4
Merge remote-tracking branch 'origin/master' into adaptive-PwCrawler
Pijukatel Jan 15, 2025
1234ea7
Mock requests in tests.
Pijukatel Jan 16, 2025
e85ec9f
Improve error readability.
Pijukatel Jan 17, 2025
6e8635a
Mock both static and browser requests in tests.
Pijukatel Jan 17, 2025
0e9146a
Create proper example code
Pijukatel Jan 17, 2025
a4eac8f
Merge remote-tracking branch 'origin/master' into adaptive-PwCrawler
Pijukatel Jan 17, 2025
44ad898
Relax timeout in test to avoid flakiness in CI
Pijukatel Jan 20, 2025
f219453
Remove AdaptivePlaywrightCrawlerStatistics
Pijukatel Jan 21, 2025
64d9e54
WIP
Pijukatel Jan 21, 2025
08fc81f
Update options typed dicts
Pijukatel Jan 21, 2025
9bde9dc
Add docstrings to adaptive context public stuff
Pijukatel Jan 21, 2025
9a14569
Make crawl_one_with private.
Pijukatel Jan 21, 2025
fd8dd82
Update tests/unit/crawlers/_adaptive_playwright/test_adaptive_playwri…
Pijukatel Jan 22, 2025
4221219
Review comments
Pijukatel Jan 22, 2025
565d36b
Remove _run_subcrawler_pipeline
Pijukatel Jan 22, 2025
4bd8251
Remove Orphans
Pijukatel Jan 22, 2025
a3ab2e8
Merge remote-tracking branch 'origin/master' into adaptive-PwCrawler
Pijukatel Jan 22, 2025
fc95132
Move SubCrawlerRun to where it is used
Pijukatel Jan 22, 2025
4f316b5
Use custom _TestInput dataclass
Pijukatel Jan 22, 2025
949c4ff
Review comments
Pijukatel Jan 23, 2025
56ad33a
Add optional argument to pre navigation hook decorator
Pijukatel Jan 27, 2025
781d5ff
Remove _push_result_to_context and add result argument/return to _run…
Pijukatel Jan 27, 2025
a93d6a1
Merge remote-tracking branch 'origin/master' into adaptive-PwCrawler
Pijukatel Jan 27, 2025
b4ba31b
Add `block_request` to adaptive pre nav context
Pijukatel Jan 28, 2025
8bce425
Use context result map for handling request handler results
Pijukatel Jan 29, 2025
5f8c26c
Review comments based comments
Pijukatel Jan 30, 2025
e1d0c7e
Merge remote-tracking branch 'origin/master' into adaptive-PwCrawler
Pijukatel Jan 30, 2025
aacb90a
Integrate RenderingTypePredictor
Pijukatel Jan 30, 2025
ab1c40e
Update src/crawlee/crawlers/_basic/_basic_crawler.py
Pijukatel Jan 30, 2025
2bf43f6
Finalize exports and re exports
Pijukatel Jan 30, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
57 changes: 57 additions & 0 deletions docs/examples/code/adaptive_playwright_crawler.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
import asyncio

from playwright.async_api import Route

from crawlee.crawlers._adaptive_playwright._adaptive_playwright_crawler import AdaptivePlaywrightCrawler
from crawlee.crawlers._adaptive_playwright._adaptive_playwright_crawling_context import (
AdaptivePlaywrightCrawlingContext,
AdaptivePlaywrightPreNavCrawlingContext,
)


async def main() -> None:
crawler = AdaptivePlaywrightCrawler.with_beautifulsoup_static_parser(
max_requests_per_crawl=5, playwright_crawler_specific_kwargs={'headless': False}
)

@crawler.router.handler(label='label')
async def request_handler_for_label(context: AdaptivePlaywrightCrawlingContext) -> None:
# Do some processing using `page`
some_locator = context.page.locator('div').first
await some_locator.wait_for()
# Do stuff with locator...
context.log.info(f'Playwright processing of: {context.request.url} ...')

@crawler.router.default_handler
async def request_handler(context: AdaptivePlaywrightCrawlingContext) -> None:
context.log.info(f'User handler processing: {context.request.url} ...')
# Do some processing using `parsed_content`
context.log.info(context.parsed_content.title)

# Find more links and enqueue them.
await context.enqueue_links()
await context.push_data({'Top crawler Url': context.request.url})

@crawler.pre_navigation_hook
async def hook(context: AdaptivePlaywrightPreNavCrawlingContext) -> None:
"""Hook executed both in static sub crawler and playwright sub crawler."""
# Trying to access context.page in this hook would raise `AdaptiveContextError` for pages crawled
# without playwright.
context.log.info(f'pre navigation hook for: {context.request.url} ...')

@crawler.pre_navigation_hook(playwright_only=True)
async def hook_playwright(context: AdaptivePlaywrightPreNavCrawlingContext) -> None:
"""Hook executed only in playwright sub crawler."""

async def some_routing_function(route: Route) -> None:
await route.continue_()

await context.page.route('*/**', some_routing_function)
context.log.info(f'Playwright only pre navigation hook for: {context.request.url} ...')

# Run the crawler with the initial list of URLs.
await crawler.run(['https://warehouse-theme-metal.myshopify.com/'])


if __name__ == '__main__':
asyncio.run(main())
4 changes: 4 additions & 0 deletions src/crawlee/_types.py
Original file line number Diff line number Diff line change
Expand Up @@ -565,3 +565,7 @@ class BasicCrawlingContext:

log: logging.Logger
"""Logger instance."""

def __hash__(self) -> int:
"""Return hash of the context. Each context is considered unique."""
return id(self)
55 changes: 45 additions & 10 deletions src/crawlee/crawlers/_abstract_http/_abstract_http_crawler.py
janbuchar marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
from typing import TYPE_CHECKING, Any, Callable, Generic

from pydantic import ValidationError
from typing_extensions import NotRequired, TypeVar
from typing_extensions import NotRequired, TypedDict, TypeVar

from crawlee import EnqueueStrategy, RequestTransformAction
from crawlee._request import Request, RequestOptions
Expand All @@ -14,6 +14,7 @@
from crawlee.crawlers._basic import BasicCrawler, BasicCrawlerOptions, ContextPipeline
from crawlee.errors import SessionError
from crawlee.http_clients import HttpxHttpClient
from crawlee.statistics import StatisticsState

from ._http_crawling_context import HttpCrawlingContext, ParsedHttpCrawlingContext, TParseResult

Expand All @@ -27,24 +28,33 @@
from ._abstract_http_parser import AbstractHttpParser

TCrawlingContext = TypeVar('TCrawlingContext', bound=ParsedHttpCrawlingContext)
TStatisticsState = TypeVar('TStatisticsState', bound=StatisticsState, default=StatisticsState)


@docs_group('Data structures')
class HttpCrawlerOptions(Generic[TCrawlingContext], BasicCrawlerOptions[TCrawlingContext]):
"""Arguments for the `AbstractHttpCrawler` constructor.

It is intended for typing forwarded `__init__` arguments in the subclasses.
"""

class _HttpCrawlerAdditionalOptions(TypedDict):
additional_http_error_status_codes: NotRequired[Iterable[int]]
"""Additional HTTP status codes to treat as errors, triggering automatic retries when encountered."""

ignore_http_error_status_codes: NotRequired[Iterable[int]]
"""HTTP status codes that are typically considered errors but should be treated as successful responses."""


@docs_group('Data structures')
class HttpCrawlerOptions(
Generic[TCrawlingContext, TStatisticsState],
_HttpCrawlerAdditionalOptions,
BasicCrawlerOptions[TCrawlingContext, StatisticsState],
):
"""Arguments for the `AbstractHttpCrawler` constructor.

It is intended for typing forwarded `__init__` arguments in the subclasses.
"""


@docs_group('Abstract classes')
class AbstractHttpCrawler(Generic[TCrawlingContext, TParseResult], BasicCrawler[TCrawlingContext], ABC):
class AbstractHttpCrawler(
Generic[TCrawlingContext, TParseResult], BasicCrawler[TCrawlingContext, StatisticsState], ABC
):
"""A web crawler for performing HTTP requests.

The `AbstractHttpCrawler` builds on top of the `BasicCrawler`, inheriting all its features. Additionally,
Expand All @@ -65,7 +75,7 @@ def __init__(
parser: AbstractHttpParser[TParseResult],
additional_http_error_status_codes: Iterable[int] = (),
ignore_http_error_status_codes: Iterable[int] = (),
**kwargs: Unpack[BasicCrawlerOptions[TCrawlingContext]],
**kwargs: Unpack[BasicCrawlerOptions[TCrawlingContext, StatisticsState]],
) -> None:
self._parser = parser
self._pre_navigation_hooks: list[Callable[[BasicCrawlingContext], Awaitable[None]]] = []
Expand All @@ -87,6 +97,31 @@ def __init__(
kwargs.setdefault('_logger', logging.getLogger(__name__))
super().__init__(**kwargs)

@staticmethod
def create_parsed_http_crawler_class(
static_parser: AbstractHttpParser[TParseResult],
) -> type[AbstractHttpCrawler[ParsedHttpCrawlingContext[TParseResult], TParseResult]]:
"""Convenience class factory that creates specific version of `AbstractHttpCrawler` class.

In general typing sense two generic types of `AbstractHttpCrawler` do not have to be dependent on each other.
This is convenience constructor for specific cases when `TParseResult` is used to specify both generic
parameters in `AbstractHttpCrawler`.
"""

class _ParsedHttpCrawler(AbstractHttpCrawler[ParsedHttpCrawlingContext[TParseResult], TParseResult]):
def __init__(
self,
parser: AbstractHttpParser[TParseResult] = static_parser,
**kwargs: Unpack[HttpCrawlerOptions[ParsedHttpCrawlingContext[TParseResult]]],
) -> None:
kwargs['_context_pipeline'] = self._create_static_content_crawler_pipeline()
super().__init__(
parser=parser,
**kwargs,
)

return _ParsedHttpCrawler

def _create_static_content_crawler_pipeline(self) -> ContextPipeline[ParsedHttpCrawlingContext[TParseResult]]:
"""Create static content crawler context pipeline with expected pipeline steps."""
return (
Expand Down
20 changes: 20 additions & 0 deletions src/crawlee/crawlers/_adaptive_playwright/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
try:
from ._rendering_type_predictor import RenderingTypePredictor
except ImportError as exc:
raise ImportError(
janbuchar marked this conversation as resolved.
Show resolved Hide resolved
"To import this, you need to install the 'adaptive-playwright' extra. "
"For example, if you use pip, run `pip install 'crawlee[adaptive-playwright]'`.",
) from exc

from ._adaptive_playwright_crawler import AdaptivePlaywrightCrawler
from ._adaptive_playwright_crawling_context import (
AdaptivePlaywrightCrawlingContext,
AdaptivePlaywrightPreNavCrawlingContext,
)

__all__ = [
'AdaptivePlaywrightCrawler',
'AdaptivePlaywrightCrawlingContext',
'AdaptivePlaywrightPreNavCrawlingContext',
'RenderingTypePredictor',
]
Loading
Loading