Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add AdaptivePlaywrightCrawler #872

Open
wants to merge 68 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 20 commits
Commits
Show all changes
68 commits
Select commit Hold shift + click to select a range
d8b2438
WIP
Pijukatel Dec 24, 2024
c12acda
More feasible version of composition.
Pijukatel Dec 31, 2024
623d341
Pass properly all kwargs to subcrawlers
Pijukatel Dec 31, 2024
548349e
Add run decision logic from JS version
Pijukatel Jan 1, 2025
29d510a
Statistics class change lead too many ripple changes.
Pijukatel Jan 1, 2025
04eefd9
Handle sub crawlers loggers
Pijukatel Jan 1, 2025
33efed3
Statistics change to be usable without ignores.
Pijukatel Jan 2, 2025
202aceb
Align use_state with JS implementation.
Pijukatel Jan 2, 2025
d474a94
Remove "fake generics" from Statistics.
Pijukatel Jan 2, 2025
e95ff19
Align result comparator witrh JS implementation.
Pijukatel Jan 2, 2025
2408d85
Add doc strings.
Pijukatel Jan 2, 2025
d345259
use_state through RequestHandlerRunResult
Pijukatel Jan 3, 2025
bfa9290
Revert "use_state through RequestHandlerRunResult"
Pijukatel Jan 6, 2025
63f278a
Add basic delegation test.
Pijukatel Jan 6, 2025
e190788
Add context test.
Pijukatel Jan 6, 2025
0ecb137
Add tests for use_state and predictor.
Pijukatel Jan 6, 2025
b73c702
Remove unintended edit.
Pijukatel Jan 6, 2025
5c79d0d
Add tests for statistics.
Pijukatel Jan 7, 2025
957915a
Add test for error handling adn commiting correct results.
Pijukatel Jan 7, 2025
f12f605
Add crawl_one_required_contexts property. (Alternative to accessing i…
Pijukatel Jan 7, 2025
5256af2
Lint
Pijukatel Jan 7, 2025
2fd7aae
Remove BasicCrawler modifications.
Pijukatel Jan 8, 2025
714b5bd
Make _commit_result consistent with how other result components are h…
Pijukatel Jan 8, 2025
b38dda1
Remove subcrawlers and add _OrphanPipeline
Pijukatel Jan 8, 2025
ffb2a78
Use dummy statistics in subcrawlers.
Pijukatel Jan 8, 2025
3b05228
Keep predictor related functions on predictor_state
Pijukatel Jan 8, 2025
dc06490
Unify pre-nav hooks.
Pijukatel Jan 8, 2025
0766b7a
Simplify pre-nav hook common context.
Pijukatel Jan 9, 2025
4a63c2c
Make static crawling part of AdaptiveCrawler generic.
Pijukatel Jan 9, 2025
bd72a84
Update tests to remove bs references.
Pijukatel Jan 9, 2025
c964d44
Revert accidental Lint edits to website/*.py
Pijukatel Jan 10, 2025
a471395
Review comments.
Pijukatel Jan 14, 2025
dbf6310
Sub crawler timeout handling + test
Pijukatel Jan 15, 2025
8ee8f99
Simplify prenav hooks
janbuchar Jan 15, 2025
c291d3e
Simplify context manager handling
janbuchar Jan 15, 2025
2c23238
Review comments - _run_request_handler + timeouts
Pijukatel Jan 15, 2025
b2a29c1
Statistics.
Pijukatel Jan 15, 2025
0786f87
Make statistics generic again!
Pijukatel Jan 15, 2025
4f7d7f4
Merge remote-tracking branch 'origin/master' into adaptive-PwCrawler
Pijukatel Jan 15, 2025
1234ea7
Mock requests in tests.
Pijukatel Jan 16, 2025
e85ec9f
Improve error readability.
Pijukatel Jan 17, 2025
6e8635a
Mock both static and browser requests in tests.
Pijukatel Jan 17, 2025
0e9146a
Create proper example code
Pijukatel Jan 17, 2025
a4eac8f
Merge remote-tracking branch 'origin/master' into adaptive-PwCrawler
Pijukatel Jan 17, 2025
44ad898
Relax timeout in test to avoid flakiness in CI
Pijukatel Jan 20, 2025
f219453
Remove AdaptivePlaywrightCrawlerStatistics
Pijukatel Jan 21, 2025
64d9e54
WIP
Pijukatel Jan 21, 2025
08fc81f
Update options typed dicts
Pijukatel Jan 21, 2025
9bde9dc
Add docstrings to adaptive context public stuff
Pijukatel Jan 21, 2025
9a14569
Make crawl_one_with private.
Pijukatel Jan 21, 2025
fd8dd82
Update tests/unit/crawlers/_adaptive_playwright/test_adaptive_playwri…
Pijukatel Jan 22, 2025
4221219
Review comments
Pijukatel Jan 22, 2025
565d36b
Remove _run_subcrawler_pipeline
Pijukatel Jan 22, 2025
4bd8251
Remove Orphans
Pijukatel Jan 22, 2025
a3ab2e8
Merge remote-tracking branch 'origin/master' into adaptive-PwCrawler
Pijukatel Jan 22, 2025
fc95132
Move SubCrawlerRun to where it is used
Pijukatel Jan 22, 2025
4f316b5
Use custom _TestInput dataclass
Pijukatel Jan 22, 2025
949c4ff
Review comments
Pijukatel Jan 23, 2025
56ad33a
Add optional argument to pre navigation hook decorator
Pijukatel Jan 27, 2025
781d5ff
Remove _push_result_to_context and add result argument/return to _run…
Pijukatel Jan 27, 2025
a93d6a1
Merge remote-tracking branch 'origin/master' into adaptive-PwCrawler
Pijukatel Jan 27, 2025
b4ba31b
Add `block_request` to adaptive pre nav context
Pijukatel Jan 28, 2025
8bce425
Use context result map for handling request handler results
Pijukatel Jan 29, 2025
5f8c26c
Review comments based comments
Pijukatel Jan 30, 2025
e1d0c7e
Merge remote-tracking branch 'origin/master' into adaptive-PwCrawler
Pijukatel Jan 30, 2025
aacb90a
Integrate RenderingTypePredictor
Pijukatel Jan 30, 2025
ab1c40e
Update src/crawlee/crawlers/_basic/_basic_crawler.py
Pijukatel Jan 30, 2025
2bf43f6
Finalize exports and re exports
Pijukatel Jan 30, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
361 changes: 174 additions & 187 deletions poetry.lock

Large diffs are not rendered by default.

1 change: 0 additions & 1 deletion src/crawlee/_types.py
Original file line number Diff line number Diff line change
Expand Up @@ -298,7 +298,6 @@ class UseStateFunction(Protocol):

def __call__(
self,
key: str,
default_value: dict[str, JsonSerializable] | None = None,
) -> Coroutine[None, None, dict[str, JsonSerializable]]: ...

Expand Down
17 changes: 10 additions & 7 deletions src/crawlee/crawlers/_abstract_http/_abstract_http_crawler.py
janbuchar marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
from typing import TYPE_CHECKING, Any, Callable, Generic

from pydantic import ValidationError
from typing_extensions import NotRequired, TypeVar
from typing_extensions import NotRequired, TypedDict, TypeVar

from crawlee import EnqueueStrategy
from crawlee._request import BaseRequestData
Expand All @@ -30,18 +30,21 @@


@docs_group('Data structures')
class HttpCrawlerOptions(Generic[TCrawlingContext], BasicCrawlerOptions[TCrawlingContext]):
"""Arguments for the `AbstractHttpCrawler` constructor.

It is intended for typing forwarded `__init__` arguments in the subclasses.
"""

class _HttpCrawlerOptions(Generic[TCrawlingContext], TypedDict):
additional_http_error_status_codes: NotRequired[Iterable[int]]
"""Additional HTTP status codes to treat as errors, triggering automatic retries when encountered."""

ignore_http_error_status_codes: NotRequired[Iterable[int]]
"""HTTP status codes typically considered errors but to be treated as successful responses."""

@docs_group('Data structures')
class HttpCrawlerOptions(Generic[TCrawlingContext],_HttpCrawlerOptions, BasicCrawlerOptions[TCrawlingContext]):
"""Arguments for the `AbstractHttpCrawler` constructor.

It is intended for typing forwarded `__init__` arguments in the subclasses.
"""



@docs_group('Abstract classes')
class AbstractHttpCrawler(Generic[TCrawlingContext, TParseResult], BasicCrawler[TCrawlingContext], ABC):
Expand Down
14 changes: 14 additions & 0 deletions src/crawlee/crawlers/_adaptive_playwright/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
from crawlee.crawlers._adaptive_playwright._adaptive_playwright_crawler import AdaptivePlaywrightCrawler
from crawlee.crawlers._adaptive_playwright._adaptive_playwright_crawling_context import (
AdaptivePlaywrightCrawlingContext,
)

__all__ = [
'AdaptivePlaywrightCrawler',
'AdaptivePlaywrightCrawlingContext',
'HttpCrawlerOptions',
'ParsedHttpCrawlingContext',
Pijukatel marked this conversation as resolved.
Show resolved Hide resolved
]



Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
from __future__ import annotations

from datetime import timedelta
from typing import TYPE_CHECKING, Annotated, Any, cast

from pydantic import BaseModel, ConfigDict, Field
from typing_extensions import override

from crawlee._utils.docs import docs_group
from crawlee.statistics import Statistics, StatisticsState

if TYPE_CHECKING:
from logging import Logger

from typing_extensions import Self

from crawlee.storages import KeyValueStore


@docs_group('Data structures')
class PredictorState(BaseModel):
model_config = ConfigDict(populate_by_name=True, ser_json_inf_nan='constants')

http_only_request_handler_runs: Annotated[int, Field(alias='http_only_request_handler_runs')] = 0
browser_request_handler_runs: Annotated[int, Field(alias='browser_request_handler_runs')] = 0
rendering_type_mispredictions: Annotated[int, Field(alias='rendering_type_mispredictions')] = 0

@docs_group('Classes')
class AdaptivePlaywrightCrawlerStatistics(Statistics):


def __init__(self,*,
persistence_enabled: bool = False,
persist_state_kvs_name: str = 'default',
persist_state_key: str | None = None,
key_value_store: KeyValueStore | None = None,
log_message: str = 'Statistics',
periodic_message_logger: Logger | None = None,
log_interval: timedelta = timedelta(minutes=1),
state_model: type[StatisticsState] = StatisticsState) -> None:
self.predictor_state = PredictorState()
super().__init__(persistence_enabled=persistence_enabled,
persist_state_kvs_name=persist_state_kvs_name,
persist_state_key=persist_state_key,
key_value_store=key_value_store,
log_message=log_message,
periodic_message_logger=periodic_message_logger,
log_interval=log_interval,
state_model=state_model)
self._persist_predictor_state_key = self._persist_state_key + '_PREDICTOR'

@classmethod
def from_statistics(cls, statistics: Statistics) -> Self:
return cls(persistence_enabled=statistics._persistence_enabled, # noqa:SLF001 # Accessing private member to create copy like-object.
persist_state_kvs_name=statistics._persist_state_kvs_name, # noqa:SLF001 # Accessing private member to create copy like-object.
persist_state_key=statistics._persist_state_key, # noqa:SLF001 # Accessing private member to create copy like-object.
key_value_store=statistics._key_value_store, # noqa:SLF001 # Accessing private member to create copy like-object.
log_message=statistics._log_message, # noqa:SLF001 # Accessing private member to create copy like-object.
periodic_message_logger=statistics._periodic_message_logger, # noqa:SLF001 # Accessing private member to create copy like-object.
log_interval=statistics._log_interval, # noqa:SLF001 # Accessing private member to create copy like-object.
state_model=statistics._state_model, # noqa:SLF001 # Accessing private member to create copy like-object.
)

def track_http_only_request_handler_runs(self) -> None:
self.predictor_state.http_only_request_handler_runs += 1

def track_browser_request_handler_runs(self) -> None:
self.predictor_state.browser_request_handler_runs += 1

def track_rendering_type_mispredictions(self) -> None:
self.predictor_state.rendering_type_mispredictions += 1

@override
async def _persist_other_statistics(self, key_value_store: KeyValueStore) -> None:
"""Persist state of predictor."""
await key_value_store.set_value(
self._persist_predictor_state_key,
self.predictor_state.model_dump(mode='json', by_alias=True),
'application/json',
)


@override
async def _load_other_statistics(self, key_value_store: KeyValueStore) -> None:
"""Load state of predictor."""
stored_state = await key_value_store.get_value(self._persist_predictor_state_key, cast(Any, {}))
self.predictor_state = self.predictor_state.__class__.model_validate(stored_state)

Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
from __future__ import annotations

from dataclasses import dataclass, fields
from typing import TYPE_CHECKING

from bs4 import BeautifulSoup

from crawlee import HttpHeaders
from crawlee._utils.docs import docs_group
from crawlee.crawlers import BeautifulSoupCrawlingContext, BeautifulSoupParserType, PlaywrightCrawlingContext

if TYPE_CHECKING:
from collections.abc import Awaitable, Callable

from playwright.async_api import Page, Response
from typing_extensions import Self

class AdaptiveContextError(RuntimeError):
pass



@dataclass(frozen=True)
@docs_group('Data structures')
class AdaptivePlaywrightCrawlingContext(BeautifulSoupCrawlingContext):
_response: Response | None = None
_infinite_scroll: Callable[[], Awaitable[None]] | None = None
_page : Page | None = None

@property
def page(self) -> Page:
if not self._page:
raise AdaptiveContextError('Page was not crawled with PlaywrightCrawler.')
return self._page

@property
def infinite_scroll(self) -> Callable[[], Awaitable[None]]:
if not self._infinite_scroll:
raise AdaptiveContextError('Page was not crawled with PlaywrightCrawler.')
return self._infinite_scroll

@property
def response(self) -> Response:
if not self._response:
raise AdaptiveContextError('Page was not crawled with PlaywrightCrawler.')
return self._response

@classmethod
def from_beautifulsoup_crawling_context(cls, context: BeautifulSoupCrawlingContext) -> Self:
"""Convenience constructor that creates new context from existing `BeautifulSoupCrawlingContext`."""
return cls(**{field.name: getattr(context, field.name) for field in fields(context)})

@classmethod
async def from_playwright_crawling_context(cls, context: PlaywrightCrawlingContext,
beautiful_soup_parser_type: BeautifulSoupParserType | None) -> Self:
"""Convenience constructor that creates new context from existing `PlaywrightCrawlingContext`."""
context_kwargs = {field.name: getattr(context, field.name) for field in fields(context)}
# Remove playwright specific attributes and pass them as private instead to be available as property.
context_kwargs['_response'] = context_kwargs.pop('response')
context_kwargs['_page'] = context_kwargs.pop('page')
context_kwargs['_infinite_scroll'] = context_kwargs.pop('infinite_scroll')
# This might be always available.
protocol_guess = await context_kwargs['_page'].evaluate('() => performance.getEntries()[0].nextHopProtocol')
http_response = await _HttpResponse.from_playwright_response(response = context.response,
protocol = protocol_guess or '')
return cls(parsed_content= BeautifulSoup(http_response.read(), features=beautiful_soup_parser_type),
http_response = http_response,
**context_kwargs)


@dataclass(frozen=True)
class _HttpResponse:
http_version : str
status_code : int
headers: HttpHeaders
_content: bytes

def read(self) -> bytes:
return self._content

@classmethod
async def from_playwright_response(cls, response: Response, protocol: str) -> Self:
headers = HttpHeaders(response.headers)
status_code = response.status
# Can't find this anywhere in PlayWright, but some headers can include information about protocol.
# In firefox for example: 'x-firefox-spdy'
# Might be also obtained by executing JS code in browser: performance.getEntries()[0].nextHopProtocol
# Response headers capitalization not respecting http1.1 Pascal case. Always lower case in PlayWright.
http_version = protocol
_content = await response.body()

return cls(http_version=http_version, status_code=status_code, headers=headers, _content=_content)
45 changes: 45 additions & 0 deletions src/crawlee/crawlers/_adaptive_playwright/_example.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
import asyncio
import logging
from logging import getLogger

from crawlee._types import BasicCrawlingContext
from crawlee.crawlers import PlaywrightPreNavCrawlingContext
from crawlee.crawlers._adaptive_playwright._adaptive_playwright_crawler import AdaptivePlaywrightCrawler
from crawlee.crawlers._adaptive_playwright._adaptive_playwright_crawling_context import (
AdaptivePlaywrightCrawlingContext,
)


async def main() ->None:
# TODO: remove in review. Move this to documentation examples instead.
top_logger = getLogger(__name__)
top_logger.setLevel(logging.DEBUG)
i=0

crawler = AdaptivePlaywrightCrawler(max_requests_per_crawl=10,
_logger=top_logger,
playwright_crawler_args={'headless':False})

@crawler.router.default_handler
async def request_handler(context: AdaptivePlaywrightCrawlingContext) -> None:
nonlocal i
i = i+1
context.log.info(f'Processing with Top adaptive_crawler: {context.request.url} ...')
await context.enqueue_links()
await context.push_data({'Top crwaler Url': context.request.url})
await context.use_state({'bla':i})

@crawler.pre_navigation_hook_bs
async def bs_hook(context: BasicCrawlingContext) -> None:
context.log.info(f'BS pre navigation hook for: {context.request.url} ...')

@crawler.pre_navigation_hook_pw
async def pw_hook(context: PlaywrightPreNavCrawlingContext) -> None:
context.log.info(f'PW pre navigation hook for: {context.request.url} ...')

# Run the crawler with the initial list of URLs.
await crawler.run(['https://crawlee.dev/'])


if __name__ == '__main__':
asyncio.run(main())
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
from __future__ import annotations

from abc import ABC, abstractmethod
from dataclasses import dataclass
from random import choice
from typing import Literal

from typing_extensions import override

RenderingType = Literal['static', 'client only']

@dataclass(frozen=True)
class RenderingTypePrediction:
rendering_type: RenderingType
detection_probability_recommendation: float



class RenderingTypePredictor(ABC):

@abstractmethod
def predict(self, url: str, label: str | None) -> RenderingTypePrediction:
...

@abstractmethod
def store_result(self, url: str, label: str | None, crawl_type: RenderingType) -> None:
...


class DefaultRenderingTypePredictor(RenderingTypePredictor):
Pijukatel marked this conversation as resolved.
Show resolved Hide resolved
#Dummy version of predictor. Proper version will be implemented in another change.

@override
def predict(self, url: str, label: str | None) -> RenderingTypePrediction: # Will be implemented later
return RenderingTypePrediction(choice(['static', 'client only']), 0.1)

@override
def store_result(self, url: str, label: str | None, crawl_type: RenderingType) -> None:
pass
43 changes: 43 additions & 0 deletions src/crawlee/crawlers/_adaptive_playwright/_result_comparator.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
from __future__ import annotations

from dataclasses import dataclass
from typing import TYPE_CHECKING

if TYPE_CHECKING:
from collections.abc import Callable

from crawlee._types import RequestHandlerRunResult


@dataclass(frozen=True)
class SubCrawlerRun:
janbuchar marked this conversation as resolved.
Show resolved Hide resolved
result: RequestHandlerRunResult | None= None
exception: Exception | None= None



def create_comparator(result_checker: Callable[[RequestHandlerRunResult], bool] | None
) -> Callable[[RequestHandlerRunResult, RequestHandlerRunResult], bool]:
"""Factory for creating comparator function."""
if result_checker:
# Fallback comparator if only user-specific checker is defined.
return lambda result_1, result_2: result_checker(result_1) and result_checker(result_2)
# Fallback default comparator.
return push_data_only_comparator


def full_result_comparator(result_1: RequestHandlerRunResult, result_2: RequestHandlerRunResult) -> bool:
"""Compare results by comparing all their parts."""
# PlayWright can produce links with extra arguments compared to pure BS. Default comparator ignores this.
# Maybe full comparator should have flag about taking into account only urls without parameters.
# https://sdk.apify.com/docs/guides/getting-started
# https://sdk.apify.com/docs/guides/getting-started?__hsfp=1136113150&__hssc=7591405.1.1735494277124&__hstc=7591405.e2b9302ed00c5bfaee3a870166792181.1735494277124.1735494277124.1735494277124.1

return (
(result_1.push_data_calls == result_2.push_data_calls) and
(result_1.add_requests_calls == result_2.add_requests_calls) and
(result_1.key_value_store_changes == result_2.key_value_store_changes))

def push_data_only_comparator(result_1: RequestHandlerRunResult, result_2: RequestHandlerRunResult) -> bool:
"""Compare results by comparing their push dara calls. Ignore other parts of results in comparison."""
janbuchar marked this conversation as resolved.
Show resolved Hide resolved
return result_1.push_data_calls == result_2.push_data_calls
Loading
Loading