Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add AdaptivePlaywrightCrawler #872

Merged
merged 71 commits into from
Feb 7, 2025
Merged
Changes from 1 commit
Commits
Show all changes
71 commits
Select commit Hold shift + click to select a range
d8b2438
WIP
Pijukatel Dec 24, 2024
c12acda
More feasible version of composition.
Pijukatel Dec 31, 2024
623d341
Pass properly all kwargs to subcrawlers
Pijukatel Dec 31, 2024
548349e
Add run decision logic from JS version
Pijukatel Jan 1, 2025
29d510a
Statistics class change lead too many ripple changes.
Pijukatel Jan 1, 2025
04eefd9
Handle sub crawlers loggers
Pijukatel Jan 1, 2025
33efed3
Statistics change to be usable without ignores.
Pijukatel Jan 2, 2025
202aceb
Align use_state with JS implementation.
Pijukatel Jan 2, 2025
d474a94
Remove "fake generics" from Statistics.
Pijukatel Jan 2, 2025
e95ff19
Align result comparator witrh JS implementation.
Pijukatel Jan 2, 2025
2408d85
Add doc strings.
Pijukatel Jan 2, 2025
d345259
use_state through RequestHandlerRunResult
Pijukatel Jan 3, 2025
bfa9290
Revert "use_state through RequestHandlerRunResult"
Pijukatel Jan 6, 2025
63f278a
Add basic delegation test.
Pijukatel Jan 6, 2025
e190788
Add context test.
Pijukatel Jan 6, 2025
0ecb137
Add tests for use_state and predictor.
Pijukatel Jan 6, 2025
b73c702
Remove unintended edit.
Pijukatel Jan 6, 2025
5c79d0d
Add tests for statistics.
Pijukatel Jan 7, 2025
957915a
Add test for error handling adn commiting correct results.
Pijukatel Jan 7, 2025
f12f605
Add crawl_one_required_contexts property. (Alternative to accessing i…
Pijukatel Jan 7, 2025
5256af2
Lint
Pijukatel Jan 7, 2025
2fd7aae
Remove BasicCrawler modifications.
Pijukatel Jan 8, 2025
714b5bd
Make _commit_result consistent with how other result components are h…
Pijukatel Jan 8, 2025
b38dda1
Remove subcrawlers and add _OrphanPipeline
Pijukatel Jan 8, 2025
ffb2a78
Use dummy statistics in subcrawlers.
Pijukatel Jan 8, 2025
3b05228
Keep predictor related functions on predictor_state
Pijukatel Jan 8, 2025
dc06490
Unify pre-nav hooks.
Pijukatel Jan 8, 2025
0766b7a
Simplify pre-nav hook common context.
Pijukatel Jan 9, 2025
4a63c2c
Make static crawling part of AdaptiveCrawler generic.
Pijukatel Jan 9, 2025
bd72a84
Update tests to remove bs references.
Pijukatel Jan 9, 2025
c964d44
Revert accidental Lint edits to website/*.py
Pijukatel Jan 10, 2025
a471395
Review comments.
Pijukatel Jan 14, 2025
dbf6310
Sub crawler timeout handling + test
Pijukatel Jan 15, 2025
8ee8f99
Simplify prenav hooks
janbuchar Jan 15, 2025
c291d3e
Simplify context manager handling
janbuchar Jan 15, 2025
2c23238
Review comments - _run_request_handler + timeouts
Pijukatel Jan 15, 2025
b2a29c1
Statistics.
Pijukatel Jan 15, 2025
0786f87
Make statistics generic again!
Pijukatel Jan 15, 2025
4f7d7f4
Merge remote-tracking branch 'origin/master' into adaptive-PwCrawler
Pijukatel Jan 15, 2025
1234ea7
Mock requests in tests.
Pijukatel Jan 16, 2025
e85ec9f
Improve error readability.
Pijukatel Jan 17, 2025
6e8635a
Mock both static and browser requests in tests.
Pijukatel Jan 17, 2025
0e9146a
Create proper example code
Pijukatel Jan 17, 2025
a4eac8f
Merge remote-tracking branch 'origin/master' into adaptive-PwCrawler
Pijukatel Jan 17, 2025
44ad898
Relax timeout in test to avoid flakiness in CI
Pijukatel Jan 20, 2025
f219453
Remove AdaptivePlaywrightCrawlerStatistics
Pijukatel Jan 21, 2025
64d9e54
WIP
Pijukatel Jan 21, 2025
08fc81f
Update options typed dicts
Pijukatel Jan 21, 2025
9bde9dc
Add docstrings to adaptive context public stuff
Pijukatel Jan 21, 2025
9a14569
Make crawl_one_with private.
Pijukatel Jan 21, 2025
fd8dd82
Update tests/unit/crawlers/_adaptive_playwright/test_adaptive_playwri…
Pijukatel Jan 22, 2025
4221219
Review comments
Pijukatel Jan 22, 2025
565d36b
Remove _run_subcrawler_pipeline
Pijukatel Jan 22, 2025
4bd8251
Remove Orphans
Pijukatel Jan 22, 2025
a3ab2e8
Merge remote-tracking branch 'origin/master' into adaptive-PwCrawler
Pijukatel Jan 22, 2025
fc95132
Move SubCrawlerRun to where it is used
Pijukatel Jan 22, 2025
4f316b5
Use custom _TestInput dataclass
Pijukatel Jan 22, 2025
949c4ff
Review comments
Pijukatel Jan 23, 2025
56ad33a
Add optional argument to pre navigation hook decorator
Pijukatel Jan 27, 2025
781d5ff
Remove _push_result_to_context and add result argument/return to _run…
Pijukatel Jan 27, 2025
a93d6a1
Merge remote-tracking branch 'origin/master' into adaptive-PwCrawler
Pijukatel Jan 27, 2025
b4ba31b
Add `block_request` to adaptive pre nav context
Pijukatel Jan 28, 2025
8bce425
Use context result map for handling request handler results
Pijukatel Jan 29, 2025
5f8c26c
Review comments based comments
Pijukatel Jan 30, 2025
e1d0c7e
Merge remote-tracking branch 'origin/master' into adaptive-PwCrawler
Pijukatel Jan 30, 2025
aacb90a
Integrate RenderingTypePredictor
Pijukatel Jan 30, 2025
ab1c40e
Update src/crawlee/crawlers/_basic/_basic_crawler.py
Pijukatel Jan 30, 2025
2bf43f6
Finalize exports and re exports
Pijukatel Jan 30, 2025
5a6d07c
Review comments
Pijukatel Feb 5, 2025
e20aa49
Merge remote-tracking branch 'origin/master' into adaptive-PwCrawler
Pijukatel Feb 5, 2025
1f48d53
Review comments
Pijukatel Feb 6, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Finalize exports and re exports
  • Loading branch information
Pijukatel committed Jan 30, 2025
commit 2bf43f63d5bb701a9c79f0997dafd417ada6d3f1
24 changes: 24 additions & 0 deletions src/crawlee/crawlers/__init__.py
Original file line number Diff line number Diff line change
@@ -18,10 +18,31 @@
with _try_import(__name__, 'PlaywrightCrawler', 'PlaywrightCrawlingContext', 'PlaywrightPreNavCrawlingContext'):
from ._playwright import PlaywrightCrawler, PlaywrightCrawlingContext, PlaywrightPreNavCrawlingContext

with _try_import(
__name__,
'AdaptivePlaywrightCrawler',
'AdaptivePlaywrightCrawlingContext',
'AdaptivePlaywrightPreNavCrawlingContext',
'RenderingType',
'RenderingTypePrediction',
'RenderingTypePredictor',
):
from ._adaptive_playwright import (
AdaptivePlaywrightCrawler,
AdaptivePlaywrightCrawlingContext,
AdaptivePlaywrightPreNavCrawlingContext,
RenderingType,
RenderingTypePrediction,
RenderingTypePredictor,
)


__all__ = [
'AbstractHttpCrawler',
'AbstractHttpParser',
'AdaptivePlaywrightCrawler',
'AdaptivePlaywrightCrawlingContext',
'AdaptivePlaywrightPreNavCrawlingContext',
'BasicCrawler',
'BasicCrawlerOptions',
'BasicCrawlingContext',
@@ -39,4 +60,7 @@
'PlaywrightCrawler',
'PlaywrightCrawlingContext',
'PlaywrightPreNavCrawlingContext',
'RenderingType',
'RenderingTypePrediction',
'RenderingTypePredictor',
]
4 changes: 3 additions & 1 deletion src/crawlee/crawlers/_adaptive_playwright/__init__.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
try:
from ._rendering_type_predictor import RenderingTypePredictor
from ._rendering_type_predictor import RenderingType, RenderingTypePrediction, RenderingTypePredictor
except ImportError as exc:
raise ImportError(
"To import this, you need to install the 'adaptive-playwright' extra. "
@@ -16,5 +16,7 @@
'AdaptivePlaywrightCrawler',
'AdaptivePlaywrightCrawlingContext',
'AdaptivePlaywrightPreNavCrawlingContext',
'RenderingType',
'RenderingTypePrediction',
'RenderingTypePredictor',
]
Original file line number Diff line number Diff line change
@@ -13,6 +13,7 @@
from typing_extensions import Self, TypeVar, override

from crawlee._types import BasicCrawlingContext, JsonSerializable, RequestHandlerRunResult
from crawlee._utils.docs import docs_group
from crawlee._utils.wait import wait_for
from crawlee.crawlers import (
AbstractHttpCrawler,
@@ -82,6 +83,7 @@ async def __aexit__(
self._active = False


@docs_group('Classes')
class AdaptivePlaywrightCrawler(
Generic[TStaticCrawlingContext, TStaticParseResult],
BasicCrawler[AdaptivePlaywrightCrawlingContext, AdaptivePlaywrightCrawlerStatisticState],
Original file line number Diff line number Diff line change
@@ -100,6 +100,7 @@ async def from_playwright_crawling_context(


@dataclass(frozen=True)
@docs_group('Data structures')
class AdaptivePlaywrightPreNavCrawlingContext(BasicCrawlingContext):
"""This is just wrapper around BasicCrawlingContext or AdaptivePlaywrightCrawlingContext.

Original file line number Diff line number Diff line change
@@ -11,18 +11,21 @@
from typing_extensions import override

from crawlee import Request
from crawlee._utils.docs import docs_group

UrlComponents = list[str]
RenderingType = Literal['static', 'client only']
FeatureVector = tuple[float, float]


@docs_group('Data structures')
@dataclass(frozen=True)
class RenderingTypePrediction:
rendering_type: RenderingType
detection_probability_recommendation: float


@docs_group('Classes')
class RenderingTypePredictor(ABC):
@abstractmethod
def predict(self, request: Request) -> RenderingTypePrediction:
@@ -42,6 +45,7 @@ def store_result(self, request: Request, rendering_type: RenderingType) -> None:
"""


@docs_group('Classes')
class DefaultRenderingTypePredictor(RenderingTypePredictor):
"""Stores rendering type for previously crawled URLs and predicts the rendering type for unvisited urls.

Original file line number Diff line number Diff line change
@@ -14,19 +14,20 @@

from crawlee import Request
from crawlee.browsers import BrowserPool
from crawlee.crawlers import BasicCrawler
from crawlee.crawlers._adaptive_playwright import AdaptivePlaywrightCrawler, AdaptivePlaywrightCrawlingContext
from crawlee.crawlers import (
AdaptivePlaywrightCrawler,
AdaptivePlaywrightCrawlingContext,
AdaptivePlaywrightPreNavCrawlingContext,
BasicCrawler,
RenderingType,
RenderingTypePrediction,
RenderingTypePredictor,
)
from crawlee.crawlers._adaptive_playwright._adaptive_playwright_crawler_statistics import (
AdaptivePlaywrightCrawlerStatisticState,
)
from crawlee.crawlers._adaptive_playwright._adaptive_playwright_crawling_context import (
AdaptiveContextError,
AdaptivePlaywrightPreNavCrawlingContext,
)
from crawlee.crawlers._adaptive_playwright._rendering_type_predictor import (
RenderingType,
RenderingTypePrediction,
RenderingTypePredictor,
)
from crawlee.statistics import Statistics