-
Notifications
You must be signed in to change notification settings - Fork 357
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
fix!: Refactor service usage to rely on
service_locator
(#691)
### Description - The `service_container` module has been completely refactored. It introduces changes to its usage, and resulting in many changes across the code base. - While it remains possible to pass "services" directly to components as before, components now rely on the `service_container` internally. They no longer store service instances themselves. - A new `force` flag has been added to the `service_container`'s setters, which is especially useful for testing purposes. - This also quite simplifies the `memory_storage_client`. - We now have only `set_storage_client`, the same approach as for the `event_manager`. This is more flexible (allows more envs than just local & cloud). And in the SDK Actor, we can them based on the `is_at_home`. - This is a breaking change, but it affects only the `service_container` interface. ### Open discussion - [x] Should we go further and remove the option to pass configuration, event managers, or storage clients directly to components - requiring them to be set exclusively via the `service_container`? Thoughts are welcome. - No - [x] Better name for `service_container`? `service_locator`? - `service_locator` ### Issues - Closes: #699 - Closes: #539 - Closes: #369 - It also unlocks the: - #670, - and maybe apify/apify-sdk-python#324 (comment). ### Testing - Existing tests, including those covering the `service_container`, have been updated to reflect these changes. - New tests covering the `MemoryStorageClient` respects the `Configuration`. ### Manual reproduction - This code snippet demonstrates that the `Crawler` and `MemoryStorageClient` respects the custom `Configuration`. - Note: Some fields remain non-working for now. These will be addressed in a future PR, as this refactor is already quite big. However, with the new architecture, those updates should now be easy. ```python import asyncio from crawlee.configuration import Configuration from crawlee.http_crawler import HttpCrawler, HttpCrawlingContext from crawlee.service_container import set_configuration async def main() -> None: config = Configuration(persist_storage=False, write_metadata=False) set_configuration(config) # or Crawler(config=config) crawler = HttpCrawler() @crawler.router.default_handler async def request_handler(context: HttpCrawlingContext) -> None: context.log.info(f'Processing {context.request.url} ...') await context.push_data({'url': context.request.url}) await crawler.run(['https://crawlee.dev/']) if __name__ == '__main__': asyncio.run(main()) ``` ### Checklist - [x] CI passed
- Loading branch information
Showing
34 changed files
with
692 additions
and
657 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -13,6 +13,9 @@ __pycache__ | |
# Poetry | ||
poetry.toml | ||
|
||
# Other Python tools | ||
.ropeproject | ||
|
||
# Mise | ||
mise.toml | ||
.mise.toml | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,9 +1,10 @@ | ||
from importlib import metadata | ||
|
||
from ._request import Request | ||
from ._service_locator import service_locator | ||
from ._types import ConcurrencySettings, EnqueueStrategy, HttpHeaders | ||
from ._utils.globs import Glob | ||
|
||
__version__ = metadata.version('crawlee') | ||
|
||
__all__ = ['ConcurrencySettings', 'EnqueueStrategy', 'Glob', 'HttpHeaders', 'Request'] | ||
__all__ = ['ConcurrencySettings', 'EnqueueStrategy', 'Glob', 'HttpHeaders', 'Request', 'service_locator'] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,98 @@ | ||
from __future__ import annotations | ||
|
||
from crawlee._utils.docs import docs_group | ||
from crawlee.base_storage_client._base_storage_client import BaseStorageClient | ||
from crawlee.configuration import Configuration | ||
from crawlee.errors import ServiceConflictError | ||
from crawlee.events._event_manager import EventManager | ||
|
||
|
||
@docs_group('Classes') | ||
class ServiceLocator: | ||
"""Service locator for managing the services used by Crawlee. | ||
All services are initialized to its default value lazily. | ||
""" | ||
|
||
def __init__(self) -> None: | ||
self._configuration: Configuration | None = None | ||
self._event_manager: EventManager | None = None | ||
self._storage_client: BaseStorageClient | None = None | ||
|
||
# Flags to check if the services were already set. | ||
self._configuration_was_set = False | ||
self._event_manager_was_set = False | ||
self._storage_client_was_set = False | ||
|
||
def get_configuration(self) -> Configuration: | ||
"""Get the configuration.""" | ||
if self._configuration is None: | ||
self._configuration = Configuration() | ||
|
||
return self._configuration | ||
|
||
def set_configuration(self, configuration: Configuration) -> None: | ||
"""Set the configuration. | ||
Args: | ||
configuration: The configuration to set. | ||
Raises: | ||
ServiceConflictError: If the configuration was already set. | ||
""" | ||
if self._configuration_was_set: | ||
raise ServiceConflictError(Configuration, configuration, self._configuration) | ||
|
||
self._configuration = configuration | ||
self._configuration_was_set = True | ||
|
||
def get_event_manager(self) -> EventManager: | ||
"""Get the event manager.""" | ||
if self._event_manager is None: | ||
from crawlee.events import LocalEventManager | ||
|
||
self._event_manager = LocalEventManager() | ||
|
||
return self._event_manager | ||
|
||
def set_event_manager(self, event_manager: EventManager) -> None: | ||
"""Set the event manager. | ||
Args: | ||
event_manager: The event manager to set. | ||
Raises: | ||
ServiceConflictError: If the event manager was already set. | ||
""" | ||
if self._event_manager_was_set: | ||
raise ServiceConflictError(EventManager, event_manager, self._event_manager) | ||
|
||
self._event_manager = event_manager | ||
self._event_manager_was_set = True | ||
|
||
def get_storage_client(self) -> BaseStorageClient: | ||
"""Get the storage client.""" | ||
if self._storage_client is None: | ||
from crawlee.memory_storage_client import MemoryStorageClient | ||
|
||
self._storage_client = MemoryStorageClient.from_config() | ||
|
||
return self._storage_client | ||
|
||
def set_storage_client(self, storage_client: BaseStorageClient) -> None: | ||
"""Set the storage client. | ||
Args: | ||
storage_client: The storage client to set. | ||
Raises: | ||
ServiceConflictError: If the storage client was already set. | ||
""" | ||
if self._storage_client_was_set: | ||
raise ServiceConflictError(BaseStorageClient, storage_client, self._storage_client) | ||
|
||
self._storage_client = storage_client | ||
self._storage_client_was_set = True | ||
|
||
|
||
service_locator = ServiceLocator() |
Oops, something went wrong.