-
Notifications
You must be signed in to change notification settings - Fork 18
Add SwitchPage #103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add SwitchPage #103
Changes from 9 commits
f012a0b
62cd1ed
51114e4
31fc060
7817813
36ad056
b5d61ea
6778aa3
073c4ab
b51f056
308c4bf
51bf31f
828a84b
fc7867f
b32f92f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,170 @@ | ||
| .. _layouts: | ||
|
|
||
| =============== | ||
| Webpage layouts | ||
| =============== | ||
|
|
||
| Different webpages may show the same *type* of page, but different *data*. For | ||
| example, in an e-commerce website there are usually many product detail pages, | ||
| each showing data from a different product. | ||
|
|
||
| The code that those webpages have in common is their **webpage layout**. | ||
|
|
||
| Coding for webpage layouts | ||
| ========================== | ||
|
|
||
| Webpage layouts should inform how you organize your data extraction code. | ||
|
|
||
| A good practice to keep your code maintainable is to have a separate :ref:`page | ||
| object class <page-objects>` per webpage layout. | ||
|
|
||
| Trying to support multiple webpage layouts with the same page object class can | ||
| make your class hard to maintain. | ||
|
|
||
|
|
||
| Identifying webpage layouts | ||
| =========================== | ||
|
|
||
| There is no precise way to determine whether 2 webpages have the same or a | ||
| different webpage layout. You must decide based on what you know, and be ready | ||
| to adapt if things change. | ||
|
|
||
| It is also often difficult to identify webpage layouts before you start writing | ||
| extraction code. Completely different webpage layouts can have the same look, | ||
| and very similar webpage layouts can look completely different. | ||
|
|
||
| It can be a good starting point to assume that, for a given combination of | ||
| data type and website, there is going to be a single webpage layout. For | ||
| example, assume that all product pages of a given e-commerce website will have | ||
| the same webpage layout. | ||
|
|
||
| Then, as you write a :ref:`page object class <page-objects>` for that webpage | ||
| layout, you may find out more, and adapt. | ||
|
|
||
| When the same piece of information must be extracted from a different place for | ||
| different webpages, that is a sign that you may be dealing with more than 1 | ||
| webpage layout. For example, if on some webpages the product name is in an | ||
| ``h1`` element, but on some webpages it is in an ``h2`` element, chances are | ||
| there are at least 2 different webpage layouts. | ||
|
|
||
| However, whether you continue to work as if everything uses the same webpage | ||
| layout, or you split your page object class into 2 page object classes, each | ||
| targeting one of the webpage layouts you have found, it is entirely up to you. | ||
|
|
||
| Ask yourself: Is supporting all webpage layout differences making your page | ||
| object class implementation only a few lines of code longer, or is it making it | ||
| an unmaintainable bowl of spaghetti code? | ||
|
|
||
|
|
||
| Mapping webpage layouts | ||
| ======================= | ||
|
|
||
| Once you have written a :ref:`page object class <page-objects>` for a webpage | ||
| layout, you need to make it so that your page object class is used for webpages | ||
| that use that webpage layout. | ||
|
|
||
| URL patterns | ||
| ------------ | ||
|
|
||
| Webpage layouts are often associated to specific URL patterns. For example, all | ||
| the product detail pages of an e-commerce website usually have similar URLs, | ||
| such as ``https://example.com/product/<product ID>``. | ||
|
|
||
| When that is the case, you can :ref:`associate your page object class to the | ||
| corresponding URL pattern <rules-intro>`. | ||
|
|
||
|
|
||
| .. _multi-layout: | ||
|
|
||
| Multi-layout page object classes | ||
| -------------------------------- | ||
|
|
||
| Sometimes it is impossible to know, based on the target URL, which webpage | ||
| layout you are getting. For example, during `A/B testing`_, you could get a | ||
| random webpage layout on every request. | ||
|
|
||
| .. _A/B testing: https://en.wikipedia.org/wiki/A/B_testing | ||
|
|
||
| For these scenarios, we recommend that you create different page object classes | ||
| for the different layouts that you may get, and then write a special | ||
| “multi-layout” page object class, and use it to select the right page object | ||
| class at run time based on the input you receive. | ||
|
|
||
| Your multi-layout page object class should: | ||
BurnzZ marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| #. Declare attributes for the input that you will need to determine which page | ||
| object class to use. | ||
|
|
||
| For example, declare an :class:`HttpResponse` attribute to select a page | ||
| object class based on the response content. | ||
|
|
||
| #. Declare an attribute for every page object class that you may use depending | ||
| on which webpage layout you get from the target website. | ||
|
|
||
| They all should return the same type of :ref:`item <item-classes>` as your | ||
| multi-layout page object class. | ||
|
|
||
| Note that all inputs of all those page object classes will be resolved and | ||
| requested along with the input of your multi-layout page object class. For | ||
| example, if one page object class requires browser HTML as input, while | ||
| another requires an HTTP response, your multi-layout page object class asks | ||
| for both inputs. | ||
|
|
||
| If combining different inputs is a problem, consider refactoring your page | ||
| object classes to require similar inputs. | ||
|
|
||
| #. On its :meth:`~web_poet.pages.ItemPage.to_item` method: | ||
|
|
||
| #. Determine, based on inputs, which page object to use. | ||
|
|
||
| #. Return the output of the :meth:`~web_poet.pages.ItemPage.to_item` | ||
| method of that page object. | ||
|
|
||
| You may use :class:`~web_poet.pages.MultiLayoutPage` as a base class for your | ||
| multi-layout page object class, so you only need to implement the | ||
| :class:`~web_poet.pages.MultiLayoutPage.layout` method that determines which | ||
| page object to use. For example: | ||
|
|
||
| .. code-block:: python | ||
Gallaecio marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| import attrs | ||
| from web_poet import handle_urls, HttpResponse, ItemPage, MultiLayoutPage, WebPage | ||
| @attrs.define | ||
Gallaecio marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| class Header: | ||
| text: str | ||
| @attrs.define | ||
Gallaecio marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| class H1Page(WebPage[Header]): | ||
| @field | ||
| def text(self) -> str: | ||
Gallaecio marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| return self.css("h1::text").get() | ||
| @attrs.define | ||
| class H2Page(WebPage[Header]): | ||
| @field | ||
| def text(self) -> str: | ||
Gallaecio marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| return self.css("h2::text").get() | ||
| @handle_urls("example.com") | ||
| @attrs.define | ||
| class HeaderMultiLayoutPage(MultiLayoutPage[Header]): | ||
| response: HttpResponse | ||
| h1: H1Page | ||
| h2: H2Page | ||
| async def layout(self) -> ItemPage[Header]: | ||
| if self.response.css("h1::text"): | ||
| return self.h1 | ||
| return self.h2 | ||
| .. note:: If you use :func:`~web_poet.handle_urls` both for your multi-layout | ||
| page object class and for any of the page object classes that it | ||
| uses, you may need to :ref:`grant your multi-layout page object class | ||
| a higher priority <rules-priority-resolution>`. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -68,6 +68,25 @@ async def to_item(self) -> ItemT: | |
| ) | ||
|
|
||
|
|
||
| class MultiLayoutPage(ItemPage[ItemT]): | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wonder if we should support fields for MultiLayoutPage, or not. There is also a use case when you need a "partial" layout, layout of a region of a page. In this case, only some fields are extracted by the layout; others are common. It seems we have a few options:
A separate, but related issue is if it's possible to use 2 or more regions, with different layouts, in the same page object. If fields are supported, it seems it makes sense to move the logic to ItemPage. I think it may simplify typing, and inheritance as well. E.g. layouts can be used with ProductPage from zytedata/zyte-common-items#19 without using multiple inheritance. It seems that if fields are not supported, it's better to keep MultiLayoutPage as a separate class, and probably raise an error or issue a warning if fields are defined. There is one argument for keeping it separate and not supporting fields: it'd allow to define fields named Sorry for a braindump :) What do you think?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A disadvantage of using MultiLayoutPage without fields: it's not possible to use fields in the code which uses the page object. So let's say we have ProductPage, it uses fields for data extraction. Then, it's refactored to use MultiLayoutPage as a base class. It means that the fields are no longer supported, and so the code which uses this page may break.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am slightly against field support in MultiLayoutPage. In addition to those 2 options you mention, I can think of a 3rd that is a variation on 1 based precisely on how you amended my API proposal for MultiLayoutPage:
We would still have the same problem as with the lack of fields in MultiLayoutPage, i.e. you could not access the fields of the dependency layout through the layout that uses it (other than accessing the dependency directly, e.g.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm also not sure about support It would seem it'd be best to keep it's task simple wherein it simply identifies and returns the PO instance based on the layout.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This is a valid point, but I think we could switch to
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, this looks nice.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We were discussing it today on a call with @proway2. They implemented a multi-layout page object, tried a few approaches. In short - having fields on the final page object is a must :) That's the reason the documented approach here won't work well for them. Taking union of all dependencies is fine.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Currently they define all the fields, and call to the self._layout in each field. It's a lot of boilerplate; exactly something a library should be solving.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems having fields availabe is required, but not necessarily being able to define top-level fields. |
||
| """Base class for :ref:`multi-layout page object classes <multi-layout>`. | ||
|
|
||
| Subclasses must reimplement the :meth:`layout` method. | ||
| """ | ||
|
|
||
| @abc.abstractmethod | ||
| async def layout(self) -> ItemPage[ItemT]: | ||
| """Return the :ref:`page object <page-objects>` to use based on the | ||
| received input.""" | ||
Gallaecio marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
Gallaecio marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| async def to_item(self) -> ItemT: | ||
Gallaecio marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| """Return the output of the :meth:`~web_poet.pages.ItemPage.to_item` | ||
| method of the :ref:`page object <page-objects>` that :meth:`layout` | ||
| returns.""" | ||
| page_object = await self.layout() | ||
| return await page_object.to_item() | ||
|
|
||
|
|
||
| @attr.s(auto_attribs=True) | ||
| class WebPage(ItemPage[ItemT], ResponseShortcutsMixin): | ||
| """Base Page Object which requires :class:`~.HttpResponse` | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.