How do various readable website extractor libraries (ie. libraries that provide a feature like Reader View in Safari) perform?
This repo exists to provide a way to compare many libraries at once across many pages at once.
Currently the following libraries are implemented:
- mozilla/readability
- cleanview
- metascraper
- @postlight/mercury-parser
- TODO - clean-mark (377 stars)
- TODO - ascrape-js (13 stars)
The latest output from running the comparisons on a set of 16 random pages selected from Hacker News in June 2020 is available on the gh-pages branch (direct link to report).
Based on these comparisons @awendland is intending to use the mozilla/readability project.
Make sure to run yarn to ensure all dependencies are installed. Each command should include --help documentation and produce explanatory output during execution.
Create a newline delimited list of URLs to fetch and store them in a text file such as test_urls.txt.
Use the fetch-test-pages script to retrieve and save them into a folder such as test_pages/ for report processing.
yarn scripts:run ./scripts/fetch-test-pages.ts --listOfUrls test_urls.txt --outDir test_pages/ --parallelism 30They will be saved as JSON files containing information such as the source URL and the HTML contents of the page.
Once test pages have been retrieved a report can be generated. The following command would be used to generate a report named report.html from test pages saved in test_pages/.
yarn scripts:run ./scripts/generate-report.ts --testPages 'test_pages/*.json' --reportFile report.htmlAdding a new library to the comparison involves several steps:
-
Add the library (and any associated
@types/package) as a project dependencyyarn add LIBRARY_NAME --exact
-
Authoring an adapter for the library in
scripts/lib/adapters/adapter-LIBRARY_NAME.tswhich conforms to the following type (detailed inscripts/lib/types.ts):type Adapter = { metadata: AdapterMetadata extract(params: ExtractParams): Promise<ExtractedInfo | null> }
-
Registering the adapter in
scripts/lib/adapters/index.ts -
Generating a report to make sure that it works
