Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep technology detections up to date #70

Closed
rviscomi opened this issue Feb 14, 2022 · 3 comments
Closed

Keep technology detections up to date #70

rviscomi opened this issue Feb 14, 2022 · 3 comments
Assignees

Comments

@rviscomi
Copy link
Member

WebPageTest integrates with Wappalyzer to get the list of detection rules and run them through the detection engine during each test. The results are parsed from the HAR and written to the technologies dataset in the Dataflow pipeline. IIUC the biggest challenge is keeping the engine up to date because it needs to be reimplemented for WebPageTest's environment; it's less of an issue to keep the detection rules up to date for each technology since it's a simple JSON schema. Still, WebPageTest needs to periodically check for updates and stay in sync.

When technology detections are outdated or broken, several HTTP Archive dependencies are affected. Many Web Almanac chapters segment by technologies, like JS, CSS, CMS, and Ecommerce. Additionally, the Core Web Vitals Technology Report is a direct visualization layer on top of the output of the detections, so any bugs would be immediately visible there.

Similar to HTTPArchive/data-pipeline#30, the WebPageTest repo can use automation like GitHub Actions and/or dependabot to keep the rules in sync. But the engine will be much harder because it requires manual integration. At a minimum we need to know when the engine is out of date, using something like a file watcher on the engine's source code. I think it'd be worth connecting with @AliasIO and @tkadlec to brainstorm more reliable ways to keep Wappalyzer and WebPageTest in sync.

There's also more we can do on the HTTP Archive side to try to catch any anomalies late in the pipeline. While it'd be too late to fix broken detections, this should hopefully alert us to the bugs so that we can get them fixed before the next crawl. One thing we can do is look at a subset of individual pages with known technologies, and assert that they're detected correctly month after month. We could also look at the adoption rates in aggregate and flag anything anomalous like a steep rise or drop. The individual page approach has the benefit of being able to alert us ASAP before the crawl is even complete, but it does require some manual curation and upkeep. Not only can these approaches catch bugs arising from version skew across projects, but they can also help catch bugs in the rules/engine itself.

@AliasIO
Copy link

AliasIO commented Feb 14, 2022

You should be able to automatically pull in updates to technology definitions. For API changes, I can create a discussion that anyone can subscribe to and announce changes there.

@max-ostapenko
Copy link

The automated updates are implemented.
It allows us to move the quality and freshness discussion to the technology definitions.

After we implemented the automation I listed a few actionable insights based on historical analysis. I suggest to move the discussion there.

@rviscomi do you have an idea on critical tech/websites list for an early detection? We could add those into wappalyzer Github checks.

@max-ostapenko max-ostapenko transferred this issue from HTTPArchive/data-pipeline Oct 18, 2024
@tunetheweb
Copy link
Member

Let's close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants