You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jan 4, 2023. It is now read-only.
When the HTTP Archive dataset was expanded in July 2018, new page ids were assigned for the newer URLs. This has broken the historical reports, which breaks continuity for the URLs that were previously tracked.
@rviscomi and I believe that this can be corrected by mapping the old pages table records with the new pageids. I've assigned this to myself and will look into it.
The text was updated successfully, but these errors were encountered:
FWIW I worked around this on my dataset by stripping the protocol from the URL (most of the new sites are https). But I use a more normalised schema with a separate URLs table and am thus not dependent upon the page_id.
Happy to share my changes if they'd be any use.
We have a goal in 2019 to reimplement the URL dashboard using BigQuery and Data Studio but we would still have the same continuity issues across URL corpus changes, even as we update the CrUX corpus monthly. Something to keep in mind.
One idea is to group by domain and have a line / table row for each origin. Note that some domains with user-generated content (like wordpress.com) would have many many origins.
Just a note: the problem isn't really witht the pageid's as these are distinct for each test, the issue is mainly the change from http to https which breaks the lookup by URL so a site that was in as http://www.archive.org is considered distinct to https://www.archive.org I suspect that working directly with the host name in the CrUX dataset would resolve this. This would require some minor schema changes but MySQL often doesn't take kindly to these (adding columns particularly), but there would be more work for the loader and pagas. And if the idea is to replace the reports with something derived more directly from Big Query then it's probaby best to wait for this.
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
When the HTTP Archive dataset was expanded in July 2018, new page ids were assigned for the newer URLs. This has broken the historical reports, which breaks continuity for the URLs that were previously tracked.
You can see an example of this here - https://legacy.httparchive.org/viewsite.php?pageid=94191763. The legacy report continues to include the latest stats, but now only shows trends starting with July 2018 -
@rviscomi and I believe that this can be corrected by mapping the old
pages
table records with the new pageids. I've assigned this to myself and will look into it.The text was updated successfully, but these errors were encountered: