Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exclude contact cards (and other non-editable blocks) from HIX #3281

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

charludo
Copy link
Contributor

@charludo charludo commented Dec 9, 2024

Short description

The contact cards introduced in #3169 have a severe negative impact on the HIX score. This PR excludes them from the HIX calculations.

Proposed changes

  • before sending text to Textlab, remove all divs with contenteditable="false".
  • this currently only affects contacts, but makes it simple to exclude other blocks in the future, should the need arise

Side effects

  • html.text_content() was previously only used to check for empty pages. For convenience, I have changed the code to send the result of this operation to textlab instead of the raw HTML - but I am uncertain if there has been a decision against this in the past when you added the code in question @david-venhoff - are my changes OK?

Resolved issues

Fixes: #3268


Pull Request Review Guidelines

@david-venhoff
Copy link
Member

@david-venhoff - are my changes OK?

I fear that this might change the hix scores of pages again. The last time we did this our service team had quite a nightmare dealing with municipalities that suddenly could not translate their pages anymore because the hix score was too low.
If we do this, we should probably at least do some tests that this change does not decrease the hix score in comparison to right now.

@JoeyStk JoeyStk self-assigned this Jan 5, 2025
Copy link
Contributor

@JoeyStk JoeyStk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a lot of context about this piece of code. I hope @david-venhoff knows more about it. For what I can see this piece of code looks good and the logic itself makes sense, but I might not think of all possible side effects :/

@charludo
Copy link
Contributor Author

charludo commented Jan 7, 2025

@david-venhoff - are my changes OK?

I fear that this might change the hix scores of pages again. The last time we did this our service team had quite a nightmare dealing with municipalities that suddenly could not translate their pages anymore because the hix score was too low. If we do this, we should probably at least do some tests that this change does not decrease the hix score in comparison to right now.

That's a very valid concern.

I'm not sure how else to progress though, tbh. Maybe passing tostring(html) would be the better option, but that is not guaranteed to achieve text == tostrong(fromstring(text)), afaik.

Removing the divs in question without parsing the HTML is an entirely differnet can of worms.

Frankly I don't think we have a choice but to risk slightly changing HIX scores :( (With prior testing still, of course)

@MizukiTemma MizukiTemma added the deadline Needs to be fixed in the given time label Jan 7, 2025
@charludo charludo force-pushed the fix/exclude-contacts-from-hix branch 4 times, most recently from 51c3185 to 7c9d876 Compare January 11, 2025 07:59
@charludo
Copy link
Contributor Author

I think we need @osmers to chime in on this 😅
In short: the issue is that we must remove contact cards, otherwise the HIX score worsens a lot; but the only way to reliably do so can lead to very slight changes in the HTML content we send to Textlab, and so could result in slightly changed HIX scores compared to right now, including pages not containing any contact cards. I really cannot say if that change in score would be an increase or decrease.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deadline Needs to be fixed in the given time
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Exclude contact card from HIX score calculation
5 participants