fix: add spam filtering for lists (closes #11905)#12045
fix: add spam filtering for lists (closes #11905)#12045shoaib-inamdar wants to merge 2 commits intointernetarchive:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds multi-layer spam filtering for Open Library lists by preventing common spam patterns (notably zero-seed lists and spam-word matches) from being created and/or returned via Solr-backed list search.
Changes:
- Prevents zero-seed lists from being indexed in Solr (deletes any existing Solr doc for such lists).
- Introduces
is_spam_list()for detecting spam words in list name/description and uses it to block spam list creation via the HTML form flow. - Adds Solr query filters to exclude zero-seed lists and spam-word list names from
/search/lists.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| openlibrary/solr/updater/list.py | Adds zero-seed deletion behavior; also changes key matching and list/series typing behavior. |
| openlibrary/plugins/worksearch/schemes/lists.py | Adds Solr fq filters for zero-seed lists and spam-word list names. |
| openlibrary/plugins/upstream/spamcheck.py | Adds is_spam_list() helper for list-specific spam detection. |
| openlibrary/plugins/openlibrary/tests/test_lists.py | Adds tests for is_spam_list() and the zero-seed Solr indexing rule. |
| openlibrary/plugins/openlibrary/lists.py | Blocks spam list creation via the HTML form path using is_spam_list(). |
You can also share your feedback on Copilot code review. Take the survey.
| if spam_words := get_spam_words(): | ||
| escaped = ' OR '.join(f'"{w}"' for w in spam_words) | ||
| params.append(('fq', f'-name:({escaped})')) | ||
| return params |
There was a problem hiding this comment.
The fq built from get_spam_words() is interpolated into a Solr query without escaping. Since spam words are stored data (and can include quotes, parentheses, : etc.), this can produce invalid queries or allow Solr query injection via the spamwords doc. Use the existing Solr escaping helpers (e.g. openlibrary.solr.query_utils.fully_escape_query) and/or ensure values are safely quoted/escaped before embedding them in name:(...).
| # NEW FUNCTION — catches spam in list names AND descriptions | ||
| def is_spam_list(name: str, description: str = "") -> bool: | ||
| """ | ||
| Check if a list's name or description contains spam words. | ||
|
|
||
| Our dump analysis found spam in both fields: | ||
| - 1,163 casino spam, 1,173 pharma spam, 1,759 phone scams | ||
| - Many had innocent names but spammy descriptions | ||
| """ |
There was a problem hiding this comment.
The PR description mentions a new cleanup script at scripts/cleanup_spam_lists.py, but that file does not appear to be included in the changes/repository. Either add the script (if still intended) or update the PR description so it matches what’s actually being shipped.
openlibrary/solr/updater/list.py
Outdated
| def key_test(self, key: str) -> bool: | ||
| return bool(re.match(r'^(/people/[^/]+)?/(lists|series)/[^/]+$', key)) | ||
| return bool(re.match(r'^(/people/[^/]+)?/lists/[^/]+$', key)) |
There was a problem hiding this comment.
ListSolrUpdater.key_test() no longer matches /series/... keys. This will prevent series documents from being updated in Solr, which can break features that query Solr for series (e.g. the /series/_autocomplete handler uses fq=type:series). If the intent is to keep series searchable, restore matching/indexing for /series/ keys (and adjust the zero-seed logic so it only deletes true lists, not series).
| return self._list['key'] | ||
|
|
||
| @property | ||
| def type(self) -> str: |
There was a problem hiding this comment.
ListSolrBuilder.type is now hard-coded to 'list', which will cause /series/... docs (if/when indexed) to lose their type:series classification in Solr. This is relied on by Solr consumers such as the series autocomplete (/series/_autocomplete uses fq=type:series). Please keep returning 'series' for series keys (and consider keeping list_type if downstream queries depend on it).
| def type(self) -> str: | |
| def type(self) -> str: | |
| # Preserve 'series' classification for /series/... docs so Solr consumers | |
| # (e.g., /series/_autocomplete with fq=type:series) continue to work. | |
| if self.key.startswith('/series/'): | |
| return 'series' |
68a2cd1 to
0ad9ef5
Compare
cdrini
left a comment
There was a problem hiding this comment.
As I mentioned here: #11905 (comment) , only the two changes should be sufficient. We'll need to do a clean-up operation to delete all these bad lists, and we'll likely use solr to help find those lists! So don't want to have the spam checks during search/indexing.
|
|
||
|
|
||
| # NEW FUNCTION — catches spam in list names AND descriptions | ||
| def is_spam_list(name: str, description: str = "") -> bool: |
There was a problem hiding this comment.
We can use the existing is_spam method ; it'll check all fields in the provided first object. You should be able to do is_spam(list_record.to_thing_json())
| if spam_words := get_spam_words(): | ||
| # Escape Solr special chars in each word to prevent query injection | ||
| import re as _re | ||
|
|
||
| _solr_special = _re.compile(r'([+\-&|!(){}[\]^"~*?:\\/])') | ||
| safe_words = [_solr_special.sub(r'\\\1', w) for w in spam_words] | ||
| escaped = ' OR '.join(f'"{w}"' for w in safe_words) | ||
| params.append(('fq', f'-name:({escaped})')) | ||
| return params |
There was a problem hiding this comment.
Preventing spam lists from being created is sufficient. Since we will need to run a cleanup step to delete all those spam lists, we don't want to add this kind of logic to the queries.
openlibrary/solr/updater/list.py
Outdated
| builder = ListSolrBuilder(list) | ||
|
|
||
| # NEW: Lists with 0 seeds should not be indexed. | ||
| # Our dump analysis found 23,091 zero-seed lists (9.1% of all lists), | ||
| # and 87% of detected spam falls in this category. | ||
| # Only apply to lists, not series (which may intentionally have 0 seeds). | ||
| if not builder.seed and not builder.key.startswith('/series/'): | ||
| return SolrUpdateRequest(deletes=[builder.key]), [] | ||
|
|
||
| seeds = builder.seed |
There was a problem hiding this comment.
Ditto here ; the preventing save is sufficient. This can be rolled back.
| ('q.op', 'AND'), | ||
| ('defType', 'edismax'), | ||
| # Exclude zero-seed lists (23,091 in dump, 87% of spam) | ||
| ('fq', 'seed_count:[1 TO *]'), |
There was a problem hiding this comment.
Actually, let's set this only if seed_count isn't in q. That way we can override it.
Also, let's make this: seed_count:[2 TO *] OR list_type:series. User lists with only one book aren't much of a list! But there are some series, eg new series, that only have one book on them.
|
@cdrini , aah sorry, i just got excited and implemented all the changes 😅. |
- Add spam check in lists_edit.POST() using existing is_spam() method Checks list name + description against spamwords store before saving - Add default fq filter: seed_count:[2 TO *] OR list_type:series Excludes low-seed lists from search results (overridable via query) Series are exempt since new series may have only one book - Fix import ordering in lists.py (pre-existing lint I001)
168c20f to
340f886
Compare
for more information, see https://pre-commit.ci
Based on analysis of 254,050 lists from ol_dump_lists dump:
Closes #11905
Adds multi-layered spam filtering for Open Library lists, informed by analysis of 254,050 lists from the
ol_dump_listsdatabase dump.Technical
Testing
Screenshot
Stakeholders