Skip to content

fix: add spam filtering for lists (closes #11905)#12045

Open
shoaib-inamdar wants to merge 2 commits intointernetarchive:masterfrom
shoaib-inamdar:fix/spam-list-filtering-11905
Open

fix: add spam filtering for lists (closes #11905)#12045
shoaib-inamdar wants to merge 2 commits intointernetarchive:masterfrom
shoaib-inamdar:fix/spam-list-filtering-11905

Conversation

@shoaib-inamdar
Copy link

  • Block zero-seed lists from Solr indexing (87% of spam has 0 seeds)
  • Add is_spam_list() to check list names AND descriptions against spam words
  • Add spam guard in lists_edit.POST() to block spam via HTML form
  • Add Solr fq filters to exclude zero-seed and spam-named lists from search
  • Add cleanup script for existing spam lists (scripts/cleanup_spam_lists.py)
  • Add 7 new tests (5 for is_spam_list, 2 for zero-seed Solr behavior)

Based on analysis of 254,050 lists from ol_dump_lists dump:

  • 23,091 zero-seed lists (9.1%)
  • 13,740 high-confidence spam (pattern match + 0 seeds)
  • Top categories: casino/gambling, pharmaceuticals, phone scams

Closes #11905

Adds multi-layered spam filtering for Open Library lists, informed by analysis of 254,050 lists from the ol_dump_lists database dump.

Technical

Testing

  • All 17 focused tests pass in Docker
  • Full CI-style test suite: 2940 passed, 16 skipped, 0 failures
  • Lint (ruff): clean on all modified files

Screenshot

Stakeholders

Copilot AI review requested due to automatic review settings March 8, 2026 10:53
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds multi-layer spam filtering for Open Library lists by preventing common spam patterns (notably zero-seed lists and spam-word matches) from being created and/or returned via Solr-backed list search.

Changes:

  • Prevents zero-seed lists from being indexed in Solr (deletes any existing Solr doc for such lists).
  • Introduces is_spam_list() for detecting spam words in list name/description and uses it to block spam list creation via the HTML form flow.
  • Adds Solr query filters to exclude zero-seed lists and spam-word list names from /search/lists.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
openlibrary/solr/updater/list.py Adds zero-seed deletion behavior; also changes key matching and list/series typing behavior.
openlibrary/plugins/worksearch/schemes/lists.py Adds Solr fq filters for zero-seed lists and spam-word list names.
openlibrary/plugins/upstream/spamcheck.py Adds is_spam_list() helper for list-specific spam detection.
openlibrary/plugins/openlibrary/tests/test_lists.py Adds tests for is_spam_list() and the zero-seed Solr indexing rule.
openlibrary/plugins/openlibrary/lists.py Blocks spam list creation via the HTML form path using is_spam_list().

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +77 to +80
if spam_words := get_spam_words():
escaped = ' OR '.join(f'"{w}"' for w in spam_words)
params.append(('fq', f'-name:({escaped})'))
return params
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fq built from get_spam_words() is interpolated into a Solr query without escaping. Since spam words are stored data (and can include quotes, parentheses, : etc.), this can produce invalid queries or allow Solr query injection via the spamwords doc. Use the existing Solr escaping helpers (e.g. openlibrary.solr.query_utils.fully_escape_query) and/or ensure values are safely quoted/escaped before embedding them in name:(...).

Copilot uses AI. Check for mistakes.
Comment on lines +68 to +76
# NEW FUNCTION — catches spam in list names AND descriptions
def is_spam_list(name: str, description: str = "") -> bool:
"""
Check if a list's name or description contains spam words.

Our dump analysis found spam in both fields:
- 1,163 casino spam, 1,173 pharma spam, 1,759 phone scams
- Many had innocent names but spammy descriptions
"""
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description mentions a new cleanup script at scripts/cleanup_spam_lists.py, but that file does not appear to be included in the changes/repository. Either add the script (if still intended) or update the PR description so it matches what’s actually being shipped.

Copilot uses AI. Check for mistakes.
Comment on lines +22 to +23
def key_test(self, key: str) -> bool:
return bool(re.match(r'^(/people/[^/]+)?/(lists|series)/[^/]+$', key))
return bool(re.match(r'^(/people/[^/]+)?/lists/[^/]+$', key))
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ListSolrUpdater.key_test() no longer matches /series/... keys. This will prevent series documents from being updated in Solr, which can break features that query Solr for series (e.g. the /series/_autocomplete handler uses fq=type:series). If the intent is to keep series searchable, restore matching/indexing for /series/ keys (and adjust the zero-seed logic so it only deletes true lists, not series).

Copilot uses AI. Check for mistakes.
return self._list['key']

@property
def type(self) -> str:
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ListSolrBuilder.type is now hard-coded to 'list', which will cause /series/... docs (if/when indexed) to lose their type:series classification in Solr. This is relied on by Solr consumers such as the series autocomplete (/series/_autocomplete uses fq=type:series). Please keep returning 'series' for series keys (and consider keeping list_type if downstream queries depend on it).

Suggested change
def type(self) -> str:
def type(self) -> str:
# Preserve 'series' classification for /series/... docs so Solr consumers
# (e.g., /series/_autocomplete with fq=type:series) continue to work.
if self.key.startswith('/series/'):
return 'series'

Copilot uses AI. Check for mistakes.
@shoaib-inamdar shoaib-inamdar force-pushed the fix/spam-list-filtering-11905 branch from 68a2cd1 to 0ad9ef5 Compare March 8, 2026 11:16
Copy link
Collaborator

@cdrini cdrini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned here: #11905 (comment) , only the two changes should be sufficient. We'll need to do a clean-up operation to delete all these bad lists, and we'll likely use solr to help find those lists! So don't want to have the spam checks during search/indexing.



# NEW FUNCTION — catches spam in list names AND descriptions
def is_spam_list(name: str, description: str = "") -> bool:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use the existing is_spam method ; it'll check all fields in the provided first object. You should be able to do is_spam(list_record.to_thing_json())

Comment on lines +77 to +85
if spam_words := get_spam_words():
# Escape Solr special chars in each word to prevent query injection
import re as _re

_solr_special = _re.compile(r'([+\-&|!(){}[\]^"~*?:\\/])')
safe_words = [_solr_special.sub(r'\\\1', w) for w in spam_words]
escaped = ' OR '.join(f'"{w}"' for w in safe_words)
params.append(('fq', f'-name:({escaped})'))
return params
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preventing spam lists from being created is sufficient. Since we will need to run a cleanup step to delete all those spam lists, we don't want to add this kind of logic to the queries.

Comment on lines +26 to +35
builder = ListSolrBuilder(list)

# NEW: Lists with 0 seeds should not be indexed.
# Our dump analysis found 23,091 zero-seed lists (9.1% of all lists),
# and 87% of detected spam falls in this category.
# Only apply to lists, not series (which may intentionally have 0 seeds).
if not builder.seed and not builder.key.startswith('/series/'):
return SolrUpdateRequest(deletes=[builder.key]), []

seeds = builder.seed
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto here ; the preventing save is sufficient. This can be rolled back.

('q.op', 'AND'),
('defType', 'edismax'),
# Exclude zero-seed lists (23,091 in dump, 87% of spam)
('fq', 'seed_count:[1 TO *]'),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, let's set this only if seed_count isn't in q. That way we can override it.

Also, let's make this: seed_count:[2 TO *] OR list_type:series. User lists with only one book aren't much of a list! But there are some series, eg new series, that only have one book on them.

@shoaib-inamdar
Copy link
Author

@cdrini , aah sorry, i just got excited and implemented all the changes 😅.

- Add spam check in lists_edit.POST() using existing is_spam() method
  Checks list name + description against spamwords store before saving
- Add default fq filter: seed_count:[2 TO *] OR list_type:series
  Excludes low-seed lists from search results (overridable via query)
  Series are exempt since new series may have only one book
- Fix import ordering in lists.py (pre-existing lint I001)
@shoaib-inamdar shoaib-inamdar force-pushed the fix/spam-list-filtering-11905 branch from 168c20f to 340f886 Compare March 8, 2026 18:44
@github-actions github-actions bot added the Needs: Response Issues which require feedback from lead label Mar 9, 2026
@shoaib-inamdar shoaib-inamdar requested a review from cdrini March 10, 2026 16:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Needs: Response Issues which require feedback from lead

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Spam lists should be prevented / removed

4 participants