fix: add spam filtering for lists (closes #11905) by shoaib-inamdar · Pull Request #12045 · internetarchive/openlibrary

shoaib-inamdar · 2026-03-08T10:53:38Z

Block zero-seed lists from Solr indexing (87% of spam has 0 seeds)
Add is_spam_list() to check list names AND descriptions against spam words
Add spam guard in lists_edit.POST() to block spam via HTML form
Add Solr fq filters to exclude zero-seed and spam-named lists from search
Add cleanup script for existing spam lists (scripts/cleanup_spam_lists.py)
Add 7 new tests (5 for is_spam_list, 2 for zero-seed Solr behavior)

Based on analysis of 254,050 lists from ol_dump_lists dump:

23,091 zero-seed lists (9.1%)
13,740 high-confidence spam (pattern match + 0 seeds)
Top categories: casino/gambling, pharmaceuticals, phone scams

Closes #11905

Adds multi-layered spam filtering for Open Library lists, informed by analysis of 254,050 lists from the ol_dump_lists database dump.

Technical

Testing

All 17 focused tests pass in Docker
Full CI-style test suite: 2940 passed, 16 skipped, 0 failures
Lint (ruff): clean on all modified files

Screenshot

Stakeholders

Copilot

Pull request overview

Adds multi-layer spam filtering for Open Library lists by preventing common spam patterns (notably zero-seed lists and spam-word matches) from being created and/or returned via Solr-backed list search.

Changes:

Prevents zero-seed lists from being indexed in Solr (deletes any existing Solr doc for such lists).
Introduces is_spam_list() for detecting spam words in list name/description and uses it to block spam list creation via the HTML form flow.
Adds Solr query filters to exclude zero-seed lists and spam-word list names from /search/lists.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
openlibrary/solr/updater/list.py	Adds zero-seed deletion behavior; also changes key matching and list/series typing behavior.
openlibrary/plugins/worksearch/schemes/lists.py	Adds Solr `fq` filters for zero-seed lists and spam-word list names.
openlibrary/plugins/upstream/spamcheck.py	Adds `is_spam_list()` helper for list-specific spam detection.
openlibrary/plugins/openlibrary/tests/test_lists.py	Adds tests for `is_spam_list()` and the zero-seed Solr indexing rule.
openlibrary/plugins/openlibrary/lists.py	Blocks spam list creation via the HTML form path using `is_spam_list()`.

You can also share your feedback on Copilot code review. Take the survey.

Copilot · 2026-03-08T10:59:17Z

openlibrary/plugins/worksearch/schemes/lists.py

+        if spam_words := get_spam_words():
+            escaped = ' OR '.join(f'"{w}"' for w in spam_words)
+            params.append(('fq', f'-name:({escaped})'))
+        return params


The fq built from get_spam_words() is interpolated into a Solr query without escaping. Since spam words are stored data (and can include quotes, parentheses, : etc.), this can produce invalid queries or allow Solr query injection via the spamwords doc. Use the existing Solr escaping helpers (e.g. openlibrary.solr.query_utils.fully_escape_query) and/or ensure values are safely quoted/escaped before embedding them in name:(...).

Copilot · 2026-03-08T10:59:18Z

openlibrary/plugins/upstream/spamcheck.py

+# NEW FUNCTION — catches spam in list names AND descriptions
+def is_spam_list(name: str, description: str = "") -> bool:
+    """
+    Check if a list's name or description contains spam words.
+
+    Our dump analysis found spam in both fields:
+    - 1,163 casino spam, 1,173 pharma spam, 1,759 phone scams
+    - Many had innocent names but spammy descriptions
+    """


The PR description mentions a new cleanup script at scripts/cleanup_spam_lists.py, but that file does not appear to be included in the changes/repository. Either add the script (if still intended) or update the PR description so it matches what’s actually being shipped.

Copilot · 2026-03-08T10:59:18Z

openlibrary/solr/updater/list.py

    def key_test(self, key: str) -> bool:
-        return bool(re.match(r'^(/people/[^/]+)?/(lists|series)/[^/]+$', key))
+        return bool(re.match(r'^(/people/[^/]+)?/lists/[^/]+$', key))


ListSolrUpdater.key_test() no longer matches /series/... keys. This will prevent series documents from being updated in Solr, which can break features that query Solr for series (e.g. the /series/_autocomplete handler uses fq=type:series). If the intent is to keep series searchable, restore matching/indexing for /series/ keys (and adjust the zero-seed logic so it only deletes true lists, not series).

Copilot · 2026-03-08T10:59:18Z

openlibrary/solr/updater/list.py

        return self._list['key']

    @property
    def type(self) -> str:


ListSolrBuilder.type is now hard-coded to 'list', which will cause /series/... docs (if/when indexed) to lose their type:series classification in Solr. This is relied on by Solr consumers such as the series autocomplete (/series/_autocomplete uses fq=type:series). Please keep returning 'series' for series keys (and consider keeping list_type if downstream queries depend on it).

Suggested change

def type(self) -> str:

def type(self) -> str:

# Preserve 'series' classification for /series/... docs so Solr consumers

# (e.g., /series/_autocomplete with fq=type:series) continue to work.

if self.key.startswith('/series/'):

return 'series'

cdrini

As I mentioned here: #11905 (comment) , only the two changes should be sufficient. We'll need to do a clean-up operation to delete all these bad lists, and we'll likely use solr to help find those lists! So don't want to have the spam checks during search/indexing.

cdrini · 2026-03-08T16:28:18Z

openlibrary/plugins/upstream/spamcheck.py

+
+
+# NEW FUNCTION — catches spam in list names AND descriptions
+def is_spam_list(name: str, description: str = "") -> bool:


We can use the existing is_spam method ; it'll check all fields in the provided first object. You should be able to do is_spam(list_record.to_thing_json())

cdrini · 2026-03-08T16:30:33Z

openlibrary/plugins/worksearch/schemes/lists.py

+        if spam_words := get_spam_words():
+            # Escape Solr special chars in each word to prevent query injection
+            import re as _re
+
+            _solr_special = _re.compile(r'([+\-&|!(){}[\]^"~*?:\\/])')
+            safe_words = [_solr_special.sub(r'\\\1', w) for w in spam_words]
+            escaped = ' OR '.join(f'"{w}"' for w in safe_words)
+            params.append(('fq', f'-name:({escaped})'))
+        return params


Preventing spam lists from being created is sufficient. Since we will need to run a cleanup step to delete all those spam lists, we don't want to add this kind of logic to the queries.

cdrini · 2026-03-08T16:31:15Z

openlibrary/solr/updater/list.py

+        builder = ListSolrBuilder(list)
+
+        # NEW: Lists with 0 seeds should not be indexed.
+        # Our dump analysis found 23,091 zero-seed lists (9.1% of all lists),
+        # and 87% of detected spam falls in this category.
+        # Only apply to lists, not series (which may intentionally have 0 seeds).
+        if not builder.seed and not builder.key.startswith('/series/'):
+            return SolrUpdateRequest(deletes=[builder.key]), []
+
+        seeds = builder.seed


Ditto here ; the preventing save is sufficient. This can be rolled back.

cdrini · 2026-03-08T16:56:26Z

openlibrary/plugins/worksearch/schemes/lists.py

+            ('q.op', 'AND'),
+            ('defType', 'edismax'),
+            # Exclude zero-seed lists (23,091 in dump, 87% of spam)
+            ('fq', 'seed_count:[1 TO *]'),


Actually, let's set this only if seed_count isn't in q. That way we can override it.

Also, let's make this: seed_count:[2 TO *] OR list_type:series. User lists with only one book aren't much of a list! But there are some series, eg new series, that only have one book on them.

shoaib-inamdar · 2026-03-08T18:30:06Z

@cdrini , aah sorry, i just got excited and implemented all the changes 😅.

- Add spam check in lists_edit.POST() using existing is_spam() method Checks list name + description against spamwords store before saving - Add default fq filter: seed_count:[2 TO *] OR list_type:series Excludes low-seed lists from search results (overridable via query) Series are exempt since new series may have only one book - Fix import ordering in lists.py (pre-existing lint I001)

for more information, see https://pre-commit.ci

Copilot AI review requested due to automatic review settings March 8, 2026 10:53

github-actions bot assigned mekarpeles Mar 8, 2026

Copilot started reviewing on behalf of shoaib-inamdar March 8, 2026 10:54 View session

Copilot AI reviewed Mar 8, 2026

View reviewed changes

shoaib-inamdar force-pushed the fix/spam-list-filtering-11905 branch from 68a2cd1 to 0ad9ef5 Compare March 8, 2026 11:16

cdrini requested changes Mar 8, 2026

View reviewed changes

shoaib-inamdar force-pushed the fix/spam-list-filtering-11905 branch from 168c20f to 340f886 Compare March 8, 2026 18:44

[pre-commit.ci] auto fixes from pre-commit.com hooks

b894423

for more information, see https://pre-commit.ci

github-actions bot added the Needs: Response Issues which require feedback from lead label Mar 9, 2026

shoaib-inamdar requested a review from cdrini March 10, 2026 16:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: add spam filtering for lists (closes #11905)#12045

fix: add spam filtering for lists (closes #11905)#12045
shoaib-inamdar wants to merge 2 commits intointernetarchive:masterfrom
shoaib-inamdar:fix/spam-list-filtering-11905

shoaib-inamdar commented Mar 8, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 8, 2026

Uh oh!

Copilot AI Mar 8, 2026

Uh oh!

Copilot AI Mar 8, 2026

Uh oh!

Copilot AI Mar 8, 2026

Uh oh!

cdrini left a comment

Uh oh!

cdrini Mar 8, 2026

Uh oh!

cdrini Mar 8, 2026

Uh oh!

cdrini Mar 8, 2026

Uh oh!

cdrini Mar 8, 2026

Uh oh!

shoaib-inamdar commented Mar 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

-    def type(self) -> str:
+    def type(self) -> str:
+        # Preserve 'series' classification for /series/... docs so Solr consumers
+        # (e.g., /series/_autocomplete with fq=type:series) continue to work.
+        if self.key.startswith('/series/'):
+            return 'series'



		# NEW FUNCTION — catches spam in list names AND descriptions
		def is_spam_list(name: str, description: str = "") -> bool:

Uh oh!

Conversation

shoaib-inamdar commented Mar 8, 2026

Technical

Testing

Screenshot

Stakeholders

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

cdrini left a comment

Choose a reason for hiding this comment

Uh oh!

cdrini Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

cdrini Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

cdrini Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

cdrini Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

shoaib-inamdar commented Mar 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants