Fix IndexError when running with splitted text chunks #500

habemusne · 2025-04-28T19:38:54Z

This is a bug report and also a potential fix to the bug.

Description

After splitting a long text and feeding each chunk into an LLM model to prompt for entities linking task, the _get_prompt_data function complains "IndexError: list index out of range" when trying to access self._ents_cands_by_shard[i_doc].

Syntax of the line self._ents_cands_by_shard = [[] * len(self._ents_cands_by_doc)] seems to imply a purpose to initialize a list with a certain length. But it won't fulfill the purpose, for example if you run [[] * 3] you only get [[]] rather than [[], [], []]. Please justify the change at your deliberation.

Before applying the change: error log

yy@yys-MacBook-Pro 
spacy-llm % pytest spacy_llm/tests/tasks/test_entity_linker.py::test_entity_linker_on_splitted_chunks
========================================== test session starts ==========================================
platform darwin -- Python 3.11.10, pytest-8.3.5, pluggy-1.5.0
rootdir: /Users/yy/Project/2024/spacy-llm
configfile: pyproject.toml
plugins: anyio-4.8.0, langsmith-0.3.33
collected 1 item

spacy_llm/tests/tasks/test_entity_linker.py F                                                     [100%]

=============================================== FAILURES ================================================
_________________________________ test_entity_linker_on_splitted_chunks _________________________________

zeroshot_cfg_string = '\n    [paths]\n    el_nlp = null\n    el_kb = null\n    el_desc = null\n\n    [nlp]\n    lang = "en"\n    pipeline = ....KBObjectLoader.v1"\n    path = ${paths.el_kb}\n    nlp_path = ${paths.el_nlp}\n    desc_path = ${paths.el_desc}\n    '
tmp_path = PosixPath('/private/var/folders/sr/frr2h94d5gxbs296hhrnd7mh0000gn/T/pytest-of-yy/pytest-17/test_entity_linker_on_splitted0')

    def test_entity_linker_on_splitted_chunks(zeroshot_cfg_string, tmp_path):
        config = Config().from_str(
            zeroshot_cfg_string,
            overrides={
                "paths.el_nlp": str(tmp_path),
                "paths.el_kb": str(tmp_path / "entity_linker" / "kb"),
                "paths.el_desc": str(tmp_path / "desc.csv"),
            },
        )
        build_el_pipeline(nlp_path=tmp_path, desc_path=tmp_path / "desc.csv")
        nlp = assemble_from_config(config)
        nlp_ner = spacy.load("en_core_web_md")
        docs = [nlp_ner(text) for text in [
            'Alice goes to Boston to see the Boston Celtics game.',
            'Alice goes to New York to see the New York Knicks game.',
            'I went to see Boston in concert yesterday',
            'Thibeau Courtois plays for the Red Devils in New York',
        ]]
>       docs = [doc for doc in nlp.pipe(docs, batch_size=50)]

spacy_llm/tests/tasks/test_entity_linker.py:815:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
spacy_llm/tests/tasks/test_entity_linker.py:815: in <listcomp>
    docs = [doc for doc in nlp.pipe(docs, batch_size=50)]
../../../.pyenv/versions/3.11.10/lib/python3.11/site-packages/spacy/language.py:1621: in pipe
    for doc in docs:
../../../.pyenv/versions/3.11.10/lib/python3.11/site-packages/spacy/util.py:1703: in _pipe
    yield from proc.pipe(docs, **kwargs)
spacy_llm/pipeline/llm.py:207: in pipe
    error_handler(self._name, self, doc_batch, e)
../../../.pyenv/versions/3.11.10/lib/python3.11/site-packages/spacy/util.py:1722: in raise_error
    raise e
spacy_llm/pipeline/llm.py:205: in pipe
    yield from iter(self._process_docs(doc_batch))
spacy_llm/pipeline/llm.py:242: in _process_docs
    self._model(
spacy_llm/models/rest/openai/model.py:78: in __call__
    for prompts_for_doc in prompts:
spacy_llm/pipeline/llm.py:245: in <genexpr>
    (
spacy_llm/tasks/builtin_task.py:89: in generate_prompts
    self._shard_mapper(_doc, _i_doc, context_length, render_template)
spacy_llm/tasks/util/sharding.py:43: in map_doc_to_shards
    prompt = render_template(doc, 0, i_doc, 1)
spacy_llm/tasks/builtin_task.py:83: in render_template
    **self._get_prompt_data(shard, i_shard, i_doc, n_shards),
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <spacy_llm.tasks.entity_linker.task.EntityLinkerTask object at 0x325396dd0>
shard = Alice goes to *New York* to see the *New York* Knicks game., i_shard = 0, i_doc = 1, n_shards = 1

    def _get_prompt_data(
        self, shard: Doc, i_shard: int, i_doc: int, n_shards: int
    ) -> Dict[str, Any]:
        # n_shards changes before reset happens in _preprocess_docs() whenever sharding mechanism varies number of
        # shards. In this case we have to reset task state as well.
        if n_shards != self._n_shards:
            self._n_shards = n_shards
            self._ents_cands_by_shard = [[] * len(self._ents_cands_by_doc)]
            self._has_ent_cands_by_shard = [[] * len(self._ents_cands_by_doc)]

        # It's not ideal that we have to run candidate selection again here - but due to (1) us wanting to know whether
        # all entities have candidates before sharding and, more importantly, (2) some entities maybe being split up in
        # the sharding process it's cleaner to look for candidates again.
        if n_shards == 1:
            # If only one shard: shard is identical to original doc, so we don't have to rerun candidate search.
            ents_cands, has_cands = (
                self._ents_cands_by_doc[i_doc],
                self._has_ent_cands_by_doc[i_doc],
            )
        else:
            cands_info = self._find_entity_candidates([shard])
            ents_cands, has_cands = cands_info[0][0], cands_info[1][0]

        # Update shard-wise candidate info so it can be reused during parsing.
>       if len(self._ents_cands_by_shard[i_doc]) == 0:
E       IndexError: list index out of range

spacy_llm/tasks/entity_linker/task.py:161: IndexError
======================================== short test summary info ========================================
FAILED spacy_llm/tests/tasks/test_entity_linker.py::test_entity_linker_on_splitted_chunks - IndexError: list index out of range
=========================================== 1 failed in 7.90s ===========================================

After applying the change: all external tests passed

Types of change

Bug fix

Checklist

I confirm that I have the right to submit this contribution under the project's MIT license.
I ran all tests in tests and usage_examples/tests, and all new and existing tests passed. This includes
- all external tests (i. e. pytest ran with --external)
- all tests requiring a GPU (i. e. pytest ran with --gpu)
My changes don't require a change to the documentation, or if they do, I've added all required information.

yueyang added 2 commits April 28, 2025 22:08

fix-el-ents-by-shard-initial-len

5b845b1

added test

3713965

habemusne changed the title ~~Fix IndexError when running with splitted tech chunks~~ Fix IndexError when running with splitted text chunks May 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix IndexError when running with splitted text chunks #500

Fix IndexError when running with splitted text chunks #500

Uh oh!

habemusne commented Apr 28, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Fix IndexError when running with splitted text chunks #500

Are you sure you want to change the base?

Fix IndexError when running with splitted text chunks #500

Uh oh!

Conversation

habemusne commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Before applying the change: error log

After applying the change: all external tests passed

Types of change

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

habemusne commented Apr 28, 2025 •

edited

Loading