Skip to content

Conversation

habemusne
Copy link

@habemusne habemusne commented Apr 28, 2025

This is a bug report and also a potential fix to the bug.

Description

After splitting a long text and feeding each chunk into an LLM model to prompt for entities linking task, the _get_prompt_data function complains "IndexError: list index out of range" when trying to access self._ents_cands_by_shard[i_doc].

Syntax of the line self._ents_cands_by_shard = [[] * len(self._ents_cands_by_doc)] seems to imply a purpose to initialize a list with a certain length. But it won't fulfill the purpose, for example if you run [[] * 3] you only get [[]] rather than [[], [], []]. Please justify the change at your deliberation.

Before applying the change: error log

yy@yys-MacBook-Pro 
spacy-llm % pytest spacy_llm/tests/tasks/test_entity_linker.py::test_entity_linker_on_splitted_chunks
========================================== test session starts ==========================================
platform darwin -- Python 3.11.10, pytest-8.3.5, pluggy-1.5.0
rootdir: /Users/yy/Project/2024/spacy-llm
configfile: pyproject.toml
plugins: anyio-4.8.0, langsmith-0.3.33
collected 1 item

spacy_llm/tests/tasks/test_entity_linker.py F                                                     [100%]

=============================================== FAILURES ================================================
_________________________________ test_entity_linker_on_splitted_chunks _________________________________

zeroshot_cfg_string = '\n    [paths]\n    el_nlp = null\n    el_kb = null\n    el_desc = null\n\n    [nlp]\n    lang = "en"\n    pipeline = ....KBObjectLoader.v1"\n    path = ${paths.el_kb}\n    nlp_path = ${paths.el_nlp}\n    desc_path = ${paths.el_desc}\n    '
tmp_path = PosixPath('/private/var/folders/sr/frr2h94d5gxbs296hhrnd7mh0000gn/T/pytest-of-yy/pytest-17/test_entity_linker_on_splitted0')

    def test_entity_linker_on_splitted_chunks(zeroshot_cfg_string, tmp_path):
        config = Config().from_str(
            zeroshot_cfg_string,
            overrides={
                "paths.el_nlp": str(tmp_path),
                "paths.el_kb": str(tmp_path / "entity_linker" / "kb"),
                "paths.el_desc": str(tmp_path / "desc.csv"),
            },
        )
        build_el_pipeline(nlp_path=tmp_path, desc_path=tmp_path / "desc.csv")
        nlp = assemble_from_config(config)
        nlp_ner = spacy.load("en_core_web_md")
        docs = [nlp_ner(text) for text in [
            'Alice goes to Boston to see the Boston Celtics game.',
            'Alice goes to New York to see the New York Knicks game.',
            'I went to see Boston in concert yesterday',
            'Thibeau Courtois plays for the Red Devils in New York',
        ]]
>       docs = [doc for doc in nlp.pipe(docs, batch_size=50)]

spacy_llm/tests/tasks/test_entity_linker.py:815:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
spacy_llm/tests/tasks/test_entity_linker.py:815: in <listcomp>
    docs = [doc for doc in nlp.pipe(docs, batch_size=50)]
../../../.pyenv/versions/3.11.10/lib/python3.11/site-packages/spacy/language.py:1621: in pipe
    for doc in docs:
../../../.pyenv/versions/3.11.10/lib/python3.11/site-packages/spacy/util.py:1703: in _pipe
    yield from proc.pipe(docs, **kwargs)
spacy_llm/pipeline/llm.py:207: in pipe
    error_handler(self._name, self, doc_batch, e)
../../../.pyenv/versions/3.11.10/lib/python3.11/site-packages/spacy/util.py:1722: in raise_error
    raise e
spacy_llm/pipeline/llm.py:205: in pipe
    yield from iter(self._process_docs(doc_batch))
spacy_llm/pipeline/llm.py:242: in _process_docs
    self._model(
spacy_llm/models/rest/openai/model.py:78: in __call__
    for prompts_for_doc in prompts:
spacy_llm/pipeline/llm.py:245: in <genexpr>
    (
spacy_llm/tasks/builtin_task.py:89: in generate_prompts
    self._shard_mapper(_doc, _i_doc, context_length, render_template)
spacy_llm/tasks/util/sharding.py:43: in map_doc_to_shards
    prompt = render_template(doc, 0, i_doc, 1)
spacy_llm/tasks/builtin_task.py:83: in render_template
    **self._get_prompt_data(shard, i_shard, i_doc, n_shards),
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <spacy_llm.tasks.entity_linker.task.EntityLinkerTask object at 0x325396dd0>
shard = Alice goes to *New York* to see the *New York* Knicks game., i_shard = 0, i_doc = 1, n_shards = 1

    def _get_prompt_data(
        self, shard: Doc, i_shard: int, i_doc: int, n_shards: int
    ) -> Dict[str, Any]:
        # n_shards changes before reset happens in _preprocess_docs() whenever sharding mechanism varies number of
        # shards. In this case we have to reset task state as well.
        if n_shards != self._n_shards:
            self._n_shards = n_shards
            self._ents_cands_by_shard = [[] * len(self._ents_cands_by_doc)]
            self._has_ent_cands_by_shard = [[] * len(self._ents_cands_by_doc)]

        # It's not ideal that we have to run candidate selection again here - but due to (1) us wanting to know whether
        # all entities have candidates before sharding and, more importantly, (2) some entities maybe being split up in
        # the sharding process it's cleaner to look for candidates again.
        if n_shards == 1:
            # If only one shard: shard is identical to original doc, so we don't have to rerun candidate search.
            ents_cands, has_cands = (
                self._ents_cands_by_doc[i_doc],
                self._has_ent_cands_by_doc[i_doc],
            )
        else:
            cands_info = self._find_entity_candidates([shard])
            ents_cands, has_cands = cands_info[0][0], cands_info[1][0]

        # Update shard-wise candidate info so it can be reused during parsing.
>       if len(self._ents_cands_by_shard[i_doc]) == 0:
E       IndexError: list index out of range

spacy_llm/tasks/entity_linker/task.py:161: IndexError
======================================== short test summary info ========================================
FAILED spacy_llm/tests/tasks/test_entity_linker.py::test_entity_linker_on_splitted_chunks - IndexError: list index out of range
=========================================== 1 failed in 7.90s ===========================================

After applying the change: all external tests passed

Screenshot 2025-04-29 at 03 34 45

Types of change

Bug fix

Checklist

  • I confirm that I have the right to submit this contribution under the project's MIT license.
  • I ran all tests in tests and usage_examples/tests, and all new and existing tests passed. This includes
    • all external tests (i. e. pytest ran with --external)
    • all tests requiring a GPU (i. e. pytest ran with --gpu)
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

@habemusne habemusne changed the title Fix IndexError when running with splitted tech chunks Fix IndexError when running with splitted text chunks May 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant