Skip to content

[Bug]: <title>Default vector_size=3072 in IndexSchema causes LanceDB FixedSizeList dimension mismatch for non-OpenAI embedding models #2231

@laomomo

Description

@laomomo

Do you need to file an issue?

  • I have searched the existing issues and this bug is not already filed.
  • My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
  • I believe this is a legitimate bug, not just a question. If this is a question, please use the Discussions area.

Describe the bug

Default vector_size=3072 in IndexSchema causes LanceDB FixedSizeList dimension mismatch for non-OpenAI embedding models

GraphRAG version: 3.0.2
graphrag-vectors version: 3.0.2


Summary

When using any embedding model that does not output 3072-dimensional vectors, the pipeline crashes with a LanceError(Arrow): Size of FixedSizeList is not the same error during the generate_text_embeddings workflow (the final step). The root cause is a hardcoded default of vector_size=3072 in graphrag_vectors/index_schema.py, which assumes OpenAI text-embedding-3-large as the universal embedding model.


Steps to Reproduce

  1. Initialize a GraphRAG project:

    graphrag init --root ./my_project
    
  2. Configure settings.yaml to use any embedding model that does not output 3072-dimensional vectors, for example:

    • NVIDIA NIM baai/bge-m3 → 1024d
    • OpenAI text-embedding-3-small → 1536d
    • OpenAI text-embedding-ada-002 → 1536d
    • Ollama nomic-embed-text → 768d
  3. Do not set vector_size explicitly under vector_store in settings.yaml (which is the case for all graphrag init-generated configs, since the template contains no index_schema entries at all)

  4. Run the indexing pipeline:

    graphrag index --root ./my_project
    
  5. Pipeline fails during generate_text_embeddings (workflow 10/10) with:

    LanceError(Arrow): Size of FixedSizeList is not the same.
    input list: fixed_size_list<item: float>[1024]
    output list: fixed_size_list<item: float>[3072]
    

Root Cause

graphrag_vectors/index_schema.py line 10:

DEFAULT_VECTOR_SIZE: int = 3072   # hardcoded to OpenAI text-embedding-3-large dims

graphrag_vectors/vector_store.py line 42:

def __init__(self, ..., vector_size: int = 3072, ...):

graphrag/config/models/graph_rag_config.py lines 259–265 (_validate_vector_store):
When an embedding's schema entry is missing from settings.yaml, it is auto-created as IndexSchema(index_name=embedding) — which inherits DEFAULT_VECTOR_SIZE=3072.

graphrag_vectors/lancedb.py create_index():
The LanceDB table schema is created using self.vector_size (3072). Then load_documents() attempts to write actual embedding vectors (e.g., 1024-dim), triggering the Arrow FixedSizeList mismatch at line 80:

vector_column = pa.FixedSizeListArray.from_arrays(flat_array, self.vector_size)

Note that load_documents() does attempt to update self.vector_size from the first document (lines 65–66), but by that point the LanceDB table schema has already been created with the wrong size in create_index().


Impact

This affects all users who use any embedding model other than OpenAI text-embedding-3-large:

Model Dims Affected
OpenAI text-embedding-3-large 3072 ✅ Works (this is the hardcoded default)
OpenAI text-embedding-3-small 1536 ❌ Crashes
OpenAI text-embedding-ada-002 1536 ❌ Crashes
NVIDIA NIM baai/bge-m3 1024 ❌ Crashes
Ollama nomic-embed-text 768 ❌ Crashes
Any other non-3072-dim model varies ❌ Crashes

The error occurs at workflow 10/10 — the very last step — after all LLM calls for entity extraction and community reports have already been made and billed. There is no graceful error message pointing to vector_size or settings.yaml as the cause.


Workaround

Explicitly set vector_size under each schema entry in settings.yaml to match the actual embedding model output dimensions:

vector_store:
  default:
    type: lancedb
    db_uri: ./output/lancedb
  entity_description:
    vector_size: 1024   # set to match your embedding model
  community_full_content:
    vector_size: 1024
  text_unit_text:
    vector_size: 1024

Suggested Fix

Option A — Auto-detect from first document (minimal change):
In lancedb.py, move the self.vector_size update from load_documents() to before create_index() is called, by inspecting the first document's vector length before creating the table schema.

Option B — Infer from model config at init time (better UX):
In _validate_vector_store() in graph_rag_config.py, when auto-creating missing IndexSchema entries, look up the output dimension from the configured embedding model's configuration (e.g., via a dimensions field in ModelConfig), instead of defaulting to 3072.

Option C — Warning at config validation (minimal, immediate improvement):
Emit a clear warning during config loading if vector_size is left at default 3072 and the configured embedding model is not text-embedding-3-large, prompting the user to set it explicitly in settings.yaml.


Additional Context

The graphrag init-generated settings.yaml contains no vector_store.index_schema entries, so new users have no indication that vector_size must be set manually for non-OpenAI embeddings. The only hint is buried in the LanceDB Arrow error message, which does not reference settings.yaml or vector_size as the fix.

Steps to reproduce

No response

Expected Behavior

No response

GraphRAG Config Used

# Paste your config here

Logs and screenshots

No response

Additional Information

  • GraphRAG Version:
  • Operating System:
  • Python Version:
  • Related Issues:

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingtriageDefault label assignment, indicates new issue needs reviewed by a maintainer

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions