-
Notifications
You must be signed in to change notification settings - Fork 3.3k
Description
Do you need to file an issue?
- I have searched the existing issues and this bug is not already filed.
- My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
- I believe this is a legitimate bug, not just a question. If this is a question, please use the Discussions area.
Describe the bug
Default vector_size=3072 in IndexSchema causes LanceDB FixedSizeList dimension mismatch for non-OpenAI embedding models
GraphRAG version: 3.0.2
graphrag-vectors version: 3.0.2
Summary
When using any embedding model that does not output 3072-dimensional vectors, the pipeline crashes with a LanceError(Arrow): Size of FixedSizeList is not the same error during the generate_text_embeddings workflow (the final step). The root cause is a hardcoded default of vector_size=3072 in graphrag_vectors/index_schema.py, which assumes OpenAI text-embedding-3-large as the universal embedding model.
Steps to Reproduce
-
Initialize a GraphRAG project:
graphrag init --root ./my_project -
Configure
settings.yamlto use any embedding model that does not output 3072-dimensional vectors, for example:- NVIDIA NIM
baai/bge-m3→ 1024d - OpenAI
text-embedding-3-small→ 1536d - OpenAI
text-embedding-ada-002→ 1536d - Ollama
nomic-embed-text→ 768d
- NVIDIA NIM
-
Do not set
vector_sizeexplicitly undervector_storeinsettings.yaml(which is the case for allgraphrag init-generated configs, since the template contains noindex_schemaentries at all) -
Run the indexing pipeline:
graphrag index --root ./my_project -
Pipeline fails during
generate_text_embeddings(workflow 10/10) with:LanceError(Arrow): Size of FixedSizeList is not the same. input list: fixed_size_list<item: float>[1024] output list: fixed_size_list<item: float>[3072]
Root Cause
graphrag_vectors/index_schema.py line 10:
DEFAULT_VECTOR_SIZE: int = 3072 # hardcoded to OpenAI text-embedding-3-large dimsgraphrag_vectors/vector_store.py line 42:
def __init__(self, ..., vector_size: int = 3072, ...):graphrag/config/models/graph_rag_config.py lines 259–265 (_validate_vector_store):
When an embedding's schema entry is missing from settings.yaml, it is auto-created as IndexSchema(index_name=embedding) — which inherits DEFAULT_VECTOR_SIZE=3072.
graphrag_vectors/lancedb.py create_index():
The LanceDB table schema is created using self.vector_size (3072). Then load_documents() attempts to write actual embedding vectors (e.g., 1024-dim), triggering the Arrow FixedSizeList mismatch at line 80:
vector_column = pa.FixedSizeListArray.from_arrays(flat_array, self.vector_size)Note that load_documents() does attempt to update self.vector_size from the first document (lines 65–66), but by that point the LanceDB table schema has already been created with the wrong size in create_index().
Impact
This affects all users who use any embedding model other than OpenAI text-embedding-3-large:
| Model | Dims | Affected |
|---|---|---|
| OpenAI text-embedding-3-large | 3072 | ✅ Works (this is the hardcoded default) |
| OpenAI text-embedding-3-small | 1536 | ❌ Crashes |
| OpenAI text-embedding-ada-002 | 1536 | ❌ Crashes |
| NVIDIA NIM baai/bge-m3 | 1024 | ❌ Crashes |
| Ollama nomic-embed-text | 768 | ❌ Crashes |
| Any other non-3072-dim model | varies | ❌ Crashes |
The error occurs at workflow 10/10 — the very last step — after all LLM calls for entity extraction and community reports have already been made and billed. There is no graceful error message pointing to vector_size or settings.yaml as the cause.
Workaround
Explicitly set vector_size under each schema entry in settings.yaml to match the actual embedding model output dimensions:
vector_store:
default:
type: lancedb
db_uri: ./output/lancedb
entity_description:
vector_size: 1024 # set to match your embedding model
community_full_content:
vector_size: 1024
text_unit_text:
vector_size: 1024Suggested Fix
Option A — Auto-detect from first document (minimal change):
In lancedb.py, move the self.vector_size update from load_documents() to before create_index() is called, by inspecting the first document's vector length before creating the table schema.
Option B — Infer from model config at init time (better UX):
In _validate_vector_store() in graph_rag_config.py, when auto-creating missing IndexSchema entries, look up the output dimension from the configured embedding model's configuration (e.g., via a dimensions field in ModelConfig), instead of defaulting to 3072.
Option C — Warning at config validation (minimal, immediate improvement):
Emit a clear warning during config loading if vector_size is left at default 3072 and the configured embedding model is not text-embedding-3-large, prompting the user to set it explicitly in settings.yaml.
Additional Context
The graphrag init-generated settings.yaml contains no vector_store.index_schema entries, so new users have no indication that vector_size must be set manually for non-OpenAI embeddings. The only hint is buried in the LanceDB Arrow error message, which does not reference settings.yaml or vector_size as the fix.
Steps to reproduce
No response
Expected Behavior
No response
GraphRAG Config Used
# Paste your config here
Logs and screenshots
No response
Additional Information
- GraphRAG Version:
- Operating System:
- Python Version:
- Related Issues: