-
Notifications
You must be signed in to change notification settings - Fork 3.2k
Open
Labels
bugSomething isn't workingSomething isn't workingtriageDefault label assignment, indicates new issue needs reviewed by a maintainerDefault label assignment, indicates new issue needs reviewed by a maintainer
Description
Do you need to file an issue?
- I have searched the existing issues and this bug is not already filed.
- My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
- I believe this is a legitimate bug, not just a question. If this is a question, please use the Discussions area.
Describe the bug
create_final_text_units workflow does not work since yesterday's "pandas==3.0.0" release. Installations before yesterday work fine.
Steps to reproduce
python3 -m venv graphrag
source graphrag/bin/activate
pip install graphrag
graphrag index
Expected Behavior
All pipeline stages complete successfully.
GraphRAG Config Used
defaults for Azure OpenAI:
### This config file contains required core defaults that must be set, along with a handful of common optional settings.
### For a full list of available settings, see https://microsoft.github.io/graphrag/config/yaml/
### LLM settings ###
## There are a number of settings to tune the threading and token limits for LLM calls - check the docs.
models:
default_chat_model:
type: azure_openai_chat
model_provider: openai
auth_type: api_key # or azure_managed_identity
api_key: ${GRAPHRAG_API_KEY} # set this in the generated .env file, or remove if managed identity
model: gpt-5-nano
deployment_name: gpt-5-nano
api_base: <url>
api_version: 2025-01-01-preview
model_supports_json: true # recommended if this is available for your model.
concurrent_requests: 20
async_mode: threaded # or asyncio
retry_strategy: exponential_backoff
max_retries: 10
tokens_per_minute: 50000
requests_per_minute: 300
max_completion_tokens: 10000
temperature: 1
default_embedding_model:
type: azure_openai_embedding
model_provider: openai
auth_type: api_key
api_key: ${GRAPHRAG_API_KEY}
model: text-embedding-3-small
deployment_name: text-embedding-3-small
api_base: <url>
api_version: 2025-01-01-preview
concurrent_requests: 25
async_mode: threaded # or asyncio
retry_strategy: exponential_backoff
max_retries: 10
tokens_per_minute: null
requests_per_minute: null
max_tokens: 8000
temperature: 1
### Input settings ###
input:
storage:
type: file # or blob
base_dir: "input"
file_type: text # [csv, text, json]
chunks:
size: 1200
overlap: 100
group_by_columns: [id]
### Output/storage settings ###
## If blob storage is specified in the following four sections,
## connection_string and container_name must be provided
output:
type: file # [file, blob, cosmosdb]
base_dir: "output"
cache:
type: file # [file, blob, cosmosdb]
base_dir: "cache"
reporting:
type: file # [file, blob]
base_dir: "logs"
vector_store:
default_vector_store:
type: lancedb
db_uri: output/lancedb
container_name: default
### Workflow settings ###
embed_text:
model_id: default_embedding_model
vector_store_id: default_vector_store
extract_graph:
model_id: default_chat_model
prompt: "prompts/extract_graph.txt"
entity_types: [organization,person,geo,event]
max_gleanings: 1
summarize_descriptions:
model_id: default_chat_model
prompt: "prompts/summarize_descriptions.txt"
max_length: 500
extract_graph_nlp:
text_analyzer:
extractor_type: regex_english # [regex_english, syntactic_parser, cfg]
async_mode: threaded # or asyncio
cluster_graph:
max_cluster_size: 10
extract_claims:
enabled: false
model_id: default_chat_model
prompt: "prompts/extract_claims.txt"
description: "Any claims or facts that could be relevant to information discovery."
max_gleanings: 1
community_reports:
model_id: default_chat_model
graph_prompt: "prompts/community_report_graph.txt"
text_prompt: "prompts/community_report_text.txt"
max_length: 2000
max_input_length: 8000
embed_graph:
enabled: false # if true, will generate node2vec embeddings for nodes
umap:
enabled: false # if true, will generate UMAP embeddings for nodes (embed_graph must also be enabled)
snapshots:
graphml: false
embeddings: false
### Query settings ###
## The prompt locations are required here, but each search method has a number of optional knobs that can be tuned.
## See the config docs: https://microsoft.github.io/graphrag/config/yaml/#query
local_search:
chat_model_id: default_chat_model
embedding_model_id: default_embedding_model
prompt: "prompts/local_search_system_prompt.txt"
global_search:
chat_model_id: default_chat_model
map_prompt: "prompts/global_search_map_system_prompt.txt"
reduce_prompt: "prompts/global_search_reduce_system_prompt.txt"
knowledge_prompt: "prompts/global_search_knowledge_system_prompt.txt"
drift_search:
chat_model_id: default_chat_model
embedding_model_id: default_embedding_model
prompt: "prompts/drift_search_system_prompt.txt"
reduce_prompt: "prompts/drift_search_reduce_prompt.txt"
basic_search:
chat_model_id: default_chat_model
embedding_model_id: default_embedding_model
prompt: "prompts/basic_search_system_prompt.txt"Logs and screenshots
2026-01-22 11:59:14.0063 - ERROR - graphrag.index.run.run_pipeline - error running workflow create_final_text_units
Traceback (most recent call last):
File "/home/greenowl/Downloads/graphrag/lib/python3.12/site-packages/graphrag/index/run/run_pipeline.py", line 121, in _run_pipeline
result = await workflow_function(config, context)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/greenowl/Downloads/graphrag/lib/python3.12/site-packages/graphrag/index/workflows/create_final_text_units.py", line 49, in run_workflow
await write_table_to_storage(output, "text_units", context.output_storage)
File "/home/greenowl/Downloads/graphrag/lib/python3.12/site-packages/graphrag/utils/storage.py", line 34, in write_table_to_storage
await storage.set(f"{name}.parquet", table.to_parquet())
^^^^^^^^^^^^^^^^^^
File "/home/greenowl/Downloads/graphrag/lib/python3.12/site-packages/pandas/core/frame.py", line 3135, in to_parquet
return to_parquet(
^^^^^^^^^^^
File "/home/greenowl/Downloads/graphrag/lib/python3.12/site-packages/pandas/io/parquet.py", line 490, in to_parquet
impl.write(
File "/home/greenowl/Downloads/graphrag/lib/python3.12/site-packages/pandas/io/parquet.py", line 191, in write
table = self.api.Table.from_pandas(df, **from_pandas_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/table.pxi", line 4796, in pyarrow.lib.Table.from_pandas
File "/home/greenowl/Downloads/graphrag/lib/python3.12/site-packages/pyarrow/pandas_compat.py", line 651, in dataframe_to_arrays
arrays = [convert_column(c, f)
^^^^^^^^^^^^^^^^^^^^
File "/home/greenowl/Downloads/graphrag/lib/python3.12/site-packages/pyarrow/pandas_compat.py", line 639, in convert_column
raise e
File "/home/greenowl/Downloads/graphrag/lib/python3.12/site-packages/pyarrow/pandas_compat.py", line 633, in convert_column
result = pa.array(col, type=type_, from_pandas=True, safe=safe)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/array.pxi", line 365, in pyarrow.lib.array
File "pyarrow/array.pxi", line 91, in pyarrow.lib._ndarray_to_array
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: ("Could not convert <ArrowStringArray>\n['ab0ce889-2dd5-4842-ba4e-0cf150772c09',\n 'e196f731-e522-454b-9aa3-2d3fa4171124',\n '7e339fc2-6a41-48f1-9aef-0bf4e9574b8e',\n '7bb73f8b-90f8-4ff4-be04-e445f7b21a20',\n '8d57bd9e-2fd8-483c-ac7f-425097b390ed']\nLength: 5, dtype: str with type ArrowStringArray: did not recognize Python value type when inferring an Arrow data type", 'Conversion failed for column entity_ids with type object')
2026-01-22 11:59:14.0064 - ERROR - graphrag.api.index - Workflow create_final_text_units completed with errors
2026-01-22 11:59:14.0065 - ERROR - graphrag.cli.index - Errors occurred during the pipeline run, see logs for more details.
Additional Information
- GraphRAG Version: graphrag==2.7.0
- Operating System: Ubuntu 24.04
- Python Version: 3.12.3
- Related Issues: None
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingtriageDefault label assignment, indicates new issue needs reviewed by a maintainerDefault label assignment, indicates new issue needs reviewed by a maintainer