Skip to content

[Bug]: error running workflow create_final_text_units: Could not convert <ArrowStringArray> #2177

@Green0wl

Description

@Green0wl

Do you need to file an issue?

  • I have searched the existing issues and this bug is not already filed.
  • My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
  • I believe this is a legitimate bug, not just a question. If this is a question, please use the Discussions area.

Describe the bug

create_final_text_units workflow does not work since yesterday's "pandas==3.0.0" release. Installations before yesterday work fine.

Steps to reproduce

python3 -m venv graphrag
source graphrag/bin/activate
pip install graphrag
graphrag index

Expected Behavior

All pipeline stages complete successfully.

GraphRAG Config Used

defaults for Azure OpenAI:

### This config file contains required core defaults that must be set, along with a handful of common optional settings.
### For a full list of available settings, see https://microsoft.github.io/graphrag/config/yaml/

### LLM settings ###
## There are a number of settings to tune the threading and token limits for LLM calls - check the docs.

models:
  default_chat_model:
    type: azure_openai_chat
    model_provider: openai
    auth_type: api_key # or azure_managed_identity
    api_key: ${GRAPHRAG_API_KEY} # set this in the generated .env file, or remove if managed identity
    model: gpt-5-nano
    deployment_name: gpt-5-nano
    api_base: <url>    
    api_version: 2025-01-01-preview
    model_supports_json: true # recommended if this is available for your model.
    concurrent_requests: 20
    async_mode: threaded # or asyncio
    retry_strategy: exponential_backoff
    max_retries: 10
    tokens_per_minute: 50000
    requests_per_minute: 300
    max_completion_tokens: 10000
    temperature: 1
  default_embedding_model:
    type: azure_openai_embedding
    model_provider: openai
    auth_type: api_key
    api_key: ${GRAPHRAG_API_KEY}
    model: text-embedding-3-small
    deployment_name: text-embedding-3-small
    api_base: <url>
    api_version: 2025-01-01-preview
    concurrent_requests: 25
    async_mode: threaded # or asyncio
    retry_strategy: exponential_backoff
    max_retries: 10
    tokens_per_minute: null
    requests_per_minute: null
    max_tokens: 8000
    temperature: 1

### Input settings ###

input:
  storage:
    type: file # or blob
    base_dir: "input"
  file_type: text # [csv, text, json]

chunks:
  size: 1200
  overlap: 100
  group_by_columns: [id]

### Output/storage settings ###
## If blob storage is specified in the following four sections,
## connection_string and container_name must be provided

output:
  type: file # [file, blob, cosmosdb]
  base_dir: "output"
    
cache:
  type: file # [file, blob, cosmosdb]
  base_dir: "cache"

reporting:
  type: file # [file, blob]
  base_dir: "logs"

vector_store:
  default_vector_store:
    type: lancedb
    db_uri: output/lancedb
    container_name: default

### Workflow settings ###

embed_text:
  model_id: default_embedding_model
  vector_store_id: default_vector_store

extract_graph:
  model_id: default_chat_model
  prompt: "prompts/extract_graph.txt"
  entity_types: [organization,person,geo,event]
  max_gleanings: 1

summarize_descriptions:
  model_id: default_chat_model
  prompt: "prompts/summarize_descriptions.txt"
  max_length: 500

extract_graph_nlp:
  text_analyzer:
    extractor_type: regex_english # [regex_english, syntactic_parser, cfg]
  async_mode: threaded # or asyncio

cluster_graph:
  max_cluster_size: 10

extract_claims:
  enabled: false
  model_id: default_chat_model
  prompt: "prompts/extract_claims.txt"
  description: "Any claims or facts that could be relevant to information discovery."
  max_gleanings: 1

community_reports:
  model_id: default_chat_model
  graph_prompt: "prompts/community_report_graph.txt"
  text_prompt: "prompts/community_report_text.txt"
  max_length: 2000
  max_input_length: 8000

embed_graph:
  enabled: false # if true, will generate node2vec embeddings for nodes

umap:
  enabled: false # if true, will generate UMAP embeddings for nodes (embed_graph must also be enabled)

snapshots:
  graphml: false
  embeddings: false

### Query settings ###
## The prompt locations are required here, but each search method has a number of optional knobs that can be tuned.
## See the config docs: https://microsoft.github.io/graphrag/config/yaml/#query

local_search:
  chat_model_id: default_chat_model
  embedding_model_id: default_embedding_model
  prompt: "prompts/local_search_system_prompt.txt"

global_search:
  chat_model_id: default_chat_model
  map_prompt: "prompts/global_search_map_system_prompt.txt"
  reduce_prompt: "prompts/global_search_reduce_system_prompt.txt"
  knowledge_prompt: "prompts/global_search_knowledge_system_prompt.txt"

drift_search:
  chat_model_id: default_chat_model
  embedding_model_id: default_embedding_model
  prompt: "prompts/drift_search_system_prompt.txt"
  reduce_prompt: "prompts/drift_search_reduce_prompt.txt"

basic_search:
  chat_model_id: default_chat_model
  embedding_model_id: default_embedding_model
  prompt: "prompts/basic_search_system_prompt.txt"

Logs and screenshots

2026-01-22 11:59:14.0063 - ERROR - graphrag.index.run.run_pipeline - error running workflow create_final_text_units
Traceback (most recent call last):
  File "/home/greenowl/Downloads/graphrag/lib/python3.12/site-packages/graphrag/index/run/run_pipeline.py", line 121, in _run_pipeline
    result = await workflow_function(config, context)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/greenowl/Downloads/graphrag/lib/python3.12/site-packages/graphrag/index/workflows/create_final_text_units.py", line 49, in run_workflow
    await write_table_to_storage(output, "text_units", context.output_storage)
  File "/home/greenowl/Downloads/graphrag/lib/python3.12/site-packages/graphrag/utils/storage.py", line 34, in write_table_to_storage
    await storage.set(f"{name}.parquet", table.to_parquet())
                                         ^^^^^^^^^^^^^^^^^^
  File "/home/greenowl/Downloads/graphrag/lib/python3.12/site-packages/pandas/core/frame.py", line 3135, in to_parquet
    return to_parquet(
           ^^^^^^^^^^^
  File "/home/greenowl/Downloads/graphrag/lib/python3.12/site-packages/pandas/io/parquet.py", line 490, in to_parquet
    impl.write(
  File "/home/greenowl/Downloads/graphrag/lib/python3.12/site-packages/pandas/io/parquet.py", line 191, in write
    table = self.api.Table.from_pandas(df, **from_pandas_kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/table.pxi", line 4796, in pyarrow.lib.Table.from_pandas
  File "/home/greenowl/Downloads/graphrag/lib/python3.12/site-packages/pyarrow/pandas_compat.py", line 651, in dataframe_to_arrays
    arrays = [convert_column(c, f)
              ^^^^^^^^^^^^^^^^^^^^
  File "/home/greenowl/Downloads/graphrag/lib/python3.12/site-packages/pyarrow/pandas_compat.py", line 639, in convert_column
    raise e
  File "/home/greenowl/Downloads/graphrag/lib/python3.12/site-packages/pyarrow/pandas_compat.py", line 633, in convert_column
    result = pa.array(col, type=type_, from_pandas=True, safe=safe)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/array.pxi", line 365, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 91, in pyarrow.lib._ndarray_to_array
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: ("Could not convert <ArrowStringArray>\n['ab0ce889-2dd5-4842-ba4e-0cf150772c09',\n 'e196f731-e522-454b-9aa3-2d3fa4171124',\n '7e339fc2-6a41-48f1-9aef-0bf4e9574b8e',\n '7bb73f8b-90f8-4ff4-be04-e445f7b21a20',\n '8d57bd9e-2fd8-483c-ac7f-425097b390ed']\nLength: 5, dtype: str with type ArrowStringArray: did not recognize Python value type when inferring an Arrow data type", 'Conversion failed for column entity_ids with type object')
2026-01-22 11:59:14.0064 - ERROR - graphrag.api.index - Workflow create_final_text_units completed with errors
2026-01-22 11:59:14.0065 - ERROR - graphrag.cli.index - Errors occurred during the pipeline run, see logs for more details.

Additional Information

  • GraphRAG Version: graphrag==2.7.0
  • Operating System: Ubuntu 24.04
  • Python Version: 3.12.3
  • Related Issues: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingtriageDefault label assignment, indicates new issue needs reviewed by a maintainer

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions