DOC-829 | GraphRAG: Add multi-file import, semantic units, restructure content #856

nerpaula · 2025-12-22T12:33:11Z

Description

Note

Major GraphRAG docs update focused on Importer capabilities and structure.

Importer docs restructured: Replace reference/importer.md with a new reference/importer/ section (_index.md, importing-files.md, llm-configuration.md, parameters.md, semantic-units.md, verify-and-explore.md).
New features documented: Multi-file import via POST /v1/import-multiple with SSE streaming progress; semantic units (image/URL extraction) with configurable options; expanded parameter reference and verification guidance.
Global link updates: Update references from ../reference/importer.md to ../reference/importer/ (and anchors), including in graphrag/_index.md, technical-overview.md, web-interface.md, ai-orchestrator.md, retriever.md, and triton-inference-server.md.
Limitations update: Reflect project-based graph/collection naming ({project_name}_kg, {project_name}_Documents, etc.) and remove single-file-only wording.

^{Written by Cursor Bugbot for commit d3f79fd. This will update automatically on new commits. Configure here.}

arangodb-docs-automation · 2025-12-22T12:33:13Z

Deploy Preview Available Via
https://deploy-preview-856--docs-hugo.netlify.app

site/content/ai-suite/reference/importer/llm-configuration.md

bluepal-pavan-kothapalli · 2025-12-23T05:05:25Z

Thanks for your work @nerpaula , I have added one comment, please resolve it, Otherwise LGTM.
Also, there are some missing parameters that need to be included, such as vector_index_n_lists, vector_index_metric, and vector_index_use_hnsw. I would suggest getting a review from @aMahanna, because he is the one who implemented most of these parameters.

Full example JSON payload for ImportMultipleFilesRequest (/v1/import-multiple):

{
  "files": [
    {
      "name": "document1.txt",
      "content": "VGhpcyBpcyBkb2MxIGNvbnRlbnQgaW4gYmFzZTY0Lg==",
      "citable_url": "https://example.com/doc1"
    },
    {
      "name": "document2.pdf",
      "content": "VGhpcyBpcyBkb2MyIGNvbnRlbnQgaW4gYmFzZTY0Lg==",
      "citable_url": "https://example.com/doc2"
    }
  ],
  "store_in_s3": false,
  "batch_size": 1000,
  "enable_chunk_embeddings": true,
  "enable_edge_embeddings": true,
  "chunk_token_size": 1000,
  "chunk_overlap_token_size": 200,
  "entity_types": [
    "PERSON",
    "ORGANIZATION",
    "LOCATION",
    "TECHNOLOGY"
  ],
  "relationship_types": [
    "RELATED_TO",
    "PART_OF",
    "USES",
    "LOCATED_IN"
  ],
  "community_report_num_findings": "5-10",
  "community_report_instructions": "Focus on key entities, relationships, and risk-related findings.",
  "partition_id": "my_partition_id_001",
  "enable_semantic_units": true,
  "process_images": true,
  "store_image_data": true,
  "chunk_min_token_size": 50,
  "chunk_custom_separators": [
    "\n\n",
    "---",
    "###"
  ],
  "preserve_chunk_separator": true,
  "smart_graph_attribute": "region",
  "shard_count": 3,
  "is_disjoint": false,
  "satellite_collections": [
    "sat_col_1",
    "sat_col_2"
  ],
  "enable_strict_types": true,
  "entity_extract_max_gleaning": 1,
  "vector_index_n_lists": 2048,
  "vector_index_metric": "cosine",
  "vector_index_use_hnsw": true,
  "enable_community_embeddings": true,
}

diegomendez40

Thanks for your work @nerpaula . I have added some comments. Please feel free to reach out for any clarification.

site/content/ai-suite/reference/importer/_index.md

diegomendez40 · 2025-12-23T13:53:23Z

site/content/ai-suite/reference/importer/importing-files.md

+## Multi-File Import
+
+Use multi-file import when you need to process multiple documents into a single 
+Knowledge Graph. This API provides streaming progress updates, making it 
+ideal for batch processing and long-running imports where you need to track progress.
+
+```
+POST /v1/import-multiple
+```


There is no mention of a reference architecture nor the AutoGraph. Our recommended way to ingest a huge corpus is by using the AutoGraph to create different clusters (mini-topics), then ingesting each of them with the multi-file importer into its own graphrag partition.

This is expected right now. Will be enhanced with reference architecture and documentation for AutoGraph, I will treat that in a separate task/PR.

diegomendez40 · 2025-12-23T13:57:58Z

site/content/ai-suite/reference/importer/parameters.md

+- `vector_index_metric`: Distance metric for vector similarity search. The supported values are `"cosine"` (default), `"l2"`, and `"innerProduct"`.
+- `vector_index_n_lists`: Number of lists for approximate search (optional). If not set, it is automatically computed as `8 * sqrt(collection_size)`. This parameter is ignored when using HNSW.
+- `vector_index_use_hnsw`: Whether to use HNSW (Hierarchical Navigable Small World) index instead of the default inverted index (default: `false`).


I am not sure these params will reach February release. I will check and come back to you.

@diegomendez40 what's the status of these parameters?

diegomendez40 · 2025-12-23T13:59:13Z

site/content/ai-suite/reference/importer/parameters.md

+- `smart_graph_attribute`: SmartGraph attribute for graph sharding.
+- `shard_count`: Number of shards for the collections.
+- `is_disjoint`: Whether the graphs must be disjoint.
+- `satellite_collections`: An array of collection names to create as Satellite Collections.


I am not sure these params will reach February release. I will check and come back to you.

@diegomendez40 what's the status of these parameters?

diegomendez40 · 2025-12-23T14:02:00Z

site/content/ai-suite/reference/importer/semantic-units.md

+## Performance Considerations
+
+### Size Guidelines
+
+- **Small Documents** (< 1MB): All features enabled with minimal impact.
+- **Medium Documents** (1-10MB): Consider disabling `store_image_data` for large images.
+- **Large Documents** (> 10MB): Use `enable_semantic_units=true, process_images=false, store_image_data=false` for basic URL extraction.
+
+### LLM Compatibility
+
+The semantic units processing works with all LLM providers:
+- **OpenAI**: GPT-4o, GPT-4o-mini (all models supported).
+- **OpenRouter**: Gemini Flash, Claude Sonnet (all models supported).
+- **Triton**: Mistral-Nemo-Instruct (all models supported).


Where did this come from? I don't think most of this information is correct. Semantic Units require multimodal LLMs.

Comes from here: https://github.com/arangoml/graphrag_importer/blob/main/docs/user_facing_documentation.md#performance-considerations

@diegomendez40 can you please provide the correct information?

diegomendez40 · 2025-12-23T14:05:03Z

site/content/ai-suite/reference/importer/verify-and-explore.md

+### Semantic Units Collection
+
+- **Purpose**: Stores semantic units extracted from documents, including image
+  references and web URLs. This collection is only created when `enable_semantic_units`
+  is set to `true`.
+- **Key Fields**:
+  - `_key`: Unique identifier for the semantic unit.
+  - `type`: Type of semantic unit (always "image" for image references).
+  - `image_url`: URL or reference to the image/web resource.
+  - `is_storage_url`: Boolean indicating if the URL is a storage URL (base64/S3) or web URL.
+  - `import_number`: Import batch number for tracking.
+  - `source_chunk_id`: Reference to the chunk where this semantic unit was found.
+
+{{< info >}}
+Learn more about semantic units in the [Semantic Units guide](semantic-units.md).
+{{< /info >}}


If this is intended for immediate release, it makes sense. If we would want to use it for February release, then we should also add that Semantic Units can also store other sources, such as DB entities via the VirtualGraph.

This PR is intended for immediate release. VirtualGraph will be treated in a separate task/PR.

Co-authored-by: Diego M-R <[email protected]>

site/content/ai-suite/reference/importer/verify-and-explore.md

bluepal-pavan-parakala

Thank you for your work @nerpaula . I have added some comments. Please consider them.

site/content/ai-suite/reference/importer/importing-files.md

site/content/ai-suite/reference/importer/llm-configuration.md

bluepal-pavan-parakala · 2025-12-30T05:39:09Z

site/content/ai-suite/reference/importer/parameters.md

+  "vector_index_metric": "cosine",
+  "vector_index_use_hnsw": true,
+  "batch_size": 500
+}


Some defaults are mentioned inconsistently:

batch_size: Default 1000 (mentioned in some places, not others)

chunk_token_size: Default 1200 (mentioned)

chunk_overlap_token_size: Default 100 (mentioned)

entity_types: Default ["person", "organization", "geo", "event"] (mentioned)

community_report_num_findings: Default "5-10" (mentioned)

@bluepal-pavan-parakala @diegomendez40 How can I see the default values of all parameters? How can I know if a parameter is required or optional? I don't see this information in the proto files. And I don't know how to find this information in the source code. Thus, it makes it a bit hard to ensure accuracy in the docs based on the implementation.

@nerpaula, @hkernbach has created https://effective-barnacle-wr1zqer.pages.github.io/ where we can find proto definition description for all the services.

site/content/ai-suite/reference/importer/llm-configuration.md

bluepal-pavan-parakala · 2025-12-30T05:45:30Z

site/content/ai-suite/reference/importer/verify-and-explore.md

+3. Verify that the following collections exist:
+   - `knowledge_graph_vertices`: Contains the nodes of the knowledge graph i.e. documents, chunks, communities, and entities.
+   - `knowledge_graph_edges`: Contains the relationships between nodes i.e. relations.
+


Please consider adding a section explaining:

Which collections are vertices vs edges

How the graph structure works

Key generation and uniqueness

@bluepal-pavan-parakala I have added which collections are vertex/edge. About graph structure and key generation, please elaborate what do you think it is necessary to add. As far as I know, the generated KG is a general graph type and the concept and data structure is described in the ArangoDB manual. As for keys, I see the collections have traditional key generators (again, this is described in the general data structure concepts of ArangoDB). Is there anything else in particular?

site/content/ai-suite/reference/importer/verify-and-explore.md

add multi-file import, semantic units, restructure content

dbf1a3f

cla-bot bot added the cla-signed label Dec 22, 2025

nerpaula self-assigned this Dec 22, 2025

nerpaula requested a review from diegomendez40 December 22, 2025 12:40

diegomendez40 requested review from bluepal-pavan-kothapalli and bluepal-pavan-parakala December 22, 2025 14:12

bluepal-pavan-kothapalli reviewed Dec 23, 2025

View reviewed changes

site/content/ai-suite/reference/importer/llm-configuration.md Show resolved Hide resolved

bluepal-pavan-kothapalli reviewed Dec 23, 2025

View reviewed changes

site/content/ai-suite/reference/importer/llm-configuration.md Show resolved Hide resolved

bluepal-pavan-kothapalli reviewed Dec 23, 2025

View reviewed changes

site/content/ai-suite/reference/importer/llm-configuration.md Outdated Show resolved Hide resolved

bluepal-pavan-kothapalli reviewed Dec 23, 2025

View reviewed changes

site/content/ai-suite/reference/importer/llm-configuration.md Show resolved Hide resolved

nerpaula added 2 commits December 23, 2025 12:07

address review comments

66ef4bf

fix broken links

560e978

diegomendez40 reviewed Dec 23, 2025

View reviewed changes

Apply suggestions from code review

cb6ea66

Co-authored-by: Diego M-R <[email protected]>

bluepal-pavan-parakala reviewed Dec 30, 2025

View reviewed changes

site/content/ai-suite/reference/importer/verify-and-explore.md Show resolved Hide resolved

bluepal-pavan-parakala reviewed Dec 30, 2025

View reviewed changes

nerpaula and others added 2 commits December 30, 2025 13:07

Merge branch 'main' into DOC-829

4af894a

review

d3f79fd

DOC-829 | GraphRAG: Add multi-file import, semantic units, restructure content #856

Are you sure you want to change the base?

DOC-829 | GraphRAG: Add multi-file import, semantic units, restructure content #856

Uh oh!

Conversation

nerpaula commented Dec 22, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

arangodb-docs-automation bot commented Dec 22, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bluepal-pavan-kothapalli commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

diegomendez40 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bluepal-pavan-parakala left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

nerpaula commented Dec 22, 2025 •

edited by cursor bot

Loading

bluepal-pavan-kothapalli commented Dec 23, 2025 •

edited

Loading