Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion open-source/ingestion/supported-file-types.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,6 @@ title: Supported file types

The Unstructured Ingest CLI and Unstructured Ingest Python library support processing of the following file types:

import SupportedFileTypesPlatform from '/snippets/general-shared-text/supported-file-types-platform.mdx';
import SupportedFileTypesPlatform from '/snippets/general-shared-text/supported-file-types-open-source.mdx';

<SupportedFileTypesPlatform />
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,14 @@ strategies other than **Auto** for sets of documents of different types could pr
including reduction in transformation quality.

- **VLM**: For the highest-quality transformation of these file types: `.bmp`, `.gif`, `.heic`, `.jpeg`, `.jpg`, `.pdf`, `.png`, `.tiff`, and `.webp`.
- **High Res**: For all other [supported file types](/ui/supported-file-types), and for the generation of bounding box coordinates.
- **High Res**: For all other [supported file types](/ui/supported-file-types) except video and audio files, and for the generation of bounding box coordinates.
- **Fast**: For text-only documents.
- **Multimedia**: For video and audio files.

The **Auto** partitioning strategy routes each file as a complete unit to the appropriate partitioning strategy (**VLM**, **High Res**, or **Fast**)
The **Auto** partitioning strategy routes each file as a complete unit to the appropriate partitioning strategy (**VLM**, **High Res**, **Fast**, or **Multimedia**)
based on the preceding file types. Additionally, for `.pdf` files, the **Auto** partitioning strategy routes these files' pages
on a page-by-page basis, as follows:

- A page is routed to **Fast** when it contains only embedded text and no images or tables are detected.
- All other kinds of pages are routed to **VLM** or **High Res**, depending on the complexity of a page's
content. Unstructured constantly optimizes its proprietary algorithm for routing to **VLM** or **High Res** in these cases.
content. Unstructured constantly optimizes its proprietary algorithm for routing to **VLM** or **High Res** in these cases.
20 changes: 19 additions & 1 deletion snippets/general-shared-text/supported-file-types-platform.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@ By file extension:

| File extension |
| --- |
| `.3gp` |
| `.aac` |
| `.abw` |
| `.bmp` |
| `.csv` |
Expand All @@ -16,21 +18,32 @@ By file extension:
| `.epub` |
| `.et` |
| `.eth` |
| `.flac` |
| `.flv` |
| `.fods` |
| `.heic` |
| `.htm` |
| `.htm` |
| `.html` |
| `.hwp` |
| `.jpeg` |
| `.jpg` |
| `.m4a` |
| `.md` |
| `.mcw` |
| `.mov` |
| `.mp3` |
| `.mp4` |
| `.mpeg` |
| `.mpg` |
| `.msg` |
| `.mw` |
| `.odt` |
| `.ogg` |
| `.opus` |
| `.org` |
| `.p7s` |
| `.pbd` |
| `.pcm` |
| `.pdf` |
| `.png` |
| `.pot` |
Expand All @@ -45,6 +58,9 @@ By file extension:
| `.tiff` |
| `.txt` |
| `.tsv` |
| `.wav` |
| `.webm` |
| `.wmv` |
| `.xls` |
| `.xlsx` |
| `.xml` |
Expand All @@ -54,6 +70,7 @@ By file type:

| Category | File types |
| --- | --- |
| Audio | `.aac`, `.flac`, `.m4a`, `.mp3`, `.mp4`, `.ogg`, `.opus`, `.pcm`, `.wav`, `.webm` |
| Apple | `.cwk`, `.mcw`
| CSV | `.csv` |
| Data Interchange | `.dif`* |
Expand All @@ -74,6 +91,7 @@ By file type:
| Spreadsheet | `.et`, `.fods`, `.mw`, `.xls`, `.xlsx` |
| StarOffice | `.sxg` |
| TSV | `.tsv` |
| Video | `.3gp`, `.flv`, `.mov`, `.mp4`, `.mpeg`, `.mpg`, `.webm`, `.wmv` |
| Word processing | `.abw`, `.doc`, `.docx`, `.dot`, `.dotm`, `.hwp`, `.zabw` |
| XML | `.xml` |

Expand Down
53 changes: 36 additions & 17 deletions ui/document-elements.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -52,23 +52,29 @@ of the file and not care about its headers and footers. You can easily filter ou
Here are some examples of the element types your file might contain:

| Element type | Description |
|---------------------|------------------------------------------------------------------------------------------------------------------------------------------------------|
| `Address` | A text element for capturing physical addresses. |
| `CodeSnippet` | A text element for capturing code snippets. |
| `EmailAddress` | A text element for capturing email addresses. |
| `FigureCaption` | An element for capturing text associated with figure captions. |
| `Footer` | An element for capturing document footers. |
| `FormKeysValues` | An element for capturing key-value pairs in a form. |
| `Formula` | An element containing formulas in a file. |
| `Header` | An element for capturing document headers. |
| `Image` | A text element for capturing image metadata. |
| `ListItem` | `ListItem` is a `NarrativeText` element that is part of a list. |
| `NarrativeText` | `NarrativeText` is an element consisting of multiple, well-formulated sentences. This excludes elements such titles, headers, footers, and captions. |
| `PageBreak` | An element for capturing page breaks. |
| `PageNumber` | An element for capturing page numbers. |
| `Table` | An element for capturing tables. |
| `Title` | A text element for capturing titles. |
| `UncategorizedText` | Base element for capturing free text from within files. Applies to extracted text not associated with bounding boxes if the input is a PDF file. |
|--------------------- |------------------------------------------------------------------------------------------------------------------------------------------------------|
| `Address` | A text element for capturing physical addresses. |
| `CodeSnippet` | A text element for capturing code snippets. |
| `EmailAddress` | A text element for capturing email addresses. |
| `FigureCaption` | An element for capturing text associated with figure captions. |
| `Footer` | An element for capturing document footers. |
| `FormKeysValues` | An element for capturing key-value pairs in a form. |
| `Formula` | An element containing formulas in a file. |
| `Header` | An element for capturing document headers. |
| `Image` | A text element for capturing image metadata. |
| `ListItem` | `ListItem` is a `NarrativeText` element that is part of a list. |
| `NarrativeText` | `NarrativeText` is an element consisting of multiple, well-formulated sentences. This excludes elements such titles, headers, footers, and captions. |
| `PageBreak` | An element for capturing page breaks. |
| `PageNumber` | An element for capturing page numbers. |
| `SceneDescription` | An element for capturing scene descriptions, for example a description of a scene in a video. |
| `Table` | An element for capturing tables. |
| `Title` | A text element for capturing titles. |
| `TranscriptFragment` | An element for capturing transcription of speech, for example a speaker's words in an audio clip or video. |
| `UncategorizedText` | Base element for capturing free text from within files. Applies to extracted text not associated with bounding boxes if the input is a PDF file. |

<Note>
`SceneDescription` and `TranscriptFragment` are specific to video and audio file processing, which is available only for [self-hosted](/self-hosted/overview) deployments of Unstructured.
</Note>

If you apply chunking, you will also see the `CompositeElement` type.
`CompositeElement` is a chunk formed from text (non-`Table`) elements.
Expand Down Expand Up @@ -187,6 +193,19 @@ file.
Headers and footers in Word files include a `header_footer_type` indicating which page a header or footer applies to.
Valid values are `"primary"`, `"even_only"`, and `"first_page"`.

#### Video files

Elements for video files include a `start_time` and `end_time`, representing the start and end times of a clip of video
from the parent video file to which this element belongs. Also included are the `model_version` representing the model that was used to
generate the element, and the `average_log_probability` representing the model's overall average confidence level for the model's output across the document, with values closer to
zero indicating higher confidence.

#### Audio files

Elements for audio files include a `start_time`, `end_time`, and `speaker`, representing the start and end times of a clip of audio
made by a specific speaker, as part of the parent audio file to which this element belongs.
If the speaker cannot be determined, `speaker` is set to `0` or `unknown`.

### Table-specific metadata

For `Table` elements, the raw text of the table will be stored in the `text` attribute for the element, and HTML representation
Expand Down
1 change: 1 addition & 0 deletions ui/workflows.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,7 @@ By default, this workflow partitions, chunks, and generates embeddings as follow
- If the page or document has no images and likely does not have tables, **Fast** partitioning is used, and the page or document is billed at the **Fast** rate for processing.
- If the page or document has only a few tables or images with standard layouts and languages, **High Res** partitioning is used, and the page or document is billed at the **High Res** rate for processing.
- If the page or document has more than a few tables or images, **VLM** partitioning is used, and the page or document is billed at the **VLM** rate for processing.
- If the page or document is a video or audio file, **Multimedia** partitioning is used.

[Learn about partitioning strategies](/ui/partitioning).

Expand Down