Skip to content

Reading htm, chm, epub files from disk, and images containing text/tables/diagramsΒ #32

@Boscop

Description

@Boscop

Thanks for making this useful library! πŸ™‚
I'm wondering where I can find the docs, so that I can see how to use it for each data source.

Btw, it lists web scraping but what about reading .htm/.html files or images from disk?

And what about files that are very similar to htm files, like chm and epub?

(In my use case I need to ingest from disk a lot of .htm files, as well as images & PDF files that contain schematics and tables in embedded images, and chm files, to convert them into vector embeddings (convert the images to alt text or to markdown table if it contains a table).)

I'm also curious, what about reading DJVU files (which are similar to scanned PDF files)?

Thanks πŸ™

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions