Reading htm, chm, epub files from disk, and images containing text/tables/diagrams

Thanks for making this useful library! 🙂
I'm wondering where I can find the docs, so that I can see how to use it for each data source.

Btw, it lists web scraping but what about reading .htm/.html files or images from disk?

And what about files that are very similar to htm files, like chm and epub?

(In my use case I need to ingest from disk a lot of .htm files, as well as images & PDF files that contain schematics and tables in embedded images, and chm files, to convert them into vector embeddings (convert the images to alt text or to markdown table if it contains a table).) 

I'm also curious, what about reading DJVU files (which are similar to scanned PDF files)?

Thanks 🙏

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reading htm, chm, epub files from disk, and images containing text/tables/diagrams #32

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Reading htm, chm, epub files from disk, and images containing text/tables/diagrams #32

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions