Feasibility of adding TextComposer::ComposeJSON to output JSON bounding boxes 

Hi,
I have been shredding PDFs for some time using a variety of tools (including pdftohtml) and only recently discovered your library. I really like the API and the CMake-based build system which makes it very easy to incorporate into projects.  Thanks!

I have a question on how best to add a feature that returns the text elements as  JSON representation of the bounding boxes (I plan to package this as a SQLite extension that would allow you to query out the (text) contents of a PDF)

Here is a sample from the XML generated by `pdftohtml`. I find this schema fine as-is and would like to do something similar with your library but take advantage of how much easier `PDFHummus` is to incorporate than `pdftohtml`). I assume that this work is not particularly difficult (basing off what has already been done) but wanted to get your advice before jumping into it.

```xml
<pdf2xml producer="poppler" version="20.11.0">
<page number="10" position="absolute" top="0" left="0" height="1188" width="918">
        <fontspec id="0" size="54" family="ArialMT" color="#b6b6b6"/>
        <fontspec id="1" size="16" family="ArialMT" color="#000000"/>
        <fontspec id="2" size="23" family="TrebuchetMS" color="#000000"/>
        <fontspec id="3" size="16" family="Arial" color="#000000"/>
        <fontspec id="4" size="16" family="Arial" color="#999999"/>
<text top="11" left="459" width="15" height="60" font="0"> </text>
<text top="109" left="108" width="701" height="18" font="1">began in the 1960s with merely identifying the systems from space. Researchers then</text>
<text top="131" left="108" width="701" height="18" font="1">developed a technique to estimate intensity from the storm cloud structure and lifetime. See</text>
```

Let me know if you have any thoughts wrt returning a string (either a single document or one document per page) or as an `nlohmann::json` data-structure. Because the JSON support in SQLite is very good, I find it easy to write simple scalar functions that return a blob of JSON and then convert that into a table-like value via a Common Table Expression.

I am delighted to have found your library and to start using it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feasibility of adding TextComposer::ComposeJSON to output JSON bounding boxes #27

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Feasibility of adding TextComposer::ComposeJSON to output JSON bounding boxes #27

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions