Unable to extract entire figure from PDF academic paper #4584
-
Description of the bugI am working with a PDF export of the paper "PaSa: An LLM Agent for Comprehensive Academic Paper Search" from https://arxiv.org/abs/2501.10120. PyMuPDF is failing to import:
Can PyMuPDF support figures better from academic papers? How to reproduce the bugWith Python 3.13.2, import pathlib
import pymupdf
from pydantic import BaseModel, Field, JsonValue
THIS_DIR = pathlib.Path(__file__).parent
class ParsedImage(BaseModel):
"""Raw image parsed from a document's page."""
index: int = Field(description="Index of the image in a given page.")
data: bytes = Field(
description="Raw image, ideally directly savable to an image file."
)
info: dict[str, JsonValue | tuple[float, ...] | bytes] = Field(
default_factory=dict, description="Optional image metadata."
)
content: dict[str, tuple[str, list[ParsedImage]]] = {}
with pymupdf.open(THIS_DIR / "pasa.pdf") as file:
for i in range(file.page_count):
page = file.load_page(i)
content[str(i + 1)] = page.get_text("text", sort=True), [
ParsedImage(
index=img_index,
data=file.extract_image(img_info["xref"])["image"],
info=img_info,
)
for img_index, img_info in enumerate(
# Extract images all at once using get_image_info()
page.get_image_info(hashes=True, xrefs=True)
)
]
assert content["1"][1], "Expected image on page 1 to be present"
assert len(content["2"][1]) < 5, "Expected figure 1 to be read-in cohesively" PyMuPDF version1.26.1 Operating systemMacOS Python version3.13 |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 5 replies
-
This no issue, but a typical Discussions item. Transferring ... |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
What do you mean by "metadata" of |
Beta Was this translation helpful? Give feedback.
The objects you are referring to are no images, but vector graphics - sometimes with overlaid text particles.
Vector graphics cannot be extracted as such - at least not in a way that you seem to be interested in.
You obviously would like to store them away as PNG / JPEG images.
PyMuPDF offers you to find / extract / group neighbored vector graphics on a page. This
Page
method is calledcluster_drawings()
.It returns a list of rectangles, each covering such a graphic. You can then make a "photo" of the corresponding page area (at any desired resolution) and store it away as an image. Here is a script that does this for the first two pages: