Unable to extract entire figure from PDF academic paper #4584

jamesbraza · 2025-07-01T23:22:56Z

jamesbraza
Jul 1, 2025

Description of the bug

I am working with a PDF export of the paper "PaSa: An LLM Agent for Comprehensive Academic Paper Search" from https://arxiv.org/abs/2501.10120.

PyMuPDF is failing to import:

The PaSa icon on page 1
Figure 1 on page 2 gets read in as many individual and small images

Can PyMuPDF support figures better from academic papers?

How to reproduce the bug

With Python 3.13.2, pymupdf==1.26.1, and pydantic==2.11.7:

import pathlib

import pymupdf
from pydantic import BaseModel, Field, JsonValue

THIS_DIR = pathlib.Path(__file__).parent


class ParsedImage(BaseModel):
    """Raw image parsed from a document's page."""

    index: int = Field(description="Index of the image in a given page.")
    data: bytes = Field(
        description="Raw image, ideally directly savable to an image file."
    )
    info: dict[str, JsonValue | tuple[float, ...] | bytes] = Field(
        default_factory=dict, description="Optional image metadata."
    )


content: dict[str, tuple[str, list[ParsedImage]]] = {}
with pymupdf.open(THIS_DIR / "pasa.pdf") as file:
    for i in range(file.page_count):
        page = file.load_page(i)
        content[str(i + 1)] = page.get_text("text", sort=True), [
            ParsedImage(
                index=img_index,
                data=file.extract_image(img_info["xref"])["image"],
                info=img_info,
            )
            for img_index, img_info in enumerate(
                # Extract images all at once using get_image_info()
                page.get_image_info(hashes=True, xrefs=True)
            )
        ]
assert content["1"][1], "Expected image on page 1 to be present"
assert len(content["2"][1]) < 5, "Expected figure 1 to be read-in cohesively"

PyMuPDF version

1.26.1

Operating system

MacOS

Python version

3.13

Answered by JorjMcKie

Jul 2, 2025

The objects you are referring to are no images, but vector graphics - sometimes with overlaid text particles.
Vector graphics cannot be extracted as such - at least not in a way that you seem to be interested in.
You obviously would like to store them away as PNG / JPEG images.

PyMuPDF offers you to find / extract / group neighbored vector graphics on a page. This Page method is called cluster_drawings().
It returns a list of rectangles, each covering such a graphic. You can then make a "photo" of the corresponding page area (at any desired resolution) and store it away as an image. Here is a script that does this for the first two pages:

import pymupdf

doc = pymupdf.open("2501.10120v2.pdf"

View full answer

JorjMcKie · 2025-07-02T08:27:50Z

JorjMcKie
Jul 2, 2025
Maintainer

This no issue, but a typical Discussions item. Transferring ...

1 reply

jamesbraza Jul 2, 2025
Author

Sorry for this, appreciate your patience

JorjMcKie · 2025-07-02T09:13:55Z

JorjMcKie
Jul 2, 2025
Maintainer

The objects you are referring to are no images, but vector graphics - sometimes with overlaid text particles.
Vector graphics cannot be extracted as such - at least not in a way that you seem to be interested in.
You obviously would like to store them away as PNG / JPEG images.

PyMuPDF offers you to find / extract / group neighbored vector graphics on a page. This Page method is called cluster_drawings().
It returns a list of rectangles, each covering such a graphic. You can then make a "photo" of the corresponding page area (at any desired resolution) and store it away as an image. Here is a script that does this for the first two pages:

import pymupdf

doc = pymupdf.open("2501.10120v2.pdf")
for page in doc[:2]:  # first 2 pages only
    paths = page.get_drawings()  # extract drawings
    # cluster them
    clusters = page.cluster_drawings(drawings=paths)
    for i, box in enumerate(clusters):  # numbered list of areas
        # make a photo of the area on page
        pix = page.get_pixmap(clip=box, dpi=150)
        # save the photo
        pix.save(f"page-{page.number}-{i}.jpg")

Results:

1 reply

jamesbraza Jul 2, 2025
Author

Wow cluster_drawings is super useful, thank you that's exactly what I needed. I will make a docs-contribution PR to improve the docs so I could've found this method on my own.

I do have one follow up question -- is there an easy way to get metadata from a Pixmap?

Here's how I am doing it right now:

# Attributes of pymupdf.Pixmap that contain useful metadata
PYMUPDF_PIXMAP_ATTRS = {
    "alpha",
    "digest",
    "height",
    "irect",
    "is_monochrome",
    "is_unicolor",
    "n",
    "size",
    "stride",
    "width",
    "x",
    "xres",
    "y",
    "yres",
}

...

images: list[ParsedImage] = []
for box_i, box in enumerate(
    page.cluster_drawings(drawings=page.get_drawings())
):
    pix = page.get_pixmap(clip=box, dpi=150)
    images.append(
        ParsedImage(
            index=box_i,
            data=pix.tobytes(),
            info={"bbox": tuple(box)}
            | {attr: getattr(pix, attr) for attr in PYMUPDF_PIXMAP_ATTRS},
        )
    )

JorjMcKie · 2025-07-02T20:04:25Z

JorjMcKie
Jul 2, 2025
Maintainer

What do you mean by "metadata" of Pixmap?
Did you see the documentation? It is a class in PyMuPDF and as such fully documented there.

3 replies

jamesbraza Jul 2, 2025
Author

What do you mean by "metadata" of Pixmap?

By metadata I meant the attributes like bbox, alpha, x and y res. To elaborate, methods such as Page.get_images return metadata, but page.cluster_drawings does not.

It would be cool if you could do dict(pix) or pix.get_metadata to just get a dictionary containing bbox, alpha, x/y res, etc.

JorjMcKie Jul 2, 2025
Maintainer

No.
Page.get_images does not return metadata at all. It reports the image definitions contained in a PDF (only!) page's object definition. This does not even mean that that page actually displays these images. And it also does not mean that this list is complete: there may exist images without an xref that are displayed by the page.
The authoritative, complete list of all actually displayed images is returned by page.get_image_info(). Each item indeed contains the complete image metadata.

Drawings on a page that seem to be one picture for your eyes are not connected to each other. They do not have a common, unified identity per se. All that method cluster_drawings() does, is looking at apparent vicinity of maybe hundreds of different lines, rectangles and curves. If they are close enough to each other, then their little rectangles are joined.
This process continues until no more neighboring vector is found. This resulting rectangle is returned. Even then, it still is just a best-can-do assumption: it may be incomplete and even wrong! Happens often enough. It ultimately is just a rectangle - not a drawings object.
Therefore, there is no way on this earth to give this rectangle something like additional "metadata".

How you use this rectangle is your decision only.

You can extract all vectors on a page: page.get_drawings(). Its output is what cluster_drawings() uses for its algorithm.
I recommend to inspect this yourself. Will probably help you to understand your misconception.

jamesbraza Jul 2, 2025
Author

Okay I understand, thank you! Much appreciated 🙏

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unable to extract entire figure from PDF academic paper #4584

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Unable to extract entire figure from PDF academic paper #4584

Uh oh!

jamesbraza Jul 1, 2025

Description of the bug

How to reproduce the bug

PyMuPDF version

Operating system

Python version

Replies: 3 comments · 5 replies

Uh oh!

JorjMcKie Jul 2, 2025 Maintainer

Uh oh!

jamesbraza Jul 2, 2025 Author

Uh oh!

Uh oh!

JorjMcKie Jul 2, 2025 Maintainer

Results:

Uh oh!

jamesbraza Jul 2, 2025 Author

Uh oh!

JorjMcKie Jul 2, 2025 Maintainer

Uh oh!

jamesbraza Jul 2, 2025 Author

Uh oh!

JorjMcKie Jul 2, 2025 Maintainer

Uh oh!

jamesbraza Jul 2, 2025 Author

jamesbraza
Jul 1, 2025

Replies: 3 comments 5 replies

JorjMcKie
Jul 2, 2025
Maintainer

jamesbraza Jul 2, 2025
Author

JorjMcKie
Jul 2, 2025
Maintainer

jamesbraza Jul 2, 2025
Author

JorjMcKie
Jul 2, 2025
Maintainer

jamesbraza Jul 2, 2025
Author

JorjMcKie Jul 2, 2025
Maintainer

jamesbraza Jul 2, 2025
Author