Skip to content

Unable to extract entire figure from PDF academic paper #4584

Closed Answered by JorjMcKie
jamesbraza asked this question in Looking for help
Discussion options

You must be logged in to vote

The objects you are referring to are no images, but vector graphics - sometimes with overlaid text particles.
Vector graphics cannot be extracted as such - at least not in a way that you seem to be interested in.
You obviously would like to store them away as PNG / JPEG images.

PyMuPDF offers you to find / extract / group neighbored vector graphics on a page. This Page method is called cluster_drawings().
It returns a list of rectangles, each covering such a graphic. You can then make a "photo" of the corresponding page area (at any desired resolution) and store it away as an image. Here is a script that does this for the first two pages:

import pymupdf

doc = pymupdf.open("2501.10120v2.pdf"

Replies: 3 comments 5 replies

Comment options

You must be logged in to vote
1 reply
@jamesbraza
Comment options

Comment options

You must be logged in to vote
1 reply
@jamesbraza
Comment options

Answer selected by jamesbraza
Comment options

You must be logged in to vote
3 replies
@jamesbraza
Comment options

@JorjMcKie
Comment options

@jamesbraza
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants
Converted from issue

This discussion was converted from issue #4583 on July 02, 2025 08:28.