feat(core): add PDF text extraction to read_file tool#3202
Closed
scrollDynasty wants to merge 3 commits intoQwenLM:mainfrom
Closed
feat(core): add PDF text extraction to read_file tool#3202scrollDynasty wants to merge 3 commits intoQwenLM:mainfrom
scrollDynasty wants to merge 3 commits intoQwenLM:mainfrom
Conversation
…ted and image-only PDFs
- Use pdf-parse to extract text from PDF files in read_file tool - Handle encrypted/password-protected PDFs with clear error message - Handle image-only/scanned PDFs with OCR suggestion - Handle corrupted files gracefully - Remove PDF from media modality check (text extraction works universally) - Add unit tests for PDF reading scenarios Closes QwenLM#1149
Collaborator
|
Thanks for working on this, @scrollDynasty! PDF support is definitely a gap we need to fill. We actually have an overlapping PR in #3160 that covers the same problem — making PDFs readable for text-only models. That one takes a slightly different approach (system Appreciate you taking the time to put this together — the error handling for encrypted and image-only PDFs is a nice touch. Going to close this one in favor of #3160. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements PDF text extraction support for the
read_filetool.Closes #1149
Problem
Previously, attempting to read a
.pdffile returned binary content or an error about unsupported modality — making PDFs completely unusable in Qwen Code.Solution
.pdfextension inprocessSingleFileContentpdf-parseto extract text content from the binary filemediaModalityKey()— text extraction works regardless of model modalityEdge Cases Handled
Testing
Manually verified before/after:
"I cannot directly read PDF files"Unit tests added:
All checks pass:
npm run test113 tests passnpm run lint:cinpm run buildnpm run typecheck