Skip to content

Conversation

@dosumis
Copy link
Collaborator

@dosumis dosumis commented Nov 6, 2025

No description provided.

dosumis and others added 5 commits November 6, 2025 12:13
…lit_agent into stratified-validation-reporting
- Revert validation demo to 100 URL sample size for cost control (377 URLs was 2.36x more expensive than expected)
- Fix unit test assertion to expect PDF_EXTRACTION instead of API_LOOKUP for PDFExtractor
- Maintain stratified validation reporting benefits while keeping performance manageable

The full 377 URL corpus proved much more complex than sampled URLs:
- More PDFs requiring LLM extraction
- More web scraping fallbacks for failed Phase 1 extractions
- Research data repositories and paywalled content

100 URL sampling provides statistically valid stratified analysis without performance penalty.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Add domain blacklist for 8 problematic publishers (ScienceDirect, MDPI, Oxford Academic, etc.)
- Implement two-stage web search approach: general search + targeted PubMed search
- Add comprehensive fallback system with URL fragment extraction and metadata parsing
- Include NCBI E-utilities integration for complete identifier retrieval (PMID, DOI, PMC)
- Transform 66 blocked domain failures into potential successes
- Update validation demo path to validation_workspace/demo_reports

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Removed Python 3.9 support
@dosumis dosumis merged commit 08afc09 into main Nov 7, 2025
4 checks passed
@dosumis dosumis deleted the stratified-validation-reporting branch November 7, 2025 06:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants