Skip to content

StandardPdfPipeline.download_models_hf() method not found - Incompatibility with Docling >= 2.28.4 #60

@KabeerThockchom

Description

@KabeerThockchom

Bug Description

When running ilab data generate --pipeline simple, the process fails with the following error:

failed to generate data with exception: type object 'StandardPdfPipeline' has no attribute 'download_models_hf'

Environment

InstructLab version: 0.26.1
instructlab-sdg version: 0.8.3
Docling version: 2.61.1 (required >= 2.28.4)
Platform: macOS (also affects other platforms)

Root Cause

The StandardPdfPipeline.download_models_hf() method has been deprecated and removed from StandardPdfPipeline in newer versions of Docling (>= 2.28.4). The method now only exists in LegacyStandardPdfPipeline with a deprecation warning.

According to Docling's deprecation notice, the recommended approach is to use:

docling.utils.model_downloader.download_models() (programmatic)
docling-tools models download (CLI)

Affected Code

File: chunkers.py (line ~142)

from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline

if self.docling_model_path is None:
    logger.info("Docling models not found on disk, downloading models...")
    self.docling_model_path = StandardPdfPipeline.download_models_hf()

Proposed Fix
Option 1 (Recommended): Use the new Docling API

from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline
from docling.utils.model_downloader import download_models

if self.docling_model_path is None:
    logger.info("Docling models not found on disk, downloading models...")
    self.docling_model_path = download_models(output_dir=None, force=False, progress=False)

Option 2 (Temporary): Use LegacyStandardPdfPipeline

from docling.pipeline.legacy_standard_pdf_pipeline import LegacyStandardPdfPipeline

if self.docling_model_path is None:
    logger.info("Docling models not found on disk, downloading models...")
    self.docling_model_path = LegacyStandardPdfPipeline.download_models_hf()

Steps to Reproduce
Install InstructLab with Docling >= 2.28.4
Set up a taxonomy with knowledge documents
Run ilab data generate --pipeline simple
Observe the error

Impact
This bug prevents users from generating synthetic training data, blocking the entire InstructLab workflow for knowledge-based model fine-tuning.

Workaround

Manually patch the file at: chunkers.py with the proposed fix above.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions