Skip to content

Add Retry Mechanism with Exponential Backoff #19

Description

@69fu

Network operations and external API calls can fail transiently but are not retried, leading to:

  • Failed jobs from temporary network glitches
  • Rate limit errors from LLM providers
  • HuggingFace dataset download failures
  • Unnecessary manual re-runs

Current Behavior

# No retry - single failure kills the job
response = openai_client.chat.completions.create(...)

Proposed Solution

Use tenacity library for declarative retry logic:

dependencies = [
    "tenacity>=8.2.3",
]

Implementation Example

from tenacity import (
    retry,
    stop_after_attempt,
    wait_exponential,
    retry_if_exception_type,
    before_sleep_log
)
import logging

logger = logging.getLogger(__name__)

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=2, max=10),
    retry=retry_if_exception_type((RequestException, Timeout)),
    before_sleep=before_sleep_log(logger, logging.WARNING)
)
def call_llm_api(prompt: str, **kwargs) -> str:
    """Call LLM API with automatic retry."""
    response = client.chat.completions.create(
        messages=[{"role": "user", "content": prompt}],
        **kwargs
    )
    return response.choices[0].message.content

Retry Strategies by Operation Type

LLM API Calls:

  • Max attempts: 3
  • Backoff: Exponential (2s, 4s, 8s)
  • Retry on: RateLimitError, Timeout, NetworkError
  • Don't retry: ValidationError, AuthError

Dataset Downloads:

  • Max attempts: 5
  • Backoff: Exponential (5s, 10s, 20s, 40s)
  • Retry on: ConnectionError, Timeout
  • Don't retry: DatasetNotFoundError

Document Parsing:

  • Max attempts: 2
  • Backoff: Fixed (1s)
  • Retry on: TemporaryFileError
  • Don't retry: CorruptedFileError

Configuration

class RetryConfig:
    llm_max_attempts: int = 3
    llm_max_wait_seconds: int = 10
    download_max_attempts: int = 5
    download_max_wait_seconds: int = 60
    enable_retry: bool = True  # Global toggle

Example with Custom Logic

from tenacity import retry_if_result

@retry(
    stop=stop_after_attempt(5),
    retry=retry_if_result(lambda x: x is None),
    wait=wait_exponential(max=30)
)
def download_with_fallback(url: str) -> Optional[bytes]:
    """Download with automatic failover to mirrors."""
    for mirror in get_mirrors(url):
        try:
            return requests.get(mirror, timeout=30).content
        except RequestException:
            continue
    return None

Logging Integration

import logging
from tenacity import before_log, after_log

@retry(
    before=before_log(logger, logging.INFO),
    after=after_log(logger, logging.INFO)
)
def operation():
    ...

Output:

INFO: Starting call to 'call_llm_api', attempt 1
WARNING: Retrying call_llm_api in 2.0 seconds (RateLimitError)
INFO: Starting call to 'call_llm_api', attempt 2
INFO: Finished call to 'call_llm_api' after 2 attempts

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions