Network operations and external API calls can fail transiently but are not retried, leading to:
- Failed jobs from temporary network glitches
- Rate limit errors from LLM providers
- HuggingFace dataset download failures
- Unnecessary manual re-runs
Current Behavior
# No retry - single failure kills the job
response = openai_client.chat.completions.create(...)
Proposed Solution
Use tenacity library for declarative retry logic:
dependencies = [
"tenacity>=8.2.3",
]
Implementation Example
from tenacity import (
retry,
stop_after_attempt,
wait_exponential,
retry_if_exception_type,
before_sleep_log
)
import logging
logger = logging.getLogger(__name__)
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=2, max=10),
retry=retry_if_exception_type((RequestException, Timeout)),
before_sleep=before_sleep_log(logger, logging.WARNING)
)
def call_llm_api(prompt: str, **kwargs) -> str:
"""Call LLM API with automatic retry."""
response = client.chat.completions.create(
messages=[{"role": "user", "content": prompt}],
**kwargs
)
return response.choices[0].message.content
Retry Strategies by Operation Type
LLM API Calls:
- Max attempts: 3
- Backoff: Exponential (2s, 4s, 8s)
- Retry on: RateLimitError, Timeout, NetworkError
- Don't retry: ValidationError, AuthError
Dataset Downloads:
- Max attempts: 5
- Backoff: Exponential (5s, 10s, 20s, 40s)
- Retry on: ConnectionError, Timeout
- Don't retry: DatasetNotFoundError
Document Parsing:
- Max attempts: 2
- Backoff: Fixed (1s)
- Retry on: TemporaryFileError
- Don't retry: CorruptedFileError
Configuration
class RetryConfig:
llm_max_attempts: int = 3
llm_max_wait_seconds: int = 10
download_max_attempts: int = 5
download_max_wait_seconds: int = 60
enable_retry: bool = True # Global toggle
Example with Custom Logic
from tenacity import retry_if_result
@retry(
stop=stop_after_attempt(5),
retry=retry_if_result(lambda x: x is None),
wait=wait_exponential(max=30)
)
def download_with_fallback(url: str) -> Optional[bytes]:
"""Download with automatic failover to mirrors."""
for mirror in get_mirrors(url):
try:
return requests.get(mirror, timeout=30).content
except RequestException:
continue
return None
Logging Integration
import logging
from tenacity import before_log, after_log
@retry(
before=before_log(logger, logging.INFO),
after=after_log(logger, logging.INFO)
)
def operation():
...
Output:
INFO: Starting call to 'call_llm_api', attempt 1
WARNING: Retrying call_llm_api in 2.0 seconds (RateLimitError)
INFO: Starting call to 'call_llm_api', attempt 2
INFO: Finished call to 'call_llm_api' after 2 attempts
Network operations and external API calls can fail transiently but are not retried, leading to:
Current Behavior
Proposed Solution
Use
tenacitylibrary for declarative retry logic:Implementation Example
Retry Strategies by Operation Type
LLM API Calls:
Dataset Downloads:
Document Parsing:
Configuration
Example with Custom Logic
Logging Integration
Output: