Add Retry Mechanism with Exponential Backoff

Network operations and external API calls can fail transiently but are not retried, leading to:
- Failed jobs from temporary network glitches
- Rate limit errors from LLM providers
- HuggingFace dataset download failures
- Unnecessary manual re-runs

## Current Behavior
```python
# No retry - single failure kills the job
response = openai_client.chat.completions.create(...)
```

## Proposed Solution

Use `tenacity` library for declarative retry logic:

```toml
dependencies = [
    "tenacity>=8.2.3",
]
```

### Implementation Example
```python
from tenacity import (
    retry,
    stop_after_attempt,
    wait_exponential,
    retry_if_exception_type,
    before_sleep_log
)
import logging

logger = logging.getLogger(__name__)

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=2, max=10),
    retry=retry_if_exception_type((RequestException, Timeout)),
    before_sleep=before_sleep_log(logger, logging.WARNING)
)
def call_llm_api(prompt: str, **kwargs) -> str:
    """Call LLM API with automatic retry."""
    response = client.chat.completions.create(
        messages=[{"role": "user", "content": prompt}],
        **kwargs
    )
    return response.choices[0].message.content
```

### Retry Strategies by Operation Type

**LLM API Calls:**
- Max attempts: 3
- Backoff: Exponential (2s, 4s, 8s)
- Retry on: RateLimitError, Timeout, NetworkError
- Don't retry: ValidationError, AuthError

**Dataset Downloads:**
- Max attempts: 5
- Backoff: Exponential (5s, 10s, 20s, 40s)
- Retry on: ConnectionError, Timeout
- Don't retry: DatasetNotFoundError

**Document Parsing:**
- Max attempts: 2
- Backoff: Fixed (1s)
- Retry on: TemporaryFileError
- Don't retry: CorruptedFileError

## Configuration
```python
class RetryConfig:
    llm_max_attempts: int = 3
    llm_max_wait_seconds: int = 10
    download_max_attempts: int = 5
    download_max_wait_seconds: int = 60
    enable_retry: bool = True  # Global toggle
```

## Example with Custom Logic
```python
from tenacity import retry_if_result

@retry(
    stop=stop_after_attempt(5),
    retry=retry_if_result(lambda x: x is None),
    wait=wait_exponential(max=30)
)
def download_with_fallback(url: str) -> Optional[bytes]:
    """Download with automatic failover to mirrors."""
    for mirror in get_mirrors(url):
        try:
            return requests.get(mirror, timeout=30).content
        except RequestException:
            continue
    return None
```

## Logging Integration
```python
import logging
from tenacity import before_log, after_log

@retry(
    before=before_log(logger, logging.INFO),
    after=after_log(logger, logging.INFO)
)
def operation():
    ...
```

Output:
```
INFO: Starting call to 'call_llm_api', attempt 1
WARNING: Retrying call_llm_api in 2.0 seconds (RateLimitError)
INFO: Starting call to 'call_llm_api', attempt 2
INFO: Finished call to 'call_llm_api' after 2 attempts
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Retry Mechanism with Exponential Backoff #19

Current Behavior

Proposed Solution

Implementation Example

Retry Strategies by Operation Type

Configuration

Example with Custom Logic

Logging Integration

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Add Retry Mechanism with Exponential Backoff #19

Description

Current Behavior

Proposed Solution

Implementation Example

Retry Strategies by Operation Type

Configuration

Example with Custom Logic

Logging Integration

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions