Skip to content

Conversation

@wortmanb
Copy link

As tested as I can make it, here's the first releasable version of deepfreeze for curator. Unit tests all pass; integration tests are a work in progress as they take so long to run that it's really difficult, and parallelizing them hasn't worked very well either.

Bret Wortman and others added 30 commits January 17, 2025 11:36
This wasn't working when tryingn to map with filters.
I added several new options and adjustd others so that we can now
specify --rotate_by and choose bucket or path. Then the suffix gets
appolied either to the bucket name or the path name, depending. The repo
name will always get the suffix.
Switched most settings to being part of a Settings object.
Completed updating Rotate up through ILM changes.
Fully implemented style.
Verified and fixed code for removing old repositories.
For oneup, at least. Need to ensure this works for date-based rotation
too.
Removed commented-out code now that I know it's safe
Finally got black configured and disabled Flake. Much happier now.
templated these, which we'll use to track repos and thawsets inside of
the status index in elasticsearch
Unit tests for utility classes used by DeepFreeze.
These tests cover all remaining utility (module-level) functions. They
could perhaps be collected into a single file.
I plan to do this wherever possible, and anywhere it doesn't cause more
problems than it solves.
This is almost certainly incomplete, but I'll add to it as we go along.
This completely breaks a number of things, but I wanted to capture it
mid-stream so as not to lose it. Flaky network at BAH.
Set defaults for this code formatter, which is faster than black but can
format just as well and to the same standard.
Switched to Ruff. It really wants " instead of '.
Added s3client.py to encapsulate S3 client code for various providers
under a consistent inteface. Includes classes S3Client and its
implementation classes, plus a factory method to return a client object
for a particular provider.
Also made some updates to deepfreeze.py to comply with testing better.
Allows us to persist more details about the repo.
Bret Wortman and others added 30 commits October 23, 2025 06:08
We now create and assign a new "frozen-only" ilm policy to each thawed
index, based on the repository it was thawed from. This prevents all
thawed indices from showing up on Index Management as having lifecycle
errors.
Users can still list all by adding --include-copmleted or -c
1. Added Status Constants (constants.py)

  - Added THAW_STATUS_IN_PROGRESS, THAW_STATUS_COMPLETED,
THAW_STATUS_FAILED, and
  THAW_STATUS_REFROZEN constants
  - Created THAW_REQUEST_STATUSES list for validation

  2. Updated Refreeze Action (refreeze.py)

  - Changed status from "completed" to THAW_STATUS_REFROZEN when
refreeze completes
  - Now properly indicates that thawed data has been cleaned up and
returned to frozen state

  3. Added Retention Setting (helpers.py)

  - Added thaw_request_retention_days_refrozen setting (default: 35
days)
  - This aligns with the 30-day max for data to return to Glacier, plus
5 days buffer

  4. Updated Cleanup Logic (cleanup.py)

  - Added handling for "refrozen" status in both
_cleanup_old_thaw_requests() and dry-run mode
  - Refrozen requests are automatically deleted after 35 days

  5. Updated Thaw List Filtering (thaw.py - do_list_requests())

  - Now excludes both "completed" AND "refrozen" requests by default
  - Use --include-completed or -c flag to see all requests
  - Updated help messages to reflect "completed/refrozen" filtering

  6. Updated Status Checking (thaw.py)

  - do_check_status(): Skips refrozen requests with helpful message
  - do_check_all_status(): Filters out refrozen requests before
processing

  Status Lifecycle

  The complete thaw request lifecycle is now:

  1. in_progress → Thaw operation is actively running
  2. completed → Thaw succeeded, data is available and mounted
  3. refrozen → Data has been cleaned up via refreeze (new!)
  4. failed → Thaw operation failed

  Retention Periods (Cleanup)

  - Completed: 7 days (default)
  - Failed: 30 days (default)
  - Refrozen: 35 days (new!)

  All syntax validation passed! The new status properly distinguishes
between "thaw completed and
  data available" vs "thaw was completed but has been cleaned up."
Added descriptions of all actions in markdown.
Due to issues in rotate, not all repos were being marked 'frozen'. This
necessitated adding repair_metadata, which can be used should this ever
occur again and serves as a foundation for other potential repair work
in the future.

Updated integration tests and fixes revealed by testing.
1. Parallelized AWS S3 API Calls (10-15x speedup on S3 checks)

  File: curator/actions/deepfreeze/utilities.py

  - Modified check_restore_status() to use ThreadPoolExecutor with 15
concurrent workers
  - Instead of checking objects sequentially (one by one), now checks up
to 15 objects in parallel
  - This is the biggest win - transforms sequential 10,000 API calls
from 16+ minutes to ~1 minute

  Technical details:
  - boto3 client is thread-safe, making this safe to implement
  - Separates instant-access objects (no check needed) from Glacier
objects (need parallel
  checking)
  - Uses concurrent.futures.as_completed() to process results as they
arrive

  2. Eliminated Redundant Status Checks (2x speedup on overall flow)

  Files: curator/actions/deepfreeze/thaw.py

  - Added status caching in both do_check_status() and
do_check_all_status()
  - Modified _display_thaw_status() to accept optional cache parameter
  - Previously called check_restore_status() twice per repository (once
for logic, once for
  display)
  - Now caches results from first check and reuses for display

  3. Added Progress Indicators (UX improvement)

  Files: curator/actions/deepfreeze/thaw.py

  - Shows "Checking repository X of Y..." as each repository is
processed
  - Gives users real-time feedback instead of appearing frozen
  - Uses existing rich library for clean terminal output

  4. Code Quality

  - All changes pass black formatting
  - All changes pass ruff linting
  - Backward compatible - no API changes

  Expected Performance Improvement

  Before: ~11 minutes (660 seconds)
  After: ~1-2 minutes (60-120 seconds)

  Overall speedup: 5-10x faster!

  Breakdown:

  - S3 API calls: 16 minutes → ~1 minute (15x faster)
  - Redundant checks eliminated: Cut remaining time in half
  - Total: 11 minutes → 1-2 minutes

  The exact improvement depends on:
  - Number of thaw requests
  - Number of repositories per request
  - Number of objects per repository
  - Network latency to AWS S3
Summary of Changes

  1. CLI Command (curator/cli_singletons/deepfreeze.py:344-370)

  Added the -f/--refrozen-retention-days option to the cleanup command:
  - Short flag: -f (mnemonic for "refrozen")
  - Long flag: --refrozen-retention-days
  - Type: integer
  - Default: None (uses config setting, typically 35 days)

  2. Cleanup Action (curator/actions/deepfreeze/cleanup.py)

  - Updated __init__ to accept refrozen_retention_days parameter
  - Modified _cleanup_old_thaw_requests() to use CLI override if
provided, otherwise fall back to
  settings value
  - Applied same logic to do_dry_run() method for consistent behavior
  - Updated class docstring to document the new parameter

  3. Schema Validation

  Added validation in two places:
  - option_defaults.py: Created refrozen_retention_days() function with
validation (1-365 days
  range, None allowed)
  - validators/options.py: Added the option to cleanup's validation
schema
1. Added NotFoundError import (line 7) - imported the specific exception
type from elasticsearch8
   to handle repository not found errors
  2. Added specific exception handling (lines 210-223) - added a new
exception handler that:
    - Specifically catches NotFoundError before the generic exception
handler
    - Detects when the error is a repository_missing_exception
(indicating the repository has
  already been unmounted)
    - Logs an INFO level message instead of ERROR: "Repository {name}
has already been unmounted,
  no indices to delete"
    - Returns gracefully with no indices deleted
    - For other NotFoundError cases, logs a WARNING instead of ERROR
Show counts in thaw list output
Detect and fix situation where a thaw request is submitted, acted upon
by AWS, but ignored by the requestor. If check-status is run after the
data is refrozen by AWS, this detects that and fixes the metadata to
show the request as being refrozen so it doesn't languish as a pending
request.
Updated test description to reflect the integration tests' unreliable
nature.
…er guide

- Add detailed overview and architecture documentation
- Document all actions: setup, rotate, status, thaw, refreeze, cleanup, repair-metadata
- Include quick start guide and common workflows
- Add cost optimization and scheduling recommendations
- Document ILM integration and troubleshooting

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant