DuckyBlender
diff --git a/‎README.md‎
Lines changed: 55 additions & 21 deletions b/‎README.md‎
Lines changed: 55 additions & 21 deletions
diff --git a/‎dev/generate_manifest.py‎
Lines changed: 19 additions & 0 deletions b/‎dev/generate_manifest.py‎
Lines changed: 19 additions & 0 deletions
@@ -1,35 +1,54 @@
 # S3 Log Compressor
 
-A simple AWS Lambda function that downloads multiple files from S3 and creates a single zip archive. Perfect for archiving log files or consolidating multiple files into a single downloadable package. The zip uses zstd for compression (mainly for paths and metadata)
+A simple AWS Lambda function that downloads multiple files from S3 and creates a single zip archive, or retrieves a single file from an archive. Perfect for archiving log files or consolidating multiple files into a single downloadable package. The zip uses zstd for compression (mainly for paths and metadata)
 
 ## Features
 
-- **Batch Processing**: Process thousands of files from S3 manifest
-- **Efficient Zipping**: Create zip archives with progress logging
-- **Cross-Bucket Support**: Works with files from multiple S3 buckets
-- **Optional Cleanup**: Delete source files after successful archiving
-- **Progress Tracking**: Logs progress every 10,000 files processed and 50,000 files listed
-- **KMS Encryption**: Supports server-side encryption with KMS keys
-- **Concurrent Downloads**: Configurable worker threads for parallel processing
-- **Flexible Path Structure**: Option to include or exclude S3 bucket names in zip paths
+- **Dual-Mode Operation**: Supports both compressing files into an archive and decompressing a single file from an archive.
+- **Batch Processing**: Process thousands of files from an S3 manifest.
+- **Asynchronous Cleanup**: Optionally deletes source files asynchronously after successful archiving.
+- **Robust Validation**: Halts on any file download failure to ensure archive integrity.
+- **Optimized Manifest Parsing**: Reduces S3 API calls by intelligently distinguishing files from directories in the manifest.
+- **Concurrent Downloads**: Configurable worker threads for parallel processing.
+- **KMS Encryption**: Supports server-side encryption with KMS keys.
 
 ## Event Structure
 
+The `operation` field determines the function's behavior.
+
+### Compress Operation
+
 ```json
 {
+  "operation": "compress",
   "input_s3_manifest_url": "s3://bucket/manifest.txt",
   "output_s3_url": "s3://bucket/archive.zip",
   "delete_source_files": false,
-  "include_s3_name": true
+  "include_s3_name": true,
+  "max_workers": 256
+}
+```
+
+### Decompress Operation
+
+```json
+{
+  "operation": "decompress",
+  "source_s3_url": "s3://bucket/archive.zip",
+  "file_to_extract": "path/in/zip/to/file.txt"
 }
 ```
 
 ### Parameters
 
-- `input_s3_manifest_url`: S3 URL to a text file containing a list of files to archive
-- `output_s3_url`: S3 URL where the final zip archive will be stored
-- `delete_source_files`: Whether to delete source files after successful archiving (default: false)
-- `include_s3_name`: Whether to include S3 bucket names in the zip archive paths (default: true)
+- `operation`: `compress` or `decompress`.
+- `input_s3_manifest_url`: (Compress) S3 URL to a text file containing a list of files/directories to archive.
+- `output_s3_url`: (Compress) S3 URL where the final zip archive will be stored.
+- `delete_source_files`: (Compress) Whether to delete source files after successful archiving (default: `false`).
+- `include_s3_name`: (Compress) Whether to include S3 bucket names in the zip archive paths (default: `true`).
+- `max_workers`: (Compress) Maximum number of concurrent workers for downloading files (default: 256).
+- `source_s3_url`: (Decompress) S3 URL of the source zip archive.
+- `file_to_extract`: (Decompress) The full path of the file to extract from the archive.
 
 When `include_s3_name` is `true`, files will be stored in the zip with paths like `bucket-name/path/to/file.txt`. When `false`, files will be stored with just their S3 key path like `path/to/file.txt`.
 
@@ -58,14 +77,29 @@ sam build && sam deploy
 
 ## How It Works
 
-1. Downloads a manifest file from S3 containing a list of files to archive, by default with 256 workers (this is optimal, more will cause os error 24 "Too many open files")
-2. Checks accessibility of all buckets mentioned in the manifest
-3. Lists all files to be processed and calculates total size
-4. Downloads files concurrently and adds them to a zip archive
-5. Uploads the final zip file to the specified S3 location
-6. Optionally deletes source files if requested
+### Compression
+1. Downloads a manifest file from S3 containing a list of files and directories to archive.
+2. Checks accessibility of all buckets mentioned in the manifest.
+3. Lists all files to be processed, intelligently exploring directories as needed.
+4. Downloads files concurrently and adds them to a zip archive. If any download fails, the process aborts.
+5. Uploads the final zip file to the specified S3 location.
+6. Optionally starts an asynchronous process to delete source files.
+
+### Decompression
+1. Downloads the specified zip archive from S3.
+2. Extracts the requested file from the archive in memory.
+3. Returns the file content as a base64-encoded string.
+
+## Technical Details
+
+The compression engine uses a few key Rust concepts to work safely and efficiently:
+
+- **`Arc<Mutex<...>>`**: To handle many concurrent file downloads, the core `ZipWriter` is wrapped in an `Arc` (Atomic Reference Counter) and a `Mutex` (Mutual Exclusion lock).
+    - `Arc` allows multiple download tasks to safely share ownership of the writer.
+    - `Mutex` ensures that only one task can write to the zip file at a time, preventing data corruption.
+- **`ZipWriter`**: This is a utility from the `zip` crate that handles the low-level details of creating a valid `.zip` archive structure.
+- **`BufWriter`**: To improve performance, file writes are sent through a `BufWriter`. It acts as an in-memory buffer, collecting smaller writes into a single larger, more efficient write to the filesystem, reducing I/O overhead.
 
 ## Environment Variables
 
-- `MAX_WORKERS`: Number of concurrent download workers (optional, default: 256)
 - `KMS_KEY_ID`: KMS key ID for server-side encryption (optional)
@@ -0,0 +1,19 @@
+# s3://s3-log-compressor-sourcebucket-rkeoqdxsxu2w/mock-logs/log_0000.json.gz
+# for log_0000 to log_9999
+# Base S3 bucket path
+bucket_path = "s3://s3-log-compressor-sourcebucket-rkeoqdxsxu2w/mock-logs/"
+
+# Generate manifest entries
+manifest_entries = []
+
+for i in range(10000):
+    log_filename = f"log_{i:04d}.json.gz"
+    s3_path = bucket_path + log_filename
+    manifest_entries.append(s3_path)
+
+# Output manifest as plaintext, one entry per line
+with open("manifest.txt", "w") as f:
+    for entry in manifest_entries:
+        f.write(entry + "\n")
+
+print(f"Generated manifest with {len(manifest_entries)} entries")