|
1 | 1 | # S3 Log Compressor |
2 | 2 |
|
3 | | -A simple AWS Lambda function that downloads multiple files from S3 and creates a single zip archive. Perfect for archiving log files or consolidating multiple files into a single downloadable package. The zip uses zstd for compression (mainly for paths and metadata) |
| 3 | +A simple AWS Lambda function that downloads multiple files from S3 and creates a single zip archive, or retrieves a single file from an archive. Perfect for archiving log files or consolidating multiple files into a single downloadable package. The zip uses zstd for compression (mainly for paths and metadata) |
4 | 4 |
|
5 | 5 | ## Features |
6 | 6 |
|
7 | | -- **Batch Processing**: Process thousands of files from S3 manifest |
8 | | -- **Efficient Zipping**: Create zip archives with progress logging |
9 | | -- **Cross-Bucket Support**: Works with files from multiple S3 buckets |
10 | | -- **Optional Cleanup**: Delete source files after successful archiving |
11 | | -- **Progress Tracking**: Logs progress every 10,000 files processed and 50,000 files listed |
12 | | -- **KMS Encryption**: Supports server-side encryption with KMS keys |
13 | | -- **Concurrent Downloads**: Configurable worker threads for parallel processing |
14 | | -- **Flexible Path Structure**: Option to include or exclude S3 bucket names in zip paths |
| 7 | +- **Dual-Mode Operation**: Supports both compressing files into an archive and decompressing a single file from an archive. |
| 8 | +- **Batch Processing**: Process thousands of files from an S3 manifest. |
| 9 | +- **Asynchronous Cleanup**: Optionally deletes source files asynchronously after successful archiving. |
| 10 | +- **Robust Validation**: Halts on any file download failure to ensure archive integrity. |
| 11 | +- **Optimized Manifest Parsing**: Reduces S3 API calls by intelligently distinguishing files from directories in the manifest. |
| 12 | +- **Concurrent Downloads**: Configurable worker threads for parallel processing. |
| 13 | +- **KMS Encryption**: Supports server-side encryption with KMS keys. |
15 | 14 |
|
16 | 15 | ## Event Structure |
17 | 16 |
|
| 17 | +The `operation` field determines the function's behavior. |
| 18 | + |
| 19 | +### Compress Operation |
| 20 | + |
18 | 21 | ```json |
19 | 22 | { |
| 23 | + "operation": "compress", |
20 | 24 | "input_s3_manifest_url": "s3://bucket/manifest.txt", |
21 | 25 | "output_s3_url": "s3://bucket/archive.zip", |
22 | 26 | "delete_source_files": false, |
23 | | - "include_s3_name": true |
| 27 | + "include_s3_name": true, |
| 28 | + "max_workers": 256 |
| 29 | +} |
| 30 | +``` |
| 31 | + |
| 32 | +### Decompress Operation |
| 33 | + |
| 34 | +```json |
| 35 | +{ |
| 36 | + "operation": "decompress", |
| 37 | + "source_s3_url": "s3://bucket/archive.zip", |
| 38 | + "file_to_extract": "path/in/zip/to/file.txt" |
24 | 39 | } |
25 | 40 | ``` |
26 | 41 |
|
27 | 42 | ### Parameters |
28 | 43 |
|
29 | | -- `input_s3_manifest_url`: S3 URL to a text file containing a list of files to archive |
30 | | -- `output_s3_url`: S3 URL where the final zip archive will be stored |
31 | | -- `delete_source_files`: Whether to delete source files after successful archiving (default: false) |
32 | | -- `include_s3_name`: Whether to include S3 bucket names in the zip archive paths (default: true) |
| 44 | +- `operation`: `compress` or `decompress`. |
| 45 | +- `input_s3_manifest_url`: (Compress) S3 URL to a text file containing a list of files/directories to archive. |
| 46 | +- `output_s3_url`: (Compress) S3 URL where the final zip archive will be stored. |
| 47 | +- `delete_source_files`: (Compress) Whether to delete source files after successful archiving (default: `false`). |
| 48 | +- `include_s3_name`: (Compress) Whether to include S3 bucket names in the zip archive paths (default: `true`). |
| 49 | +- `max_workers`: (Compress) Maximum number of concurrent workers for downloading files (default: 256). |
| 50 | +- `source_s3_url`: (Decompress) S3 URL of the source zip archive. |
| 51 | +- `file_to_extract`: (Decompress) The full path of the file to extract from the archive. |
33 | 52 |
|
34 | 53 | When `include_s3_name` is `true`, files will be stored in the zip with paths like `bucket-name/path/to/file.txt`. When `false`, files will be stored with just their S3 key path like `path/to/file.txt`. |
35 | 54 |
|
@@ -58,14 +77,29 @@ sam build && sam deploy |
58 | 77 |
|
59 | 78 | ## How It Works |
60 | 79 |
|
61 | | -1. Downloads a manifest file from S3 containing a list of files to archive, by default with 256 workers (this is optimal, more will cause os error 24 "Too many open files") |
62 | | -2. Checks accessibility of all buckets mentioned in the manifest |
63 | | -3. Lists all files to be processed and calculates total size |
64 | | -4. Downloads files concurrently and adds them to a zip archive |
65 | | -5. Uploads the final zip file to the specified S3 location |
66 | | -6. Optionally deletes source files if requested |
| 80 | +### Compression |
| 81 | +1. Downloads a manifest file from S3 containing a list of files and directories to archive. |
| 82 | +2. Checks accessibility of all buckets mentioned in the manifest. |
| 83 | +3. Lists all files to be processed, intelligently exploring directories as needed. |
| 84 | +4. Downloads files concurrently and adds them to a zip archive. If any download fails, the process aborts. |
| 85 | +5. Uploads the final zip file to the specified S3 location. |
| 86 | +6. Optionally starts an asynchronous process to delete source files. |
| 87 | + |
| 88 | +### Decompression |
| 89 | +1. Downloads the specified zip archive from S3. |
| 90 | +2. Extracts the requested file from the archive in memory. |
| 91 | +3. Returns the file content as a base64-encoded string. |
| 92 | + |
| 93 | +## Technical Details |
| 94 | + |
| 95 | +The compression engine uses a few key Rust concepts to work safely and efficiently: |
| 96 | + |
| 97 | +- **`Arc<Mutex<...>>`**: To handle many concurrent file downloads, the core `ZipWriter` is wrapped in an `Arc` (Atomic Reference Counter) and a `Mutex` (Mutual Exclusion lock). |
| 98 | + - `Arc` allows multiple download tasks to safely share ownership of the writer. |
| 99 | + - `Mutex` ensures that only one task can write to the zip file at a time, preventing data corruption. |
| 100 | +- **`ZipWriter`**: This is a utility from the `zip` crate that handles the low-level details of creating a valid `.zip` archive structure. |
| 101 | +- **`BufWriter`**: To improve performance, file writes are sent through a `BufWriter`. It acts as an in-memory buffer, collecting smaller writes into a single larger, more efficient write to the filesystem, reducing I/O overhead. |
67 | 102 |
|
68 | 103 | ## Environment Variables |
69 | 104 |
|
70 | | -- `MAX_WORKERS`: Number of concurrent download workers (optional, default: 256) |
71 | 105 | - `KMS_KEY_ID`: KMS key ID for server-side encryption (optional) |
0 commit comments