Skip to content

Commit 6ef6c7b

Browse files
committed
When files are listed, don’t do list_objects_v2, size is not needed anymore, revert retrieval mode, count successful downloads and failed downloads instead of comparing amount of bytes, zstd compression
1 parent 3a56ba0 commit 6ef6c7b

9 files changed

Lines changed: 10427 additions & 289 deletions

File tree

README.md

Lines changed: 55 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1,35 +1,54 @@
11
# S3 Log Compressor
22

3-
A simple AWS Lambda function that downloads multiple files from S3 and creates a single zip archive. Perfect for archiving log files or consolidating multiple files into a single downloadable package. The zip uses zstd for compression (mainly for paths and metadata)
3+
A simple AWS Lambda function that downloads multiple files from S3 and creates a single zip archive, or retrieves a single file from an archive. Perfect for archiving log files or consolidating multiple files into a single downloadable package. The zip uses zstd for compression (mainly for paths and metadata)
44

55
## Features
66

7-
- **Batch Processing**: Process thousands of files from S3 manifest
8-
- **Efficient Zipping**: Create zip archives with progress logging
9-
- **Cross-Bucket Support**: Works with files from multiple S3 buckets
10-
- **Optional Cleanup**: Delete source files after successful archiving
11-
- **Progress Tracking**: Logs progress every 10,000 files processed and 50,000 files listed
12-
- **KMS Encryption**: Supports server-side encryption with KMS keys
13-
- **Concurrent Downloads**: Configurable worker threads for parallel processing
14-
- **Flexible Path Structure**: Option to include or exclude S3 bucket names in zip paths
7+
- **Dual-Mode Operation**: Supports both compressing files into an archive and decompressing a single file from an archive.
8+
- **Batch Processing**: Process thousands of files from an S3 manifest.
9+
- **Asynchronous Cleanup**: Optionally deletes source files asynchronously after successful archiving.
10+
- **Robust Validation**: Halts on any file download failure to ensure archive integrity.
11+
- **Optimized Manifest Parsing**: Reduces S3 API calls by intelligently distinguishing files from directories in the manifest.
12+
- **Concurrent Downloads**: Configurable worker threads for parallel processing.
13+
- **KMS Encryption**: Supports server-side encryption with KMS keys.
1514

1615
## Event Structure
1716

17+
The `operation` field determines the function's behavior.
18+
19+
### Compress Operation
20+
1821
```json
1922
{
23+
"operation": "compress",
2024
"input_s3_manifest_url": "s3://bucket/manifest.txt",
2125
"output_s3_url": "s3://bucket/archive.zip",
2226
"delete_source_files": false,
23-
"include_s3_name": true
27+
"include_s3_name": true,
28+
"max_workers": 256
29+
}
30+
```
31+
32+
### Decompress Operation
33+
34+
```json
35+
{
36+
"operation": "decompress",
37+
"source_s3_url": "s3://bucket/archive.zip",
38+
"file_to_extract": "path/in/zip/to/file.txt"
2439
}
2540
```
2641

2742
### Parameters
2843

29-
- `input_s3_manifest_url`: S3 URL to a text file containing a list of files to archive
30-
- `output_s3_url`: S3 URL where the final zip archive will be stored
31-
- `delete_source_files`: Whether to delete source files after successful archiving (default: false)
32-
- `include_s3_name`: Whether to include S3 bucket names in the zip archive paths (default: true)
44+
- `operation`: `compress` or `decompress`.
45+
- `input_s3_manifest_url`: (Compress) S3 URL to a text file containing a list of files/directories to archive.
46+
- `output_s3_url`: (Compress) S3 URL where the final zip archive will be stored.
47+
- `delete_source_files`: (Compress) Whether to delete source files after successful archiving (default: `false`).
48+
- `include_s3_name`: (Compress) Whether to include S3 bucket names in the zip archive paths (default: `true`).
49+
- `max_workers`: (Compress) Maximum number of concurrent workers for downloading files (default: 256).
50+
- `source_s3_url`: (Decompress) S3 URL of the source zip archive.
51+
- `file_to_extract`: (Decompress) The full path of the file to extract from the archive.
3352

3453
When `include_s3_name` is `true`, files will be stored in the zip with paths like `bucket-name/path/to/file.txt`. When `false`, files will be stored with just their S3 key path like `path/to/file.txt`.
3554

@@ -58,14 +77,29 @@ sam build && sam deploy
5877

5978
## How It Works
6079

61-
1. Downloads a manifest file from S3 containing a list of files to archive, by default with 256 workers (this is optimal, more will cause os error 24 "Too many open files")
62-
2. Checks accessibility of all buckets mentioned in the manifest
63-
3. Lists all files to be processed and calculates total size
64-
4. Downloads files concurrently and adds them to a zip archive
65-
5. Uploads the final zip file to the specified S3 location
66-
6. Optionally deletes source files if requested
80+
### Compression
81+
1. Downloads a manifest file from S3 containing a list of files and directories to archive.
82+
2. Checks accessibility of all buckets mentioned in the manifest.
83+
3. Lists all files to be processed, intelligently exploring directories as needed.
84+
4. Downloads files concurrently and adds them to a zip archive. If any download fails, the process aborts.
85+
5. Uploads the final zip file to the specified S3 location.
86+
6. Optionally starts an asynchronous process to delete source files.
87+
88+
### Decompression
89+
1. Downloads the specified zip archive from S3.
90+
2. Extracts the requested file from the archive in memory.
91+
3. Returns the file content as a base64-encoded string.
92+
93+
## Technical Details
94+
95+
The compression engine uses a few key Rust concepts to work safely and efficiently:
96+
97+
- **`Arc<Mutex<...>>`**: To handle many concurrent file downloads, the core `ZipWriter` is wrapped in an `Arc` (Atomic Reference Counter) and a `Mutex` (Mutual Exclusion lock).
98+
- `Arc` allows multiple download tasks to safely share ownership of the writer.
99+
- `Mutex` ensures that only one task can write to the zip file at a time, preventing data corruption.
100+
- **`ZipWriter`**: This is a utility from the `zip` crate that handles the low-level details of creating a valid `.zip` archive structure.
101+
- **`BufWriter`**: To improve performance, file writes are sent through a `BufWriter`. It acts as an in-memory buffer, collecting smaller writes into a single larger, more efficient write to the filesystem, reducing I/O overhead.
67102

68103
## Environment Variables
69104

70-
- `MAX_WORKERS`: Number of concurrent download workers (optional, default: 256)
71105
- `KMS_KEY_ID`: KMS key ID for server-side encryption (optional)

dev/generate_manifest.py

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
# s3://s3-log-compressor-sourcebucket-rkeoqdxsxu2w/mock-logs/log_0000.json.gz
2+
# for log_0000 to log_9999
3+
# Base S3 bucket path
4+
bucket_path = "s3://s3-log-compressor-sourcebucket-rkeoqdxsxu2w/mock-logs/"
5+
6+
# Generate manifest entries
7+
manifest_entries = []
8+
9+
for i in range(10000):
10+
log_filename = f"log_{i:04d}.json.gz"
11+
s3_path = bucket_path + log_filename
12+
manifest_entries.append(s3_path)
13+
14+
# Output manifest as plaintext, one entry per line
15+
with open("manifest.txt", "w") as f:
16+
for entry in manifest_entries:
17+
f.write(entry + "\n")
18+
19+
print(f"Generated manifest with {len(manifest_entries)} entries")

0 commit comments

Comments
 (0)