-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(log-viewer-webui): Add support for loading IR/JSON streams from S3. #673
Conversation
Co-authored-by: Devin Gibson <[email protected]>
WalkthroughThe pull request introduces enhancements to the log viewer web UI's S3 storage integration. Changes span multiple components, including configuration updates in Changes
Possibly related PRs
Suggested reviewers
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms (3)
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. some of the handling may change after we convert this component to TypeScript. i will validate the changes later
"StreamFilesS3Region": "", | ||
"StreamFilesS3PathPrefix": "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it make more sense to group these together
- StreamFilesDir
- StreamFilesS3Region
- StreamFilesS3PathPrefix
- StreamTargetUncompressedSize
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean put them next to each other in the file, or do you mean introducing another level in the json file for these 4 members
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Junhao means putting them next to each other.
import S3Manager from "../S3Manager.js"; | ||
|
||
|
||
const S3_MANAGER = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of having this as a global variable, can we implement S3Manager as a plugin, to make the s3Manager accessible from Fastify instances?
(similar to DbManager
export default fastifyPlugin(async (app, options) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the proposed timeline for S3 support, don't think I have time to implement and test this. Let's make it a future PR and use this to track the progress
https://app.asana.com/0/1209113033187881/1209185782944111
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
components/job-orchestration/job_orchestration/executor/query/extract_stream_task.py (2)
131-138
: Add error handling for S3 configuration.Consider adding error handling to validate the S3 configuration when
StorageType.S3
is used.# Get S3 config s3_config: S3Config enable_s3_upload = False storage_config = worker_config.stream_output.storage if StorageType.S3 == storage_config.type: + if not isinstance(storage_config, StreamS3Storage): + logger.error("Invalid storage configuration for S3") + return report_task_failure( + sql_adapter=sql_adapter, + task_id=task_id, + start_time=start_time, + ) s3_config = storage_config.s3_config enable_s3_upload = True
166-207
: Consider batch processing for S3 uploads.The current implementation processes streams sequentially. For better performance, consider:
- Collecting all valid streams first
- Using concurrent uploads with a thread pool
- Implementing retry logic for failed uploads
if enable_s3_upload and QueryTaskStatus.SUCCEEDED == task_results.status: logger.info(f"Uploading streams to S3...") - upload_error = False + streams_to_upload = [] for line in task_stdout_str.splitlines(): try: stream_stats = json.loads(line) except json.decoder.JSONDecodeError: logger.exception(f"`{line}` cannot be decoded as JSON") - upload_error = True continue stream_path_str = stream_stats.get("path") if stream_path_str is None: logger.error(f"`path` is not a valid key in `{line}`") - upload_error = True continue stream_path = Path(stream_path_str) + streams_to_upload.append((stream_path, stream_path.name)) + with ThreadPoolExecutor(max_workers=4) as executor: + futures = [] + for stream_path, stream_name in streams_to_upload: + futures.append(executor.submit( + s3_put, s3_config, stream_path, stream_name + )) + + upload_error = False + for future in as_completed(futures): + try: + future.result() + except Exception as err: + logger.error(f"Failed to upload stream: {err}") + upload_error = True + for stream_path, _ in streams_to_upload: + stream_path.unlink()components/clp-py-utils/clp_py_utils/clp_config.py (2)
402-416
: Add docstrings to storage classes.Consider adding docstrings to explain:
- The purpose of each storage class
- The significance of default paths
- The relationship between FS and S3 storage variants
class ArchiveFsStorage(FsStorage): + """Filesystem storage for archives with default path at var/data/archives.""" directory: pathlib.Path = CLP_DEFAULT_DATA_DIRECTORY_PATH / "archives" class StreamFsStorage(FsStorage): + """Filesystem storage for streams with default path at var/data/streams.""" directory: pathlib.Path = CLP_DEFAULT_DATA_DIRECTORY_PATH / "streams" class ArchiveS3Storage(S3Storage): + """S3 storage for archives with local staging at var/data/staged-archives.""" staging_directory: pathlib.Path = CLP_DEFAULT_DATA_DIRECTORY_PATH / "staged-archives" class StreamS3Storage(S3Storage): + """S3 storage for streams with local staging at var/data/staged-streams.""" staging_directory: pathlib.Path = CLP_DEFAULT_DATA_DIRECTORY_PATH / "staged-streams"
418-438
: Add more specific type hints.Consider using Union with specific storage types instead of the base classes for better type safety.
-def _get_directory_from_storage_config(storage_config: Union[FsStorage, S3Storage]) -> pathlib.Path: +def _get_directory_from_storage_config( + storage_config: Union[ArchiveFsStorage, StreamFsStorage, ArchiveS3Storage, StreamS3Storage] +) -> pathlib.Path: -def _set_directory_for_storage_config( - storage_config: Union[FsStorage, S3Storage], directory -) -> None: +def _set_directory_for_storage_config( + storage_config: Union[ArchiveFsStorage, StreamFsStorage, ArchiveS3Storage, StreamS3Storage], + directory: pathlib.Path +) -> None:
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
components/clp-package-utils/clp_package_utils/scripts/start_clp.py
(5 hunks)components/clp-py-utils/clp_py_utils/clp_config.py
(11 hunks)components/clp-py-utils/clp_py_utils/s3_utils.py
(1 hunks)components/core/src/clp/clo/CommandLineArguments.cpp
(1 hunks)components/core/src/clp_s/CommandLineArguments.cpp
(2 hunks)components/core/src/clp_s/clp-s.cpp
(1 hunks)components/job-orchestration/job_orchestration/executor/query/extract_stream_task.py
(7 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- components/core/src/clp_s/clp-s.cpp
- components/clp-py-utils/clp_py_utils/s3_utils.py
- components/clp-package-utils/clp_package_utils/scripts/start_clp.py
🧰 Additional context used
📓 Path-based instructions (2)
components/core/src/clp/clo/CommandLineArguments.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp_s/CommandLineArguments.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
⏰ Context from checks skipped due to timeout of 90000ms (10)
- GitHub Check: ubuntu-focal-static-linked-bins
- GitHub Check: ubuntu-focal-dynamic-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: build-macos (macos-13, false)
- GitHub Check: lint-check (ubuntu-latest)
- GitHub Check: build-macos (macos-13, true)
- GitHub Check: lint-check (macos-latest)
🔇 Additional comments (9)
components/core/src/clp/clo/CommandLineArguments.cpp (1)
182-192
: LGTM! Clear and well-documented command-line options.The changes enhance the command-line interface by:
- Adding a default value for the
target-size
option- Introducing a new
print-ir-stats
option with clear documentationcomponents/core/src/clp_s/CommandLineArguments.cpp (2)
415-418
: LGTM! Well-documented command-line option.The new
print-ordered-chunk-stats
option follows the established pattern and includes clear documentation.
493-509
: LGTM! Comprehensive validation logic.The validation logic ensures that:
target-ordered-chunk-size
is only used withordered
argumentprint-ordered-chunk-stats
is only used withordered
argument- MongoDB URI is only used with ordered decompression
The error messages are clear and descriptive.
components/job-orchestration/job_orchestration/executor/query/extract_stream_task.py (3)
2-2
: LGTM!The added imports are necessary for the new S3 functionality.
Also applies to: 9-9, 11-11
32-32
: LGTM!The addition of the
print_stream_stats
parameter and corresponding flags enables capturing stream statistics for S3 uploads.Also applies to: 57-58, 78-79
Line range hint
155-165
: LGTM!The modification to capture stdout is necessary for processing stream statistics.
components/clp-py-utils/clp_py_utils/clp_config.py (3)
313-328
: LGTM!The S3Credentials class properly encapsulates AWS credentials with appropriate validation.
335-335
: LGTM!The S3Config changes provide a clean interface for credential management and retrieval.
Also applies to: 357-361
441-441
: LGTM!The updates to configuration classes properly integrate S3 support while maintaining backward compatibility.
Also applies to: 484-484, 571-571, 601-603
Co-authored-by: Junhao Liao <[email protected]>
…into webui_s3_chunks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
components/clp-package-utils/clp_package_utils/scripts/start_clp.py (1)
942-952
: Consider adding error handling for missing AWS credentials.While the code correctly handles the presence of AWS credentials, it silently continues if they are missing. Consider logging a warning message to inform users that the Credential Provider Chain will be used.
access_key_id, secret_access_key = s3_config.get_credentials() if access_key_id is not None and secret_access_key is not None: container_cmd_extra_opts.extend( [ "-e", f"AWS_ACCESS_KEY_ID={access_key_id}", "-e", f"AWS_SECRET_ACCESS_KEY={secret_access_key}", ] ) + else: + logger.warning("AWS credentials not provided. Using AWS Credential Provider Chain.")components/log-viewer-webui/server/src/S3Manager.js (1)
30-54
: Consider adding input validation for empty or undefined URI.While the URL constructor handles invalid URLs (as noted in the learnings), it's good practice to validate that the input is not empty or undefined.
async getPreSignedUrl (s3uriString) { + if (!s3uriString) { + throw new Error('S3 URI string is required'); + } + const s3uri = new URL(s3uriString); const command = new GetObjectCommand({ Bucket: s3uri.hostname, Key: s3uri.pathname.substring(1), }); try { return await getSignedUrl( this.#s3Client, command, { expiresIn: PRE_SIGNED_URL_EXPIRY_TIME_SECONDS, } ); } catch (error) { throw new Error(`Failed to generate pre-signed URL: ${error.message}`); } }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
components/log-viewer-webui/server/package-lock.json
is excluded by!**/package-lock.json
📒 Files selected for processing (5)
components/clp-package-utils/clp_package_utils/scripts/start_clp.py
(2 hunks)components/log-viewer-webui/server/package.json
(1 hunks)components/log-viewer-webui/server/settings.json
(1 hunks)components/log-viewer-webui/server/src/S3Manager.js
(1 hunks)components/log-viewer-webui/server/src/routes/query.js
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- components/log-viewer-webui/server/settings.json
- components/log-viewer-webui/server/package.json
- components/log-viewer-webui/server/src/routes/query.js
🧰 Additional context used
📓 Path-based instructions (1)
components/log-viewer-webui/server/src/S3Manager.js (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
📓 Learnings (1)
components/log-viewer-webui/server/src/S3Manager.js (1)
Learnt from: haiqi96
PR: y-scope/clp#673
File: components/log-viewer-webui/server/src/S3Manager.js:0-0
Timestamp: 2025-01-17T23:25:38.165Z
Learning: In S3Manager.js, URL validation is handled by the URL constructor which throws TypeError for invalid URLs, making additional scheme (s3://) validation redundant.
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: lint-check (ubuntu-latest)
- GitHub Check: lint-check (macos-latest)
🔇 Additional comments (7)
components/clp-package-utils/clp_package_utils/scripts/start_clp.py (3)
931-932
: LGTM! Good use of empty lines for readability.The empty lines help separate different sections of the code, making it more readable.
Also applies to: 933-933
934-941
: LGTM! Clean S3 configuration handling.The code properly checks for S3 storage type and updates the settings with the required S3 configuration.
974-975
: LGTM! Clean integration of container options.The code properly extends the container command with the additional options.
components/log-viewer-webui/server/src/S3Manager.js (4)
1-6
: LGTM! Clean import declarations.The imports are properly organized and specific to the required functionality.
8-12
: LGTM! Good use of constants.The pre-signed URL expiry time is properly defined as a constant with clear documentation.
13-28
: LGTM! Well-documented class implementation.The class is properly documented and follows best practices:
- Clear class description
- Private field using
#
prefix- Properly typed constructor parameter
57-57
: LGTM! Clean export declaration.The class is properly exported as the default export.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor suggestions.
"@aws-sdk/client-s3": "^3.731.0", | ||
"@aws-sdk/s3-request-presigner": "^3.731.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"@aws-sdk/client-s3": "^3.731.0", | |
"@aws-sdk/s3-request-presigner": "^3.731.0", | |
"@aws-sdk/client-s3": "^3.731.1", | |
"@aws-sdk/s3-request-presigner": "^3.731.1", |
New release.
Co-authored-by: kirkrodrigues <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
components/clp-package-utils/clp_package_utils/scripts/start_clp.py (1)
934-953
: Consider using AWS credential provider chain as the primary method.The S3 configuration handling is well-implemented. However, consider promoting the use of the AWS credential provider chain (e.g., IAM roles) over environment variables for better security.
Apply this diff to add a comment about credential best practices:
if access_key_id is not None and secret_access_key is not None: + # TODO: Consider using AWS credential provider chain (e.g., IAM roles) + # instead of environment variables for better security container_cmd_extra_opts.extend( [ "-e",Also applies to: 975-976
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
components/clp-package-utils/clp_package_utils/scripts/start_clp.py
(2 hunks)components/log-viewer-webui/server/src/S3Manager.js
(1 hunks)components/log-viewer-webui/server/src/routes/query.js
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- components/log-viewer-webui/server/src/routes/query.js
🧰 Additional context used
📓 Path-based instructions (1)
components/log-viewer-webui/server/src/S3Manager.js (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
📓 Learnings (1)
components/log-viewer-webui/server/src/S3Manager.js (1)
Learnt from: haiqi96
PR: y-scope/clp#673
File: components/log-viewer-webui/server/src/S3Manager.js:0-0
Timestamp: 2025-01-17T23:25:38.165Z
Learning: In S3Manager.js, URL validation is handled by the URL constructor which throws TypeError for invalid URLs, making additional scheme (s3://) validation redundant.
⏰ Context from checks skipped due to timeout of 90000ms (3)
- GitHub Check: lint-check (ubuntu-latest)
- GitHub Check: lint-check (macos-latest)
- GitHub Check: build (macos-latest)
🔇 Additional comments (5)
components/log-viewer-webui/server/src/S3Manager.js (4)
1-11
: LGTM! Well-structured imports and constants.The imports are correctly structured, and the constant is well-documented with an appropriate name.
13-28
: LGTM! Well-structured class with proper encapsulation.The class is well-documented, uses private field syntax correctly, and the constructor properly initializes the S3 client.
30-55
: LGTM! Well-implemented pre-signed URL generation.The method is well-documented, includes proper error handling, and correctly uses the URL constructor for validation as per the learnings.
58-58
: LGTM! Proper export statement.The class is correctly exported as the default export.
components/clp-package-utils/clp_package_utils/scripts/start_clp.py (1)
931-933
: LGTM! Proper initialization of container command options.The list is correctly initialized before the S3 configuration block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
components/log-viewer-webui/server/src/S3Manager.js (1)
13-15
: Enhance class documentation.Consider adding more details about the class's purpose and responsibilities, such as generating pre-signed URLs for S3 objects.
/** - * Class to manage Simple Storage Service (S3) objects. + * Class to manage Amazon Simple Storage Service (S3) objects. + * Provides functionality to generate pre-signed URLs for temporary access to S3 objects. */
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
components/log-viewer-webui/server/package-lock.json
is excluded by!**/package-lock.json
📒 Files selected for processing (5)
components/clp-package-utils/clp_package_utils/scripts/start_clp.py
(2 hunks)components/log-viewer-webui/server/package.json
(1 hunks)components/log-viewer-webui/server/settings.json
(1 hunks)components/log-viewer-webui/server/src/S3Manager.js
(1 hunks)components/log-viewer-webui/server/src/routes/query.js
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- components/log-viewer-webui/server/settings.json
- components/log-viewer-webui/server/package.json
- components/log-viewer-webui/server/src/routes/query.js
🧰 Additional context used
📓 Path-based instructions (1)
components/log-viewer-webui/server/src/S3Manager.js (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
📓 Learnings (1)
components/log-viewer-webui/server/src/S3Manager.js (1)
Learnt from: haiqi96
PR: y-scope/clp#673
File: components/log-viewer-webui/server/src/S3Manager.js:0-0
Timestamp: 2025-01-17T23:25:38.165Z
Learning: In S3Manager.js, URL validation is handled by the URL constructor which throws TypeError for invalid URLs, making additional scheme (s3://) validation redundant.
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: lint-check (ubuntu-latest)
- GitHub Check: lint-check (macos-latest)
🔇 Additional comments (6)
components/log-viewer-webui/server/src/S3Manager.js (4)
1-11
: Well-structured imports and constant definition!The imports are specific and the constant for URL expiry time is well-documented.
16-28
: Well-implemented constructor with private field!The constructor is focused, and the use of private field for S3 client is appropriate.
30-55
: Excellent implementation of getPreSignedUrl!The method demonstrates good practices:
- Comprehensive JSDoc documentation
- Proper error handling with descriptive messages
- Use of the constant for URL expiry time
- Efficient URL validation through the URL constructor
58-58
: Clean export statement!The default export is appropriate for this single-class module.
components/clp-package-utils/clp_package_utils/scripts/start_clp.py (2)
931-953
: Well-implemented S3 configuration handling!The code demonstrates good practices:
- Clear conditional check for S3 storage type
- Proper handling of AWS credentials
- Clean separation of container command options
974-975
: Clean integration of S3 configuration!The extension of container commands is handled cleanly, maintaining good separation of concerns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix for my suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the PR title, how about:
feat(log-viewer-webui): Add support for loading IR/JSON streams from S3.
can't find the approve button on my phone
Description
Adding support in log_viewer webui for displaying streams from S3.
This PR depends on #662
In this PR, we uses @aws-sdk/client-s3 to create a presigned URL for the streams stored in S3. Also slightly modified the server-client logic to properly populate the "stream path".
The S3client relies on the start_clp.py to supply the AWS credentials as the environment variables. In the case no s3 credentials are provided, it should follow the default Credential Provider Chain to look for credentials.
Note:
To support loading streams from S3, the bucket storing the stream must be configured with cross region resouce sharing with the following configs:
Note that if a user wants to run the clp webui on a different port, he/she should also update the CORS config.
Validation performed
Tested
Summary by CodeRabbit
New Features
Dependencies
Improvements