Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BFD-3664: Pipeline job for SAMHSA tag backfill. #2506

Open
wants to merge 48 commits into
base: feature/samhsa2.0
Choose a base branch
from

Conversation

dondevun
Copy link
Contributor

@dondevun dondevun commented Dec 3, 2024

JIRA Ticket:
BFD-3664

What Does This PR Do?

This PR adds a pipeline job to backfill the SAMHSA tags tables. This will be able to run concurrently with the RDA pipeline job, but is disabled on CCW pipeline instances.

This will process all of the tables that could have SAMHSA codes concurrently, each with its own entityManager. There were some tradeoffs for the sake of performance, the biggest being that it does not construct entities from the SQL queries. Instead, it returns arrays of objects, and relies on the code to be aware of the types in each array position. This is obviously not ideal, the biggest issue being possible ClassCastExceptions if the array objects are not processed in the correct order; however, without the entity class to map the columns to the types, it is unfortunately unavoidable for this implementation.

What Should Reviewers Watch For?

If you're reviewing this PR, please check for these things in particular:

What Security Implications Does This PR Have?

Please indicate if this PR does any of the following:

  • Adds any new software dependencies

  • Modifies any security controls

  • Adds new transmission or storage of data

  • Any other changes that could possibly affect security?

  • I have considered the above security implications as it relates to this PR. (If one or more of the above apply, it cannot be merged without the ISSO or team security engineer's (@sb-benohe) approval.)

Validation

Have you fully verified and tested these changes? Is the acceptance criteria met? Please provide reproducible testing instructions, code snippets, or screenshots as applicable.

@dondevun dondevun changed the title Bfd 3664 BFD-3664: Pipeline job for SAMHSA tag backfill. Dec 4, 2024
@dondevun dondevun marked this pull request as ready for review December 5, 2024 19:25
ConfigLoader config, boolean ccwPipelineEnabled) {
boolean enabled = config.booleanOption(SSM_PATH_SAMHSA_BACKFILL_ENABLED).orElse(false);
// We don't want to run if we're on a CCW Pipeline instance
if (!enabled || ccwPipelineEnabled) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a bit confusing that this runs on the RDA pipeline, but I understand the reasoning. Ideally this could run as its own pipeline, but that would increase the complexity a fair bit. I think this is fine for now and we can revisit once we're running in ECS.

String queryStr =
strSub.replace(
startingClaim.isPresent() ? QUERY_WITH_STARTING_CLAIM : QUERY_WITH_NO_STARTING_CLAIM);
return entityManager.createNativeQuery(queryStr, tableEntry.getClaimClass());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to use createQuery instead of createNativeQuery? That will at least perform some type checking to ensure the type is assignable to the query result. I believe that would remove the need for select * as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we can do that. The check for existing tags may be a bit tricky, but I'll see what I can do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these changes should work fine. Testing performance now.

@@ -152,6 +152,8 @@
/bfd/${env}/pipeline/nonsensitive/rda/cleanup/run_size: UNDEFINED
/bfd/${env}/pipeline/nonsensitive/rda/cleanup/transaction_size: UNDEFINED
/bfd/${env}/pipeline/nonsensitive/rda/instance_type: m6a.large
/bfd/${env}/pipeline/nonsensitive/rda/samhsa/backfill/enabled: "false"
/bfd/${env}/pipeline/nonsensitive/rda/samhsa/backfill/batch_size: 15000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to test with a larger batch size here to see if it helps. 100,000 or so shouldn't be a problem.

@dondevun
Copy link
Contributor Author

image

@dondevun dondevun marked this pull request as draft January 6, 2025 16:18
@dondevun dondevun force-pushed the BFD-3664 branch 2 times, most recently from 890887b to b75b481 Compare January 7, 2025 19:06
@dondevun dondevun marked this pull request as ready for review January 10, 2025 14:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants