Skip to content

feat: ALS-based i2i similarity experiment#325

Open
em3s wants to merge 2 commits into
feat/step-dsl-313from
feat/als-i2i-experiment
Open

feat: ALS-based i2i similarity experiment#325
em3s wants to merge 2 commits into
feat/step-dsl-313from
feat/als-i2i-experiment

Conversation

@em3s
Copy link
Copy Markdown
Contributor

@em3s em3s commented May 13, 2026

Summary

ALS i2i similarity experiment dogfooding the Step DSL on V2 edge JSON. First real workload that drove the Source/Flow/Merge/Split/Sink spec revision in #313.

Closes none — sits on top of #314 (Step DSL).

Changes

  • AlsFlow (Flow 1→1): wraps spark.ml.recommendation.ALS; emits item factors. Implicit-feedback defaults.
  • TopKSimilarityFlow (Flow 1→1): cosine similarity self-cross-join + Window top-K. Output schema (item_id long, similar_item_id long, score double, rank int).
  • pipeline/conf/als-i2i-experiment.yaml: production-shape workflow — FileSource(json) → SqlMerge(prep) → AlsFlow → TopKSimilarityFlow → FileSink(parquet).
  • Spark MLlib added to pipeline module (compileOnly + testImplementation).

How to Test

  • ./gradlew :pipeline:testAlsFlowTest (2), TopKSimilarityFlowTest (5), AlsI2iWorkflowTest (end-to-end on synthetic two-cluster edges; verifies each item's top-1 neighbor is its cluster sibling).
  • For real data: drop V2 edge JSON dumps under the path: in the YAML and run via spark-submit --class com.kakao.actionbase.pipeline.jobs.StepsRunnerJob.

AI Assistance

  • This PR was written largely with AI assistance.
    • Tool / model: Claude Code (Opus 4.7, 1M context)

@dosubot dosubot Bot added size:L This PR changes 100-499 lines, ignoring generated files. enhancement New feature or request labels May 13, 2026
em3s and others added 2 commits May 14, 2026 00:08
Dogfoods the Step DSL (#313) on a real recommendation workload: V2 edge
JSON dumps treated as implicit feedback, ALS-fit, then top-K cosine
similarity over the item factors written as Parquet.

- AlsTransform: wraps spark.ml.recommendation.ALS as a Transform. Default
  hyperparams target implicit feedback (implicitPrefs=true,
  coldStartStrategy="drop"); seed pinned for reproducibility.
- TopKSimilarityTransform: cosine similarity via self-cross-join + Window
  row_number(). Output schema (item_id long, similar_item_id long, score
  double, rank int). Returns 0.0 for zero-vector pairs instead of NaN.
- Adds Spark MLlib dependency (compileOnly + testImplementation).
- pipeline/conf/als-i2i-experiment.yaml: production-shape workflow YAML
  driving FileSource → SqlTransform prep → AlsTransform → TopK → FileSink.
- AlsI2iWorkflowTest: end-to-end on synthetic two-cluster edges,
  verifying ALS recovers the structure and TopK returns the correct
  sibling for each item.

Depends on feat/step-dsl-313 (Step DSL itself).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Step DSL spec was refactored to Source/Flow/Merge/Split/Sink (commit
e147235 on feat/step-dsl-313). This commit reclassifies the experiment's
built-ins onto the new traits:

- AlsTransform → AlsFlow (Flow, 1→1; emits item factors only)
- TopKSimilarityTransform → TopKSimilarityFlow (Flow, 1→1)
- YAML and end-to-end test references updated (SqlTransform → SqlMerge,
  AlsTransform → AlsFlow, TopKSimilarityTransform → TopKSimilarityFlow)

CacheTransform multi-input require-check is dropped since Flow's signature
guarantees a single input by construction.

AlsFlow doc notes that dual-factor (item + user) belongs to a future
AlsFactorSplit (1→2 Split) once Split built-ins land.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@em3s em3s force-pushed the feat/als-i2i-experiment branch from b6784e2 to 0befd43 Compare May 13, 2026 15:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant