Skip to content

Conversation

@mohitmundhragithub
Copy link
Contributor

No description provided.

@mohitmundhragithub mohitmundhragithub requested review from a team and anhappdev as code owners August 26, 2025 05:41
@github-actions
Copy link

github-actions bot commented Aug 26, 2025

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

…lemented performance benchmark for LLM pipeline
…y input and issue_query only handles output tokens
@farook-edev farook-edev changed the title Feat llm LLM pipeline implementation Sep 2, 2025
@farook-edev farook-edev linked an issue Sep 2, 2025 that may be closed by this pull request
Copy link
Contributor Author

@mohitmundhragithub mohitmundhragithub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

@anhappdev
Copy link
Collaborator

This PR should resolve the issue with the iOS build: #1064
However, the Windows build still fails. Here's the log: 2025-10-07-windows.log

@farook-edev
Copy link
Contributor

This PR should resolve the issue with the iOS build: #1064
However, the Windows build still fails. Here's the log: 2025-10-07-windows.log

Thanks a ton! I'll look into the windows issue.

@freedomtan
Copy link
Contributor

@freedomtan to test the app (and the accuracy of tinyMMLU).

@freedomtan
Copy link
Contributor

for performance:

  • time to first token
  • tokens/s

for accuracy:

  • tinyMMLU
  • ifeval

@Mostelk
Copy link

Mostelk commented Oct 22, 2025

@farook-edev please share link to all assets used for LLM benchmark, TFRecords for datasets and models

@freedomtan
Copy link
Contributor

confirmed that I can run the apk (https://github.com/mlcommons/mobile_app_open/actions/runs/18638579937/artifacts/4313148049) on Pixel 10.

However, it seems there are no ttft and tokens/s information in mlperf_log_summary.txt. I think we need those information in both detail log and the summary one.

The following is the mlperf_log_summary.txt of performance mode. The performance_sample_count: 1 is weird too.

mustang:/sdcard/Android/data/org.mlcommons.android.mlperfbench/files/logs/2025-10-22T16-49-11.233245/llm-performance $ cat  mlperf_log_summary.txt                                                                                      
================================================
MLPerf Results Summary
================================================
SUT name : TFLite
Scenario : SingleStream
Mode     : PerformanceOnly
90th percentile latency (ns) : 5324192452
Result is : VALID
  Min duration satisfied : Yes
  Min queries satisfied : Skipped
  Early stopping satisfied: Yes
Early Stopping Result:
 * Processed at least 64 queries (77).
 * Would discard 0 highest latency queries.
 * Early stopping 90th percentile estimate: 6144549223
 * Not enough queries processed for 99th percentile
 early stopping estimate (would need to process at
 least 662 total queries).

================================================
Additional Stats
================================================
QPS w/ loadgen overhead         : 0.25
QPS w/o loadgen overhead        : 0.25

Min latency (ns)                : 3479648544
Max latency (ns)                : 6144549223
Mean latency (ns)               : 4035065622
50.00 percentile latency (ns)   : 3731987816
90.00 percentile latency (ns)   : 5324192452
95.00 percentile latency (ns)   : 5677467322
97.00 percentile latency (ns)   : 5766271567
99.00 percentile latency (ns)   : 6144549223
99.90 percentile latency (ns)   : 6144549223

================================================
Test Parameters Used
================================================
samples_per_query : 1
target_qps : 1000
target_latency (ns): 0
max_async_queries : 1
min_duration (ms): 60000
max_duration (ms): 300000
min_query_count : 100
max_query_count : 0
qsl_rng_seed : 3066443479025735752
sample_index_rng_seed : 10688027786191513374
schedule_rng_seed : 14962580496156340209
accuracy_log_rng_seed : 0
accuracy_log_probability : 0
accuracy_log_sampling_target : 0
print_timestamps : 0
performance_issue_unique : 0
performance_issue_same : 0
performance_issue_same_index : 0
performance_sample_count : 1

No warnings encountered during test.

No errors encountered during test.

@farook-edev
Copy link
Contributor

@farook-edev please share link to all assets used for LLM benchmark, TFRecords for datasets and models

You should be able to find them here.

@freedomtan
Copy link
Contributor

@farook-edev please help to have meaningful numbers in the mlperf_log_summary.txt so that we can determine the constraints for performance model (min running time, number of output tokens).

@anhappdev: we always update to have the latest LoadGen version when we have a new release. Please do it. Surely, we should test it carefully.

@freedomtan
Copy link
Contributor

freedomtan commented Oct 28, 2025

Disable the C++ exception handling when building Eigen for iOS.

@freedomtan
Copy link
Contributor

@mohitmundhragithub: please provide the link to the discussed mlperf_client testing method.

@mohitmundhragithub
Copy link
Contributor Author

mohitmundhragithub commented Oct 28, 2025

@freedomtan
Copy link
Contributor

Note on samples:

  • reproducible results: when random number seeds are changed, we must have comparable results.

@mohitmundhragithub, @Mostelk

@sonarqubecloud
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Master issue: LLM Benchmark

6 participants