Skip to content

Commit

Permalink
upload new benchmark table (#1669)
Browse files Browse the repository at this point in the history
* upload new benchmark table

* upload llm bencmark table

* update typo

* Update benchmark.md
  • Loading branch information
akrztrk authored Dec 19, 2024
1 parent c873925 commit d2281bb
Show file tree
Hide file tree
Showing 2 changed files with 54 additions and 1 deletion.
28 changes: 28 additions & 0 deletions docs/en/benchmark.md
Original file line number Diff line number Diff line change
Expand Up @@ -659,6 +659,34 @@ deid_pipeline = Pipeline().setStages([

PS: The reason why pipelines with the same stages have different costs is due to the layers of the NER model and the hardcoded regexes in Deidentification.


- ZeroShot Deidentification Pipelines Speed Comparison

- **[clinical_deidentification](https://nlp.johnsnowlabs.com/2024/03/27/clinical_deidentification_en.html)** 2 NER, 1 clinical embedding, 13 Rule-based NER, 3 chunk merger, 1 Deidentification

- **[clinical_deidentification_zeroshot_medium](https://nlp.johnsnowlabs.com/2024/12/04/clinical_deidentification_zeroshot_medium_en.html)** 1 ZeroShotNER, 18 Rule-based NER, 2 chunk merger

- **[clinical_deidentification_docwise_medium_wip](https://nlp.johnsnowlabs.com/2024/12/03/clinical_deidentification_docwise_medium_wip_en.html)** 1 ZeroShotNER, 4 NER, 1 clinical embedding, 18 Rule-based NER, 3 chunk merger, 1 Deidentification

- **[clinical_deidentification_zeroshot_large](https://nlp.johnsnowlabs.com/2024/12/04/clinical_deidentification_zeroshot_large_en.html)** 1 ZeroShotNER, 18 Rule-based NER, 2 chunk merger

- **[clinical_deidentification_docwise_large_wip](https://nlp.johnsnowlabs.com/2024/12/03/clinical_deidentification_docwise_large_wip_en.html)** 1 ZeroShotNER, 4 NER, 1 clinical embedding, 18 Rule-based NER, 3 chunk merger, 1 Deidentification

- CPU Testing:

{:.table-model-big.db}

| partition | clinical deidendification | clinical deidendification <br> zeroshot_medium | clinical deidendification <br> docwise_medium_wip | clinical deidendification <br> zeroshot_large | clinical deidendification <br> docwise_large_wip |
|-----------|---------------------------|-------------------------------------------|----------------------------------------------|------------------------------------------|---------------------------------------------|
| 4 | 295.8 | 520.8 | 862.7 | 1537.9 | 1832.4 |
| 8 | 195.0 | 345.6 | 577.0 | 1013.9 | 1228.3 |
| 16 | 133.3 | 227.2 | 401.8 | 666.2 | 835.2 |
| 32 | 109.5 | 160.9 | 305.3 | 456.9 | 614.7 |
| 64 | 92.0 | 166.8 | 291.5 | 465.0 | 584.9 |
| 100 | 79.3 | 174.1 | 274.8 | 495.3 | 587.8 |
| 1000 | 56.3 | 181.4 | 270.7 | 502.4 | 556.4 |


</div><div class="h3-box" markdown="1">

### Deidentification Pipelines Cost Benchmarks
Expand Down
27 changes: 26 additions & 1 deletion docs/en/benchmark_llm.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,31 @@ show_nav: true
sidebar:
nav: sparknlp-healthcare
---
<div class="h3-box" markdown="1">

## Medical Benchmarks

### Benchmarking

{:.table-model-big.db}

| Model | Avarega | MedMCQA | MedQA | MMLU <br>anotomy | MMLU<br>clinical<br>knowledge | MMLU<br>college<br>biology | MMLU<br>college<br>medicine | MMLU<br>medical<br>genetics | MMLU<br>professional<br>medicine | PubMedQA |
|-----------------|---------|---------|--------|------------------|-------------------------------|----------------------------|------------------------------|------------------------------|-----------------------------------|----------|
| jsl_medm_q4_v3 | 0.6884 | 0.6421 | 0.6889 | 0.7333 | 0.834 | 0.8681 | 0.7514 | 0.9 | 0.8493 | 0.782 |
| jsl_medm_q8_v3 | 0.6947 | 0.6416 | 0.707 | 0.7556 | 0.8377 | 0.9097 | 0.7688 | 0.9 | 0.8713 | 0.79 |
| jsl_medm_q16_v3 | 0.6964 | 0.6436 | 0.7117 | 0.7481 | 0.8453 | 0.9028 | 0.7688 | 0.87 | 0.8676 | 0.794 |
| jsl_meds_q4_v3 | 0.5522 | 0.5104 | 0.48 | 0.6444 | 0.7472 | 0.8333 | 0.6532 | 0.68 | 0.6691 | 0.752 |
| jsl_meds_q8_v3 | 0.5727 | 0.53 | 0.4933 | 0.6593 | 0.7623 | 0.8681 | 0.6301 | 0.76 | 0.7647 | 0.762 |
| jsl_meds_q16_v3 | 0.5793 | 0.5482 | 0.4839 | 0.637 | 0.7585 | 0.8403 | 0.6532 | 0.77 | 0.7022 | 0.766 |
</div><div class="h3-box" markdown="1">

### Benchmark Summary

We evaluated six Johnsnow Lab LLM models across ten task categories: MedMCQA, MedQA, MMLU Anatomy, MMLU Clinical Knowledge, MMLU College Biology, MMLU College Medicine, MMLU Medical Genetics, MMLU Professional Medicine, and PubMedQA.

Each model's performance was measured based on accuracy, reflecting how well it handled medical reasoning, clinical knowledge, and biomedical question answering.

</div><div class="h3-box" markdown="1">

<div class="h3-box" markdown="1">

Expand Down Expand Up @@ -204,4 +229,4 @@ GPT4o demonstrates strength in Clinical Relevance, especially in Biomedical and
Neutral and "None" ratings across categories highlight areas for further optimization for both models.
This analysis underscores the strengths of JSL-MedM in producing concise and factual outputs, while GPT4o shows a stronger contextual understanding in certain specialized tasks.

</div>
</div>

0 comments on commit d2281bb

Please sign in to comment.