upload new benchmark table (#1669)

* upload new benchmark table * upload llm bencmark table * update typo * Update benchmark.md
JohnSnowLabs · Dec 19, 2024 · d2281bb · d2281bb
1 parent c873925
commit d2281bb
Show file tree

Hide file tree

Showing 2 changed files with 54 additions and 1 deletion.
diff --git a/docs/en/benchmark.md b/docs/en/benchmark.md
@@ -659,6 +659,34 @@ deid_pipeline = Pipeline().setStages([
 
 PS: The reason why pipelines with the same stages have different costs is due to the layers of the NER model and the hardcoded regexes in Deidentification.
 
+
+- ZeroShot Deidentification Pipelines Speed Comparison
+
+    - **[clinical_deidentification](https://nlp.johnsnowlabs.com/2024/03/27/clinical_deidentification_en.html)** 2 NER, 1 clinical embedding, 13 Rule-based NER, 3 chunk merger, 1 Deidentification
+
+    - **[clinical_deidentification_zeroshot_medium](https://nlp.johnsnowlabs.com/2024/12/04/clinical_deidentification_zeroshot_medium_en.html)** 1 ZeroShotNER, 18 Rule-based NER, 2 chunk merger 
+
+    - **[clinical_deidentification_docwise_medium_wip](https://nlp.johnsnowlabs.com/2024/12/03/clinical_deidentification_docwise_medium_wip_en.html)** 1 ZeroShotNER, 4 NER, 1 clinical embedding, 18 Rule-based NER,  3 chunk merger, 1 Deidentification
+
+    - **[clinical_deidentification_zeroshot_large](https://nlp.johnsnowlabs.com/2024/12/04/clinical_deidentification_zeroshot_large_en.html)** 1 ZeroShotNER, 18 Rule-based NER, 2 chunk merger 
+
+    - **[clinical_deidentification_docwise_large_wip](https://nlp.johnsnowlabs.com/2024/12/03/clinical_deidentification_docwise_large_wip_en.html)** 1 ZeroShotNER, 4 NER, 1 clinical embedding, 18 Rule-based NER, 3 chunk merger, 1 Deidentification
+
+- CPU Testing:
+
+{:.table-model-big.db}
+
+| partition | clinical deidendification | clinical deidendification <br> zeroshot_medium | clinical deidendification  <br> docwise_medium_wip | clinical deidendification  <br>  zeroshot_large | clinical deidendification  <br> docwise_large_wip |
+|-----------|---------------------------|-------------------------------------------|----------------------------------------------|------------------------------------------|---------------------------------------------|
+|         4 |                     295.8 |                                     520.8 |                                        862.7 |                                   1537.9 |                                      1832.4 |
+|         8 |                     195.0 |                                     345.6 |                                        577.0 |                                   1013.9 |                                      1228.3 |
+|        16 |                     133.3 |                                     227.2 |                                        401.8 |                                    666.2 |                                       835.2 |
+|        32 |                     109.5 |                                     160.9 |                                        305.3 |                                    456.9 |                                       614.7 |
+|        64 |                      92.0 |                                     166.8 |                                        291.5 |                                    465.0 |                                       584.9 |
+|       100 |                      79.3 |                                     174.1 |                                        274.8 |                                    495.3 |                                       587.8 |
+|      1000 |                      56.3 |                                     181.4 |                                        270.7 |                                    502.4 |                                       556.4 |
+
+
 </div><div class="h3-box" markdown="1">
 
 ### Deidentification Pipelines Cost Benchmarks 

diff --git a/docs/en/benchmark_llm.md b/docs/en/benchmark_llm.md
@@ -10,6 +10,31 @@ show_nav: true
 sidebar:
     nav: sparknlp-healthcare
 ---
+<div class="h3-box" markdown="1">
+
+##  Medical Benchmarks
+
+### Benchmarking
+
+{:.table-model-big.db}
+
+| Model           | Avarega | MedMCQA | MedQA  | MMLU <br>anotomy | MMLU<br>clinical<br>knowledge | MMLU<br>college<br>biology | MMLU<br>college<br>medicine  | MMLU<br>medical<br>genetics  | MMLU<br>professional<br>medicine  | PubMedQA |
+|-----------------|---------|---------|--------|------------------|-------------------------------|----------------------------|------------------------------|------------------------------|-----------------------------------|----------|
+| jsl_medm_q4_v3  | 0.6884  | 0.6421  | 0.6889 | 0.7333           | 0.834                         | 0.8681                     | 0.7514                       | 0.9                          | 0.8493                            | 0.782    |
+| jsl_medm_q8_v3  | 0.6947  | 0.6416  | 0.707  | 0.7556           | 0.8377                        | 0.9097                     | 0.7688                       | 0.9                          | 0.8713                            | 0.79     |
+| jsl_medm_q16_v3 | 0.6964  | 0.6436  | 0.7117 | 0.7481           | 0.8453                        | 0.9028                     | 0.7688                       | 0.87                         | 0.8676                            | 0.794    |
+| jsl_meds_q4_v3  | 0.5522  | 0.5104  | 0.48   | 0.6444           | 0.7472                        | 0.8333                     | 0.6532                       | 0.68                         | 0.6691                            | 0.752    |
+| jsl_meds_q8_v3  | 0.5727  | 0.53    | 0.4933 | 0.6593           | 0.7623                        | 0.8681                     | 0.6301                       | 0.76                         | 0.7647                            | 0.762    |
+| jsl_meds_q16_v3 | 0.5793  | 0.5482  | 0.4839 | 0.637            | 0.7585                        | 0.8403                     | 0.6532                       | 0.77                         | 0.7022                            | 0.766    |
+</div><div class="h3-box" markdown="1">
+
+### Benchmark Summary
+
+We evaluated six Johnsnow Lab LLM models across ten task categories: MedMCQA, MedQA, MMLU Anatomy, MMLU Clinical Knowledge, MMLU College Biology, MMLU College Medicine, MMLU Medical Genetics, MMLU Professional Medicine, and PubMedQA.
+
+Each model's performance was measured based on accuracy, reflecting how well it handled medical reasoning, clinical knowledge, and biomedical question answering. 
+
+</div><div class="h3-box" markdown="1">
 
 <div class="h3-box" markdown="1">
 
@@ -204,4 +229,4 @@ GPT4o demonstrates strength in Clinical Relevance, especially in Biomedical and
 Neutral and "None" ratings across categories highlight areas for further optimization for both models.
 This analysis underscores the strengths of JSL-MedM in producing concise and factual outputs, while GPT4o shows a stronger contextual understanding in certain specialized tasks.
 
-</div>
+</div>