From d2281bbc04b1d26d5055c1d8967c6ac33938adfe Mon Sep 17 00:00:00 2001 From: Akar <67700732+akrztrk@users.noreply.github.com> Date: Thu, 19 Dec 2024 20:34:25 +0100 Subject: [PATCH] upload new benchmark table (#1669) * upload new benchmark table * upload llm bencmark table * update typo * Update benchmark.md --- docs/en/benchmark.md | 28 ++++++++++++++++++++++++++++ docs/en/benchmark_llm.md | 27 ++++++++++++++++++++++++++- 2 files changed, 54 insertions(+), 1 deletion(-) diff --git a/docs/en/benchmark.md b/docs/en/benchmark.md index 0a8ef32720..51effc2445 100644 --- a/docs/en/benchmark.md +++ b/docs/en/benchmark.md @@ -659,6 +659,34 @@ deid_pipeline = Pipeline().setStages([ PS: The reason why pipelines with the same stages have different costs is due to the layers of the NER model and the hardcoded regexes in Deidentification. + +- ZeroShot Deidentification Pipelines Speed Comparison + + - **[clinical_deidentification](https://nlp.johnsnowlabs.com/2024/03/27/clinical_deidentification_en.html)** 2 NER, 1 clinical embedding, 13 Rule-based NER, 3 chunk merger, 1 Deidentification + + - **[clinical_deidentification_zeroshot_medium](https://nlp.johnsnowlabs.com/2024/12/04/clinical_deidentification_zeroshot_medium_en.html)** 1 ZeroShotNER, 18 Rule-based NER, 2 chunk merger + + - **[clinical_deidentification_docwise_medium_wip](https://nlp.johnsnowlabs.com/2024/12/03/clinical_deidentification_docwise_medium_wip_en.html)** 1 ZeroShotNER, 4 NER, 1 clinical embedding, 18 Rule-based NER, 3 chunk merger, 1 Deidentification + + - **[clinical_deidentification_zeroshot_large](https://nlp.johnsnowlabs.com/2024/12/04/clinical_deidentification_zeroshot_large_en.html)** 1 ZeroShotNER, 18 Rule-based NER, 2 chunk merger + + - **[clinical_deidentification_docwise_large_wip](https://nlp.johnsnowlabs.com/2024/12/03/clinical_deidentification_docwise_large_wip_en.html)** 1 ZeroShotNER, 4 NER, 1 clinical embedding, 18 Rule-based NER, 3 chunk merger, 1 Deidentification + +- CPU Testing: + +{:.table-model-big.db} + +| partition | clinical deidendification | clinical deidendification
zeroshot_medium | clinical deidendification
docwise_medium_wip | clinical deidendification
zeroshot_large | clinical deidendification
docwise_large_wip | +|-----------|---------------------------|-------------------------------------------|----------------------------------------------|------------------------------------------|---------------------------------------------| +| 4 | 295.8 | 520.8 | 862.7 | 1537.9 | 1832.4 | +| 8 | 195.0 | 345.6 | 577.0 | 1013.9 | 1228.3 | +| 16 | 133.3 | 227.2 | 401.8 | 666.2 | 835.2 | +| 32 | 109.5 | 160.9 | 305.3 | 456.9 | 614.7 | +| 64 | 92.0 | 166.8 | 291.5 | 465.0 | 584.9 | +| 100 | 79.3 | 174.1 | 274.8 | 495.3 | 587.8 | +| 1000 | 56.3 | 181.4 | 270.7 | 502.4 | 556.4 | + +
### Deidentification Pipelines Cost Benchmarks diff --git a/docs/en/benchmark_llm.md b/docs/en/benchmark_llm.md index 05fdc17a99..817532dcd9 100644 --- a/docs/en/benchmark_llm.md +++ b/docs/en/benchmark_llm.md @@ -10,6 +10,31 @@ show_nav: true sidebar: nav: sparknlp-healthcare --- +
+ +## Medical Benchmarks + +### Benchmarking + +{:.table-model-big.db} + +| Model | Avarega | MedMCQA | MedQA | MMLU
anotomy | MMLU
clinical
knowledge | MMLU
college
biology | MMLU
college
medicine | MMLU
medical
genetics | MMLU
professional
medicine | PubMedQA | +|-----------------|---------|---------|--------|------------------|-------------------------------|----------------------------|------------------------------|------------------------------|-----------------------------------|----------| +| jsl_medm_q4_v3 | 0.6884 | 0.6421 | 0.6889 | 0.7333 | 0.834 | 0.8681 | 0.7514 | 0.9 | 0.8493 | 0.782 | +| jsl_medm_q8_v3 | 0.6947 | 0.6416 | 0.707 | 0.7556 | 0.8377 | 0.9097 | 0.7688 | 0.9 | 0.8713 | 0.79 | +| jsl_medm_q16_v3 | 0.6964 | 0.6436 | 0.7117 | 0.7481 | 0.8453 | 0.9028 | 0.7688 | 0.87 | 0.8676 | 0.794 | +| jsl_meds_q4_v3 | 0.5522 | 0.5104 | 0.48 | 0.6444 | 0.7472 | 0.8333 | 0.6532 | 0.68 | 0.6691 | 0.752 | +| jsl_meds_q8_v3 | 0.5727 | 0.53 | 0.4933 | 0.6593 | 0.7623 | 0.8681 | 0.6301 | 0.76 | 0.7647 | 0.762 | +| jsl_meds_q16_v3 | 0.5793 | 0.5482 | 0.4839 | 0.637 | 0.7585 | 0.8403 | 0.6532 | 0.77 | 0.7022 | 0.766 | +
+ +### Benchmark Summary + +We evaluated six Johnsnow Lab LLM models across ten task categories: MedMCQA, MedQA, MMLU Anatomy, MMLU Clinical Knowledge, MMLU College Biology, MMLU College Medicine, MMLU Medical Genetics, MMLU Professional Medicine, and PubMedQA. + +Each model's performance was measured based on accuracy, reflecting how well it handled medical reasoning, clinical knowledge, and biomedical question answering. + +
@@ -204,4 +229,4 @@ GPT4o demonstrates strength in Clinical Relevance, especially in Biomedical and Neutral and "None" ratings across categories highlight areas for further optimization for both models. This analysis underscores the strengths of JSL-MedM in producing concise and factual outputs, while GPT4o shows a stronger contextual understanding in certain specialized tasks. -
\ No newline at end of file +