diff --git a/docs/_posts/Meryem1425/2025-01-16-clinical_deidentification_docwise_benchmark_en.md b/docs/_posts/Meryem1425/2025-01-16-clinical_deidentification_docwise_benchmark_en.md new file mode 100644 index 0000000000..304dc0feac --- /dev/null +++ b/docs/_posts/Meryem1425/2025-01-16-clinical_deidentification_docwise_benchmark_en.md @@ -0,0 +1,166 @@ +--- +layout: model +title: Clinical Deidentification Pipeline (Document Wise - Benchmark) +author: John Snow Labs +name: clinical_deidentification_docwise_benchmark +date: 2025-01-16 +tags: [licensed, en, deidentification, deid, pipeline, clinical, docwise, benchmark] +task: [De-identification, Pipeline Healthcare] +language: en +edition: Healthcare NLP 5.5.1 +spark_version: 3.4 +supported: true +annotator: PipelineModel +article_header: + type: cover +use_language_switcher: "Python-Scala-Java" +--- + +## Description + +This pipeline can be used to deidentify PHI information from medical texts. The PHI information will be masked and obfuscated in the resulting text. The pipeline can mask and obfuscate `NAME`, `IDNUM`, `CONTACT`, `LOCATION`, `AGE`, `DATE` entities. +**This pipeline is prepared for benchmarking with cloud providers.** + +## Predicted Entities + +`NAME`, `IDNUM`, `CONTACT`, `LOCATION`, `AGE`, `DATE` + +{:.btn-box} + + +[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_docwise_benchmark_en_5.5.1_3.4_1737046494582.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} +[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_docwise_benchmark_en_5.5.1_3.4_1737046494582.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} + +## How to use + + + +
+{% include programmingLanguageSelectScalaPythonNLU.html %} + +```python + +from sparknlp.pretrained import PretrainedPipeline + +deid_pipeline = PretrainedPipeline("clinical_deidentification_docwise_benchmark", "en", "clinical/models") + +deid_result = deid_pipeline.fullAnnotate("""Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435. +Dr. John Green, ID: 1231511863, IP 203.120.223.13. +He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93. +Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B. +Phone (302) 786-5227, Keats Street, San Francisco, E-MAIL: smith@gmail.com.""") + +print(''.join([i.result for i in deid_result['mask_entity']])) +print(''.join([i.result for i in deid_result['obfuscated']])) + +``` + +{:.jsl-block} +```python + +deid_pipeline = nlp.PretrainedPipeline("clinical_deidentification_docwise_benchmark", "en", "clinical/models") + +deid_result = deid_pipeline.fullAnnotate("""Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435. +Dr. John Green, ID: 1231511863, IP 203.120.223.13. +He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93. +Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B. +Phone (302) 786-5227, Keats Street, San Francisco, E-MAIL: smith@gmail.com.""") + +print(''.join([i.result for i in deid_result['mask_entity']])) +print(''.join([i.result for i in deid_result['obfuscated']])) + +``` +```scala + +import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline + +val deid_pipeline = PretrainedPipeline("clinical_deidentification_docwise_benchmark", "en", "clinical/models") + +val deid_result = deid_pipeline.fullAnnotate("""Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435. +Dr. John Green, ID: 1231511863, IP 203.120.223.13. +He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93. +Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B. +Phone (302) 786-5227, Keats Street, San Francisco, E-MAIL: smith@gmail.com.""") + +println(deid_result("mask_entity").map(_("result").toString).mkString("")) +println(deid_result("obfuscated").map(_("result").toString).mkString("")) + +``` +
+ +## Results + +```bash + +Masked with entity labels +------------------------------ +Name : , Record date: , # . +Dr. , ID: , IP . +He is a male was admitted to the for cystectomy on . +Patient's VIN : , SSN , Driver's license . +Phone , , , E-MAIL: . + + +Obfuscated +------------------------------ +Name : Lawrnce Pretzel, Record date: 2093-01-24, # 486302. +Dr. Carolina Cid, ID: 5875955427, IP 089.708.009.79. +He is a 65-year-old male was admitted to the South Benjaminside for cystectomy on 01/24/93. +Patient's VIN : 0OZUO50MYTQ018397, SSN #888-11-3333, Driver's license YZ:Z881100W. +Phone (546) 920-7669, Traceyburgh, 1441 Eastlake Avenue, E-MAIL: UIEZD@OIMEH.KGI. + +``` + +{:.model-param} +## Model Information + +{:.table-model} +|---|---| +|Model Name:|clinical_deidentification_docwise_benchmark| +|Type:|pipeline| +|Compatibility:|Healthcare NLP 5.5.1+| +|License:|Licensed| +|Edition:|Official| +|Language:|en| +|Size:|2.5 GB| + +## Included Models + +- DocumentAssembler +- InternalDocumentSplitter +- TokenizerModel +- WordEmbeddingsModel +- MedicalNerModel +- NerConverterInternalModel +- MedicalNerModel +- MedicalNerModel +- MedicalNerModel +- NerConverterInternalModel +- NerConverterInternalModel +- NerConverterInternalModel +- PretrainedZeroShotNER +- NerConverterInternalModel +- MedicalNerModel +- NerConverterInternalModel +- ContextualEntityRuler +- ChunkMergeModel +- ContextualParserModel +- ContextualParserModel +- ContextualParserModel +- ContextualParserModel +- ContextualParserModel +- ContextualParserModel +- ContextualParserModel +- TextMatcherInternalModel +- TextMatcherInternalModel +- ContextualParserModel +- RegexMatcherInternalModel +- ContextualParserModel +- ContextualParserModel +- ContextualParserModel +- RegexMatcherInternalModel +- RegexMatcherInternalModel +- ChunkMergeModel +- ChunkMergeModel +- LightDeIdentification +- LightDeIdentification diff --git a/docs/_posts/akrztrk/2025-01-16-clinical_deidentification_docwise_benchmark_en.md b/docs/_posts/akrztrk/2025-01-16-clinical_deidentification_docwise_benchmark_en.md new file mode 100644 index 0000000000..535ed52229 --- /dev/null +++ b/docs/_posts/akrztrk/2025-01-16-clinical_deidentification_docwise_benchmark_en.md @@ -0,0 +1,166 @@ +--- +layout: model +title: Clinical Deidentification Pipeline (Document Wise - Benchmark) +author: John Snow Labs +name: clinical_deidentification_docwise_benchmark +date: 2025-01-16 +tags: [licensed, en, deidentification, deid, pipeline, clinical, docwise, benchmark] +task: [De-identification, Pipeline Healthcare] +language: en +edition: Healthcare NLP 5.5.1 +spark_version: 3.2 +supported: true +annotator: PipelineModel +article_header: + type: cover +use_language_switcher: "Python-Scala-Java" +--- + +## Description + +This pipeline can be used to deidentify PHI information from medical texts. The PHI information will be masked and obfuscated in the resulting text. The pipeline can mask and obfuscate `NAME`, `IDNUM`, `CONTACT`, `LOCATION`, `AGE`, `DATE` entities. +**This pipeline is prepared for benchmarking with cloud providers.** + +## Predicted Entities + +`NAME`, `IDNUM`, `CONTACT`, `LOCATION`, `AGE`, `DATE` + +{:.btn-box} + + +[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_docwise_benchmark_en_5.5.1_3.2_1737048679338.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} +[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_docwise_benchmark_en_5.5.1_3.2_1737048679338.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} + +## How to use + + + +
+{% include programmingLanguageSelectScalaPythonNLU.html %} + +```python + +from sparknlp.pretrained import PretrainedPipeline + +deid_pipeline = PretrainedPipeline("clinical_deidentification_docwise_benchmark", "en", "clinical/models") + +deid_result = deid_pipeline.fullAnnotate("""Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435. +Dr. John Green, ID: 1231511863, IP 203.120.223.13. +He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93. +Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B. +Phone (302) 786-5227, Keats Street, San Francisco, E-MAIL: smith@gmail.com.""") + +print(''.join([i.result for i in deid_result['mask_entity']])) +print(''.join([i.result for i in deid_result['obfuscated']])) + +``` + +{:.jsl-block} +```python + +deid_pipeline = nlp.PretrainedPipeline("clinical_deidentification_docwise_benchmark", "en", "clinical/models") + +deid_result = deid_pipeline.fullAnnotate("""Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435. +Dr. John Green, ID: 1231511863, IP 203.120.223.13. +He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93. +Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B. +Phone (302) 786-5227, Keats Street, San Francisco, E-MAIL: smith@gmail.com.""") + +print(''.join([i.result for i in deid_result['mask_entity']])) +print(''.join([i.result for i in deid_result['obfuscated']])) + +``` +```scala + +import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline + +val deid_pipeline = PretrainedPipeline("clinical_deidentification_docwise_benchmark", "en", "clinical/models") + +val deid_result = deid_pipeline.fullAnnotate("""Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435. +Dr. John Green, ID: 1231511863, IP 203.120.223.13. +He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93. +Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B. +Phone (302) 786-5227, Keats Street, San Francisco, E-MAIL: smith@gmail.com.""") + +println(deid_result("mask_entity").map(_("result").toString).mkString("")) +println(deid_result("obfuscated").map(_("result").toString).mkString("")) + +``` +
+ +## Results + +```bash + +Masked with entity labels +------------------------------ +Name : , Record date: , # . +Dr. , ID: , IP . +He is a male was admitted to the for cystectomy on . +Patient's VIN : , SSN , Driver's license . +Phone , , , E-MAIL: . + + +Obfuscated +------------------------------ +Name : Laray Platt, Record date: 2093-02-17, # 264180. +Dr. Tedd Favorite, ID: 1431511083, IP 534.253.554.24. +He is a 71-year-old male was admitted to the 900 Hospital Drive for cystectomy on 02/17/93. +Patient's VIN : 7HSNH27FRMJ785064, SSN #999-22-4444, Driver's license RS:S114433P. +Phone (546) 920-7669, 830 Kempsville Road, 624 N Second, E-MAIL: AOKFJ@UOSKN.QMO. + +``` + +{:.model-param} +## Model Information + +{:.table-model} +|---|---| +|Model Name:|clinical_deidentification_docwise_benchmark| +|Type:|pipeline| +|Compatibility:|Healthcare NLP 5.5.1+| +|License:|Licensed| +|Edition:|Official| +|Language:|en| +|Size:|2.5 GB| + +## Included Models + +- DocumentAssembler +- InternalDocumentSplitter +- TokenizerModel +- WordEmbeddingsModel +- MedicalNerModel +- NerConverterInternalModel +- MedicalNerModel +- MedicalNerModel +- MedicalNerModel +- NerConverterInternalModel +- NerConverterInternalModel +- NerConverterInternalModel +- PretrainedZeroShotNER +- NerConverterInternalModel +- MedicalNerModel +- NerConverterInternalModel +- ContextualEntityRuler +- ChunkMergeModel +- ContextualParserModel +- ContextualParserModel +- ContextualParserModel +- ContextualParserModel +- ContextualParserModel +- ContextualParserModel +- ContextualParserModel +- TextMatcherInternalModel +- TextMatcherInternalModel +- ContextualParserModel +- RegexMatcherInternalModel +- ContextualParserModel +- ContextualParserModel +- ContextualParserModel +- RegexMatcherInternalModel +- RegexMatcherInternalModel +- ChunkMergeModel +- ChunkMergeModel +- LightDeIdentification +- LightDeIdentification diff --git a/docs/_posts/gokhanturer/2025-01-16-clinical_deidentification_docwise_benchmark_en.md b/docs/_posts/gokhanturer/2025-01-16-clinical_deidentification_docwise_benchmark_en.md new file mode 100644 index 0000000000..c4809c6523 --- /dev/null +++ b/docs/_posts/gokhanturer/2025-01-16-clinical_deidentification_docwise_benchmark_en.md @@ -0,0 +1,167 @@ +--- +layout: model +title: Clinical Deidentification Pipeline (Document Wise - Benchmark) +author: John Snow Labs +name: clinical_deidentification_docwise_benchmark +date: 2025-01-16 +tags: [licensed, en, deidentification, deid, pipeline, clinical, docwise, benchmark] +task: [De-identification, Pipeline Healthcare] +language: en +edition: Healthcare NLP 5.5.1 +spark_version: 3.0 +supported: true +annotator: PipelineModel +article_header: + type: cover +use_language_switcher: "Python-Scala-Java" +--- + +## Description + +This pipeline can be used to deidentify PHI information from medical texts. The PHI information will be masked and obfuscated in the resulting text. The pipeline can mask and obfuscate `NAME`, `IDNUM`, `CONTACT`, `LOCATION`, `AGE`, `DATE` entities. +**This pipeline is prepared for benchmarking with cloud providers.** + +## Predicted Entities + +`NAME`, `IDNUM`, `CONTACT`, `LOCATION`, `AGE`, `DATE` + +{:.btn-box} + + +[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_docwise_benchmark_en_5.5.1_3.0_1737051714368.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} +[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_docwise_benchmark_en_5.5.1_3.0_1737051714368.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} + +## How to use + + + +
+{% include programmingLanguageSelectScalaPythonNLU.html %} + +```python + +from sparknlp.pretrained import PretrainedPipeline + +deid_pipeline = PretrainedPipeline("clinical_deidentification_docwise_benchmark", "en", "clinical/models") + +deid_result = deid_pipeline.fullAnnotate("""Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435. +Dr. John Green, ID: 1231511863, IP 203.120.223.13. +He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93. +Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B. +Phone (302) 786-5227, Keats Street, San Francisco, E-MAIL: smith@gmail.com.""") + +print(''.join([i.result for i in deid_result['mask_entity']])) +print(''.join([i.result for i in deid_result['obfuscated']])) + +``` + +{:.jsl-block} +```python + +deid_pipeline = nlp.PretrainedPipeline("clinical_deidentification_docwise_benchmark", "en", "clinical/models") + +deid_result = deid_pipeline.fullAnnotate("""Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435. +Dr. John Green, ID: 1231511863, IP 203.120.223.13. +He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93. +Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B. +Phone (302) 786-5227, Keats Street, San Francisco, E-MAIL: smith@gmail.com.""") + +print(''.join([i.result for i in deid_result['mask_entity']])) +print(''.join([i.result for i in deid_result['obfuscated']])) + +``` +```scala + +import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline + +val deid_pipeline = PretrainedPipeline("clinical_deidentification_docwise_benchmark", "en", "clinical/models") + +val deid_result = deid_pipeline.fullAnnotate("""Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435. +Dr. John Green, ID: 1231511863, IP 203.120.223.13. +He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93. +Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B. +Phone (302) 786-5227, Keats Street, San Francisco, E-MAIL: smith@gmail.com.""") + +println(deid_result("mask_entity").map(_("result").toString).mkString("")) +println(deid_result("obfuscated").map(_("result").toString).mkString("")) + +``` +
+ +## Results + +```bash + +Masked with entity labels +------------------------------ +Name : , Record date: , # . +Dr. , ID: , IP . +He is a male was admitted to the for cystectomy on . +Patient's VIN : , SSN , Driver's license . +Phone , , , E-MAIL: . + + +Obfuscated +------------------------------ +Name : Luberta Ruse, Record date: 2093-02-15, # 264180. +Dr. Brannon Calamity, ID: 7097177649, IP 867.586.887.57. +He is a 64-year-old male was admitted to the 1316 E Seventh St for cystectomy on 02/15/93. +Patient's VIN : 0EPKE50COJG018397, SSN #777-00-2222, Driver's license YZ:Z881100W. +Phone (768) 142-9881, Anthonyland, 100 Kenyon Ave, E-MAIL: YMIDH@SMQIL.OKM. + + +``` + +{:.model-param} +## Model Information + +{:.table-model} +|---|---| +|Model Name:|clinical_deidentification_docwise_benchmark| +|Type:|pipeline| +|Compatibility:|Healthcare NLP 5.5.1+| +|License:|Licensed| +|Edition:|Official| +|Language:|en| +|Size:|2.5 GB| + +## Included Models + +- DocumentAssembler +- InternalDocumentSplitter +- TokenizerModel +- WordEmbeddingsModel +- MedicalNerModel +- NerConverterInternalModel +- MedicalNerModel +- MedicalNerModel +- MedicalNerModel +- NerConverterInternalModel +- NerConverterInternalModel +- NerConverterInternalModel +- PretrainedZeroShotNER +- NerConverterInternalModel +- MedicalNerModel +- NerConverterInternalModel +- ContextualEntityRuler +- ChunkMergeModel +- ContextualParserModel +- ContextualParserModel +- ContextualParserModel +- ContextualParserModel +- ContextualParserModel +- ContextualParserModel +- ContextualParserModel +- TextMatcherInternalModel +- TextMatcherInternalModel +- ContextualParserModel +- RegexMatcherInternalModel +- ContextualParserModel +- ContextualParserModel +- ContextualParserModel +- RegexMatcherInternalModel +- RegexMatcherInternalModel +- ChunkMergeModel +- ChunkMergeModel +- LightDeIdentification +- LightDeIdentification diff --git a/docs/_posts/gpirge/2025-01-16-assertion_genomic_abnormality_wip_en.md b/docs/_posts/gpirge/2025-01-16-assertion_genomic_abnormality_wip_en.md new file mode 100644 index 0000000000..29deee64d6 --- /dev/null +++ b/docs/_posts/gpirge/2025-01-16-assertion_genomic_abnormality_wip_en.md @@ -0,0 +1,250 @@ +--- +layout: model +title: "Genomic Assertion Status Model: Classifying Normal, Affected, and Variant Entities" +author: John Snow Labs +name: assertion_genomic_abnormality_wip +date: 2025-01-16 +tags: [en, clinical, licensed, assertion, gene, normal, variant, affected] +task: Assertion Status +language: en +edition: Healthcare NLP 5.5.1 +spark_version: 3.0 +supported: true +annotator: AssertionDLModel +article_header: + type: cover +use_language_switcher: "Python-Scala-Java" +--- + +## Description + +This assertion status detection model is trained to classify entities (Gene and MPG) extracted by the NER model `ner_genes_phenotypes` into three categories: + +`Normal`, for genes and molecules part of normal physiology; + +`Affected`, for molecules or proteins impacted by genetic mutations; + +`Variant`, for genes that are abnormal or of a variant type, enabling precise characterization of genomic and molecular states. + +## Predicted Entities + +`Normal`, `Affected`, `Variant` + +{:.btn-box} + + +[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/assertion_genomic_abnormality_wip_en_5.5.1_3.0_1737034731887.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} +[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/assertion_genomic_abnormality_wip_en_5.5.1_3.0_1737034731887.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} + +## How to use + + + +
+{% include programmingLanguageSelectScalaPythonNLU.html %} +```python +document_assembler = DocumentAssembler()\ + .setInputCol("text")\ + .setOutputCol("document") + +sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en")\ + .setInputCols(["document"])\ + .setOutputCol("sentence") + +tokenizer = Tokenizer()\ + .setInputCols(["sentence"])\ + .setOutputCol("token") + +clinical_embeddings = WordEmbeddingsModel.pretrained('embeddings_clinical', "en", "clinical/models")\ + .setInputCols(["sentence", "token"])\ + .setOutputCol("embeddings") + +ner_model = MedicalNerModel.pretrained('ner_genes_phenotypes', "en", "clinical/models")\ + .setInputCols(["sentence", "token","embeddings"])\ + .setOutputCol("ner") + +ner_converter = NerConverterInternal()\ + .setInputCols(['sentence', 'token', 'ner'])\ + .setOutputCol('ner_chunk')\ + .setWhiteList(['Gene', 'MPG']) + +assertion = AssertionDLModel.pretrained("assertion_genomic_abnormality_wip", "en", "clinical/models")\ + .setInputCols(["sentence", "ner_chunk", "embeddings"]) \ + .setOutputCol("assertion") + +pipeline = Pipeline(stages=[ + document_assembler, + sentence_detector, + tokenizer, + clinical_embeddings, + ner_model, + ner_converter, + assertion + ]) + +sample_texts = [""" +The ATP7B gene provides instructions for a copper-transporting ATPase essential for copper homeostasis. Mutations in the ATP7B gene cause Wilson disease, an autosomal recessive disorder of copper metabolism. +Over 500 mutations have been identified, including missense, nonsense, and splice site mutations. The variant ATP7B protein leads to impaired copper excretion and accumulation in various organs, particularly the liver and brain. +Clinical presentations of Wilson disease include hepatic dysfunction, neurological symptoms (e.g., tremors, dystonia), and psychiatric disturbances. +Kayser-Fleischer rings, copper deposits in the cornea, are a characteristic sign. Gene-environment interactions are significant, with dietary copper intake and other environmental factors influencing disease progression. +Diagnosis involves a combination of clinical symptoms, low serum ceruloplasmin, high urinary copper, and genetic testing. +Treatment focuses on reducing copper accumulation through chelation therapy with drugs like penicillamine or trientine, and zinc supplementation to block copper absorption. +Liver transplantation may be necessary in severe cases. The worldwide prevalence of Wilson disease is estimated at 1 in 30,000, with higher rates in certain isolated populations. +"""] + +data = spark.createDataFrame(sample_texts, StringType()).toDF("text") + +result = pipeline.fit(data).transform(data) +``` + +{:.jsl-block} +```python +document_assembler = nlp.DocumentAssembler()\ + .setInputCol("text")\ + .setOutputCol("document") + +sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en")\ + .setInputCols(["document"])\ + .setOutputCol("sentence") + +tokenizer = nlp.Tokenizer()\ + .setInputCols(["sentence"])\ + .setOutputCol("token") + +clinical_embeddings = nlp.WordEmbeddingsModel.pretrained('embeddings_clinical', "en", "clinical/models")\ + .setInputCols(["sentence", "token"])\ + .setOutputCol("embeddings") + +ner_model = medical.NerModel.pretrained('ner_genes_phenotypes', "en", "clinical/models")\ + .setInputCols(["sentence", "token","embeddings"])\ + .setOutputCol("ner") + +ner_converter = medical.NerConverterInternal()\ + .setInputCols(['sentence', 'token', 'ner'])\ + .setOutputCol('ner_chunk')\ + .setWhiteList(['Gene', 'MPG']) + +assertion = medical.AssertionDLModel.pretrained("assertion_genomic_abnormality_wip", "en", "clinical/models")\ + .setInputCols(["sentence", "ner_chunk", "embeddings"]) \ + .setOutputCol("assertion") + +pipeline = nlp.Pipeline(stages=[ + document_assembler, + sentence_detector, + tokenizer, + clinical_embeddings, + ner_model, + ner_converter, + assertion + ]) + +sample_texts = [""" +The ATP7B gene provides instructions for a copper-transporting ATPase essential for copper homeostasis. Mutations in the ATP7B gene cause Wilson disease, an autosomal recessive disorder of copper metabolism. +Over 500 mutations have been identified, including missense, nonsense, and splice site mutations. The variant ATP7B protein leads to impaired copper excretion and accumulation in various organs, particularly the liver and brain. +Clinical presentations of Wilson disease include hepatic dysfunction, neurological symptoms (e.g., tremors, dystonia), and psychiatric disturbances. +Kayser-Fleischer rings, copper deposits in the cornea, are a characteristic sign. Gene-environment interactions are significant, with dietary copper intake and other environmental factors influencing disease progression. +Diagnosis involves a combination of clinical symptoms, low serum ceruloplasmin, high urinary copper, and genetic testing. +Treatment focuses on reducing copper accumulation through chelation therapy with drugs like penicillamine or trientine, and zinc supplementation to block copper absorption. +Liver transplantation may be necessary in severe cases. The worldwide prevalence of Wilson disease is estimated at 1 in 30,000, with higher rates in certain isolated populations. +"""] + +data = spark.createDataFrame(sample_texts, StringType()).toDF("text") + +result = pipeline.fit(data).transform(data) +``` +```scala +val document_assembler = new DocumentAssembler() + .setInputCol("text") + .setOutputCol("document") + +val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","en","clinical/models") + .setInputCols("document") + .setOutputCol("sentence") + +val tokenizer = new Tokenizer() + .setInputCols("sentence") + .setOutputCol("token") + +val clinical_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") + .setInputCols(Array("sentence", "token")) + .setOutputCol("embeddings") + +val ner_model = MedicalNerModel.pretrained("ner_genes_phenotypes", "en", "clinical/models") + .setInputCols(Array("sentence", "token","embeddings")) + .setOutputCol("ner") + +val ner_converter = new NerConverterInternal() + .setInputCols(Array("sentence", "token", "ner")) + .setOutputCol("ner_chunk") + .setWhiteList(["Gene", "MPG"]) + +val assertion = AssertionDLModel.pretrained("assertion_genomic_abnormality_wip", "en", "clinical/models") + .setInputCols(Array("sentence", "ner_chunk", "embeddings")) + .setOutputCol("assertion") + +val pipeline = new Pipeline().setStages(Array( + document_assembler, + sentenceDetector, + tokenizer, + clinical_embeddings, + ner_model, + ner_converter, + assertion +)) + +val sample_texts = Seq("""The ATP7B gene provides instructions for a copper-transporting ATPase essential for copper homeostasis. Mutations in the ATP7B gene cause Wilson disease, an autosomal recessive disorder of copper metabolism. +Over 500 mutations have been identified, including missense, nonsense, and splice site mutations. The variant ATP7B protein leads to impaired copper excretion and accumulation in various organs, particularly the liver and brain. +Clinical presentations of Wilson disease include hepatic dysfunction, neurological symptoms (e.g., tremors, dystonia), and psychiatric disturbances. +Kayser-Fleischer rings, copper deposits in the cornea, are a characteristic sign. Gene-environment interactions are significant, with dietary copper intake and other environmental factors influencing disease progression. +Diagnosis involves a combination of clinical symptoms, low serum ceruloplasmin, high urinary copper, and genetic testing. +Treatment focuses on reducing copper accumulation through chelation therapy with drugs like penicillamine or trientine, and zinc supplementation to block copper absorption. +Liver transplantation may be necessary in severe cases. The worldwide prevalence of Wilson disease is estimated at 1 in 30,000, with higher rates in certain isolated populations. +""").toDF("text") + +val result = pipeline.fit(sample_texts).transform(sample_texts) +``` +
+ +## Results + +```bash ++-------------+-----+---+---------+---------+----------+ +|chunk |begin|end|ner_label|assertion|confidence| ++-------------+-----+---+---------+---------+----------+ +|ATP7B gene |5 |14 |MPG |Normal |0.9835 | +|ATPase |64 |69 |MPG |Normal |0.9979 | +|ATP7B gene |122 |131|MPG |Affected |0.9974 | +|ATP7B protein|319 |331|MPG |Affected |0.9713 | +|ceruloplasmin|873 |885|MPG |Affected |0.9707 | ++-------------+-----+---+---------+---------+----------+ +``` + +{:.model-param} +## Model Information + +{:.table-model} +|---|---| +|Model Name:|assertion_genomic_abnormality_wip| +|Compatibility:|Healthcare NLP 5.5.1+| +|License:|Licensed| +|Edition:|Official| +|Input Labels:|[document, ner_chunk, embeddings]| +|Output Labels:|[assertion_pred]| +|Language:|en| +|Size:|944.2 KB| + +## References + +In-house annotated case reports. + +## Benchmarking + +```bash + label precision recall f1-score support + Affected 0.84 0.82 0.83 342 + Normal 0.82 0.86 0.84 315 + Variant 0.88 0.84 0.86 94 + accuracy - - 0.84 751 + macro-avg 0.85 0.84 0.84 751 +weighted-avg 0.84 0.84 0.84 751 +```