deploy: 47fa80b

JohnSnowLabs · Jan 20, 2025 · fdcaf04 · fdcaf04
1 parent 0951332
commit fdcaf04
Show file tree

Hide file tree

Showing 9 changed files with 1,228 additions and 1,188 deletions.
diff --git a/docs/en/licensed_annotators.html b/docs/en/licensed_annotators.html
@@ -2289,6 +2289,15 @@ <h2 id="assertiondl">AssertionDL</h2>
       <li>
         <p><code class="language-plaintext highlighter-rouge">datasetInfo</code> <em>(Str)</em>: Descriptive information about the dataset being used.</p>
       </li>
+      <li>
+        <p><code class="language-plaintext highlighter-rouge">blackList</code> <em>(list[str])</em>: If defined, list of entities to ignore. The rest will be processed.</p>
+      </li>
+      <li>
+        <p><code class="language-plaintext highlighter-rouge">whiteList</code> <em>(list[str])</em>:  If defined, list of entities to process. The rest will be ignored. Do not include IOB prefix on labels.</p>
+      </li>
+      <li>
+        <p><code class="language-plaintext highlighter-rouge">caseSensitive</code> <em>(Bool)</em>: Determines whether the definitions of the white listed and black listed entities are case sensitive. Default: True.</p>
+      </li>
     </ul>
 
     <p>For pretrained models please see the
@@ -4981,9 +4990,15 @@ <h2 id="bertsentencechunkembeddings">BertSentenceChunkEmbeddings</h2>
       <li>
         <p><code class="language-plaintext highlighter-rouge">caseSensitive</code>: Determines whether the definitions of the white listed entities are case sensitive.</p>
       </li>
+      <li>
+        <p><code class="language-plaintext highlighter-rouge">strategy</code>: Strategy for computing embeddings. Supported strategies are: <code class="language-plaintext highlighter-rouge">sentence_average</code>, <code class="language-plaintext highlighter-rouge">scope_average</code>, <code class="language-plaintext highlighter-rouge">chunk_only</code>, <code class="language-plaintext highlighter-rouge">scope_only</code>. The default is <code class="language-plaintext highlighter-rouge">sentence_average</code>.</p>
+      </li>
+      <li>
+        <p><code class="language-plaintext highlighter-rouge">scopeWindow</code>: cope window to calculate scope embeddings. The scope window is defined by two non-negative integers. The default is [0, 0], which means only the chunk embeddings are used. The first integer defines the number of tokens before the chunk and the second integer defines the number of tokens after the chunk.</p>
+      </li>
     </ul>
 
-    <p>All the parameters can be set using the corresponding set method in camel case. For example, <code class="language-plaintext highlighter-rouge">.setInputcols()</code>.</p>
+    <p>All the parameters can be set using the corresponding set method in camel case. For example, <code class="language-plaintext highlighter-rouge">.setInputCols()</code>.</p>
 
     <blockquote>
       <p>For more information and examples of <code class="language-plaintext highlighter-rouge">BertSentenceChunkEmbeddings</code> annotator, you can check the <a href="https://github.com/JohnSnowLabs/spark-nlp-workshop">Spark NLP Workshop</a>, and in special, the notebook <a href="https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.1.Improved_Entity_Resolution_with_SentenceChunkEmbeddings.ipynb">24.1.Improved_Entity_Resolution_with_SentenceChunkEmbeddings.ipynb</a>.</p>
@@ -10128,6 +10143,8 @@ <h2 id="contextualentityfilterer">ContextualEntityFilterer</h2>
           <li><code class="language-plaintext highlighter-rouge">blackListWords</code>: The black list of words. If a word from this list appears within the scope window, the chunk will be filtered out.</li>
           <li><code class="language-plaintext highlighter-rouge">whiteListWords</code>: The white list of words. If a word from this list appears within the scope window, the chunk will be kept.</li>
           <li><code class="language-plaintext highlighter-rouge">confidenceThreshold</code>: The confidence threshold to filter the chunks. Filtering is only applied if the confidence of the chunk is below the threshold.</li>
+          <li><code class="language-plaintext highlighter-rouge">possibleRegexContext</code> : The possible regex context to filter the chunks. If the regex is found in the context(chunk), the chunk is kept.</li>
+          <li><code class="language-plaintext highlighter-rouge">impossibleRegexContext</code> : The impossible regex context to filter the chunks. If the regex is found in the context(chunk), the chunk is removed.</li>
         </ul>
       </li>
     </ul>
@@ -10339,11 +10356,11 @@ <h2 id="contextualentityruler">ContextualEntityRuler</h2>
     <p>Parameters:</p>
 
     <ul>
-      <li><code class="language-plaintext highlighter-rouge">setCaseSensitive</code>: Whether to perform case-sensitive matching. Default is False.</li>
-      <li><code class="language-plaintext highlighter-rouge">setAllowPunctuationInBetween</code>: Whether to allow punctuation between prefix/suffix patterns and the entity. Default is True.</li>
-      <li><code class="language-plaintext highlighter-rouge">setDropEmptyChunks</code>: If True, removes chunks with empty content after applying rules. Default is False.</li>
-      <li><code class="language-plaintext highlighter-rouge">setCaseSensitive</code>: If True, it is case sensitive while checking the context. Default is False.</li>
-      <li><code class="language-plaintext highlighter-rouge">setMergeOverlapping</code>: If False, it returns both modified entities and the original entities at the same time. Default is True.</li>
+      <li><code class="language-plaintext highlighter-rouge">caseSensitive</code>: Whether to perform case-sensitive matching. Default is False.</li>
+      <li><code class="language-plaintext highlighter-rouge">allowPunctuationInBetween</code>: Whether to allow punctuation between prefix/suffix patterns and the entity. Default is True.</li>
+      <li><code class="language-plaintext highlighter-rouge">allowTokensInBetween</code>: Whether to allow tokens between prefix/suffix patterns and the entity. Default is False.</li>
+      <li><code class="language-plaintext highlighter-rouge">dropEmptyChunks</code>: If True, removes chunks with empty content after applying rules. Default is False.</li>
+      <li><code class="language-plaintext highlighter-rouge">mergeOverlapping</code>: If False, it returns both modified entities and the original entities at the same time. Default is True.</li>
       <li><code class="language-plaintext highlighter-rouge">rules</code>: The updating rules. Each rule is a dictionary with the following keys:
         <ul>
           <li><code class="language-plaintext highlighter-rouge">entity</code>: The target entity label to modify.<br />
@@ -10364,9 +10381,7 @@ <h2 id="contextualentityruler">ContextualEntityRuler</h2>
     Example: <code class="language-plaintext highlighter-rouge">["\\b(old|young)\\b"]</code> matches words like “old” or “young” as suffixes.</li>
           <li><code class="language-plaintext highlighter-rouge">replaceEntity</code>: Optional string specifying the new entity label to replace with the target entity label.<br />
     Example: <code class="language-plaintext highlighter-rouge">"MODIFIED_AGE"</code> replaces <code class="language-plaintext highlighter-rouge">"AGE"</code> with <code class="language-plaintext highlighter-rouge">"MODIFIED_AGE"</code> in matching cases.</li>
-          <li><code class="language-plaintext highlighter-rouge">mode</code>: Specifies the operational mode for the rules.<br />
-    Possible values depend on the use case (e.g., <code class="language-plaintext highlighter-rouge">"include"</code>, <code class="language-plaintext highlighter-rouge">"exclude"</code>).
-    Default: <code class="language-plaintext highlighter-rouge">"include"</code></li>
+          <li><code class="language-plaintext highlighter-rouge">mode</code>: Specifies the operational mode for the rules. Options: <code class="language-plaintext highlighter-rouge">include</code>, <code class="language-plaintext highlighter-rouge">exclude</code>, or <code class="language-plaintext highlighter-rouge">replace_label_only</code>. Default is <code class="language-plaintext highlighter-rouge">include</code>.</li>
         </ul>
       </li>
     </ul>
@@ -10444,7 +10459,6 @@ <h2 id="contextualentityruler">ContextualEntityRuler</h2>
                 <span class="s">"replaceEntity"</span> <span class="p">:</span> <span class="s">"Modified_Date"</span><span class="p">,</span>
                 <span class="s">"mode"</span> <span class="p">:</span> <span class="s">"include"</span>
             <span class="p">}</span>
-
         <span class="p">]</span>
 
 <span class="n">contextual_entity_ruler</span> <span class="o">=</span> <span class="n">medical</span><span class="p">.</span><span class="n">ContextualEntityRuler</span><span class="p">()</span> \
@@ -11595,6 +11609,12 @@ <h2 id="deidentification">DeIdentification</h2>
 If False, the month will be modified along with the year and day.
 Default: False.</p>
       </li>
+      <li>
+        <p><code class="language-plaintext highlighter-rouge">keepTextSizeForObfuscation</code> : Whether to keep the text length same obfuscating entities. If <code class="language-plaintext highlighter-rouge">True</code>, the output text will remain the same if a same length fake is available, otherwise length might vary.</p>
+      </li>
+      <li>
+        <p><code class="language-plaintext highlighter-rouge">fakerLengthOffset</code> : It specifies how much length deviation is accepted in obfuscation, with <code class="language-plaintext highlighter-rouge">keepTextSizeForObfuscation</code> enabled. It must be greater than 0.</p>
+      </li>
     </ul>
 
     <p>To create a configured DeIdentificationModel, please see the example of DeIdentification.</p>
@@ -16215,6 +16235,9 @@ <h2 id="fewshotassertionclassifiermodel">FewShotAssertionClassifierModel</h2>
       <li><code class="language-plaintext highlighter-rouge">batchSize</code> <em>(Int)</em>: Batch size</li>
       <li><code class="language-plaintext highlighter-rouge">caseSensitive</code> <em>(Bool)</em>: Whether the classifier is sensitive to text casing</li>
       <li><code class="language-plaintext highlighter-rouge">maxSentenceLength</code> <em>(Int)</em>: The maximum length of the input text</li>
+      <li><code class="language-plaintext highlighter-rouge">blackList</code> <em>(list[str])</em>: If defined, list of entities to ignore. The rest will be processed.</li>
+      <li><code class="language-plaintext highlighter-rouge">whiteList</code> <em>(list[str])</em>:  If defined, list of entities to process. The rest will be ignored. Do not include IOB prefix on labels.</li>
+      <li><code class="language-plaintext highlighter-rouge">caseSensitive</code> <em>(Bool)</em>: Determines whether the definitions of the white listed and black listed entities are case sensitive. Default: True.</li>
     </ul>
 
     <p><strong>Input Annotator Types:</strong> <code class="language-plaintext highlighter-rouge">DOCUMENT, CHUNK</code></p>

diff --git a/en/licensed_annotator_entries/AssertionDL.md b/en/licensed_annotator_entries/AssertionDL.md
@@ -33,6 +33,12 @@ Parameters:
 
 - `datasetInfo` *(Str)*: Descriptive information about the dataset being used.
 
+- `blackList` *(list[str])*: If defined, list of entities to ignore. The rest will be processed.
+
+- `whiteList` *(list[str])*:  If defined, list of entities to process. The rest will be ignored. Do not include IOB prefix on labels.
+
+- `caseSensitive` *(Bool)*: Determines whether the definitions of the white listed and black listed entities are case sensitive. Default: True.
+
 For pretrained models please see the
 [Models Hub](https://nlp.johnsnowlabs.com/models?task=Assertion+Status) for available models.
 {%- endcapture -%}

diff --git a/en/licensed_annotator_entries/BertSentenceChunkEmbeddings.md b/en/licensed_annotator_entries/BertSentenceChunkEmbeddings.md
@@ -21,7 +21,11 @@ Parameters:
 
 - `caseSensitive`: Determines whether the definitions of the white listed entities are case sensitive.
 
-All the parameters can be set using the corresponding set method in camel case. For example, `.setInputcols()`.
+- `strategy`: Strategy for computing embeddings. Supported strategies are: `sentence_average`, `scope_average`, `chunk_only`, `scope_only`. The default is `sentence_average`.
+
+- `scopeWindow`: cope window to calculate scope embeddings. The scope window is defined by two non-negative integers. The default is [0, 0], which means only the chunk embeddings are used. The first integer defines the number of tokens before the chunk and the second integer defines the number of tokens after the chunk.
+
+All the parameters can be set using the corresponding set method in camel case. For example, `.setInputCols()`.
 
 > For more information and examples of `BertSentenceChunkEmbeddings` annotator, you can check the [Spark NLP Workshop](https://github.com/JohnSnowLabs/spark-nlp-workshop), and in special, the notebook [24.1.Improved_Entity_Resolution_with_SentenceChunkEmbeddings.ipynb](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.1.Improved_Entity_Resolution_with_SentenceChunkEmbeddings.ipynb).
 

diff --git a/en/licensed_annotator_entries/ContextualEntityFilterer.md b/en/licensed_annotator_entries/ContextualEntityFilterer.md
@@ -24,6 +24,8 @@ Parameters:
   - `blackListWords`: The black list of words. If a word from this list appears within the scope window, the chunk will be filtered out.
   - `whiteListWords`: The white list of words. If a word from this list appears within the scope window, the chunk will be kept.
   - `confidenceThreshold`: The confidence threshold to filter the chunks. Filtering is only applied if the confidence of the chunk is below the threshold.
+  - `possibleRegexContext` : The possible regex context to filter the chunks. If the regex is found in the context(chunk), the chunk is kept.
+  - `impossibleRegexContext` : The impossible regex context to filter the chunks. If the regex is found in the context(chunk), the chunk is removed.
 
 {%- endcapture -%}
 

diff --git a/en/licensed_annotator_entries/ContextualEntityRuler.md b/en/licensed_annotator_entries/ContextualEntityRuler.md
@@ -14,11 +14,11 @@ It is particularly useful for refining entity recognition results according to s
 
 Parameters:
 
-- `setCaseSensitive`: Whether to perform case-sensitive matching. Default is False.
-- `setAllowPunctuationInBetween`: Whether to allow punctuation between prefix/suffix patterns and the entity. Default is True.
-- `setDropEmptyChunks`: If True, removes chunks with empty content after applying rules. Default is False.
-- `setCaseSensitive`: If True, it is case sensitive while checking the context. Default is False.
-- `setMergeOverlapping`: If False, it returns both modified entities and the original entities at the same time. Default is True.
+- `caseSensitive`: Whether to perform case-sensitive matching. Default is False.
+- `allowPunctuationInBetween`: Whether to allow punctuation between prefix/suffix patterns and the entity. Default is True.
+- `allowTokensInBetween`: Whether to allow tokens between prefix/suffix patterns and the entity. Default is False.
+- `dropEmptyChunks`: If True, removes chunks with empty content after applying rules. Default is False.
+- `mergeOverlapping`: If False, it returns both modified entities and the original entities at the same time. Default is True.
 - `rules`: The updating rules. Each rule is a dictionary with the following keys:
   - `entity`: The target entity label to modify.  
         Example: `"AGE"`.
@@ -38,9 +38,8 @@ Parameters:
         Example: `["\\b(old|young)\\b"]` matches words like "old" or "young" as suffixes.
   - `replaceEntity`: Optional string specifying the new entity label to replace with the target entity label.  
         Example: `"MODIFIED_AGE"` replaces `"AGE"` with `"MODIFIED_AGE"` in matching cases.
-  - `mode`: Specifies the operational mode for the rules.  
-        Possible values depend on the use case (e.g., `"include"`, `"exclude"`).
-        Default: `"include"` 
+  - `mode`: Specifies the operational mode for the rules. Options: `include`, `exclude`, or `replace_label_only`. Default is `include`.
+
   {%- endcapture -%}
 
 {%- capture model_input_anno -%}
@@ -101,7 +100,6 @@ rules = [   {
                 "replaceEntity" : "Modified_Date",
                 "mode" : "include"
             }
-
         ]
 
 contextual_entity_ruler = medical.ContextualEntityRuler() \

diff --git a/en/licensed_annotator_entries/DeIdentification.md b/en/licensed_annotator_entries/DeIdentification.md
@@ -123,6 +123,10 @@ If True, the month will remain unchanged during the obfuscation process.
 If False, the month will be modified along with the year and day.
 Default: False.
 
+- `keepTextSizeForObfuscation` : Whether to keep the text length same obfuscating entities. If `True`, the output text will remain the same if a same length fake is available, otherwise length might vary.
+
+- `fakerLengthOffset` : It specifies how much length deviation is accepted in obfuscation, with `keepTextSizeForObfuscation` enabled. It must be greater than 0.
+
 
 To create a configured DeIdentificationModel, please see the example of DeIdentification.
 {%- endcapture -%}

diff --git a/en/licensed_annotator_entries/FewShotAssertionClassifier.md b/en/licensed_annotator_entries/FewShotAssertionClassifier.md
@@ -16,6 +16,9 @@ Parameters:
 - `batchSize` *(Int)*: Batch size
 - `caseSensitive` *(Bool)*: Whether the classifier is sensitive to text casing
 - `maxSentenceLength` *(Int)*: The maximum length of the input text
+- `blackList` *(list[str])*: If defined, list of entities to ignore. The rest will be processed.
+- `whiteList` *(list[str])*:  If defined, list of entities to process. The rest will be ignored. Do not include IOB prefix on labels.
+- `caseSensitive` *(Bool)*: Determines whether the definitions of the white listed and black listed entities are case sensitive. Default: True.
 
 
 {%- endcapture -%}

diff --git a/feed.xml b/feed.xml
@@ -1,4 +1,4 @@
-<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.9.2">Jekyll</generator><link href="/feed.xml" rel="self" type="application/atom+xml" /><link href="/" rel="alternate" type="text/html" /><updated>2025-01-20T18:46:09+00:00</updated><id>/feed.xml</id><title type="html">Spark NLP</title><subtitle>High Performance NLP with Apache Spark
+<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.9.2">Jekyll</generator><link href="/feed.xml" rel="self" type="application/atom+xml" /><link href="/" rel="alternate" type="text/html" /><updated>2025-01-20T18:53:47+00:00</updated><id>/feed.xml</id><title type="html">Spark NLP</title><subtitle>High Performance NLP with Apache Spark
 </subtitle><author><name>{&quot;type&quot;=&gt;nil, &quot;name&quot;=&gt;nil, &quot;url&quot;=&gt;nil, &quot;avatar&quot;=&gt;nil, &quot;bio&quot;=&gt;nil, &quot;email&quot;=&gt;nil, &quot;facebook&quot;=&gt;nil, &quot;twitter&quot;=&gt;nil, &quot;weibo&quot;=&gt;nil, &quot;googleplus&quot;=&gt;nil, &quot;telegram&quot;=&gt;nil, &quot;medium&quot;=&gt;nil, &quot;zhihu&quot;=&gt;nil, &quot;douban&quot;=&gt;nil, &quot;linkedin&quot;=&gt;nil, &quot;github&quot;=&gt;nil, &quot;npm&quot;=&gt;nil}</name></author><entry><title type="html">Clinical Deidentification Pipeline (Document Wise - Benchmark)</title><link href="/2025/01/16/clinical_deidentification_docwise_benchmark_en.html" rel="alternate" type="text/html" title="Clinical Deidentification Pipeline (Document Wise - Benchmark)" /><published>2025-01-16T00:00:00+00:00</published><updated>2025-01-16T00:00:00+00:00</updated><id>/2025/01/16/clinical_deidentification_docwise_benchmark_en</id><content type="html" xml:base="/2025/01/16/clinical_deidentification_docwise_benchmark_en.html">## Description
 
 This pipeline can be used to deidentify PHI information from medical texts. The PHI information will be masked and obfuscated in the resulting text. The pipeline can mask and obfuscate `NAME`, `IDNUM`, `CONTACT`, `LOCATION`, `AGE`, `DATE` entities.