From 79982b00a7f47637fa2bf7eb8d13b245444df17e Mon Sep 17 00:00:00 2001 From: Anton Rubin Date: Mon, 28 Apr 2025 12:15:42 +0100 Subject: [PATCH 1/2] updating standard analyzer docs Signed-off-by: Anton Rubin --- _analyzers/supported-analyzers/standard.md | 88 +++++++++++++--------- 1 file changed, 53 insertions(+), 35 deletions(-) diff --git a/_analyzers/supported-analyzers/standard.md b/_analyzers/supported-analyzers/standard.md index d5c3650d5d..20af96b22e 100644 --- a/_analyzers/supported-analyzers/standard.md +++ b/_analyzers/supported-analyzers/standard.md @@ -7,17 +7,20 @@ nav_order: 50 # Standard analyzer -The `standard` analyzer is the default analyzer used when no other analyzer is specified. It is designed to provide a basic and efficient approach to generic text processing. +The `standard` analyzer is the built-in default analyzer used for general-purpose full-text search in OpenSearch and Elasticsearch. It is designed to provide consistent, language-agnostic text processing by efficiently breaking down text into searchable terms. -This analyzer consists of the following tokenizers and token filters: +The `standard` analyzer performs the following operations: -- `standard` tokenizer: Removes most punctuation and splits text on spaces and other common delimiters. -- `lowercase` token filter: Converts all tokens to lowercase, ensuring case-insensitive matching. -- `stop` token filter: Removes common stopwords, such as "the", "is", and "and", from the tokenized output. +- **Tokenization**: It uses the [`standard`]({{site.url}}{{site.baseurl}}/analyzers/tokenizers/standard/) tokenizer, which splits text into words based on Unicode text segmentation rules, handling spaces, punctuation, and common delimiters. +- **Lowercasing**: It applies the [`lowercase`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/lowercase/) token filter to convert all tokens to lowercase, ensuring consistent matching regardless of input case. -## Example +This combination makes the `standard` analyzer ideal for indexing a wide range of natural language content without needing language-specific customizations. -Use the following command to create an index named `my_standard_index` with a `standard` analyzer: +--- + +## Example: Creating an index with the standard analyzer + +You can assign the `standard` analyzer to a text field when creating an index: ```json PUT /my_standard_index @@ -26,7 +29,7 @@ PUT /my_standard_index "properties": { "my_field": { "type": "text", - "analyzer": "standard" + "analyzer": "standard" } } } @@ -34,33 +37,33 @@ PUT /my_standard_index ``` {% include copy-curl.html %} -## Parameters +--- -You can configure a `standard` analyzer with the following parameters. +## Parameters -Parameter | Required/Optional | Data type | Description -:--- | :--- | :--- | :--- -`max_token_length` | Optional | Integer | Sets the maximum length of the produced token. If this length is exceeded, the token is split into multiple tokens at the length configured in `max_token_length`. Default is `255`. -`stopwords` | Optional | String or list of strings | A string specifying a predefined list of stopwords (such as `_english_`) or an array specifying a custom list of stopwords. Default is `_none_`. -`stopwords_path` | Optional | String | The path (absolute or relative to the config directory) to the file containing a list of stop words. +The `standard` analyzer supports the following parameters: +| Parameter | Type | Default | Description | +|:----------|:-----|:--------|:------------| +| `max_token_length` | Integer | `255` | Sets the maximum length of a token before it is split. | +| `stopwords` | List or String | None | A list of stopwords or a predefined stopword set like `_english_` to remove during analysis. | +| `stopwords_path` | String | None | Path to a file containing stopwords to be used during analysis. | -## Configuring a custom analyzer +## Example: Analyzer with parameters -Use the following command to configure an index with a custom analyzer that is equivalent to the `standard` analyzer: +The following example crated index `products` and configures `max_token_length` and `stopwords`: ```json -PUT /my_custom_index +PUT /animals { "settings": { "analysis": { "analyzer": { - "my_custom_analyzer": { - "type": "custom", - "tokenizer": "standard", - "filter": [ - "lowercase", - "stop" + "my_manual_stopwords_analyzer": { + "type": "standard", + "max_token_length": 10, + "stopwords": [ + "the", "is", "and", "but", "an", "a", "it" ] } } @@ -70,28 +73,43 @@ PUT /my_custom_index ``` {% include copy-curl.html %} -## Generated tokens - -Use the following request to examine the tokens generated using the analyzer: +Use the following `_analyze` API to see how the `my_manual_stopwords_analyzer` processes text: ```json -POST /my_custom_index/_analyze +POST /animals/_analyze { - "analyzer": "my_custom_analyzer", - "text": "The slow turtle swims away" + "analyzer": "my_manual_stopwords_analyzer", + "text": "The Turtle is Large but it is Slow" } ``` {% include copy-curl.html %} -The response contains the generated tokens: +The returned token are separated based on spacing, lowercased and stopwords are removed: ```json { "tokens": [ - {"token": "slow","start_offset": 4,"end_offset": 8,"type": "","position": 1}, - {"token": "turtle","start_offset": 9,"end_offset": 15,"type": "","position": 2}, - {"token": "swims","start_offset": 16,"end_offset": 21,"type": "","position": 3}, - {"token": "away","start_offset": 22,"end_offset": 26,"type": "","position": 4} + { + "token": "turtle", + "start_offset": 4, + "end_offset": 10, + "type": "", + "position": 1 + }, + { + "token": "large", + "start_offset": 14, + "end_offset": 19, + "type": "", + "position": 3 + }, + { + "token": "slow", + "start_offset": 30, + "end_offset": 34, + "type": "", + "position": 7 + } ] } ``` From 72c65e17efdb3f158fd8e719f51bc8a7b4b01155 Mon Sep 17 00:00:00 2001 From: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Date: Mon, 28 Apr 2025 09:21:30 -0400 Subject: [PATCH 2/2] Update _analyzers/supported-analyzers/standard.md Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> --- _analyzers/supported-analyzers/standard.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_analyzers/supported-analyzers/standard.md b/_analyzers/supported-analyzers/standard.md index 20af96b22e..3f31f35a17 100644 --- a/_analyzers/supported-analyzers/standard.md +++ b/_analyzers/supported-analyzers/standard.md @@ -7,7 +7,7 @@ nav_order: 50 # Standard analyzer -The `standard` analyzer is the built-in default analyzer used for general-purpose full-text search in OpenSearch and Elasticsearch. It is designed to provide consistent, language-agnostic text processing by efficiently breaking down text into searchable terms. +The `standard` analyzer is the built-in default analyzer used for general-purpose full-text search in OpenSearch. It is designed to provide consistent, language-agnostic text processing by efficiently breaking down text into searchable terms. The `standard` analyzer performs the following operations: