diff --git a/articles/search/index-add-custom-analyzers.md b/articles/search/index-add-custom-analyzers.md index 919659ec10b..bc59a6ff535 100644 --- a/articles/search/index-add-custom-analyzers.md +++ b/articles/search/index-add-custom-analyzers.md @@ -9,20 +9,20 @@ ms.service: azure-ai-search ms.custom: - ignite-2023 ms.topic: how-to -ms.date: 05/23/2024 +ms.date: 01/16/2025 --- # Add custom analyzers to string fields in an Azure AI Search index -A *custom analyzer* is a user-defined combination of one tokenizer, one or more token filters, and one or more character filters. A custom analyzer is specified within a search index, and then referenced by name on field definitions that require custom analysis. A custom analyzer is invoked on a per-field basis. Attributes on the field will determine whether it's used for indexing, queries, or both. +A *custom analyzer* is a component of lexical analysis over plain text content. It's a user-defined combination of one tokenizer, one or more token filters, and one or more character filters. A custom analyzer is specified within a search index, and then referenced by name on field definitions that require custom analysis. A custom analyzer is invoked on a per-field basis. Attributes on the field determine whether it's used for indexing, queries, or both. -In a custom analyzer, character filters prepare the input text before it's processed by the tokenizer (for example, removing markup). Next, the tokenizer breaks text into tokens. Finally, token filters modify the tokens emitted by the tokenizer. For concepts and examples, see [Analyzers in Azure AI Search](search-analyzers.md). +In a custom analyzer, character filters prepare the input text before it's processed by the tokenizer (for example, removing markup). Next, the tokenizer breaks text into tokens. Finally, token filters modify the tokens emitted by the tokenizer. For concepts and examples, see [Analyzers in Azure AI Search](search-analyzers.md) and [Tutorial: Create a custom analyzer for phone numbers](tutorial-create-custom-analyzer.md). ## Why use a custom analyzer? -A custom analyzer gives you control over the process of converting text into indexable and searchable tokens by allowing you to choose which types of analysis or filtering to invoke, and the order in which they occur. +A custom analyzer gives you control over the process of converting plain text into indexable and searchable tokens by allowing you to choose which types of analysis or filtering to invoke, and the order in which they occur. -Create and assign a custom analyzer if none of default (Standard Lucence), built-in, or language analyzers are sufficient for your needs. You might also create a custom analyzer if you want to use a built-in analyzer with custom options. For example, if you wanted to change the maxTokenLength on Standard, you would create a custom analyzer, with a user-defined name, to set that option. +Create and assign a custom analyzer if none of default (Standard Lucene), built-in, or language analyzers are sufficient for your needs. You might also create a custom analyzer if you want to use a built-in analyzer with custom options. For example, if you wanted to change the `maxTokenLength` on Standard Lucene, you would create a custom analyzer, with a user-defined name, to set that option. Scenarios where custom analyzers can be helpful include: @@ -39,7 +39,7 @@ Scenarios where custom analyzers can be helpful include: - ASCII folding. Add the Standard ASCII folding filter to normalize diacritics like ö or ê in search terms. > [!NOTE] -> Custom analyzers aren't exposed in the Azure portal. The only way to add a custom analyzer is through code that defines an index. +> Custom analyzers aren't exposed in the Azure portal. The only way to add a custom analyzer is through code that [creates an index schema](/rest/api/searchservice/indexes/create-or-update). ## Create a custom analyzer @@ -47,7 +47,7 @@ To create a custom analyzer, specify it in the `analyzers` section of an index a An analyzer definition includes a name, type, one or more character filters, a maximum of one tokenizer, and one or more token filters for post-tokenization processing. Character filters are applied before tokenization. Token filters and character filters are applied from left to right. -- Names in a custom analyzer must be unique and can't be the same as any of the built-in analyzers, tokenizers, token filters, or characters filters. Names consist of letters, digits, spaces, dashes or underscores. Names must start and end with plain text characters. Names must be under 128 characters in length. +- Names in a custom analyzer must be unique and can't be the same as any of the built-in analyzers, tokenizers, token filters, or characters filters. Names consist of letters, digits, spaces, dashes, or underscores. Names must start and end with plain text characters. Names must be under 128 characters in length. - Type must be #Microsoft.Azure.Search.CustomAnalyzer. @@ -224,7 +224,7 @@ Azure AI Search supports character filters in the following list. More informati |[mapping](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/charfilter/MappingCharFilter.html)|MappingCharFilter|A char filter that applies mappings defined with the mappings option. Matching is greedy (longest pattern matching at a given point wins). Replacement is allowed to be the empty string.

**Options**

mappings (type: string array) - A list of mappings of the following format: `a=>b` (all occurrences of the character `a` are replaced with character `b`). Required.| |[pattern_replace](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/pattern/PatternReplaceCharFilter.html)|PatternReplaceCharFilter|A char filter that replaces characters in the input string. It uses a regular expression to identify character sequences to preserve and a replacement pattern to identify characters to replace. For example, input text = `aa bb aa bb`, pattern=`(aa)\\\s+(bb)` replacement=`$1#$2`, result = `aa#bb aa#bb`.

**Options**

pattern (type: string) - Required.

replacement (type: string) - Required.| - 1 Char Filter Types are always prefixed in code with `#Microsoft.Azure.Search` such that `MappingCharFilter` would actually be specified as `#Microsoft.Azure.Search.MappingCharFilter`. We removed the prefix to reduce the width of the table, but please remember to include it in your code. Notice that char_filter_type is only provided for filters that can be customized. If there are no options, as is the case with html_strip, there's no associated #Microsoft.Azure.Search type. + 1 Char Filter Types are always prefixed in code with `#Microsoft.Azure.Search` such that `MappingCharFilter` would actually be specified as `#Microsoft.Azure.Search.MappingCharFilter`. We removed the prefix to reduce the width of the table, but remember to include it in your code. Notice that char_filter_type is only provided for filters that can be customized. If there are no options, as is the case with html_strip, there's no associated #Microsoft.Azure.Search type. @@ -237,20 +237,20 @@ Azure AI Search supports tokenizers in the following list. More information abou |**tokenizer_name**|**tokenizer_type** 1|**Description and Options**| |------------------|-------------------------------|---------------------------| |[classic](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/standard/ClassicTokenizer.html)|ClassicTokenizer|Grammar based tokenizer that is suitable for processing most European-language documents.

**Options**

maxTokenLength (type: int) - The maximum token length. Default: 255, maximum: 300. Tokens longer than the maximum length are split.| -|[edgeNGram](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/ngram/EdgeNGramTokenizer.html)|EdgeNGramTokenizer|Tokenizes the input from an edge into n-grams of given size(s).

**Options**

minGram (type: int) - Default: 1, maximum: 300.

maxGram (type: int) - Default: 2, maximum: 300. Must be greater than minGram.

tokenChars (type: string array) - Character classes to keep in the tokens. Allowed values:
`letter`, `digit`, `whitespace`, `punctuation`, `symbol`. Defaults to an empty array - keeps all characters. | +|[edgeNGram](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/ngram/EdgeNGramTokenizer.html)|EdgeNGramTokenizer|Tokenizes the input from an edge into n-grams of given sizes.

**Options**

minGram (type: int) - Default: 1, maximum: 300.

maxGram (type: int) - Default: 2, maximum: 300. Must be greater than minGram.

tokenChars (type: string array) - Character classes to keep in the tokens. Allowed values:
`letter`, `digit`, `whitespace`, `punctuation`, `symbol`. Defaults to an empty array - keeps all characters. | |[keyword_v2](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/core/KeywordTokenizer.html)|KeywordTokenizerV2|Emits the entire input as a single token.

**Options**

maxTokenLength (type: int) - The maximum token length. Default: 256, maximum: 300. Tokens longer than the maximum length are split.| |[letter](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/core/LetterTokenizer.html)|(type applies only when options are available) |Divides text at non-letters. Tokens that are longer than 255 characters are split.| |[lowercase](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/core/LowerCaseTokenizer.html)|(type applies only when options are available) |Divides text at non-letters and converts them to lower case. Tokens that are longer than 255 characters are split.| | microsoft_language_tokenizer| MicrosoftLanguageTokenizer| Divides text using language-specific rules.

**Options**

maxTokenLength (type: int) - The maximum token length, default: 255, maximum: 300. Tokens longer than the maximum length are split. Tokens longer than 300 characters are first split into tokens of length 300 and then each of those tokens is split based on the maxTokenLength set.

isSearchTokenizer (type: bool) - Set to true if used as the search tokenizer, set to false if used as the indexing tokenizer.

language (type: string) - Language to use, default `english`. Allowed values include:
`bangla`, `bulgarian`, `catalan`, `chineseSimplified`, `chineseTraditional`, `croatian`, `czech`, `danish`, `dutch`, `english`, `french`, `german`, `greek`, `gujarati`, `hindi`, `icelandic`, `indonesian`, `italian`, `japanese`, `kannada`, `korean`, `malay`, `malayalam`, `marathi`, `norwegianBokmaal`, `polish`, `portuguese`, `portugueseBrazilian`, `punjabi`, `romanian`, `russian`, `serbianCyrillic`, `serbianLatin`, `slovenian`, `spanish`, `swedish`, `tamil`, `telugu`, `thai`, `ukrainian`, `urdu`, `vietnamese` | | microsoft_language_stemming_tokenizer | MicrosoftLanguageStemmingTokenizer| Divides text using language-specific rules and reduces words to their base forms. This tokenizer performs lemmatization.

**Options**

maxTokenLength (type: int) - The maximum token length, default: 255, maximum: 300. Tokens longer than the maximum length are split. Tokens longer than 300 characters are first split into tokens of length 300 and then each of those tokens is split based on the maxTokenLength set.

isSearchTokenizer (type: bool) - Set to true if used as the search tokenizer, set to false if used as the indexing tokenizer.

language (type: string) - Language to use, default `english`. Allowed values include:
`arabic`, `bangla`, `bulgarian`, `catalan`, `croatian`, `czech`, `danish`, `dutch`, `english`, `estonian`, `finnish`, `french`, `german`, `greek`, `gujarati`, `hebrew`, `hindi`, `hungarian`, `icelandic`, `indonesian`, `italian`, `kannada`, `latvian`, `lithuanian`, `malay`, `malayalam`, `marathi`, `norwegianBokmaal`, `polish`, `portuguese`, `portugueseBrazilian`, `punjabi`, `romanian`, `russian`, `serbianCyrillic`, `serbianLatin`, `slovak`, `slovenian`, `spanish`, `swedish`, `tamil`, `telugu`, `turkish`, `ukrainian`, `urdu` | -|[nGram](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/ngram/NGramTokenizer.html)|NGramTokenizer|Tokenizes the input into n-grams of the given size(s).

**Options**

minGram (type: int) - Default: 1, maximum: 300.

maxGram (type: int) - Default: 2, maximum: 300. Must be greater than minGram.

tokenChars (type: string array) - Character classes to keep in the tokens. Allowed values: `letter`, `digit`, `whitespace`, `punctuation`, `symbol`. Defaults to an empty array - keeps all characters. | +|[nGram](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/ngram/NGramTokenizer.html)|NGramTokenizer|Tokenizes the input into n-grams of the given sizes.

**Options**

minGram (type: int) - Default: 1, maximum: 300.

maxGram (type: int) - Default: 2, maximum: 300. Must be greater than minGram.

tokenChars (type: string array) - Character classes to keep in the tokens. Allowed values: `letter`, `digit`, `whitespace`, `punctuation`, `symbol`. Defaults to an empty array - keeps all characters. | |[path_hierarchy_v2](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/path/PathHierarchyTokenizer.html)|PathHierarchyTokenizerV2|Tokenizer for path-like hierarchies. **Options**

delimiter (type: string) - Default: '/.

replacement (type: string) - If set, replaces the delimiter character. Default same as the value of delimiter.

maxTokenLength (type: int) - The maximum token length. Default: 300, maximum: 300. Paths longer than maxTokenLength are ignored.

reverse (type: bool) - If true, generates token in reverse order. Default: false.

skip (type: bool) - Initial tokens to skip. The default is 0.| |[pattern](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/pattern/PatternTokenizer.html)|PatternTokenizer|This tokenizer uses regex pattern matching to construct distinct tokens.

**Options**

[pattern](https://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html) (type: string) - Regular expression pattern to match token separators. The default is `\W+`, which matches non-word characters.

[flags](https://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html#field_summary) (type: string) - Regular expression flags. The default is an empty string. Allowed values: CANON_EQ, CASE_INSENSITIVE, COMMENTS, DOTALL, LITERAL, MULTILINE, UNICODE_CASE, UNIX_LINES

group (type: int) - Which group to extract into tokens. The default is -1 (split).| |[standard_v2](https://lucene.apache.org/core/6_6_1/core/org/apache/lucene/analysis/standard/StandardTokenizer.html)|StandardTokenizerV2|Breaks text following the [Unicode Text Segmentation rules](https://unicode.org/reports/tr29/).

**Options**

maxTokenLength (type: int) - The maximum token length. Default: 255, maximum: 300. Tokens longer than the maximum length are split.| |[uax_url_email](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/standard/UAX29URLEmailTokenizer.html)|UaxUrlEmailTokenizer|Tokenizes urls and emails as one token.

**Options**

maxTokenLength (type: int) - The maximum token length. Default: 255, maximum: 300. Tokens longer than the maximum length are split.| |[whitespace](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/core/WhitespaceTokenizer.html)|(type applies only when options are available) |Divides text at whitespace. Tokens that are longer than 255 characters are split.| - 1 Tokenizer Types are always prefixed in code with `#Microsoft.Azure.Search` such that `ClassicTokenizer` would actually be specified as `#Microsoft.Azure.Search.ClassicTokenizer`. We removed the prefix to reduce the width of the table, but please remember to include it in your code. Notice that tokenizer_type is only provided for tokenizers that can be customized. If there are no options, as is the case with the letter tokenizer, there's no associated #Microsoft.Azure.Search type. + 1 Tokenizer Types are always prefixed in code with `#Microsoft.Azure.Search` such that `ClassicTokenizer` would actually be specified as `#Microsoft.Azure.Search.ClassicTokenizer`. We removed the prefix to reduce the width of the table, but remember to include it in your code. Notice that tokenizer_type is only provided for tokenizers that can be customized. If there are no options, as is the case with the letter tokenizer, there's no associated #Microsoft.Azure.Search type. @@ -258,7 +258,7 @@ Azure AI Search supports tokenizers in the following list. More information abou A token filter is used to filter out or modify the tokens generated by a tokenizer. For example, you can specify a lowercase filter that converts all characters to lowercase. You can have multiple token filters in a custom analyzer. Token filters run in the order in which they're listed. -In the table below, the token filters that are implemented using Apache Lucene are linked to the Lucene API documentation. +In the following table, the token filters that are implemented using Apache Lucene are linked to the Lucene API documentation. |**token_filter_name**|**token_filter_type** 1|**Description and Options**| |-|-|-| @@ -266,11 +266,11 @@ In the table below, the token filters that are implemented using Apache Lucene a |[apostrophe](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/tr/ApostropheFilter.html)|(type applies only when options are available) |Strips all characters after an apostrophe (including the apostrophe itself). | |[asciifolding](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.html)|AsciiFoldingTokenFilter|Converts alphabetic, numeric, and symbolic Unicode characters which aren't in the first 127 ASCII characters (the `Basic Latin` Unicode block) into their ASCII equivalents, if one exists.

**Options**

preserveOriginal (type: bool) - If true, the original token is kept. The default is false.| |[cjk_bigram](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/cjk/CJKBigramFilter.html)|CjkBigramTokenFilter|Forms bigrams of CJK terms that are generated from StandardTokenizer.

**Options**

ignoreScripts (type: string array) - Scripts to ignore. Allowed values include: `han`, `hiragana`, `katakana`, `hangul`. The default is an empty list.

outputUnigrams (type: bool) - Set to true if you always want to output both unigrams and bigrams. The default is false.| -|[cjk_width](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/cjk/CJKWidthFilter.html)|(type applies only when options are available) |Normalizes CJK width differences. Folds full width ASCII variants into the equivalent basic latin and half-width Katakana variants into the equivalent kana. | +|[cjk_width](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/cjk/CJKWidthFilter.html)|(type applies only when options are available) |Normalizes CJK width differences. Folds full width ASCII variants into the equivalent basic Latin and half-width Katakana variants into the equivalent kana. | |[classic](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/standard/ClassicFilter.html)|(type applies only when options are available) |Removes the English possessives, and dots from acronyms. | |[common_grams](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/commongrams/CommonGramsFilter.html)|CommonGramTokenFilter|Construct bigrams for frequently occurring terms while indexing. Single terms are still indexed too, with bigrams overlaid.

**Options**

commonWords (type: string array) - The set of common words. The default is an empty list. Required.

ignoreCase (type: bool) - If true, matching is case insensitive. The default is false.

queryMode (type: bool) - Generates bigrams then removes common words and single terms followed by a common word. The default is false.| |[dictionary_decompounder](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/compound/DictionaryCompoundWordTokenFilter.html)|DictionaryDecompounderTokenFilter|Decomposes compound words found in many Germanic languages.

**Options**

wordList (type: string array) - The list of words to match against. The default is an empty list. Required.

minWordSize (type: int) - Only words longer than this will be processed. The default is 5.

minSubwordSize (type: int) - Only subwords longer than this will be outputted. The default is 2.

maxSubwordSize (type: int) - Only subwords shorter than this will be outputted. The default is 15.

onlyLongestMatch (type: bool) - Add only the longest matching subword to output. The default is false.| -|[edgeNGram_v2](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/ngram/EdgeNGramTokenFilter.html)|EdgeNGramTokenFilterV2|Generates n-grams of the given size(s) from starting from the front or the back of an input token.

**Options**

minGram (type: int) - Default: 1, maximum: 300.

maxGram (type: int) - Default: 2, maximum 300. Must be greater than minGram.

side (type: string) - Specifies which side of the input the n-gram should be generated from. Allowed values: `front`, `back` | +|[edgeNGram_v2](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/ngram/EdgeNGramTokenFilter.html)|EdgeNGramTokenFilterV2|Generates n-grams of the given sizes from starting from the front or the back of an input token.

**Options**

minGram (type: int) - Default: 1, maximum: 300.

maxGram (type: int) - Default: 2, maximum 300. Must be greater than minGram.

side (type: string) - Specifies which side of the input the n-gram should be generated from. Allowed values: `front`, `back` | |[elision](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/util/ElisionFilter.html)|ElisionTokenFilter|Removes elisions. For example, `l'avion` (the plane) is converted to `avion` (plane).

**Options**

articles (type: string array) - A set of articles to remove. The default is an empty list. If there's no list of articles set, by default all French articles are removed.| |[german_normalization](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/de/GermanNormalizationFilter.html)|(type applies only when options are available) |Normalizes German characters according to the heuristics of the [German2 snowball algorithm](https://snowballstem.org/algorithms/german2/stemmer.html) .| |[hindi_normalization](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/hi/HindiNormalizationFilter.html)|(type applies only when options are available) |Normalizes text in Hindi to remove some differences in spelling variations. | @@ -282,7 +282,7 @@ In the table below, the token filters that are implemented using Apache Lucene a |[length](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/miscellaneous/LengthFilter.html)|LengthTokenFilter|Removes words that are too long or too short.

**Options**

min (type: int) - The minimum number. Default: 0, maximum: 300.

max (type: int) - The maximum number. Default: 300, maximum: 300.| |[limit](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/miscellaneous/LimitTokenCountFilter.html)|Microsoft.Azure.Search.LimitTokenFilter|Limits the number of tokens while indexing.

**Options**

maxTokenCount (type: int) - Max number of tokens to produce. The default is 1.

consumeAllTokens (type: bool) - Whether all tokens from the input must be consumed even if maxTokenCount is reached. The default is false.| |[lowercase](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/core/LowerCaseFilter.html)|(type applies only when options are available) |Normalizes token text to lower case. | -|[nGram_v2](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/ngram/NGramTokenFilter.html)|NGramTokenFilterV2|Generates n-grams of the given size(s).

**Options**

minGram (type: int) - Default: 1, maximum: 300.

maxGram (type: int) - Default: 2, maximum 300. Must be greater than minGram.| +|[nGram_v2](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/ngram/NGramTokenFilter.html)|NGramTokenFilterV2|Generates n-grams of the given sizes.

**Options**

minGram (type: int) - Default: 1, maximum: 300.

maxGram (type: int) - Default: 2, maximum 300. Must be greater than minGram.| |[pattern_capture](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/pattern/PatternCaptureGroupTokenFilter.html)|PatternCaptureTokenFilter|Uses Java regexes to emit multiple tokens, one for each capture group in one or more patterns.

**Options**

patterns (type: string array) - A list of patterns to match against each token. Required.

preserveOriginal (type: bool) - Set to true to return the original token even if one of the patterns matches, default: true | |[pattern_replace](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/pattern/PatternReplaceFilter.html)|PatternReplaceTokenFilter|A token filter which applies a pattern to each token in the stream, replacing match occurrences with the specified replacement string.

**Options**

pattern (type: string) - Required.

replacement (type: string) - Required.| |[persian_normalization](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/fa/PersianNormalizationFilter.html)|(type applies only when options are available) |Applies normalization for Persian. | @@ -291,20 +291,20 @@ In the table below, the token filters that are implemented using Apache Lucene a |[reverse](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/reverse/ReverseStringFilter.html)|(type applies only when options are available) |Reverses the token string. | |[scandinavian_normalization](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/miscellaneous/ScandinavianNormalizationFilter.html)|(type applies only when options are available) |Normalizes use of the interchangeable Scandinavian characters. | |[scandinavian_folding](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/miscellaneous/ScandinavianFoldingFilter.html)|(type applies only when options are available) |Folds Scandinavian characters `åÅäæÄÆ`into `a` and `öÖøØ`into `o`. It also discriminates against use of double vowels `aa`, `ae`, `ao`, `oe` and `oo`, leaving just the first one. | -|[shingle](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/shingle/ShingleFilter.html)|ShingleTokenFilter|Creates combinations of tokens as a single token.

**Options**

maxShingleSize (type: int) - Defaults to 2.

minShingleSize (type: int) - Defaults to 2.

outputUnigrams (type: bool) - if true, the output stream contains the input tokens (unigrams) as well as shingles. The default is true.

outputUnigramsIfNoShingles (type: bool) - If true, override the behavior of outputUnigrams==false for those times when no shingles are available. The default is false.

tokenSeparator (type: string) - The string to use when joining adjacent tokens to form a shingle. The default is a single empty space ` `.

filterToken (type: string) - The string to insert for each position for which there is no token. The default is `_`.| +|[shingle](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/shingle/ShingleFilter.html)|ShingleTokenFilter|Creates combinations of tokens as a single token.

**Options**

maxShingleSize (type: int) - Defaults to 2.

minShingleSize (type: int) - Defaults to 2.

outputUnigrams (type: bool) - if true, the output stream contains the input tokens (unigrams) as well as shingles. The default is true.

outputUnigramsIfNoShingles (type: bool) - If true, override the behavior of outputUnigrams==false for those times when no shingles are available. The default is false.

tokenSeparator (type: string) - The string to use when joining adjacent tokens to form a shingle. The default is a single empty space ` `.

filterToken (type: string) - The string to insert for each position for which there's no token. The default is `_`.| |[snowball](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/snowball/SnowballFilter.html)|SnowballTokenFilter|Snowball Token Filter.

**Options**

language (type: string) - Allowed values include: `armenian`, `basque`, `catalan`, `danish`, `dutch`, `english`, `finnish`, `french`, `german`, `german2`, `hungarian`, `italian`, `kp`, `lovins`, `norwegian`, `porter`, `portuguese`, `romanian`, `russian`, `spanish`, `swedish`, `turkish`| |[sorani_normalization](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/ckb/SoraniNormalizationFilter.html)|SoraniNormalizationTokenFilter|Normalizes the Unicode representation of `Sorani` text.

**Options**

None.| |stemmer|StemmerTokenFilter|Language-specific stemming filter.

**Options**

language (type: string) - Allowed values include:
- [`arabic`](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/ar/ArabicStemmer.html)
- [`armenian`](https://snowballstem.org/algorithms/armenian/stemmer.html)
- [`basque`](https://snowballstem.org/algorithms/basque/stemmer.html)
- [`brazilian`](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/br/BrazilianStemmer.html)
- `bulgarian`
- [`catalan`](https://snowballstem.org/algorithms/catalan/stemmer.html)
- [`czech`](https://portal.acm.org/citation.cfm?id=1598600)
- [`danish`](https://snowballstem.org/algorithms/danish/stemmer.html)
- [`dutch`](https://snowballstem.org/algorithms/dutch/stemmer.html)
- [`dutchKp`](https://snowballstem.org/algorithms/kraaij_pohlmann/stemmer.html)
- [`english`](https://snowballstem.org/algorithms/porter/stemmer.html)
- [`lightEnglish`](https://ciir.cs.umass.edu/pubfiles/ir-35.pdf)
- [`minimalEnglish`](https://www.researchgate.net/publication/220433848_How_effective_is_suffixing)
- [`possessiveEnglish`](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/en/EnglishPossessiveFilter.html)
- [`porter2`](https://snowballstem.org/algorithms/english/stemmer.html)
- [`lovins`](https://snowballstem.org/algorithms/lovins/stemmer.html)
- [`finnish`](https://snowballstem.org/algorithms/finnish/stemmer.html)
- `lightFinnish`
- [`french`](https://snowballstem.org/algorithms/french/stemmer.html)
- [`lightFrench`](https://dl.acm.org/citation.cfm?id=1141523)
- [`minimalFrench`](https://dl.acm.org/citation.cfm?id=318984)
- `galician`
- `minimalGalician`
- [`german`](https://snowballstem.org/algorithms/german/stemmer.html)
- [`german2`](https://snowballstem.org/algorithms/german2/stemmer.html)
- [`lightGerman`](https://dl.acm.org/citation.cfm?id=1141523)
- `minimalGerman`
- [`greek`](https://sais.se/mthprize/2007/ntais2007.pdf)
- `hindi`
- [`hungarian`](https://snowballstem.org/algorithms/hungarian/stemmer.html)
- [`lightHungarian`](https://dl.acm.org/citation.cfm?id=1141523&dl=ACM&coll=DL&CFID=179095584&CFTOKEN=80067181)
- [`indonesian`](https://eprints.illc.uva.nl/741/2/MoL-2003-03.text.pdf)
- [`irish`](https://snowballstem.org/algorithms/irish/stemmer.html)
- [`italian`](https://snowballstem.org/algorithms/italian/stemmer.html)
- [`lightItalian`](https://www.ercim.eu/publication/ws-proceedings/CLEF2/savoy.pdf)
- [`sorani`](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/ckb/SoraniStemmer.html)
- [`latvian`](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/lv/LatvianStemmer.html)
- [`norwegian`](https://snowballstem.org/algorithms/norwegian/stemmer.html)
- [`lightNorwegian`](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/no/NorwegianLightStemmer.html)
- [`minimalNorwegian`](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/no/NorwegianMinimalStemmer.html)
- [`lightNynorsk`](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/no/NorwegianLightStemmer.html)
- [`minimalNynorsk`](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/no/NorwegianMinimalStemmer.html)
- [`portuguese`](https://snowballstem.org/algorithms/portuguese/stemmer.html)
- [`lightPortuguese`](https://dl.acm.org/citation.cfm?id=1141523&dl=ACM&coll=DL&CFID=179095584&CFTOKEN=80067181)
- [`minimalPortuguese`](https://web.archive.org/web/20230425141918/https://www.inf.ufrgs.br/~buriol/papers/Orengo_CLEF07.pdf)
- [`portugueseRslp`](https://web.archive.org/web/20230422082818/https://www.inf.ufrgs.br/~viviane/rslp/index.htm)
- [`romanian`](https://snowballstem.org/otherapps/romanian/)
- [`russian`](https://snowballstem.org/algorithms/russian/stemmer.html)
- [`lightRussian`](https://doc.rero.ch/lm.php?url=1000%2C43%2C4%2C20091209094227-CA%2FDolamic_Ljiljana_-_Indexing_and_Searching_Strategies_for_the_Russian_20091209.pdf)
- [`spanish`](https://snowballstem.org/algorithms/spanish/stemmer.html)
- [`lightSpanish`](https://www.ercim.eu/publication/ws-proceedings/CLEF2/savoy.pdf)
- [`swedish`](https://snowballstem.org/algorithms/swedish/stemmer.html)
- `lightSwedish`
- [`turkish`](https://snowballstem.org/algorithms/turkish/stemmer.html)| |[stemmer_override](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/miscellaneous/StemmerOverrideFilter.html)|StemmerOverrideTokenFilter|Any dictionary-Stemmed terms are marked as keywords, which prevents stemming down the chain. Must be placed before any stemming filters.

**Options**

rules (type: string array) - Stemming rules in the following format `word => stem` for example `ran => run`. The default is an empty list. Required.| |[stopwords](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/core/StopFilter.html)|StopwordsTokenFilter|Removes stop words from a token stream. By default, the filter uses a predefined stop word list for English.

**Options**

stopwords (type: string array) - A list of stopwords. Can't be specified if a stopwordsList is specified.

stopwordsList (type: string) - A predefined list of stopwords. Can't be specified if `stopwords` is specified. Allowed values include:`arabic`, `armenian`, `basque`, `brazilian`, `bulgarian`, `catalan`, `czech`, `danish`, `dutch`, `english`, `finnish`, `french`, `galician`, `german`, `greek`, `hindi`, `hungarian`, `indonesian`, `irish`, `italian`, `latvian`, `norwegian`, `persian`, `portuguese`, `romanian`, `russian`, `sorani`, `spanish`, `swedish`, `thai`, `turkish`, default: `english`. Can't be specified if `stopwords` is specified.

ignoreCase (type: bool) - If true, all words are lower cased first. The default is false.

removeTrailing (type: bool) - If true, ignore the last search term if it's a stop word. The default is true. -|[synonym](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/synonym/SynonymFilter.html)|SynonymTokenFilter|Matches single or multi word synonyms in a token stream.

**Options**

synonyms (type: string array) - Required. List of synonyms in one of the following two formats:

-incredible, unbelievable, fabulous => amazing - all terms on the left side of => symbol are replaced with all terms on its right side.

-incredible, unbelievable, fabulous, amazing - A comma-separated list of equivalent words. Set the expand option to change how this list is interpreted.

ignoreCase (type: bool) - Case-folds input for matching. The default is false.

expand (type: bool) - If true, all words in the list of synonyms (if => notation is not used) map to one another.
The following list: incredible, unbelievable, fabulous, amazing is equivalent to: incredible, unbelievable, fabulous, amazing => incredible, unbelievable, fabulous, amazing

- If false, the following list: incredible, unbelievable, fabulous, amazing are equivalent to: incredible, unbelievable, fabulous, amazing => incredible.| +|[synonym](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/synonym/SynonymFilter.html)|SynonymTokenFilter|Matches single or multi word synonyms in a token stream.

**Options**

synonyms (type: string array) - Required. List of synonyms in one of the following two formats:

-incredible, unbelievable, fabulous => amazing - all terms on the left side of => symbol are replaced with all terms on its right side.

-incredible, unbelievable, fabulous, amazing - A comma-separated list of equivalent words. Set the expand option to change how this list is interpreted.

ignoreCase (type: bool) - Case-folds input for matching. The default is false.

expand (type: bool) - If true, all words in the list of synonyms (if => notation isn't used) map to one another.
The following list: incredible, unbelievable, fabulous, amazing is equivalent to: incredible, unbelievable, fabulous, amazing => incredible, unbelievable, fabulous, amazing

- If false, the following list: incredible, unbelievable, fabulous, amazing are equivalent to: incredible, unbelievable, fabulous, amazing => incredible.| |[trim](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/miscellaneous/TrimFilter.html)|(type applies only when options are available) |Trims leading and trailing whitespace from tokens. | |[truncate](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/miscellaneous/TruncateTokenFilter.html)|TruncateTokenFilter|Truncates the terms into a specific length.

**Options**

length (type: int) - Default: 300, maximum: 300. Required.| |[unique](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/miscellaneous/RemoveDuplicatesTokenFilter.html)|UniqueTokenFilter|Filters out tokens with same text as the previous token.

**Options**

onlyOnSamePosition (type: bool) - If set, remove duplicates only at the same position. The default is true.| |[uppercase](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/core/UpperCaseFilter.html)|(type applies only when options are available) |Normalizes token text to upper case. | |[word_delimiter](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/miscellaneous/WordDelimiterFilter.html)|WordDelimiterTokenFilter|Splits words into subwords and performs optional transformations on subword groups.

**Options**

generateWordParts (type: bool) - Causes parts of words to be generated, for example `AzureSearch` becomes `Azure` `Search`. The default is true.

generateNumberParts (type: bool) - Causes number subwords to be generated. The default is true.

catenateWords (type: bool) - Causes maximum runs of word parts to be catenated, for example `Azure-Search` becomes `AzureSearch`. The default is false.

catenateNumbers (type: bool) - Causes maximum runs of number parts to be catenated, for example `1-2` becomes `12`. The default is false.

catenateAll (type: bool) - Causes all subword parts to be catenated, e.g `Azure-Search-1` becomes `AzureSearch1`. The default is false.

splitOnCaseChange (type: bool) - If true, splits words on caseChange, for example `AzureSearch` becomes `Azure` `Search`. The default is true.

preserveOriginal - Causes original words to be preserved and added to the subword list. The default is false.

splitOnNumerics (type: bool) - If true, splits on numbers, for example `Azure1Search` becomes `Azure` `1` `Search`. The default is true.

stemEnglishPossessive (type: bool) - Causes trailing `'s` to be removed for each subword. The default is true.

protectedWords (type: string array) - Tokens to protect from being delimited. The default is an empty list.| - 1 Token Filter Types are always prefixed in code with `#Microsoft.Azure.Search` such that `ArabicNormalizationTokenFilter` would actually be specified as `#Microsoft.Azure.Search.ArabicNormalizationTokenFilter`. We removed the prefix to reduce the width of the table, but please remember to include it in your code. + 1 Token Filter Types are always prefixed in code with `#Microsoft.Azure.Search` such that `ArabicNormalizationTokenFilter` would actually be specified as `#Microsoft.Azure.Search.ArabicNormalizationTokenFilter`. We removed the prefix to reduce the width of the table, but remember to include it in your code. ## See also diff --git a/articles/search/index-add-language-analyzers.md b/articles/search/index-add-language-analyzers.md index 272d414f45e..15e93314d64 100644 --- a/articles/search/index-add-language-analyzers.md +++ b/articles/search/index-add-language-analyzers.md @@ -9,7 +9,7 @@ ms.service: azure-ai-search ms.custom: - ignite-2023 ms.topic: how-to -ms.date: 05/23/2024 +ms.date: 01/16/2025 --- # Add language analyzers to string fields in an Azure AI Search index diff --git a/articles/search/search-analyzers.md b/articles/search/search-analyzers.md index 81b9cd8b41f..fed0b4af219 100644 --- a/articles/search/search-analyzers.md +++ b/articles/search/search-analyzers.md @@ -7,7 +7,7 @@ manager: nitinme ms.author: heidist ms.service: azure-ai-search ms.topic: conceptual -ms.date: 05/23/2024 +ms.date: 01/16/2025 ms.custom: - devx-track-csharp - ignite-2023 @@ -22,9 +22,9 @@ An *analyzer* is a component of the [full text search engine](search-lucene-quer + Lower-case any upper-case words + Reduce words into primitive root forms for storage efficiency and so that matches can be found regardless of tense -Analysis applies to `Edm.String` fields that are marked as "searchable", which indicates full text search. +The output of a lexical analyzer is a sequence of [tokens](https://suif.stanford.edu/dragonbook/lecture-notes/Stanford-CS143/03-Lexical-Analysis.pdf). -For fields of this configuration, analysis occurs during indexing when tokens are created, and then again during query execution when queries are parsed and the engine scans for matching tokens. A match is more likely to occur when the same analyzer is used for both indexing and queries, but you can set the analyzer for each workload independently, depending on your requirements. +Lexical analysis applies to `Edm.String` fields that are marked as "searchable", which indicates full text search. For fields of this configuration, analysis occurs during indexing when tokens are created, and then again during query execution when queries are parsed and the engine scans for matching tokens. A match is more likely to occur when the same analyzer is used for both indexing and queries, but you can set the analyzer for each workload independently, depending on your requirements. Query types that are *not* full text search, such as filters or fuzzy search, don't go through the analysis phase on the query side. Instead, the parser sends those strings directly to the search engine, using the pattern that you provide as the basis for the match. Typically, these query forms require whole-string tokens to make pattern matching work. To ensure whole term tokens are preserved during indexing, you might need [custom analyzers](index-add-custom-analyzers.md). For more information about when and why query terms are analyzed, see [Full text search in Azure AI Search](search-lucene-query-architecture.md). diff --git a/articles/search/search-api-versions.md b/articles/search/search-api-versions.md index 72b572966a4..1cd9742f516 100644 --- a/articles/search/search-api-versions.md +++ b/articles/search/search-api-versions.md @@ -14,7 +14,7 @@ ms.custom: - devx-track-python - ignite-2023 ms.topic: conceptual -ms.date: 06/24/2024 +ms.date: 01/16/2025 --- # API versions in Azure AI Search @@ -57,33 +57,31 @@ Support for the above-listed versions ended on October 15, 2020. If you have cod The following table provides links to more recent SDK versions. -| SDK version | Status | Description | -|-------------|--------|------------------------------| -| [Azure.Search.Documents 11](/dotnet/api/overview/azure/search.documents-readme) | Active | New client library from the Azure .NET SDK team, initially released July 2020. See the [Change Log](https://github.com/Azure/azure-sdk-for-net/blob/Azure.Search.Documents_11.3.0/sdk/search/Azure.Search.Documents/CHANGELOG.md) for information about minor releases. | -| [Microsoft.Azure.Search 10](https://www.nuget.org/packages/Microsoft.Azure.Search/) | Retired | Released May 2019. This is the last version of the Microsoft.Azure.Search package and it's now deprecated. It's succeeded by Azure.Search.Documents. | -| [Microsoft.Azure.Management.Search 4.0.0](https://www.nuget.org/packages/Microsoft.Azure.Management.Search/4.0.0) | Active | Targets the Management REST api-version=2020-08-01. | -| [Microsoft.Azure.Management.Search 3.0.0](https://www.nuget.org/packages/Microsoft.Azure.Management.Search/3.0.0) | Retired | Targets the Management REST api-version=2015-08-19. | +| SDK version | Status | Change log | Description | +|-------------|--------|------------ |-----------------| +| [Azure.Search.Documents 11](/dotnet/api/overview/azure/search.documents-readme) | Active | [Change Log](https://github.com/Azure/azure-sdk-for-net/blob/main/sdk/search/Azure.Search.Documents/CHANGELOG.md) | APIs for data plane operations on a service, such as read-write operations on content and objects. | +| [Azure.ResourceManager.Search](https://www.nuget.org/packages/Microsoft.Azure.Management.Search/4.0.0) | Active | [Change Log](https://github.com/Azure/azure-sdk-for-net/blob/main/sdk/search/Azure.ResourceManager.Search/CHANGELOG.md) | APIs for control plane operations on the search service. | ## Azure SDK for Java -| SDK version | Status | Description | -|-------------|--------|------------------------------| -| [Java azure-search-documents 11](/java/api/overview/azure/search-documents-readme) | Active | Use the `azure-search-documents` client library for data plane operations. | -| [Java Management Client 1.35.0](/java/api/overview/azure/search/management) | Active | Use the `azure-mgmt-search` client library for control plane operations. | +| SDK version | Status | Change log | Description | +|-------------|--------|------------|-----------------| +| [azure-search-documents 11](/java/api/overview/azure/search-documents-readme) | Active | [Change Log](https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/search/azure-search-documents/CHANGELOG.md) Use the `azure-search-documents` client library for data plane operations. | +| [azure-resourcemanager-search 2](/java/api/overview/azure/resourcemanager-search-readme) | Active | [Change Log](https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/resourcemanager/azure-resourcemanager-search/CHANGELOG.md) | Use the `azure-resourcemanager-search` client library for control plane operations. | ## Azure SDK for JavaScript -| SDK version | Status | Description | -|-------------|--------|------------------------------| -| [JavaScript @azure/search-documents 11.0](/javascript/api/overview/azure/search-documents-readme) | Active | Use the `@azure/search-documents` client library for data plane operations. | -| [JavaScript @azure/arm-search](https://www.npmjs.com/package/@azure/arm-search) | Active | Use the `@azure/arm-search` client library for control plane operations. | +| SDK version | Status | Change log | Description | +|-------------|--------|------------|------------------| +| [@azure/search-documents 12](/javascript/api/overview/azure/search-documents-readme) | Active | [Change Log](https://github.com/Azure/azure-sdk-for-js/blob/main/sdk/search/search-documents/CHANGELOG.md) | Use the `@azure/search-documents` client library for data plane operations. | +| [@azure/arm-search 4](/javascript/api/overview/azure/arm-search-readme) | Active | [Change Log](https://github.com/Azure/azure-sdk-for-js/blob/main/sdk/search/arm-search/CHANGELOG.md) | Use the `@azure/arm-search` package for control plane operations. | ## Azure SDK for Python -| SDK version | Status | Description | -|-------------|--------|------------------------------| -| [Python azure-search-documents 11.0](/python/api/azure-search-documents) | Active | Use the `azure-search-documents` client library for data plane operations. | -| [Python azure-mgmt-search 8.0](https://pypi.org/project/azure-mgmt-search/) | Active | Use the `azure-mgmt-search` client library for control plane operations. | +| SDK version | Status | Change log | Description | +|-------------|--------|------------|------------------| +| [azure-search-documents 11](/python/api/overview/azure/search-documents-readme) | Active | [Change Log](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/search/azure-search-documents/CHANGELOG.md) | Use the `azure-search-documents` client library for data plane operations. | +| [azure-mgmt-search 9](https://pypi.org/project/azure-mgmt-search/) | Active | [Change Log](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/search/azure-mgmt-search/CHANGELOG.md) | Use the `azure-mgmt-search` client library for control plane operations. | ## All Azure SDKs diff --git a/articles/search/search-faceted-navigation.md b/articles/search/search-faceted-navigation.md index 279f082e4b0..af75307aba2 100644 --- a/articles/search/search-faceted-navigation.md +++ b/articles/search/search-faceted-navigation.md @@ -8,7 +8,7 @@ author: HeidiSteen ms.author: heidist ms.service: azure-ai-search ms.topic: conceptual -ms.date: 10/31/2024 +ms.date: 01/16/2025 --- # Add faceted navigation to a search app diff --git a/articles/search/search-faq-frequently-asked-questions.yml b/articles/search/search-faq-frequently-asked-questions.yml index 9cbe836c3dd..2c8cf7b91c6 100644 --- a/articles/search/search-faq-frequently-asked-questions.yml +++ b/articles/search/search-faq-frequently-asked-questions.yml @@ -9,7 +9,7 @@ metadata: ms.author: heidist ms.service: azure-ai-search ms.topic: faq - ms.date: 05/28/2024 + ms.date: 01/16/2025 title: Azure AI Search Frequently Asked Questions summary: Find answers to commonly asked questions about Azure AI Search. @@ -36,7 +36,7 @@ sections: answer: | For vectors, the embedding models you use determines the linguistic experience. - For nonvector strings and numbers, the default analyzer used for tokenization is standard Lucene and it is language agnostic. Otherwise, language support is expressed through [language analyzers](index-add-language-analyzers.md#supported-language-analyzers) that apply linguistic rules to inbound (indexing) and outbound (queries) content. Some features, such as [speller](speller-how-to-add.md#supported-languages), are limited to a subset of languages. + For nonvector strings and numbers, the default analyzer used for tokenization is standard Lucene and it's language agnostic. Otherwise, language support is expressed through [language analyzers](index-add-language-analyzers.md#supported-language-analyzers) that apply linguistic rules to inbound (indexing) and outbound (queries) content. Some features, such as [speller](speller-how-to-add.md#supported-languages) and [query rewrite](semantic-how-to-query-rewrite.md), are limited to a subset of languages. - question: | How do I integrate search into my solution? @@ -101,12 +101,12 @@ sections: - question: | Does Azure AI Search support vector search? answer: | - Azure AI Search supports vector indexing and retrieval. It can vectorize query strings and content if you use the preview and beta libraries. + Azure AI Search supports vector indexing and retrieval. It can chunk and vectorize query strings and content if you use [integrated vectorization](vector-search-integrated-vectorization.md) and take a dependency on indexers and skillsets. - question: | How does vector search work in Azure AI Search? answer: | - With standalone vector search, you first use an embedding model to transform content into a vector representation within an embedding space. You can then provide these vectors in a document payload to the search index for indexing. To serve search requests, you use the same deep neural network (DNN) from indexing to transform the search query into a vector representation, and vector search finds the most similar vectors and return the corresponding documents. + With standalone vector search, you first use an embedding model to transform content into a vector representation within an embedding space. You can then provide these vectors in a document payload to the search index for indexing. To serve search requests, you use the same embedding model to transform the search query into a vector representation, and vector search finds the most similar vectors and return the corresponding documents. In Azure AI Search, you can index vector data as fields in documents alongside textual and other types of content. There are [multiple data types](/rest/api/searchservice/supported-data-types#edm-data-types-for-vector-fields) for vector fields. @@ -130,7 +130,7 @@ sections: - question: | Why do I see different vector index size limits between my new search services and existing search services? answer: | - We're rolling out improved vector index size limits worldwide for new search services, but we're still building out infrastructure capacity in certain regions. New search services created in supported regions will see increased vector index size limits. Unfortunately, we can't migrate existing services to the new limits. + Azure AI Search rolled out improved vector index size limits worldwide for new search services, but [some regions experience capacity constraints](search-region-support.md), and some regions don't have the required infrastructure. New search services created in supported regions should see increased vector index size limits. Unfortunately, we can't migrate existing services to the new limits. Also, only vector indexes that use the Hierarchical Navigable Small World (HNSW) algorithm report on vector index size in the Azure portal. If your index uses exhaustive KNN, vector index size is reported as zero, even though the index contains vectors. - question: | How do I enable vector search on a search index? @@ -141,7 +141,7 @@ sections: * Add a "vectorSearch" section to the index schema specifying the configuration used by vector search fields, including the parameters of the Approximate Nearest Neighbor algorithm used, like HNSW. - * Use the latest stable version[**2024-07-01**](/rest/api/searchservice), or an Azure SDK to create or update the index, load documents, and issue queries. + * Use the latest stable version[**2024-07-01**](/rest/api/searchservice), or an Azure SDK to create or update the index, load documents, and issue queries. For more information, see [Create a vector index](vector-search-how-to-create-index.md). - name: Queries questions: diff --git a/articles/search/search-howto-concurrency.md b/articles/search/search-howto-concurrency.md index 049313cf452..c4d7c4e8ab2 100644 --- a/articles/search/search-howto-concurrency.md +++ b/articles/search/search-howto-concurrency.md @@ -8,7 +8,7 @@ author: HeidiSteen ms.author: heidist ms.service: azure-ai-search ms.topic: how-to -ms.date: 04/23/2024 +ms.date: 01/16/2025 ms.custom: - devx-track-csharp - ignite-2023 diff --git a/articles/search/search-howto-index-json-blobs.md b/articles/search/search-howto-index-json-blobs.md index ded344c3780..2f061541a95 100644 --- a/articles/search/search-howto-index-json-blobs.md +++ b/articles/search/search-howto-index-json-blobs.md @@ -11,7 +11,7 @@ ms.service: azure-ai-search ms.custom: - ignite-2023 ms.topic: how-to -ms.date: 06/25/2024 +ms.date: 01/16/2025 --- # Index JSON blobs and files in Azure AI Search @@ -39,7 +39,6 @@ Within the indexer definition, you can optionally set [field mappings](search-in > [!NOTE] > When a JSON parsing mode is used, Azure AI Search assumes that all blobs use the same parser (either for **`json`**, **`jsonArray`** or **`jsonLines`**). If you have a mix of different file types in the same data source, consider using [file extension filters](search-blob-storage-integration.md#controlling-which-blobs-are-indexed) to control which files are imported. - The following sections describe each mode in more detail. If you're unfamiliar with indexer clients and concepts, see [Create a search indexer](search-howto-create-indexers.md). You should also be familiar with the details of [basic blob indexer configuration](search-howto-indexing-azure-blob-storage.md), which isn't repeated here. @@ -76,8 +75,7 @@ api-key: [admin key] ``` > [!NOTE] -> As with all indexers, if fields do not clearly match, you should expect to explicitly specify individual [field mappings](search-indexer-field-mappings.md) unless you are using the implicit fields mappings available for blob content and metadata, as described in [basic blob indexer configuration](search-howto-indexing-azure-blob-storage.md). - +> As with all indexers, if fields don't clearly match, you should expect to explicitly specify individual [field mappings](search-indexer-field-mappings.md) unless you're using the implicit fields mappings available for blob content and metadata, as described in [basic blob indexer configuration](search-howto-indexing-azure-blob-storage.md). ### json example (single hotel JSON files) @@ -208,7 +206,7 @@ You can also refer to individual array elements by using a zero-based index. For ``` > [!NOTE] -> If "sourceFieldName" refers to a property that doesn't exist in the JSON blob, that mapping is skipped without an error. This behavior allows indexing to continue for JSON blobs that have a different schema (which is a common use case). Because there is no validation check, check the mappings carefully for typos so that you aren't losing documents for the wrong reason. +> If "sourceFieldName" refers to a property that doesn't exist in the JSON blob, that mapping is skipped without an error. This behavior allows indexing to continue for JSON blobs that have a different schema (which is a common use case). Because there's no validation check, check the mappings carefully for typos so that you aren't losing documents for the wrong reason. > ## Next steps diff --git a/articles/search/search-howto-managed-identities-storage.md b/articles/search/search-howto-managed-identities-storage.md index 8dcab234763..ba499a3aa45 100644 --- a/articles/search/search-howto-managed-identities-storage.md +++ b/articles/search/search-howto-managed-identities-storage.md @@ -8,7 +8,7 @@ manager: nitinme ms.service: azure-ai-search ms.topic: how-to -ms.date: 06/03/2024 +ms.date: 01/16/2025 ms.custom: - subject-rbac-steps - ignite-2023 @@ -43,9 +43,9 @@ You can use a system-assigned managed identity or a user-assigned managed identi | ADLS Gen2 indexing using an indexer | Add **Storage Blob Data Reader** | | Table indexing using an indexer | Add **Reader and Data Access** | | File indexing using an indexer | Add **Reader and Data Access** | - | Write to a knowledge store | Add **Storage Blob DataContributor** for object and file projections, and **Reader and Data Access** for table projections. | - | Write to an enrichment cache | Add **Storage Blob Data Contributor** | - | Save debug session state | Add **Storage Blob Data Contributor** | + | Write to a [knowledge store](knowledge-store-concept-intro.md) | Add **Storage Blob Data Contributor** for object and file projections, and **Reader and Data Access** for table projections. | + | Write to an [enrichment cache](cognitive-search-incremental-indexing-conceptual.md) | Add **Storage Blob Data Contributor** | + | Save [debug session state](cognitive-search-debug-session.md) | Add **Storage Blob Data Contributor** | 1. Select **Next**. @@ -59,7 +59,7 @@ You can use a system-assigned managed identity or a user-assigned managed identi Once you have a role assignment, you can set up a connection to Azure Storage that operates under that role. -Indexers use a data source object for connections to an external data source. This section explains how to specify a system-assigned managed identity or a user-assigned managed identity on a data source connection string. You can find more [connection string examples](search-howto-managed-identities-data-sources.md#connection-string-examples) in the managed identity article. +[Indexers](search-indexer-overview.md) use a data source object for connections to an external data source. This section explains how to specify a system-assigned managed identity or a user-assigned managed identity on a data source connection string. You can find more [connection string examples](search-howto-managed-identities-data-sources.md#connection-string-examples) in the managed identity article. > [!TIP] > You can create a data source connection to Azure Storage in the Azure portal, specifying either a system or user-assigned managed identity, and then view the JSON definition to see how the connection string is formulated. @@ -70,7 +70,7 @@ You must have a [system-assigned managed identity already configured](search-how For connections made using a system-assigned managed identity, the only change to the [data source definition](/rest/api/searchservice/data-sources/create) is the format of the `credentials` property. -Provide a `ResourceId` that has no account key or password. The `ResourceId` must include the subscription ID of the storage account, the resource group of the storage account, and the storage account name. +Provide a connection string that contains a `ResourceId`, with no account key or password. The `ResourceId` must include the subscription ID of the storage account, the resource group of the storage account, and the storage account name. ```http POST https://[service name].search.windows.net/datasources?api-version=2024-07-01 @@ -91,11 +91,11 @@ POST https://[service name].search.windows.net/datasources?api-version=2024-07-0 You must have a [user-assigned managed identity already configured](search-howto-managed-identities-data-sources.md) and associated with your search service, and the identity must have a role-assignment on Azure Storage. -Connections made through user-assigned managed identities use the same credentials as a system-assigned managed identity, plus an extra identity property that contains the collection of user-assigned managed identities. Only one user-assigned managed identity should be provided when creating the data source. Set `userAssignedIdentity` to the user-assigned managed identity. +Connections made through user-assigned managed identities use the same credentials as a system-assigned managed identity, plus an extra identity property that contains the collection of user-assigned managed identities. Only one user-assigned managed identity should be provided when creating the data source. -Provide a `ResourceId` that has no account key or password. The `ResourceId` must include the subscription ID of the storage account, the resource group of the storage account, and the storage account name. +Provide a connection string that contains a `ResourceId`, with no account key or password. The `ResourceId` must include the subscription ID of the storage account, the resource group of the storage account, and the storage account name. -Provide an `identity` using the syntax shown in the following example. +Provide an `identity` using the syntax shown in the following example. Set `userAssignedIdentity` to the user-assigned managed identity. ```http POST https://[service name].search.windows.net/datasources?api-version=2024-07-01 diff --git a/articles/search/search-performance-analysis.md b/articles/search/search-performance-analysis.md index 7c89c591cd3..5f7a50ba30b 100644 --- a/articles/search/search-performance-analysis.md +++ b/articles/search/search-performance-analysis.md @@ -6,7 +6,7 @@ author: mattgotteiner ms.author: magottei ms.service: azure-ai-search ms.topic: conceptual -ms.date: 06/06/2024 +ms.date: 01/16/2025 --- # Analyze performance in Azure AI Search @@ -17,7 +17,7 @@ This article describes the tools, behaviors, and approaches for analyzing query In any large implementation, it's critical to do a performance benchmarking test of your Azure AI Search service before you roll it into production. You should test both the search query load that you expect, but also the expected data ingestion workloads (if possible, run both workloads simultaneously). Having benchmark numbers helps to validate the proper [search tier](search-sku-tier.md), [service configuration](search-capacity-planning.md), and expected [query latency](search-performance-analysis.md#average-query-latency). -To develop benchmarks, we recommend the [azure-search-performance-testing (GitHub)](https://github.com/Azure-Samples/azure-search-performance-testing) tool. + To isolate the effects of a distributed service architecture, try testing on service configurations of one replica and one partition. @@ -59,8 +59,8 @@ AzureDiagnostics Examining throttling over a specific time period can help you identify the times where throttling might occur more frequently. In the below example, a time series chart is used to show the number of throttled queries that occurred over a specified time frame. In this case, the throttled queries correlated with the times in with the performance benchmarking was performed. ```kusto -let ['_startTime']=datetime('2021-02-25T20:45:07Z'); -let ['_endTime']=datetime('2021-03-03T20:45:07Z'); +let ['_startTime']=datetime('2024-02-25T20:45:07Z'); +let ['_endTime']=datetime('2024-03-03T20:45:07Z'); let intervalsize = 1m; AzureDiagnostics | where TimeGenerated > ago(7d) @@ -122,8 +122,8 @@ In the below query, an interval size of 1 minute is used to show the average lat ```kusto let intervalsize = 1m; -let _startTime = datetime('2021-02-23 17:40'); -let _endTime = datetime('2021-02-23 18:00'); +let _startTime = datetime('2024-02-23 17:40'); +let _endTime = datetime('2024-02-23 18:00'); AzureDiagnostics | where TimeGenerated between(['_startTime']..['_endTime']) // Time range filtering | summarize AverageQueryLatency = avgif(DurationMs, OperationName in ("Query.Search", "Query.Suggest", "Query.Lookup", "Query.Autocomplete")) @@ -139,8 +139,8 @@ The following query looks at the average number of queries per minute to ensure ```kusto let intervalsize = 1m; -let _startTime = datetime('2021-02-23 17:40'); -let _endTime = datetime('2021-02-23 18:00'); +let _startTime = datetime('2024-02-23 17:40'); +let _endTime = datetime('2024-02-23 18:00'); AzureDiagnostics | where TimeGenerated between(['_startTime'] .. ['_endTime']) // Time range filtering | summarize QueriesPerMinute=bin(countif(OperationName in ("Query.Search", "Query.Suggest", "Query.Lookup", "Query.Autocomplete"))/(intervalsize/1m), 0.01) @@ -158,8 +158,8 @@ From this insight, we can see that it took about 3 minutes for the search servic ```kusto let intervalsize = 1m; -let _startTime = datetime('2021-02-23 17:40'); -let _endTime = datetime('2021-02-23 18:00'); +let _startTime = datetime('2024-02-23 17:40'); +let _endTime = datetime('2024-02-23 18:00'); AzureDiagnostics | where TimeGenerated between(['_startTime'] .. ['_endTime']) // Time range filtering | summarize IndexingOperationsPerSecond=bin(countif(OperationName == "Indexing.Index")/ (intervalsize/1m), 0.01) @@ -171,7 +171,7 @@ AzureDiagnostics ## Background service processing -It isn't unusual to see periodic spikes in query or indexing latency. Spikes might occur in response to indexing or high query rates, but could also occur during merge operations. Search indexes are stored in chunks - or shards. Periodically, the system merges smaller shards into large shards, which can help optimize service performance. This merge process also cleans up documents that have previously been marked for deletion from the index, resulting in the recovery of storage space. +It's common to see occasional spikes in query or indexing latency. Spikes might occur in response to indexing or high query rates, but could also occur during merge operations. Search indexes are stored in chunks - or shards. Periodically, the system merges smaller shards into large shards, which can help optimize service performance. This merge process also cleans up documents that have previously been marked for deletion from the index, resulting in the recovery of storage space. Merging shards is fast, but also resource intensive and thus has the potential to degrade service performance. If you notice short bursts of query latency, and those bursts coincide with recent changes to indexed content, you can assume the latency is due to shard merge operations.