-
Notifications
You must be signed in to change notification settings - Fork 578
adding more_like_this query dsl docs #9746
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
AntonEliatra
wants to merge
6
commits into
opensearch-project:main
Choose a base branch
from
AntonEliatra:adding-more_like_this-dsl-query-docs
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
747efd7
adding more_like_this query dsl docs
AntonEliatra 732bfb1
addressing the PR comments
AntonEliatra 52f4c73
addressing the PR comments
AntonEliatra 5dfcd0b
Apply suggestions from code review
AntonEliatra 7f6e1c9
addressing PR comments
AntonEliatra 5a94ce0
addressing PR comments
AntonEliatra File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,286 @@ | ||
--- | ||
layout: default | ||
title: More like this | ||
parent: Specialized queries | ||
nav_order: 45 | ||
has_math: false | ||
--- | ||
|
||
# More like this | ||
|
||
Use a `more_like_this` query to find documents that are similar to one or more given documents. This is useful for recommendation engines, content discovery, and identifying related items in a dataset. | ||
|
||
The `more_like_this` query analyzes the input documents or texts and selects terms that best characterize them. It then searches for other documents that contain those significant terms. | ||
|
||
## Prerequisites | ||
|
||
Before you use a `more_like_this` query, ensure that the fields you target are indexed and their data type is either [`text`]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/text/) or [`keyword`]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/keyword/). | ||
|
||
If you reference documents in the `like` section, OpenSearch needs access to their content. This is typically done through the `_source` field, which is enabled by default. If `_source` is disabled, you must either store the fields individually or configure them to save [`term_vector`]({{site.url}}{{site.baseurl}}/field-types/mapping-parameters/term-vector/) data. | ||
|
||
Saving [`term_vector`]({{site.url}}{{site.baseurl}}/field-types/mapping-parameters/term-vector/) information when indexing documents can greatly accelerate `more_like_this` queries, because the engine can directly retrieve the important terms without re-analyzing the field text at query time. | ||
{: .note} | ||
|
||
## Example: No term vector optimization | ||
|
||
Create an index named `articles-basic` using the following mapping: | ||
|
||
```json | ||
PUT /articles-basic | ||
{ | ||
"mappings": { | ||
"properties": { | ||
"title": { "type": "text" }, | ||
"content": { "type": "text" } | ||
} | ||
} | ||
} | ||
``` | ||
{% include copy-curl.html %} | ||
|
||
Add sample documents: | ||
|
||
```json | ||
POST /articles-basic/_bulk | ||
{ "index": { "_id": 1 }} | ||
{ "title": "Exploring the Sahara Desert", "content": "Sand dunes and vast landscapes." } | ||
{ "index": { "_id": 2 }} | ||
{ "title": "Amazon Rainforest Tour", "content": "Dense jungle and exotic wildlife." } | ||
{ "index": { "_id": 3 }} | ||
{ "title": "Mountain Adventures", "content": "Snowy peaks and hiking trails." } | ||
``` | ||
{% include copy-curl.html %} | ||
|
||
Query using the following request: | ||
|
||
```json | ||
GET /articles-basic/_search | ||
{ | ||
"query": { | ||
"more_like_this": { | ||
"fields": ["content"], | ||
"like": "jungle wildlife", | ||
"min_term_freq": 1, | ||
"min_doc_freq": 1 | ||
} | ||
} | ||
} | ||
``` | ||
{% include copy-curl.html %} | ||
|
||
The `more_like_this` query searches for the terms `jungle` and `wildlife` in the `content` field, which matches only one document: | ||
|
||
```json | ||
{ | ||
... | ||
"hits": { | ||
"total": { | ||
"value": 1, | ||
"relation": "eq" | ||
}, | ||
"max_score": 1.9616582, | ||
"hits": [ | ||
{ | ||
"_index": "articles-basic", | ||
"_id": "2", | ||
"_score": 1.9616582, | ||
"_source": { | ||
"title": "Amazon Rainforest Tour", | ||
"content": "Dense jungle and exotic wildlife." | ||
} | ||
} | ||
] | ||
} | ||
} | ||
``` | ||
|
||
## Example: Term vector optimization | ||
|
||
Create an index named `articles-optimized` using the following mapping: | ||
|
||
```json | ||
PUT /articles-optimized | ||
{ | ||
"mappings": { | ||
"properties": { | ||
"title": { | ||
"type": "text", | ||
"term_vector": "with_positions_offsets" | ||
}, | ||
"content": { | ||
"type": "text", | ||
"term_vector": "with_positions_offsets" | ||
} | ||
} | ||
} | ||
} | ||
``` | ||
{% include copy-curl.html %} | ||
|
||
Insert sample documents into the optimized index: | ||
|
||
```json | ||
POST /articles-optimized/_bulk | ||
{ "index": { "_id": "a1" } } | ||
{ "name": "Diana", "alias": "Wonder Woman", "quote": "Justice will come when it is deserved." } | ||
{ "index": { "_id": "a2" } } | ||
{ "name": "Clark", "alias": "Superman", "quote": "Even in the darkest times, hope cuts through." } | ||
{ "index": { "_id": "a3" } } | ||
{ "name": "Bruce", "alias": "Batman", "quote": "I am vengeance. I am the night. I am Batman!" } | ||
``` | ||
{% include copy-curl.html %} | ||
|
||
Find documents similar to `dark night` using the following request: | ||
|
||
```json | ||
GET /articles-optimized/_search | ||
{ | ||
"query": { | ||
"more_like_this": { | ||
"fields": ["quote"], | ||
"like": "dark night", | ||
"min_term_freq": 1, | ||
"min_doc_freq": 1 | ||
} | ||
} | ||
} | ||
``` | ||
{% include copy-curl.html %} | ||
|
||
The `more_like_this` query searches for the terms `dark` and `night` and returns the following hit: | ||
|
||
```json | ||
{ | ||
... | ||
"hits": { | ||
"total": { | ||
"value": 1, | ||
"relation": "eq" | ||
}, | ||
"max_score": 1.2363393, | ||
"hits": [ | ||
{ | ||
"_index": "articles-optimized", | ||
"_id": "a3", | ||
"_score": 1.2363393, | ||
"_source": { | ||
"name": "Bruce", | ||
"alias": "Batman", | ||
"quote": "I am vengeance. I am the night. I am Batman!" | ||
} | ||
} | ||
] | ||
} | ||
} | ||
``` | ||
|
||
## Example: Using multiple documents and text input | ||
|
||
The `more_like_this` query allows you to provide multiple sources in the `like` parameter. You can combine free text with documents from the index. This is useful if you want the search to combine relevance signals from several examples. | ||
|
||
In the following example, a custom document is provided directly. Additionally, an existing document with the ID `5` from the `heroes` index is included: | ||
|
||
```json | ||
GET /articles-optimized/_search | ||
{ | ||
"query": { | ||
"more_like_this": { | ||
"fields": ["name", "alias"], | ||
"like": [ | ||
{ | ||
"doc": { | ||
"name": "Diana", | ||
"alias": "Wonder Woman", | ||
"quote": "Courage is not the absence of fear, but the triumph over it." | ||
} | ||
}, | ||
{ | ||
"_index": "heroes", | ||
"_id": "5" | ||
} | ||
], | ||
"min_term_freq": 1, | ||
"min_doc_freq": 1, | ||
"max_query_terms": 25 | ||
} | ||
} | ||
} | ||
``` | ||
{% include copy-curl.html %} | ||
|
||
The returned results contain articles most similar to the `name` and `alias` fields provided in the query: | ||
|
||
```json | ||
{ | ||
... | ||
"hits": { | ||
"total": { | ||
"value": 2, | ||
"relation": "eq" | ||
}, | ||
"max_score": 2.140194, | ||
"hits": [ | ||
{ | ||
"_index": "articles-optimized", | ||
"_id": "a1", | ||
"_score": 2.140194, | ||
"_source": { | ||
"name": "Diana", | ||
"alias": "Wonder Woman", | ||
"quote": "Justice will come when it is deserved." | ||
} | ||
}, | ||
{ | ||
"_index": "articles-optimized", | ||
"_id": "a2", | ||
"_score": 1.1596459, | ||
"_source": { | ||
"name": "Clark", | ||
"alias": "Superman", | ||
"quote": "Even in the darkest times, hope cuts through." | ||
} | ||
} | ||
] | ||
} | ||
} | ||
``` | ||
|
||
Use this pattern when you want to boost results based on a new concept that is not yet fully indexed, but also want to combine it with knowledge from existing indexed documents. | ||
{: .note} | ||
|
||
# Parameters | ||
|
||
The only required parameter for a `more_like_this` query is `like`. The rest of the parameters have default values but allow fine-tuning. Parameters fall into the following main categories. | ||
|
||
## Document input parameters | ||
|
||
The following table specifies document input parameters. | ||
|
||
AntonEliatra marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| Parameter | Required/Optional | Data type | Description | | ||
| :--- | :--- | :--- | :--- | | ||
| `like`| Required| Array of strings or objects | Defines the text or documents to find similar documents for. You can input free text, real documents from the index, or artificial documents. The analyzer associated with the field processes the text unless overridden. | | ||
| `unlike`| Optional| Array of strings or objects | Provides text or documents whose terms should be *excluded* from influencing the query. Useful for specifying negative examples.| | ||
| `fields`| Optional| Array of strings| Lists fields to use when analyzing text. If not specified, all fields are used. | | ||
|
||
## Term selection parameters | ||
|
||
| Parameter | Required/Optional | Data type| Description| | ||
| :--- | :--- | :--- | :--- | | ||
| `max_query_terms` | Optional| Integer| Sets the maximum number of terms to select from the input. A higher value increases precision but slows down execution. Default is `25`. | | ||
| `min_term_freq` | Optional| Integer| Terms appearing fewer times than this in the input will be ignored. Default is `2`.| | ||
| `min_doc_freq`| Optional| Integer| Terms that appear in fewer documents than this value will be ignored. Default is `5`.| | ||
| `max_doc_freq`| Optional| Integer| Terms appearing in more documents than this limit are ignored. Useful for avoiding very common words. Default is unlimited (2<sup>31</sup> - 1). | | ||
Check failure on line 272 in _query-dsl/specialized/more-like-this.md
|
||
| `min_word_length` | Optional| Integer| Ignore words shorter than this value. Default is `0`.| | ||
| `max_word_length` | Optional| Integer| Ignore words longer than this value. Default is unlimited. | | ||
| `stop_words`| Optional| Array of strings | Defines a list of words that are ignored completely when selecting terms.| | ||
| `analyzer`| Optional| String | The custom analyzer to use for processing input text. Defaults to the analyzer of the first field listed in `fields`.| | ||
|
||
## Query formation parameters | ||
|
||
| Parameter | Required/Optional | Data type | Description | | ||
| :--- | :--- | :--- | :--- | | ||
| `minimum_should_match`| Optional | String | Specifies the minimum number of terms that must match in the final query. The value can be a percentage or a fixed number. Helps fine-tune the balance between recall and precision. Default is `30%` | | ||
| `fail_on_unsupported_field` | Optional | Boolean | Determines whether to throw an error if one of the target fields is not of a compatible type (`text` or `keyword`). Set to `false` to silently skip unsupported fields. Default is `true`. | | ||
| `boost_terms` | Optional | Float | Applies a boost to selected terms based on their TF-IDF weight. Any value greater than `0` activates term boosting with the specified factor. Default is `0`. | | ||
| `include` | Optional | Boolean | If `true`, the source documents provided in `like` are included in the result hits. Default is `false`. | | ||
| `boost` | Optional | Float | Multiplies the relevance score of the entire `more_like_this` query. Default is `1.0`. | |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.