-
Notifications
You must be signed in to change notification settings - Fork 521
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add score normalization and combination documentation #4985
Changes from 9 commits
6ab7d57
79e9597
69d4274
06dcb26
f0d1667
0ff381f
32f7a6e
8b0bb3d
76e5164
2fe3464
c353572
9cff096
7ee90cd
7f360ba
a898585
6e1a73c
b842fcf
d7971cb
6ca775f
b16de8d
f7bc213
c605b5a
1f89522
1bbb929
e42f8ad
76a893b
e126508
0c7b587
6d48caf
838b42f
9ead908
8f292f1
6fd7468
20cb3df
0c3f589
76036c4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,52 @@ | ||
--- | ||
layout: default | ||
title: Hybrid | ||
parent: Compound queries | ||
grand_parent: Query DSL | ||
nav_order: 70 | ||
--- | ||
|
||
# Hybrid query | ||
|
||
Use a hybrid query to combine relevance scores from multiple queries into one score for a given document. A hybrid query contains a list of one or more queries and calculates document scores at the shard level independently for each subquery. The subquery rewriting is done at the coordinating node level to avoid duplicate computations. | ||
|
||
## Example | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can we have a full example of how to do hybrid search here. Example including the creation of Search Pipeline and then the hybrid query There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added a link to the example in the normalization processor documentation so it can be maintained in one place. |
||
|
||
The following example request combines a score from a regular `match` query clause with a score from a `neural` query clause. It uses a [search pipeline]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/index/) with a [normalization processor]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/script-processor/), which specifies the techniques to normalize and combine query clause relevance scores: | ||
|
||
```json | ||
POST flicker-index/_search?search_pipeline=normalizationPipeline | ||
{ | ||
"query": { | ||
"hybrid": { | ||
"queries": [ | ||
{ | ||
"neural": { | ||
"passage_embedding": { | ||
"query_text": "Girl with Brown Hair", | ||
"model_id": "ABCBMODELID", | ||
"k": 20 | ||
} | ||
} | ||
}, | ||
{ | ||
"match": { | ||
"passage_text": "Girl Brown hair" | ||
} | ||
} | ||
] | ||
} | ||
} | ||
} | ||
``` | ||
{% include copy-curl.html %} | ||
|
||
To learn more about the normalization processor, see [Normalization processor]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/normalization-processor/). | ||
|
||
## Parameters | ||
|
||
The following table lists all top-level parameters supported by `hybrid` queries. | ||
|
||
Parameter | Description | ||
:--- | :--- | ||
`queries` | An array of one or more query clauses that are used to match documents. A document must match at least one query clause to be returned in the results. The documents' relevance scores from all query clauses are combined into one score by applying a [search pipeline]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/index/). The maximum number of query clauses is 5. Required. |
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -14,9 +14,10 @@ You can use _search pipelines_ to build new or reuse existing result rerankers, | |||||
|
||||||
The following is a list of search pipeline terminology: | ||||||
|
||||||
* _Search request processor_: A component that takes a search request (the query and the metadata passed in the request), performs an operation with or on the search request, and returns a search request. | ||||||
* _Search response processor_: A component that takes a search response and search request (the query, results, and metadata passed in the request), performs an operation with or on the search response, and returns a search response. | ||||||
* _Processor_: Either a search request processor or a search response processor. | ||||||
* [_Search request processor_]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/search-processors#search-request-processors): A component that takes a search request (the query and the metadata passed in the request), performs an operation with or on the search request, and returns a search request. | ||||||
* [_Search response processor_]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/search-processors#search-response-processors): A component that takes a search response and search request (the query, results, and metadata passed in the request), performs an operation with or on the search response, and returns a search response. | ||||||
* [_Search phase results processor_]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/search-processors#search-phase-results-processors): A component that runs between search phases at the coordinating node level. A search phase results processor takes the results retrieved from one search phase and transforms them before passing them to the next search phase. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
* [_Processor_]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/search-processors/): Either a search request processor or a search response processor. | ||||||
* _Search pipeline_: An ordered list of processors that is integrated into OpenSearch. The pipeline intercepts a query, performs processing on the query, sends it to OpenSearch, intercepts the results, performs processing on the results, and returns them to the calling application, as shown in the following diagram. | ||||||
|
||||||
data:image/s3,"s3://crabby-images/be4ad/be4adfca254e203cdae84b72a5529e9fcf714c90" alt="Search processor diagram" | ||||||
|
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,103 @@ | ||||||
--- | ||||||
layout: default | ||||||
title: Normalization | ||||||
nav_order: 15 | ||||||
has_children: false | ||||||
parent: Search processors | ||||||
grand_parent: Search pipelines | ||||||
--- | ||||||
|
||||||
# Normalization processor | ||||||
|
||||||
The `normalization_processor` is a search phase results processor that runs between the query and fetch phases of search. It intercepts the query phase results and then normalizes and combines the document scores from different query clauses before passing the documents to the fetch phase. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
## Score normalization and combination | ||||||
|
||||||
Many applications require both keyword matching and semantic understanding. For example, BM25 accurately provides relevant search results for a query containing keywords, and neural networks perform well when a query requires natural language understanding. Thus, you might want to combine BM25 search results with the results of k-NN or neural search. However, BM25 and k-NN search use different scales to calculate relevance scores for the matching documents. Before combining the scores from multiple queries, it is necessary to normalize those scores so they are on the same scale. For further reading about score normalization and combination, including benchmarks and discussion of various techniques, see this [semantic search blog](https://opensearch.org/blog/semantic-science-benchmarks/). | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I'm not sure it's necessary, it's rather proved by experimental data that final results do have better metric on information retrieval. Not a strong opinion, but maybe worth checking if we can formulate this in more relaxed way. |
||||||
|
||||||
## Query then fetch | ||||||
|
||||||
OpenSearch supports two search types: `query_then_fetch` and `dfs_query_then_fetch`. The following diagram outlines the query then fetch process that includes a normalization processor. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
data:image/s3,"s3://crabby-images/0b456/0b456e524b71bb5269bdd37992f7ece2133ecc34" alt="Normalization processor flow diagram" | ||||||
|
||||||
When you send a search request to a node, this node becomes a _coordinating node_. During the first phase of search, the _query phase_, the coordinating node routes the search request to all shards in the index, including primary and replica shards. Each shard then runs the search query locally and returns metadata about the matching documents, which includes their doc IDs and relevance scores. The `normalization_processor` then normalizes and combines scores from different query clauses. The coordinating node merges and sorts the local result lists, compiling a global list of top documents that match the query. After that, search enters a _fetch phase_, in which the coordinating node requests the documents in the global list from the shards where they reside. Each shard returns the documents' `_source` to the coordinating node. Finally, the coordinating node sends a search response containing the results back to you. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The first sentence reads as though the request becomes a node... Reworded. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fifth sentence: "local lists of results" instead of "result lists"? |
||||||
|
||||||
## Request fields | ||||||
|
||||||
The following table lists all available request fields. | ||||||
|
||||||
Field | Data type | Description | ||||||
:--- | :--- | :--- | ||||||
`normalization.technique` | String | The technique for normalizing scores. Valid values are `min_max`, `L2`. Optional. Default is `min_max`. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||||||
`combination.technique` | String | The technique for combining scores. Valid values are `harmonic_mean`, `arithmetic_mean`, `geometric_mean`. Optional. Default is `arithmetic_mean`. | ||||||
`combination.parameters.weights` | Array of floating-point values | Specifies the weights to use for each query. Valid values are in the [0.0, 1.0] range and signify decimal percentages. The closer the weight is to 1.0, the more weight is given to a query. The number of values in the `weights` array must equal the number of queries. The sum of the values in the array must equal 1.0. Optional. If not provided, all queries are given equal weight. | ||||||
`tag` | String | The processor's identifier. Optional. | ||||||
`description` | String | A description of the processor. Optional. | ||||||
`ignore_failure` | Boolean | If `true`, OpenSearch [ignores a failure]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/index/#ignoring-processor-failures) of this processor and continues to run the remaining processors in the search pipeline. Optional. Default is `false`. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this setting is hardcoded to |
||||||
|
||||||
## Example | ||||||
|
||||||
The following example demonstrates using a search pipeline with a `normalization_processor`. | ||||||
|
||||||
### Creating a search pipeline | ||||||
|
||||||
The following request creates a search pipeline with a `normalization_processor` that uses the `min_max` normalization technique and the `harmonic_mean` combination technique: | ||||||
|
||||||
```json | ||||||
PUT /_search/pipeline/my_pipeline | ||||||
{ | ||||||
"phase_results_processors" : [ | ||||||
{ | ||||||
"normalization-processor" : { | ||||||
"normalization": { | ||||||
"technique": "min_max", | ||||||
}, | ||||||
"combination": { | ||||||
"technique" : "arithmetic_mean", | ||||||
"parameters" : { | ||||||
"weights" : [0.4, 0.7] | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this will fail as weight are not sum up to 1.0, values like |
||||||
} | ||||||
} | ||||||
} | ||||||
} | ||||||
] | ||||||
} | ||||||
``` | ||||||
{% include copy-curl.html %} | ||||||
|
||||||
### Using a search pipeline | ||||||
|
||||||
Provide the query clauses that you want to combine in a `hybrid` query and apply the search pipeline created in the previous section so the scores are combined using the chosen techniques: | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. combine "into"? |
||||||
|
||||||
```json | ||||||
POST flicker-index/_search?search_pipeline=normalizationPipeline | ||||||
{ | ||||||
"query": { | ||||||
"hybrid": { | ||||||
"queries": [ | ||||||
{ | ||||||
"neural": { | ||||||
"passage_embedding": { | ||||||
"query_text": "Girl with Brown Hair", | ||||||
"model_id": "ABCBMODELID", | ||||||
"k": 20 | ||||||
} | ||||||
} | ||||||
}, | ||||||
{ | ||||||
"match": { | ||||||
"passage_text": "Girl Brown hair" | ||||||
} | ||||||
} | ||||||
] | ||||||
} | ||||||
} | ||||||
} | ||||||
``` | ||||||
{% include copy-curl.html %} | ||||||
|
||||||
For more information, see [Hybrid query]({{site.url}}{{site.baseurl}}/query-dsl/compound/hybrid/). | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @kolchfa-aws Can we please add section (or include to one of existing sections) below information: Search tuningWe have identified some recommendation on tuning search relevancy
If you're not seeing some results that are expected from hybrid query, that can be due to smallest size for each of the sub-queries. Only results returned by each individual sub-query are passed to the normalization processor, it does not perform additional sampling. |
||||||
|
||||||
The `normalization_processor` does not produce consistent results for a cluster with one node and one shard. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We actually developed a workaround for 1 shard case n hybrid search, it's more a limitation on a core side due to some optimization, for 1 shard they take shortcuts and actual fetch phase executed before the normalization_processor. I'm not sure where to put this warning, as it's not specific to hybrid search or normalization processor, it's part of the processor for search pipelines in core. @navneet1v what do you think? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @kolchfa-aws we should remove this warning. This warning is no longer applicable. Plus we don't need to mention that we did a workaround. |
||||||
{: .warning} |
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -13,9 +13,12 @@ Search processors can be of the following types: | |||||
|
||||||
- [Search request processors](#search-request-processors) | ||||||
- [Search response processors](#search-response-processors) | ||||||
- [Search phase results processors](#search-phase-results-processors) | ||||||
|
||||||
## Search request processors | ||||||
|
||||||
A search request processor takes a search request (the query and the metadata passed in the request) and performs an operation on the search request before submitting the search request to the index. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
The following table lists all supported search request processors. | ||||||
|
||||||
Processor | Description | Earliest available version | ||||||
|
@@ -25,13 +28,25 @@ Processor | Description | Earliest available version | |||||
|
||||||
## Search response processors | ||||||
|
||||||
A search response processor performs an operation on the search response and returns a search response. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Confirm that the articles before the last two instances of "search response" are correct. |
||||||
|
||||||
The following table lists all supported search response processors. | ||||||
|
||||||
Processor | Description | Earliest available version | ||||||
:--- | :--- | :--- | ||||||
[`rename_field`]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/rename-field-processor/)| Renames an existing field. | 2.8 | ||||||
[`personalize_search_ranking`]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/personalize-search-ranking/) | Uses [Amazon Personalize](https://aws.amazon.com/personalize/) to rerank search results (requires setting up the Amazon Personalize service). | 2.9 | ||||||
|
||||||
## Search phase results processors | ||||||
|
||||||
A search phase results processor runs between search phases at the coordinating node level. It takes the results retrieved from one search phase and transforms them before passing them to the next search phase. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
The following table lists all supported search request processors. | ||||||
|
||||||
Processor | Description | Earliest available version | ||||||
:--- | :--- | :--- | ||||||
[`normalization_processor`]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/script-processor/) | Intercepts the query phase results and normalizes and combines the document scores before passing the documents to the fetch phase. | 2.10 | ||||||
|
||||||
## Viewing available processor types | ||||||
|
||||||
You can use the Nodes Search Pipelines API to view the available processor types: | ||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.