Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce neural highlighter framework #1193

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

junqiu-lei
Copy link
Member

@junqiu-lei junqiu-lei commented Feb 19, 2025

Description

This PR introduces the Neural Highlighter framework from OpenSearch core Highlighter, enhancing the semantic highlighting capabilities. Key changes include:

  • Added NeuralHighlighter with core functionality.
  • Updated NeuralSearch and NeuralKNNQuery for integration.
  • Added unit test for NeuralHighlighter (IT will be added when all functionality is ready).

PR opensearch-project/ml-commons#3600 adds the functionality in ml-commons to use sentence highlighting model.

Related Issues

Part of #1182

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Junqiu Lei <junqiu@amazon.com>
@junqiu-lei junqiu-lei force-pushed the highlighter-framework branch from 1bc7505 to 9704f6c Compare February 19, 2025 22:41
public static final String NAME = "neural";
private static final String MODEL_ID_FIELD = "model_id";

private static MLCommonsClientAccessor mlCommonsClient;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This static pattern is not ideal for unit test. Can we receive accessor thought constructor?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack


@Override
public boolean canHighlight(MappedFieldType fieldType) {
// TODO: Implement actual condition check in subsequent PR
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be better to restrict it to TEXT field type and expand it later in subsequent PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure


Map<String, Object> options = fieldContext.field.fieldOptions().options();
String modelId = getModelId(options);
log.info("Using model ID: {}", modelId); // Will be replaced with actual model loading logic
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This log will be flooded and obscure other important logs. Let't move it to debug level.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to add a temp placeholder here, since client api function was planed in another PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add javadoc and a comment of all the important steps this method is trying to do.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log.info("Using model ID: {}", modelId); // Will be replaced with actual model loading logic
log.info("Using ML client: {}", mlCommonsClient); // Will be replaced with actual model loading logic

// TODO: Implement actual highlighting logic in subsequent PR
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With all these TODO, it seems like this PR should not be merged to main now.
Let me stop here and review it when the code is fully completed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. My original purpose was trying to make the PR more clear by different component(highlighter framework, ml client, text processing logic...). I could also combine all code in one big PR.


private String formatHighlight(String text) {
// TODO: Implement user provided format options in subsequent PR
return "<em>" + text + "</em>";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use String format for safer concatenation

}
}

return query.toString().replaceAll("\\w+:", "").replaceAll("\\s+", " ").trim();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add comment describing what you are trying to extract from string here

// For now, return a basic highlight of the field text
Text[] fragments = new Text[] { new Text(formatHighlight(fieldText)) };
return new HighlightField(fieldContext.fieldName, fragments);
} catch (Exception e) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to narrow cases when we can have exceptions? Having global level try/catch is not that great for performance on search path

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, let me try to adjust to narrow exception catch cases.

}

private String getFieldText(FieldHighlightContext fieldContext) {
Object value = fieldContext.hitContext.sourceLookup().extractValue(fieldContext.fieldName, null);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this second argument a fallback in case requested key is missing? If yes can't we return empty string here and avoid the null check on a line below?


Map<String, Object> options = fieldContext.field.fieldOptions().options();
String modelId = getModelId(options);
log.info("Using model ID: {}", modelId); // Will be replaced with actual model loading logic
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add javadoc and a comment of all the important steps this method is trying to do.

Map<String, Object> options = fieldContext.field.fieldOptions().options();
String modelId = getModelId(options);
log.info("Using model ID: {}", modelId); // Will be replaced with actual model loading logic
log.info("Using ML client: {}", mlCommonsClient); // Will be replaced with actual model loading logic
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
log.info("Using ML client: {}", mlCommonsClient); // Will be replaced with actual model loading logic
log.debug("Using ML client: {}", mlCommonsClient); // Will be replaced with actual model loading logic

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why do we need this log this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the comment along with this line code


private String extractOriginalQuery(Query query) {
if (query instanceof NeuralKNNQuery neuralQuery) {
String originalText = neuralQuery.getOriginalQueryText();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
String originalText = neuralQuery.getOriginalQueryText();
String originalQuery = neuralQuery.getOriginalQueryText();

this.knnQuery = knnQuery;
this.originalQueryText = originalQueryText;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add BWC test to validate that it does not break anything in bwc environment?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will have a check on bwc side

@@ -107,6 +109,11 @@ public Builder rescoreContext(RescoreContext rescoreContext) {
return this;
}

public Builder originalQueryText(String originalQueryText) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add javadoc. We already have an issue for reducing the warning logs #1095

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible please add the javadoc on all public methods of this class.

@junqiu-lei
Copy link
Member Author

junqiu-lei commented Mar 3, 2025

For reference in this PR, in ml-commons we have PR for sentence highlighting model support :
opensearch-project/ml-commons#3600

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Features Introduces a new unit of functionality that satisfies a requirement neural-search v3.0.0 v3.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants