Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ITs to check embeddings are updated when using TextEmbeddingProcessors in document reindex/update scenarios #1205

Open
q-andy opened this issue Feb 28, 2025 · 0 comments

Comments

@q-andy
Copy link

q-andy commented Feb 28, 2025

Is your feature request related to a problem?

Currently we are facing an issue from OS core while testing #1191 where document field updates are not resulting in regenerating embeddings when using TextImageEmbeddingProcessor. The root cause seems to be the ingestion isn't hitting the processor at all.

In this case the root cause is OS core, but we should have tests for TextEmbeddingProcessor/TextImageEmbeddingProcessor to verify that embeddings are updated correctly as well to catch and verified these problems earlier

@junqiu-lei verified issue with the following:

public void testBulkUpdateUpdatesEmbeddings() throws Exception {
      // Upload and load the model
      String modelId = uploadTextEmbeddingModel();
      loadModel(modelId);

      // Create pipeline with text embedding processor that maps "text" to "text_embedding"
      String pipelineConfig = "{\n"
          + "  \"description\": \"text embedding pipeline for bulk update test\",\n"
          + "  \"processors\": [\n"
          + "    {\n"
          + "      \"text_embedding\": {\n"
          + "        \"model_id\": \"%s\",\n"
          + "        \"field_map\": {\n"
          + "          \"text\": \"text_embedding\"\n"
          + "        }\n"
          + "      }\n"
          + "    }\n"
          + "  ]\n"
          + "}\n";

      createPipelineProcessor(String.format(pipelineConfig, modelId), PIPELINE_NAME, modelId, null);

      // Create index with pipeline
      createIndexWithPipeline(INDEX_NAME, "IndexMappings.json", PIPELINE_NAME);

      // Index original document
      String documentId = "bulk-update-test";
      String originalText = "{ \"text\": \"hello world\" }";
      ingestDocument(INDEX_NAME, originalText, documentId);

      // Get the original embedding
      Map<String, Object> originalDoc = (Map<String, Object>) getDocById(INDEX_NAME, documentId).get("_source");
      List<Double> originalEmbedding = (List<Double>) originalDoc.get("text_embedding");
      assertNotNull(originalEmbedding);

      // Update document using _bulk API with pipeline parameter
      String updatedText = "bye world";
      String bulkRequest = "{ \"update\": { \"_index\": \""
          + INDEX_NAME
          + "\", \"_id\": \""
          + documentId
          + "\", \"pipeline\": \""
          + PIPELINE_NAME
          + "\" } }\n"
          + "{ \"doc\" : { \"text\": \""
          + updatedText
          + "\" } }\n";

      Map<String, String> params = new HashMap<>();
      params.put("refresh", "true");
      makeRequest(
          client(),
          "POST",
          "_bulk",
          params,
          toHttpEntity(bulkRequest),
          ImmutableList.of(new BasicHeader(HttpHeaders.USER_AGENT, "Kibana"))
      );

      // Get the updated embedding
      Map<String, Object> updatedDoc = (Map<String, Object>) getDocById(INDEX_NAME, documentId).get("_source");
      List<Double> updatedEmbedding = (List<Double>) updatedDoc.get("text_embedding");
      assertNotNull(updatedEmbedding);

      // Verify the text was updated

      assertEquals(updatedText, updatedDoc.get("text"));

      // Verify the embedding has changed
      assertNotEquals(originalEmbedding, updatedEmbedding);
  }

Test fails with

java.lang.AssertionError: Values should be different. Actual: [0.44230038, ....

What solution would you like?

Adding additional checks like the above in reindex integration tests for TextEmbeddingProcessor and TextImageEmbeddingProcessor added in #1075 to verify document embeddings are updated when the documents are updated

What alternatives have you considered?

Not having these tests, adding tests in core instead

Do you have any additional context?

Will update with core issue when it is cut or resolved

@martin-gaievski martin-gaievski changed the title [FEATURE] Add ITs to check embeddings are updated when using TextEmbeddingProcessors in document reindex/update scenarios Add ITs to check embeddings are updated when using TextEmbeddingProcessors in document reindex/update scenarios Feb 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant