Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deal with Control Characters in DynamoDB #5437

Merged

Conversation

Galactus22625
Copy link
Member

@Galactus22625 Galactus22625 commented Feb 14, 2025

Description

Manually Parse DDB Records

  • Tested by adding control characters to test_writeSingleRecordToBuffer_with_other_data in StreamRecordConveterTest
  • Tested by running DynamoDB source pipeline on local and ingesting records with control characters successfully.
  • Tested by running DynamoDB source pipeline on local and ingesting records with more complex data structures like lists, maps, lists and maps in lists, and lists and maps in maps as well as every other ddb data type. Compared output to output of unchanged dataprepper.

Issues Resolved

Resolves #5027 . Bug ingesting control characters from dynamo db

Check List

  • New functionality includes testing.
  • New functionality has a documentation issue. Please link to it in this PR.
    • New functionality has javadoc added
  • Commits are signed with a real name per the DCO

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Maxwell Brown <mxwelwbr@amazon.com>
@Galactus22625
Copy link
Member Author

Not sure if I need to make the same change in Export Record Converter because I'm not sure what the export option does for dynamo db source

@@ -67,6 +68,7 @@ public StreamRecordConverter(final BufferAccumulator<org.opensearch.dataprepper.
this.bytesProcessedSummary = pluginMetrics.summary(BYTES_PROCESSED);
this.streamConfig = streamConfig;

MAPPER.configure(JsonParser.Feature.ALLOW_UNQUOTED_CONTROL_CHARS, true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Galactus22625 for working this.

I tend to think this is not the right approach. Here are a few reasons.

  1. The DDB records do in fact contain properly escaped control characters. So the ultimate bug appears to be that we are unescaping them (probably through the EnhancedDocument).
  2. This may result in allowing leniency that we don't actually want to allow. Again, this is not a problem with the input data.
  3. There may be other issues with the serialization we miss from this.
  4. Looking at the code, there is a bit of overhead in this EnhancedDocument approach anyway. We take the DDB record, make a JSON string, then deserialize it.

See this comment in particular:

#5027 (comment)

It appears that the best options are:

  1. Contact the DDB team to see if this is the expected behavior of EnhancedDocument. Perhaps by opening a GitHub issue.
  2. or; Change the implementation to parse the DDB record using the approach I took in the proof-of-concept that I shared in the comment above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I will try implementing a solution in the style of the proof-of-concept from the comment above

Signed-off-by: Maxwell Brown <mxwelwbr@amazon.com>
Signed-off-by: Maxwell Brown <mxwelwbr@amazon.com>
Signed-off-by: Maxwell Brown <mxwelwbr@amazon.com>
Signed-off-by: Maxwell Brown <mxwelwbr@amazon.com>
Signed-off-by: Maxwell Brown <mxwelwbr@amazon.com>
private Map<String, Object> convertData(Map<String, AttributeValue> data) throws JsonProcessingException {
String jsonData = EnhancedDocument.fromAttributeValueMap(data).toJson();
return MAPPER.readValue(jsonData, MAP_TYPE_REFERENCE);
private Map<String, Object> convertData(Map<String, AttributeValue> data) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add unit tests to cover all these data types to verify that we handle them correctly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unit tests added

break;
case BS: // BS for Binary Set
result.put(attributeName, attributeValue.bs().stream()
.map(buffer -> BASE64_ENCODER.encodeToString(buffer.asByteArray()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the current behavior - base64 encoding the string?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, i made sure the output matches what was being outputted before

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and also dynamo db only takes base64 binary values


data.forEach(((attributeValue) -> {
//dealing with all dynamo db attribute types
switch (attributeValue.type()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could consolidate this code with the code above.

private Object processAttributeValue(AttributeValue attributeValue) {
  switch (attributeValue.type()) {
    case N:
       return new BigDecimal(attributeValue.n());
    case B:
       return BASE64_ENCODER.encodeToString(attributeValue.b().asByteArray());
   ....
  }
}

And then use it. The following should work (or be close to working)

 convertListData(...) {
   return data.stream.map(this::processAttributeValue).collect(toList());
 }

 convertMapData(...) {
   return data.stream.collect(toMap(attributeValue::type, this::processAttributeValue));
  }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, will consolidate

result.add(null);
break;
default:
throw new IllegalStateException("Unsupported attribute type: " + attributeValue.type());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be InvalidArgumentException.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, will change. I was matching the error message that was being sent before.

Signed-off-by: Maxwell Brown <mxwelwbr@amazon.com>
Signed-off-by: Maxwell Brown <mxwelwbr@amazon.com>
Copy link
Member

@dlvenable dlvenable left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice. Thank you for this improvement!

@san81 san81 merged commit 528e032 into opensearch-project:main Feb 18, 2025
44 of 47 checks passed
@Galactus22625 Galactus22625 deleted the DDB-source-character-parsing branch February 18, 2025 20:51
divbok pushed a commit to divbok/data-prepper that referenced this pull request Feb 24, 2025
* Deal with Control Characters in DynamoDB

Signed-off-by: Maxwell Brown <mxwelwbr@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] DynamoDB Source doesn't support parsing data with Control Characters
3 participants