-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Handling OpenSearch geo_point Data in Spark #1053
Comments
Did u consider map to geo_point (UDT) type in spark? |
Added that option to 3.4. I did not see much benefit to make it UDT. |
High-level question: is this target to unified the data type for PPL language or we see some requirements from reality world (e.g. CWL)? |
Maybe we could learn from Apache Sedona (https://github.com/apache/sedona) which provides spatial computing based on Apache Spark. |
We don't have specific requirements, yet. This is part of supporting OpenSearch sample data set (They include geo_point fields). |
That is very good point. Apache Sedona utilize UDT and store geometry data as Geometry class (serialized to Array[Byte]) provided by JTS Topology Suite. And it provides lots of UDFs (ref) to allow geographical calculation, aggregation, etc. |
1. Introduction
1.1 Background
When using OpenSearch as a data source for Apache Spark, special handling is required for OpenSearch-specific data types, such as
geo_point
. Spark does not have a native equivalent forgeo_point
, so a standard approach must be established for efficient querying and geospatial analysis.Related Issue: #1047
1.2 Objective
This RFC proposes an approach for integrating OpenSearch
geo_point
fields into Spark DataFrames, ensuring proper data representation and efficient querying without relying on external geospatial libraries.2. Problem Statement
2.1 OpenSearch geo_point Representation
OpenSearch stores
geo_point
fields in multiple formats, including:[longitude, latitude]
"lat,lon"
or"POINT(longitude latitude)"
{ "lat": value, "lon": value }
(ref: https://opensearch.org/docs/latest/field-types/supported-field-types/geo-point/)
Spark does not natively support
geo_point
, so an appropriate transformation is needed when retrieving data.2.2 Challenges
geo_point
, making data retrieval non-trivial.geo_point
fields efficiently is critical for large datasets.3. Proposed Solution
3.1 Mapping OpenSearch geo_point to Spark DataFrames
We propose normalizing
geo_point
fields into one of the following Spark representations:[lon, lat]
(Array)STRUCT<lon: DOUBLE, lat: DOUBLE>
{"lat": x, "lon": y}
(Object)STRUCT<lon: DOUBLE, lat: DOUBLE>
"lat,lon"
(String)STRUCT<lon: DOUBLE, lat: DOUBLE>
"POINT(lon lat)"
(WKT)STRING (WKT format)
3. Alternatives Considered
3.1 Keeping Original Representation
3.2 Keeping geo_point as JSON String
3.3 Using WKT Representation
3.4 Using UDT
4. Open Questions
geo_point
data?Discussion & Feedback
Please provide feedback by commenting on this RFC or suggesting changes via a pull request.
The text was updated successfully, but these errors were encountered: