[BUG] - OpenSearchHadoopIllegalArgumentException: invalid pattern given t4-nearby-dealers-v2
#391
Labels
bug
Something isn't working
t4-nearby-dealers-v2
#391
What is the bug?
When using spark openserach connector opensearch-spark-30_2.12-1.0.1-20240108.222620-77
spark version - 3.3
scala -2.12
opensearch version - 2.11 (AWS managed)
I am getting the below error
24/01/08 22:17:59 INFO OpenSearchDataFrameWriter: Writing to [t4-nearby-dealers-v2/https://vpc-osprdcore-kvxhwlq2gogz6l3.us-east-1.es.amazonaws.com] 24/01/08 22:17:59 INFO OpenSearchDataFrameWriter: Writing to [t4-nearby-dealers-v2/https://vpc-osprdcore-kvxhwlq2gogz6l3.us-east-1.es.amazonaws.com] 24/01/08 22:17:59 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 2) org.opensearch.hadoop.OpenSearchHadoopIllegalArgumentException: invalid pattern given t4-nearby-dealers-v2/https://vpc-osprdcore-kvxhwlq2gogz6l3.us-east-1.es.amazonaws.com at org.opensearch.hadoop.util.Assert.isTrue(Assert.java:70) at org.opensearch.hadoop.serialization.field.AbstractIndexExtractor.compile(AbstractIndexExtractor.java:69) at org.opensearch.hadoop.rest.RestService.createWriter(RestService.java:604) at org.opensearch.spark.rdd.OpenSearchRDDWriter.write(OpenSearchRDDWriter.scala:77) at org.opensearch.spark.sql.OpenSearchSparkSQL$.$anonfun$saveToOpenSearch$1(OpenSearchSparkSQL.scala:113) at org.opensearch.spark.sql.OpenSearchSparkSQL$.$anonfun$saveToOpenSearch$1$adapted(OpenSearchSparkSQL.scala:113) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:136) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
How can one reproduce the bug?
If you have the above versions of cluster running and spark and scala installed running the below line of code trying to index into opensearch will result in error
`import sys
from pyspark.sql import SparkSession
gcs_file = sys.argv[1]
index_name = sys.argv[2]
index_type = sys.argv[3]
spark = (
SparkSession.builder.appName("Indexing_{0}".format(index_type))
.config("opensearch.port", "443")
.config("opensearch.nodes", "https://vpc-osprdcore-kvxhwlq2gogz6l3.us-east-1.es.amazonaws.com")
.config("opensearch.nodes.wan.only", "true")
.config("opensearch.index.auto.create", "yes")
.config("opensearch.batch.size.bytes", "25mb")
.config("opensearch.batch.size.entries", "0")
.config("opensearch.net.ssl", "true")
.config("spark.es.batch.write.retry.count", "-1")
.config("spark.shuffle.service.enabled", "true")
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.getOrCreate()
)
df = spark.read.json("{0}".format(gcs_file)).repartition(2)
df.write.format("org.opensearch.spark.sql").mode("Overwrite").save("{0}/{1}".format(index_name, index_type))
df.stop()`
What is the expected behavior?
Create a new index on opensearch.
What is your host/environment?
command to run the above code
spark-submit --jars "/Users/opensearch-spark-30_2.12-1.0.1-20240108.222620-77.jar" \ emr_utils/spark_indexer_os.py \ "/Users/Downloads/v2_2024_0108_data_000000000000.jsonl.gz" \ "t4-nearby-dealers-v2" \ "https://vpc-osprdcore-kvxhwlq2gogz6l3.us-east-1.es.amazonaws.com"
Do you have any screenshots?
If applicable, add screenshots to help explain your problem.
Do you have any additional context?
Is the opensearch hadoop connector only compatible with opensearch version 2.7 and not on other latest releases ?
The text was updated successfully, but these errors were encountered: