Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] - OpenSearchHadoopIllegalArgumentException: invalid pattern given t4-nearby-dealers-v2 #391

Closed
elangovankrishna opened this issue Jan 9, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@elangovankrishna
Copy link

What is the bug?

When using spark openserach connector opensearch-spark-30_2.12-1.0.1-20240108.222620-77
spark version - 3.3
scala -2.12
opensearch version - 2.11 (AWS managed)

I am getting the below error

24/01/08 22:17:59 INFO OpenSearchDataFrameWriter: Writing to [t4-nearby-dealers-v2/https://vpc-osprdcore-kvxhwlq2gogz6l3.us-east-1.es.amazonaws.com] 24/01/08 22:17:59 INFO OpenSearchDataFrameWriter: Writing to [t4-nearby-dealers-v2/https://vpc-osprdcore-kvxhwlq2gogz6l3.us-east-1.es.amazonaws.com] 24/01/08 22:17:59 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 2) org.opensearch.hadoop.OpenSearchHadoopIllegalArgumentException: invalid pattern given t4-nearby-dealers-v2/https://vpc-osprdcore-kvxhwlq2gogz6l3.us-east-1.es.amazonaws.com at org.opensearch.hadoop.util.Assert.isTrue(Assert.java:70) at org.opensearch.hadoop.serialization.field.AbstractIndexExtractor.compile(AbstractIndexExtractor.java:69) at org.opensearch.hadoop.rest.RestService.createWriter(RestService.java:604) at org.opensearch.spark.rdd.OpenSearchRDDWriter.write(OpenSearchRDDWriter.scala:77) at org.opensearch.spark.sql.OpenSearchSparkSQL$.$anonfun$saveToOpenSearch$1(OpenSearchSparkSQL.scala:113) at org.opensearch.spark.sql.OpenSearchSparkSQL$.$anonfun$saveToOpenSearch$1$adapted(OpenSearchSparkSQL.scala:113) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:136) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

How can one reproduce the bug?

If you have the above versions of cluster running and spark and scala installed running the below line of code trying to index into opensearch will result in error

`import sys

from pyspark.sql import SparkSession

gcs_file = sys.argv[1]
index_name = sys.argv[2]
index_type = sys.argv[3]

spark = (
SparkSession.builder.appName("Indexing_{0}".format(index_type))
.config("opensearch.port", "443")
.config("opensearch.nodes", "https://vpc-osprdcore-kvxhwlq2gogz6l3.us-east-1.es.amazonaws.com")
.config("opensearch.nodes.wan.only", "true")
.config("opensearch.index.auto.create", "yes")
.config("opensearch.batch.size.bytes", "25mb")
.config("opensearch.batch.size.entries", "0")
.config("opensearch.net.ssl", "true")
.config("spark.es.batch.write.retry.count", "-1")
.config("spark.shuffle.service.enabled", "true")
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.getOrCreate()
)

df = spark.read.json("{0}".format(gcs_file)).repartition(2)

df.write.format("org.opensearch.spark.sql").mode("Overwrite").save("{0}/{1}".format(index_name, index_type))

df.stop()`

What is the expected behavior?

Create a new index on opensearch.

What is your host/environment?

command to run the above code

spark-submit --jars "/Users/opensearch-spark-30_2.12-1.0.1-20240108.222620-77.jar" \ emr_utils/spark_indexer_os.py \ "/Users/Downloads/v2_2024_0108_data_000000000000.jsonl.gz" \ "t4-nearby-dealers-v2" \ "https://vpc-osprdcore-kvxhwlq2gogz6l3.us-east-1.es.amazonaws.com"

Do you have any screenshots?

If applicable, add screenshots to help explain your problem.

Do you have any additional context?

Is the opensearch hadoop connector only compatible with opensearch version 2.7 and not on other latest releases ?

@elangovankrishna elangovankrishna added bug Something isn't working untriaged labels Jan 9, 2024
@Xtansia
Copy link
Collaborator

Xtansia commented Jan 9, 2024

Hi @elangovankrishna

At first glance, from how you're running your code you're passing the domain URI as the "index_type" argument, which is being appended to the index name and most likely causing this issue.

I'd also point out that OpenSearch v2+ deprecated the index "types".

@Xtansia Xtansia removed the untriaged label Jan 9, 2024
@elangovankrishna
Copy link
Author

elangovankrishna commented Jan 9, 2024

@Xtansia Thanks for pointing out, i have remove that line of code and its right now df.write.format("org.opensearch.spark.sql").mode("Overwrite").save("{0}".format(index_name)) and it fixed the error

@Xtansia
Copy link
Collaborator

Xtansia commented Jan 9, 2024

Glad to hear it's working for you now 👍

@Xtansia Xtansia closed this as completed Jan 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants