Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate a detailed report for the write ops #1544

Merged
merged 2 commits into from
Feb 19, 2025

Conversation

amahussein
Copy link
Collaborator

@amahussein amahussein commented Feb 13, 2025

Signed-off-by: Ahmed Hussein (amahussein) a@ahussein.me

Fixes #1536

This commits adds a new report containing the write operations.
The generated report is mostly functional for InsertIntoHadoopFsRelationCommand and more incremental improvements need to follow.

New report:

No text format is generated for this report. The reason is that it is not readable to have the path and the node description into Text and it will represent unecessary overhead

The generated report is called write_operations.csv and it looks like the following:

  • sqlPlanVersion the number of the plan version. This is helpful if we want to generate the write operations for all the plans when AQE is enabled.
  • fromFinalPlan: True when the plan is final.
  • If fields are not extracted, it will insert "unknown"
  • the fullDescription contains the actual node-secr truncated at 500 characters.
  • outputColumns: lists the columns separated by semicolumn. In case of truncated schema, the column will include something like ... and 24 more fields.
appIndex	sqlID	sqlPlanVersion	nodeId	fromFinalPlan	execName	format	location	tableName	dataBase	outputColumns	writeMode	fullDescription
1	4	0	0	true	InsertIntoHadoopFsRelationCommand	 Parquet	gs://path/to/store/database/table_name	table_name	database	station_id;date;element;value;mflag;qflag;sflag;time;year	Append	Execute InsertIntoHadoopFsRelationCommand gs://path/to/store/database/table_name, false, Parquet, [serialization.format=1, mergeschema=false, __hive_compatible_bucketed_table_insertion__=true], Append, `spark_catalog`.`weather`.`table_name`, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, org.apache.spark.sql.execution.datasources.InMemoryFileIndex(gs://path/to/store/database/table_name), [station_id, date, element, value, mflag, qflag, sflag, time, year]
1	7	2	1	true	InsertIntoHadoopFsRelationCommand	 Parquet	gs://path/to/store/database/table_name_2	table_name_2	database	station_id;date;element;value;mflag;qflag;sflag;time;station_latitude;station_longitude;station_elevation;station_state;station_name	Append	Execute InsertIntoHadoopFsRelationCommand gs://path/to/store/database/table_name_2, false, Parquet, [serialization.format=1, mergeschema=false, __hive_compatible_bucketed_table_insertion__=true], Append, `spark_catalog`.`weather`.`table_name_2`, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, org.apache.spark.sql.execution.datasources.InMemoryFileIndex(gs://path/to/store/database/table_name_2), [station_id, date, element, value, mflag, qflag, sflag, time, station_latitude, station_longitude, station_elevation, station_state, station_name]
1	10	0	0	true	InsertIntoHadoopFsRelationCommand	 Parquet	gs://path/to/store/database/temp_valid_data	temp_valid_data	database	station_id;date;element;value;mflag;qflag;sflag;time;station_latitude;station_longitude;station_elevation;station_state;station_name	Append	Execute InsertIntoHadoopFsRelationCommand gs://path/to/store/database/temp_valid_data, false, Parquet, [serialization.format=1, mergeschema=false, __hive_compatible_bucketed_table_insertion__=true], Append, `spark_catalog`.`weather`.`temp_valid_data`, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, org.apache.spark.sql.execution.datasources.InMemoryFileIndex(gs://path/to/store/database/temp_valid_data), [station_id, date, element, value, mflag, qflag, sflag, time, station_latitude, station_longitude, station_elevation, station_state, station_name]
1	11	4	1	true	InsertIntoHadoopFsRelationCommand	 Parquet	gs://path/to/store/database/weather_result	weather_result	database	station_state;element;record_count;avg_value;max_value;min_value;rank;state_name	Overwrite	Execute InsertIntoHadoopFsRelationCommand gs://path/to/store/database/weather_result, false, Parquet, [serialization.format=1, mergeschema=false, __hive_compatible_bucketed_table_insertion__=true], Overwrite, `spark_catalog`.`weather`.`weather_result`, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, org.apache.spark.sql.execution.datasources.InMemoryFileIndex(gs://path/to/store/database/weather_result), [station_state, element, record_count, avg_value, max_value, min_value, rank, state_name]

What is missing in this PR:

  • the report does not include metrics of the writeOp. We can consider adding those metrics if needed.
  • add unit test to evaluate the generated CSV file.
  • evaluate the behavior with hive operations like "insertIntoHiveTable"
  • extract the metadata from DeltaLake write operations.
  • test with eventlogs generated by Spark2.x
  • we need a followup to remove redundancy resulting from created a ToolsPlanGraph outside the SqlPlanModel. This refactor will improve the memory utilization and reduce the overhead from building the graph multiple times.

Detailed descriptions of the changes.

In addition to the new report, this pull request includes several changes to enhance the functionality and maintainability of the codebase, particularly in the planparser and profiling modules. The most important changes involve the addition of new metadata extraction methods, the introduction of a new utility class, and updates to existing classes to accommodate these changes.

Enhancements in metadata extraction and utility usage:

Introduction of StringUtils utility class:

Enhancements in profiling:

These changes collectively improve the code's robustness, maintainability, and functionality, particularly in handling write operations and profiling.

Signed-off-by: Ahmed Hussein (amahussein) <a@ahussein.me>

Fixes NVIDIA#1536

This commits adds a new report containing the write operations.
@amahussein amahussein added core_tools Scope the core module (scala) API change A changeA change affecting the output (add/remove/rename files, add/remove/rename columns) labels Feb 13, 2025
@amahussein amahussein self-assigned this Feb 13, 2025
Copy link
Collaborator

@sayedbilalbari sayedbilalbari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No changes suggested. Just some clarifying questions.

Copy link
Collaborator Author

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sayedbilalbari
I will add samples of the Node content to the code.

Signed-off-by: Ahmed Hussein (amahussein) <a@ahussein.me>
Copy link
Collaborator Author

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sayedbilalbari for the offline discussion.
I addressed all your questions and comments.

Copy link
Collaborator

@sayedbilalbari sayedbilalbari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM !

@amahussein amahussein merged commit 78cab00 into NVIDIA:dev Feb 19, 2025
13 checks passed
@amahussein amahussein deleted the rapids-tools-1536 branch February 19, 2025 23:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API change A changeA change affecting the output (add/remove/rename files, add/remove/rename columns) core_tools Scope the core module (scala)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Generate a detailed report for the write ops
2 participants