-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Calculate task metric aggregates on-the-fly to reduce memory usage #1543
Calculate task metric aggregates on-the-fly to reduce memory usage #1543
Conversation
Signed-off-by: Sayed Bilal Bari <sbari@nvidia.com>
Signed-off-by: Sayed Bilal Bari <sbari@nvidia.com>
Signed-off-by: Sayed Bilal Bari <sbari@nvidia.com>
Signed-off-by: Sayed Bilal Bari <sbari@nvidia.com>
was it possible to do some workarounds to keep that functionality for now? if we can avoid removing the feature in this PR it will be better. Otherwise, we will have to fix the CI/CDs and the QA tests
Is the runtime evaluation done through the benchmarkSuite? What is meant here by "Memory utilisation"? Does it mean minimum required heap to run the eventlog? or the total allocated bytes as measured by the Java's profiler? |
core/src/main/scala/org/apache/spark/sql/rapids/tool/store/AccumInfo.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/GenerateTimeline.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/sql/rapids/tool/store/AccumInfo.scala
Outdated
Show resolved
Hide resolved
Signed-off-by: Sayed Bilal Bari <sbari@nvidia.com>
Signed-off-by: Sayed Bilal Bari <sbari@nvidia.com>
@amahussein Thanks for the review !
|
Signed-off-by: Sayed Bilal Bari <sbari@nvidia.com>
Signed-off-by: Sayed Bilal Bari <sbari@nvidia.com>
Signed-off-by: Sayed Bilal Bari <sbari@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sayedbilalbari
This is going to be a great milestone in the optimizations.
There is a few things we need to address:
- When evaluating the memory savings, I like to see using the
dev
mode to show the heap size after reading the eventlog. Then we compare that with the dev branch. This method is way more reliable and consistent. It helps us to set a reference for the subsequent changes in the tools metrics. In addition, not all devs/reviewers would be using the same Java profiler tool. While the JProfiler is handy, we want to be able to converge to standard means to evaluate performance.- It is a bonus if there is a script that builds in
dev
mode to gather these numbers, but I understand if we want to keep that in a different PR.
- It is a bonus if there is a script that builds in
- We want to review how incomplete stages are handled. IIRC, taskEnd events are triggered before the stageCompletion. Also, we create stageModel only during the handling of stageSubmitted. We can do that in an offline discussion.
- since we are doing average instead of median. Probably we may consider renaming the fields to average instead. we can also discuss that offline.
core/src/test/resources/ProfilingExpectations/rapids_join_eventlog_sqlmetrics_expectation.csv
Show resolved
Hide resolved
core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/GenerateTimeline.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/StatisticsMetrics.scala
Outdated
Show resolved
Hide resolved
Signed-off-by: Sayed Bilal Bari <sbari@nvidia.com>
@amahussein Thanks for the review again !
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall. looks great!
Some comments to make it even better :)
core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSparkMetricsAnalyzer.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSparkMetricsAnalyzer.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSparkMetricsAnalyzer.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/GenerateTimeline.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/GenerateTimeline.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/sql/rapids/tool/store/AccumInfo.scala
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/sql/rapids/tool/store/AccumInfo.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/sql/rapids/tool/store/AccumInfo.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/sql/rapids/tool/store/AccumInfo.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/sql/rapids/tool/store/AccumInfo.scala
Outdated
Show resolved
Hide resolved
Signed-off-by: Sayed Bilal Bari <sbari@nvidia.com>
core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/StatisticsMetrics.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/GenerateTimeline.scala
Show resolved
Hide resolved
Signed-off-by: Sayed Bilal Bari <sbari@nvidia.com>
Signed-off-by: Ahmed Hussein (amahussein) <a@ahussein.me>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTME! Thanks @sayedbilalbari !
Lets move fwd with this to make it easy to investigate the correctness of metrics
Signed off by: Sayed Bilal Bari ( sayedbilalbari )
Fixes #1545
This PR updates the AccumInfo logic to not store the updates associated with each task and maintain rolling Statistics at the stage level. This improves the overall memory usage by qualification run.
Why this change?
HashMap
storing these update values can become very large.Change
AccumInfo
:taskUpdateMap
.stageUpdateMap
now:StatisticMetric
object that is updated with the latest task metrics on a rolling basis.AppSQLPlanAnalyzer
:StageLevel
metrics object.Profiler
:TimelineGeneration
to remove taskUpdate usageTesting
Improvement
Metrics are for a 34G Databricks ZSTD log:
BenchmarkSuite output before change:
BenchmarkSuite output after change:
Metrics by running with dev checkpoint enabled
Ran with JVM params ->
-Xms=16G
-Xmx=16G
Before -
After -
This pull request includes multiple changes aimed at improving the performance and accuracy of the Spark RAPIDS tool. The most significant changes involve the removal of unused methods, updates to the handling of accumulator statistics, and improvements to the profiling and analysis components.
Codebase simplification and removal of unused methods:
getStageTaskIds
andfilterAccumTaskUpdatesForStage
methods from theAppSQLPlanAnalyzer
class as they are no longer needed. [1] [2] [3] [4]Enhancements to accumulator statistics:
AccumInfo
class to usestageValuesMap
for storing stage-level metrics and added methods to calculate and retrieve statistics. This change improves the accuracy of the accumulator statistics by leveraging stage-level metrics. [1] [2] [3]StatisticsMetrics
case class to use mutable fields formax
andtotal
to support incremental updates.Profiling and analysis improvements:
GenerateTimeline
object to usestageValuesMap
for calculating semaphore wait times and other metrics, improving the accuracy of the profiling data. [1] [2]AppSparkMetricsAnalyzer
class to leverage stage-level metrics for peak memory and shuffle write time calculations, enhancing the profiling of Photon applications.Miscellaneous updates:
These changes collectively improve the performance, accuracy, and maintainability of the Spark RAPIDS tool.