Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculate task metric aggregates on-the-fly to reduce memory usage #1543

Merged
merged 16 commits into from
Feb 21, 2025

Conversation

sayedbilalbari
Copy link
Collaborator

@sayedbilalbari sayedbilalbari commented Feb 13, 2025

Signed off by: Sayed Bilal Bari ( sayedbilalbari )

Fixes #1545

This PR updates the AccumInfo logic to not store the updates associated with each task and maintain rolling Statistics at the stage level. This improves the overall memory usage by qualification run.

Why this change?

  • This change aims to reduce overall heap usage of the tools qualification run.
  • The qualification tool parses Spark event logs, collates data in memory, and outputs aggregate metrics (SQL metrics, Job level metrics, etc.) to generate a recommendation report.
  • Major bottleneck identified:
    • The tool stores all update values coming in for each Task towards an Accumulable.
    • Because Tasks are numerous, the HashMap storing these update values can become very large.
      • For example, out of 14G of usage for an event log, the update map uses ~10G.
    • These update values are primarily used to calculate min, median, max, and total contribution of a stage towards an Accumulable (e.g., total shuffle).
  • Since the final usage is just min, median, max, and total, this PR changes the previous behavior:
    • On-the-fly calculation of these metrics without storing all update values in memory.

Change

  1. AccumInfo:
    • Updated to remove the taskUpdateMap.
    • The stageUpdateMap now:
      • Stores a map of values associated with each task.
      • Maintains a StatisticMetric object that is updated with the latest task metrics on a rolling basis.
  2. AppSQLPlanAnalyzer:
    • Previously, there was significant aggregation at the stage level.
    • Updated to use a new method for calculating metrics at the stage level.
    • Removed other methods to favor the StageLevel metrics object.
  3. Profiler:
    • Timeline generation is no longer used - will be tackled in next PR
    • Updated code for TimelineGeneration to remove taskUpdate usage

Testing

  • The output reports generated by this change were compared against the original output reports:
    • No change in the recommendations.
    • The only difference in raw metrics is in median values, since a rolling average is now used instead of a strict median.
      • Observed no major outliers or disparities.
  • Local test cases failing only due to the slight change in median values.
    • Very small difference observed.

Improvement

Metrics are for a 34G Databricks ZSTD log:

  • Total runtime improvement of 37%
  • Peak heap usage improvement of 68% from 15G to 5G.
  • Min heap requirement improvement of 85% from 14G to 3G

Screenshot 2025-02-10 at 5 32 33 PM
Screenshot 2025-02-13 at 10 43 55 AM

BenchmarkSuite output before change:

Benchmark :                               Best Time(ms)   Avg Time(ms)   Stdev(ms)      Avg GC Time(ms)       Avg GC Count     Stdev GC Count    Max GC Time(ms)       Max GC Count   Relative
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Enable_Per_SQL_Arg_Qualification                 722587         724549        2774              20098.0              364.0                 63              20601                396      1.00X

BenchmarkSuite output after change:

Benchmark` :                               Best Time(ms)   Avg Time(ms)   Stdev(ms)      Avg GC Time(ms)       Avg GC Count     Stdev GC Count    Max GC Time(ms)       Max GC Count   Relative
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Enable_Per_SQL_Arg_Qualification                 449096         450413        1862               2951.0              198.0                  1               2955                199      1.00X

Metrics by running with dev checkpoint enabled
Ran with JVM params ->
-Xms=16G
-Xmx=16G

Before -

Memory Marker: Post processing events, 
jvm.heap.max -> 17.17G
jvm.heap.total -> 17.17G
jvm.heap.free -> 4.2G

After -

Memory Marker: Post processing events, 
jvm.heap.max -> 17.17G
jvm.heap.total -> 17.17G
jvm.heap.free -> 14.77G

This pull request includes multiple changes aimed at improving the performance and accuracy of the Spark RAPIDS tool. The most significant changes involve the removal of unused methods, updates to the handling of accumulator statistics, and improvements to the profiling and analysis components.

Codebase simplification and removal of unused methods:

  • Removed the getStageTaskIds and filterAccumTaskUpdatesForStage methods from the AppSQLPlanAnalyzer class as they are no longer needed. [1] [2] [3] [4]

Enhancements to accumulator statistics:

  • Updated the AccumInfo class to use stageValuesMap for storing stage-level metrics and added methods to calculate and retrieve statistics. This change improves the accuracy of the accumulator statistics by leveraging stage-level metrics. [1] [2] [3]
  • Modified the StatisticsMetrics case class to use mutable fields for max and total to support incremental updates.

Profiling and analysis improvements:

  • Updated the GenerateTimeline object to use stageValuesMap for calculating semaphore wait times and other metrics, improving the accuracy of the profiling data. [1] [2]
  • Adjusted the AppSparkMetricsAnalyzer class to leverage stage-level metrics for peak memory and shuffle write time calculations, enhancing the profiling of Photon applications.

Miscellaneous updates:

  • Updated the copyright year in several files to 2025. [1] [2]
  • Modified the profiling expectations CSV file to reflect the changes in the accumulator statistics.

These changes collectively improve the performance, accuracy, and maintainability of the Spark RAPIDS tool.

Signed-off-by: Sayed Bilal Bari <sbari@nvidia.com>
Signed-off-by: Sayed Bilal Bari <sbari@nvidia.com>
Signed-off-by: Sayed Bilal Bari <sbari@nvidia.com>
Signed-off-by: Sayed Bilal Bari <sbari@nvidia.com>
@amahussein amahussein added core_tools Scope the core module (scala) performance performance and scalability of tools labels Feb 13, 2025
@amahussein
Copy link
Collaborator

amahussein commented Feb 13, 2025

Profiler - timeline generation is no longer used. Removed code for TimelineGeneration

was it possible to do some workarounds to keep that functionality for now? if we can avoid removing the feature in this PR it will be better. Otherwise, we will have to fix the CI/CDs and the QA tests

Metrics mentioned are for a 34G databricks zstd log
Total runtime comes down 27% ( from 9 min 20 sec to 6 min 50 sec )
Memory utilisation comes down 68% ( from 15G to 5G )

Is the runtime evaluation done through the benchmarkSuite?

What is meant here by "Memory utilisation"? Does it mean minimum required heap to run the eventlog? or the total allocated bytes as measured by the Java's profiler?

Signed-off-by: Sayed Bilal Bari <sbari@nvidia.com>
Signed-off-by: Sayed Bilal Bari <sbari@nvidia.com>
@sayedbilalbari
Copy link
Collaborator Author

@amahussein Thanks for the review !

  • Have updated the code to keep GenerateTimeline as of now
  • Added benchmarkSuite runtime and updated the exact improvement metric
  • Have added precise memoryUtilisation metrics as well ( peak heap usage and min heap usage )

Signed-off-by: Sayed Bilal Bari <sbari@nvidia.com>
@sayedbilalbari sayedbilalbari marked this pull request as ready for review February 14, 2025 01:46
Signed-off-by: Sayed Bilal Bari <sbari@nvidia.com>
Signed-off-by: Sayed Bilal Bari <sbari@nvidia.com>
@sayedbilalbari sayedbilalbari changed the title Memory Optimization - AccumInfo Refactor + Dead Code Removal Memory Optimization - AccumInfo Refactor Feb 14, 2025
Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sayedbilalbari
This is going to be a great milestone in the optimizations.

There is a few things we need to address:

  • When evaluating the memory savings, I like to see using the dev mode to show the heap size after reading the eventlog. Then we compare that with the dev branch. This method is way more reliable and consistent. It helps us to set a reference for the subsequent changes in the tools metrics. In addition, not all devs/reviewers would be using the same Java profiler tool. While the JProfiler is handy, we want to be able to converge to standard means to evaluate performance.
    • It is a bonus if there is a script that builds in dev mode to gather these numbers, but I understand if we want to keep that in a different PR.
  • We want to review how incomplete stages are handled. IIRC, taskEnd events are triggered before the stageCompletion. Also, we create stageModel only during the handling of stageSubmitted. We can do that in an offline discussion.
  • since we are doing average instead of median. Probably we may consider renaming the fields to average instead. we can also discuss that offline.

@sayedbilalbari
Copy link
Collaborator Author

@amahussein Thanks for the review again !

  • Have added the metrics from running the dev checkpoint. Added it to the description. Regarding the script, we can add that in a separate PR
  • For handling missing StageCompleted events, for each TaskEnd, the stageUpdate map is updated with the cummulative update values.
  • We create the StageModel when we get a StageSubmitted and update that on StageCompleted.
  • If we miss out on the StageCompleted event, the metric that we might miss out on would be the completionTime, rest should be updated with the StageSubmitted event

Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall. looks great!
Some comments to make it even better :)

sayedbilalbari and others added 2 commits February 20, 2025 16:50
Signed-off-by: Sayed Bilal Bari <sbari@nvidia.com>
Signed-off-by: Ahmed Hussein (amahussein) <a@ahussein.me>
@amahussein amahussein changed the title Memory Optimization - AccumInfo Refactor Calculate task metric aggregates on-the-fly to reduce memory usage Feb 21, 2025
Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTME! Thanks @sayedbilalbari !
Lets move fwd with this to make it easy to investigate the correctness of metrics

@amahussein amahussein merged commit bf1e55e into NVIDIA:dev Feb 21, 2025
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core_tools Scope the core module (scala) performance performance and scalability of tools
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[TASK] Improve core's memory utilization by reducing storage of Accum task updates
2 participants