Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TaskModel + AccumInfo refactor for memory optimisation #1540

Closed
wants to merge 3 commits into from

Conversation

sayedbilalbari
Copy link
Collaborator

@sayedbilalbari sayedbilalbari commented Feb 11, 2025

Solves for #815

Changes -

This PR introduces the following refactors -

  1. core/src/main/scala/org/apache/spark/sql/rapids/tool/store/AccumInfo.scala - previously AccumInfo used HashMap for maintaining tasks to update value mapping. Now HashMap by default boxes its data i.e. long is converted to Long. This adds extra memory requirements ( references, hashes ) inflating the memory requirement ( 8B -> 24B ). LongMap does not do that internally ArrayBuffers with primitive types.
  2. core/src/main/scala/org/apache/spark/sql/rapids/tool/store/TaskModel.scala - TaskModel is a class that maintains details related to a TaskInfo. On observation the data related to TaskModel was pretty sparse( metrics like bytesread, byteswritten are only present in a few ). Hence updating the structure to maintain the long value metrics in a map and storing them only when required non zero.

Improvements -

This refactor improves the shallow memory usage by ~35%. Attached are the screenshots from the peak heap dump -

Old heap dump -

Screenshot 2025-02-10 at 5 32 33 PM

Post refactor dump -

Screenshot 2025-02-10 at 5 27 57 PM

Testing -

This has been tested against a databricks event log file and improves the minimum heap required by 15% down from 12GB to 14GB

Although the heap analysis show the peak usage going down till 9GB, the reason it still needs the extra heap buffer is the usage of LongMap and IntMap which do reallocation where it double the current Buffer size even though it does not need it.

Optimisation -

Optimisation of this can be using 3rd party Map implementations which can further leverage the remaining heap space

Update on further testing

On further testing and discussion, this result is not very consistent though. Mainly because of the way LongMap and IntMap function ( over allocation of fixed buffer memory ).

A better long term solution would be to get rid of TaskModel altogether and leverage the Accumulable to get this information. This is just used to get the Task level metrics ( min, med, max, sum) and all of these can be retrieved from the accumulable information that we are storing.

Signed-off-by: Sayed Bilal Bari <sbari@nvidia.com>
Signed-off-by: Sayed Bilal Bari <sbari@nvidia.com>
Signed-off-by: Sayed Bilal Bari <sbari@nvidia.com>
Copy link
Collaborator

@parthosa parthosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bilalbari for this improved logic. Minor nits

storeIfNonZero(EXEC_DESERIALIZE_TIME, metrics.executorDeserializeTime)
storeIfNonZero(EXEC_DESERIALIZE_CPU_TIME, metrics.executorDeserializeCpuTime)
storeIfNonZero(EXEC_RUN_TIME, metrics.executorRunTime)
storeIfNonZero(EXEC_CPU_TIME, metrics.executorCpuTime)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can we retain some of the comments that were present in the original metrics? (e.g. nanoseconds)

output_recordsWritten: Long)



Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Is there an extra line before the object?


var taskMetrics = immutable.IntMap.empty[Long]

def storeIfNonZero(field: Int, value: Long): Unit =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we include{} instead of inlined statements and a new line after the function definition?

@sayedbilalbari
Copy link
Collaborator Author

This PR is outdated due to the changes in the incoming PR - #1543.
Closing this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants