-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TaskModel + AccumInfo refactor for memory optimisation #1540
Conversation
Signed-off-by: Sayed Bilal Bari <sbari@nvidia.com>
Signed-off-by: Sayed Bilal Bari <sbari@nvidia.com>
Signed-off-by: Sayed Bilal Bari <sbari@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @bilalbari for this improved logic. Minor nits
storeIfNonZero(EXEC_DESERIALIZE_TIME, metrics.executorDeserializeTime) | ||
storeIfNonZero(EXEC_DESERIALIZE_CPU_TIME, metrics.executorDeserializeCpuTime) | ||
storeIfNonZero(EXEC_RUN_TIME, metrics.executorRunTime) | ||
storeIfNonZero(EXEC_CPU_TIME, metrics.executorCpuTime) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Can we retain some of the comments that were present in the original metrics? (e.g. nanoseconds)
output_recordsWritten: Long) | ||
|
||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Is there an extra line before the object?
|
||
var taskMetrics = immutable.IntMap.empty[Long] | ||
|
||
def storeIfNonZero(field: Int, value: Long): Unit = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we include{}
instead of inlined statements and a new line after the function definition?
This PR is outdated due to the changes in the incoming PR - #1543. |
Solves for #815
Changes -
This PR introduces the following refactors -
TaskModel
is a class that maintains details related to a TaskInfo. On observation the data related to TaskModel was pretty sparse( metrics like bytesread, byteswritten are only present in a few ). Hence updating the structure to maintain the long value metrics in a map and storing them only when required non zero.Improvements -
This refactor improves the shallow memory usage by ~35%. Attached are the screenshots from the peak heap dump -
Old heap dump -
Post refactor dump -
Testing -
This has been tested against a databricks event log file and improves the minimum heap required by 15% down from 12GB to 14GB
Although the heap analysis show the peak usage going down till 9GB, the reason it still needs the extra heap buffer is the usage of LongMap and IntMap which do reallocation where it double the current Buffer size even though it does not need it.
Optimisation -
Optimisation of this can be using 3rd party Map implementations which can further leverage the remaining heap space
Update on further testing
On further testing and discussion, this result is not very consistent though. Mainly because of the way LongMap and IntMap function ( over allocation of fixed buffer memory ).
A better long term solution would be to get rid of TaskModel altogether and leverage the Accumulable to get this information. This is just used to get the Task level metrics ( min, med, max, sum) and all of these can be retrieved from the accumulable information that we are storing.