-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GraphBolt] modify preprocess_ondisk_dataset()
#6986
Conversation
To trigger regression tests:
|
plz take a look at your convenience @frozenbugs @czkkkkkk @Rhett-Ying |
please check if such peak memory issue happens in current implementation. #7086 |
Benchmarking results are pasted in the description. @Rhett-Ying @czkkkkkk Good news is that the new implementation is generally faster than the old one, and is more memory-saving for homo datasets (ogbn-products). But it is worth noting that the new implementation uses more memory to deal with datasets with a heterograph (ogbn-mag). I am going through the code to figure out why, but I don't have a clue right now. Your inspiration on this issue will be appreciated. |
Tried to change the dtype of After change Version: old | Dataset: ogbn-mag | include_EID: True
Memory Usage: 8.008 GB | Execution Time: 14.99524 seconds
Version: new | Dataset: ogbn-mag | include_EID: True
Memory Usage: 8.682 GB | Execution Time: 13.81134 seconds
Version: old | Dataset: ogbn-mag | include_EID: False
Memory Usage: 7.976 GB | Execution Time: 13.61416 seconds
Version: new | Dataset: ogbn-mag | include_EID: False
Memory Usage: 8.053 GB | Execution Time: 12.39217 seconds After change
|
@Skeleton003 the default or previous dtype of |
According to the |
Yes, default or previous dtype of |
Latest benchmark: Version: old | Dataset: ogbn-mag | include_EID: True
Memory Usage: 7.983 GB | Execution Time: 14.86984 seconds
Version: new | Dataset: ogbn-mag | include_EID: True
Memory Usage: 6.674 GB | Execution Time: 14.08306 seconds
Version: old | Dataset: ogbn-mag | include_EID: False
Memory Usage: 7.969 GB | Execution Time: 13.24852 seconds
Version: new | Dataset: ogbn-mag | include_EID: False
Memory Usage: 6.674 GB | Execution Time: 12.80191 seconds
Version: old | Dataset: ogbn-products | include_EID: True
Memory Usage: 14.054 GB | Execution Time: 41.93387 seconds
Version: new | Dataset: ogbn-products | include_EID: True
Memory Usage: 10.922 GB | Execution Time: 38.63654 seconds
Version: old | Dataset: ogbn-products | include_EID: False
Memory Usage: 14.054 GB | Execution Time: 36.94148 seconds
Version: new | Dataset: ogbn-products | include_EID: False
Memory Usage: 10.922 GB | Execution Time: 35.49174 seconds |
So the remaining work items are
|
Benchmark on self-made dataset with graph-features, containing a heterograph with 3 million nodes and 30 million edges. Version: old | Dataset: selfmade | include_EID: True
Memory Usage: 12.004 GB | Execution Time: 26.74319 seconds
Version: new | Dataset: selfmade | include_EID: True
Memory Usage: 10.524 GB | Execution Time: 26.45344 seconds
Version: old | Dataset: selfmade | include_EID: False
Memory Usage: 12.004 GB | Execution Time: 25.80169 seconds
Version: new | Dataset: selfmade | include_EID: False
Memory Usage: 10.524 GB | Execution Time: 25.55227 seconds |
|
Description
#5820
I am planning to handle issues regarding
int32
in another PR.Update on Feb 5:
A simple benchmarking:
Update on Feb 6:
Benchmark results on the latest commit:
Now we can say that for the new implementation, whether
include_original_edge_id
isTrue
orFalse
does not impact peak memory usage.Checklist
Please feel free to remove inapplicable items for your PR.
Changes