Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GraphBolt] modify preprocess_ondisk_dataset() #6986

Merged
merged 29 commits into from
Feb 17, 2024

Conversation

Skeleton003
Copy link
Collaborator

@Skeleton003 Skeleton003 commented Jan 21, 2024

Description

#5820

I am planning to handle issues regarding int32 in another PR.

Update on Feb 5:

A simple benchmarking:

  • Code:
def bench(func, args, result_queue):
    t0 = time()
    result = func(**args)
    t1 = time()
    peak_mem = get_peak_mem()
    result_queue.put((result, peak_mem, t1 - t0))

for dataset_dir in ["ogbn-mag", "ogbn-products"]:
    for include_original_edge_id in [True, False]:
        for func in [old, new]:
            result_queue = multiprocessing.Queue()
            print(f"Version: {func.__name__} | Dataset: {dataset_dir} | include_EID: {include_original_edge_id}")
            process = multiprocessing.Process(
                target=bench,
                args=(func, {
                    "dataset_dir": os.path.join("/home/ubuntu/dgl/datasets", dataset_dir),
                    "include_original_edge_id": include_original_edge_id,
                    "force_preprocess": True,
                }, result_queue)
            )
            process.start()
            process.join()
            result, memory_usage, exec_time = result_queue.get()
            print(f"Memory Usage: {memory_usage:.3f} GB | Execution Time: {exec_time:.5f} seconds")
  • Result:
Version: old | Dataset: ogbn-mag | include_EID: True
Memory Usage: 8.000 GB | Execution Time: 15.10166 seconds
Version: new | Dataset: ogbn-mag | include_EID: True
Memory Usage: 9.159 GB | Execution Time: 14.50847 seconds
 
Version: old | Dataset: ogbn-mag | include_EID: False
Memory Usage: 7.969 GB | Execution Time: 13.55839 seconds
Version: new | Dataset: ogbn-mag | include_EID: False
Memory Usage: 8.530 GB | Execution Time: 13.03771 seconds
 
Version: old | Dataset: ogbn-products | include_EID: True
Memory Usage: 14.054 GB | Execution Time: 42.97557 seconds
Version: new | Dataset: ogbn-products | include_EID: True
Memory Usage: 10.922 GB | Execution Time: 39.46200 seconds
 
Version: old | Dataset: ogbn-products | include_EID: False
Memory Usage: 14.054 GB | Execution Time: 37.75033 seconds
Version: new | Dataset: ogbn-products | include_EID: False
Memory Usage: 10.922 GB | Execution Time: 36.29849 seconds

Update on Feb 6:

Benchmark results on the latest commit:

Version: old | Dataset: ogbn-mag | include_EID: True
Memory Usage: 7.983 GB | Execution Time: 14.86984 seconds
Version: new | Dataset: ogbn-mag | include_EID: True
Memory Usage: 6.674 GB | Execution Time: 14.08306 seconds
 
Version: old | Dataset: ogbn-mag | include_EID: False
Memory Usage: 7.969 GB | Execution Time: 13.24852 seconds
Version: new | Dataset: ogbn-mag | include_EID: False
Memory Usage: 6.674 GB | Execution Time: 12.80191 seconds
 
Version: old | Dataset: ogbn-products | include_EID: True
Memory Usage: 14.054 GB | Execution Time: 41.93387 seconds
Version: new | Dataset: ogbn-products | include_EID: True
Memory Usage: 10.922 GB | Execution Time: 38.63654 seconds
 
Version: old | Dataset: ogbn-products | include_EID: False
Memory Usage: 14.054 GB | Execution Time: 36.94148 seconds
Version: new | Dataset: ogbn-products | include_EID: False
Memory Usage: 10.922 GB | Execution Time: 35.49174 seconds

Now we can say that for the new implementation, whether include_original_edge_id is True or False does not impact peak memory usage.

Checklist

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
  • I've leverage the tools to beautify the python and c++ code.
  • The PR is complete and small, read the Google eng practice (CL equals to PR) to understand more about small PR. In DGL, we consider PRs with less than 200 lines of core code change are small (example, test and documentation could be exempted).
  • All changes have test coverage
  • Code is well-documented
  • To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change
  • Related issue is referred in this PR
  • If the PR is for a new model/paper, I've updated the example index here.

Changes

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jan 21, 2024

To trigger regression tests:

  • @dgl-bot run [instance-type] [which tests] [compare-with-branch];
    For example: @dgl-bot run g4dn.4xlarge all dmlc/master or @dgl-bot run c5.9xlarge kernel,api dmlc/master

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jan 21, 2024

Commit ID: a07696e

Build ID: 1

Status: ❌ CI test failed in Stage [Lint Check].

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jan 21, 2024

Commit ID: 4a83ddb

Build ID: 2

Status: ❌ CI test failed in Stage [Torch CPU (Win64) Unit test].

Report path: link

Full logs path: link

@Skeleton003
Copy link
Collaborator Author

@dgl-bot

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jan 21, 2024

Commit ID: 4a83ddb

Build ID: 3

Status: ❌ CI test failed in Stage [Torch CPU (Win64) Unit test].

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jan 21, 2024

Commit ID: 25bee10

Build ID: 4

Status: ❌ CI test failed in Stage [Torch CPU (Win64) Unit test].

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jan 21, 2024

Commit ID: 0a36a87

Build ID: 5

Status: ❌ CI test failed in Stage [Torch CPU (Win64) Unit test].

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jan 22, 2024

Commit ID: 660df83

Build ID: 6

Status: ❌ CI test failed in Stage [Torch CPU (Win64) Unit test].

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jan 22, 2024

Commit ID: 66dbeeb

Build ID: 7

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jan 22, 2024

Commit ID: 7584583

Build ID: 8

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jan 22, 2024

Commit ID: fc53d7d

Build ID: 9

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@Skeleton003
Copy link
Collaborator Author

plz take a look at your convenience @frozenbugs @czkkkkkk @Rhett-Ying

@dgl-bot
Copy link
Collaborator

dgl-bot commented Feb 5, 2024

Commit ID: 7fd3348ecb32621271c8599afa050dd666950775

Build ID: 21

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@Rhett-Ying
Copy link
Collaborator

please check if such peak memory issue happens in current implementation. #7086

@Skeleton003
Copy link
Collaborator Author

Benchmarking results are pasted in the description. @Rhett-Ying @czkkkkkk

Good news is that the new implementation is generally faster than the old one, and is more memory-saving for homo datasets (ogbn-products).

But it is worth noting that the new implementation uses more memory to deal with datasets with a heterograph (ogbn-mag). I am going through the code to figure out why, but I don't have a clue right now. Your inspiration on this issue will be appreciated.

@dgl-bot
Copy link
Collaborator

dgl-bot commented Feb 5, 2024

Commit ID: 2b8e056391997cc0ace357e2a8e0e7eb8e35e528

Build ID: 22

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Feb 5, 2024

Commit ID: 903d2762243cd0d67ec16aee398655a6810defbe

Build ID: 23

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@Skeleton003
Copy link
Collaborator Author

Tried to change the dtype of type_per_edge for hetero graph:

After change type_per_edge to torch.int32:

Version: old | Dataset: ogbn-mag | include_EID: True
Memory Usage: 8.008 GB | Execution Time: 14.99524 seconds
Version: new | Dataset: ogbn-mag | include_EID: True
Memory Usage: 8.682 GB | Execution Time: 13.81134 seconds
 
Version: old | Dataset: ogbn-mag | include_EID: False
Memory Usage: 7.976 GB | Execution Time: 13.61416 seconds
Version: new | Dataset: ogbn-mag | include_EID: False
Memory Usage: 8.053 GB | Execution Time: 12.39217 seconds

After change type_per_edge to torch.int16:

Version: old | Dataset: ogbn-mag | include_EID: True
Memory Usage: 8.008 GB | Execution Time: 15.08433 seconds
Version: new | Dataset: ogbn-mag | include_EID: True
Memory Usage: 8.448 GB | Execution Time: 13.50126 seconds
 
Version: old | Dataset: ogbn-mag | include_EID: False
Memory Usage: 7.976 GB | Execution Time: 13.57649 seconds
Version: new | Dataset: ogbn-mag | include_EID: False
Memory Usage: 7.819 GB | Execution Time: 12.00695 seconds

@Rhett-Ying
Copy link
Collaborator

@Skeleton003 the default or previous dtype of type_per_edge is torch.int64? could you paste the results for ogbn-products as well? And how do you measure the memory usage?

@Rhett-Ying
Copy link
Collaborator

According to the ogbn-mag results, storing edge attributes has impact on memory usage.

@Skeleton003
Copy link
Collaborator Author

@Skeleton003 the default or previous dtype of type_per_edge is torch.int64? could you paste the results for ogbn-products as well? And how do you measure the memory usage?

Yes, default or previous dtype of type_per_edge is torch.int64. The code and results are pasted in the description.

@dgl-bot
Copy link
Collaborator

dgl-bot commented Feb 6, 2024

Commit ID: 315b1a6b8aa2635f2e12ca23a233f34bca5fc962

Build ID: 24

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

@Skeleton003
Copy link
Collaborator Author

Latest benchmark:

Version: old | Dataset: ogbn-mag | include_EID: True
Memory Usage: 7.983 GB | Execution Time: 14.86984 seconds
Version: new | Dataset: ogbn-mag | include_EID: True
Memory Usage: 6.674 GB | Execution Time: 14.08306 seconds
 
Version: old | Dataset: ogbn-mag | include_EID: False
Memory Usage: 7.969 GB | Execution Time: 13.24852 seconds
Version: new | Dataset: ogbn-mag | include_EID: False
Memory Usage: 6.674 GB | Execution Time: 12.80191 seconds
 
Version: old | Dataset: ogbn-products | include_EID: True
Memory Usage: 14.054 GB | Execution Time: 41.93387 seconds
Version: new | Dataset: ogbn-products | include_EID: True
Memory Usage: 10.922 GB | Execution Time: 38.63654 seconds
 
Version: old | Dataset: ogbn-products | include_EID: False
Memory Usage: 14.054 GB | Execution Time: 36.94148 seconds
Version: new | Dataset: ogbn-products | include_EID: False
Memory Usage: 10.922 GB | Execution Time: 35.49174 seconds

@Rhett-Ying @czkkkkkk

@dgl-bot
Copy link
Collaborator

dgl-bot commented Feb 6, 2024

Commit ID: 23bbbd1ea1484026a358ae6b0e917350aa067dc9

Build ID: 25

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Feb 6, 2024

Commit ID: 94c0a34ae6b9632c668c5b39dc99ec5e99d4a110

Build ID: 26

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@Rhett-Ying
Copy link
Collaborator

So the remaining work items are

  • > profile memory footprint of feature data.
  • > re-preprocess datasets and run end2end example to check if accuracy is expected.(just make sure the new processing works well)

@dgl-bot
Copy link
Collaborator

dgl-bot commented Feb 17, 2024

Commit ID: 500d5622fe20d85ee5818ccf477720123baa1dca

Build ID: 27

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@Skeleton003
Copy link
Collaborator Author

Benchmark on self-made dataset with graph-features, containing a heterograph with 3 million nodes and 30 million edges.

Version: old | Dataset: selfmade | include_EID: True
Memory Usage: 12.004 GB | Execution Time: 26.74319 seconds
Version: new | Dataset: selfmade | include_EID: True
Memory Usage: 10.524 GB | Execution Time: 26.45344 seconds
 
Version: old | Dataset: selfmade | include_EID: False
Memory Usage: 12.004 GB | Execution Time: 25.80169 seconds
Version: new | Dataset: selfmade | include_EID: False
Memory Usage: 10.524 GB | Execution Time: 25.55227 seconds

@Skeleton003
Copy link
Collaborator Author

  • profile memory footprint of feature data.
  • re-preprocess datasets and run end2end example to check if accuracy is expected.(just make sure the new processing works well)

@Skeleton003 Skeleton003 merged commit 458b938 into dmlc:master Feb 17, 2024
2 checks passed
@Skeleton003 Skeleton003 deleted the 1preprocess branch March 6, 2024 00:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants