Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GraphBolt][PyG] Modify PyG example with to_pyg_data #7123

Merged
merged 15 commits into from
Mar 1, 2024

Conversation

RamonZhou
Copy link
Collaborator

@RamonZhou RamonZhou commented Feb 19, 2024

Description

Apply to_pyg_data in the example and implement layer-wise inference.

For one train epoch:

For inference:

  • PyG original example costs 1m51s
  • Out example costs 1m28s

Acc after 10 epoch:

  • PyG original example 0.7807
  • Our example 0.7808

Checklist

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
  • I've leverage the tools to beautify the python and c++ code.
  • The PR is complete and small, read the Google eng practice (CL equals to PR) to understand more about small PR. In DGL, we consider PRs with less than 200 lines of core code change are small (example, test and documentation could be exempted).
  • All changes have test coverage
  • Code is well-documented
  • To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change
  • Related issue is referred in this PR
  • If the PR is for a new model/paper, I've updated the example index here.

Changes

@dgl-bot
Copy link
Collaborator

dgl-bot commented Feb 19, 2024

To trigger regression tests:

  • @dgl-bot run [instance-type] [which tests] [compare-with-branch];
    For example: @dgl-bot run g4dn.4xlarge all dmlc/master or @dgl-bot run c5.9xlarge kernel,api dmlc/master

@dgl-bot
Copy link
Collaborator

dgl-bot commented Feb 19, 2024

Commit ID: 7dbbc7fa37b84f5a193ca55649f61737a8e88f4a

Build ID: 1

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Feb 20, 2024

Commit ID: 99b09b16c47b2a1aa9048fc9d8f082d8c01578b6

Build ID: 2

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@frozenbugs
Copy link
Collaborator

Did you test the performance compare to original pyg example?

@dgl-bot
Copy link
Collaborator

dgl-bot commented Feb 22, 2024

Commit ID: 6c8072d7b2d3b2a337128f9ddb3354c113d14db0

Build ID: 3

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@RamonZhou
Copy link
Collaborator Author

Did you test the performance compare to original pyg example?

I have updated in the description

@dgl-bot
Copy link
Collaborator

dgl-bot commented Feb 22, 2024

Commit ID: e77506366b3fba19628ae3ae034235debb1c26bc

Build ID: 4

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

Copy link
Collaborator

@mfbalin mfbalin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall. What is the performance like compared to the previous version of this example? @RamonZhou

@mfbalin
Copy link
Collaborator

mfbalin commented Feb 27, 2024

What is the blocker for this PR?

@mfbalin
Copy link
Collaborator

mfbalin commented Feb 27, 2024

There were some changes in master, we should update the branch and resolve any potential issue w.r.t. dtype use of int32.

Note: There will be after #7127 is merged.

@mfbalin
Copy link
Collaborator

mfbalin commented Feb 27, 2024

@RamonZhou It might help improve end-to-end performance if we move the copy_to operation so that it is before fetch_feature but after sample_neighbor as discussed in #7005. We will need to pin the features in this case. I see close to 2x speedup just by doing this.

Another thing is to imitate the inference function in our official node_classification example so that we use the dataloader's functionality to load the features. Below, we update the featurestore to be the next layer's features, and no manual feature fetch indexing operation needed.

def inference(self, graph, features, dataloader, storage_device):
"""Conduct layer-wise inference to get all the node embeddings."""
pin_memory = storage_device == "pinned"
buffer_device = torch.device("cpu" if pin_memory else storage_device)
for layer_idx, layer in enumerate(self.layers):
is_last_layer = layer_idx == len(self.layers) - 1
y = torch.empty(
graph.total_num_nodes,
self.out_size if is_last_layer else self.hidden_size,
dtype=torch.float32,
device=buffer_device,
pin_memory=pin_memory,
)
for data in tqdm(dataloader):
# len(blocks) = 1
hidden_x = layer(data.blocks[0], data.node_features["feat"])
if not is_last_layer:
hidden_x = F.relu(hidden_x)
hidden_x = self.dropout(hidden_x)
# By design, our output nodes are contiguous.
y[data.seed_nodes[0] : data.seed_nodes[-1] + 1] = hidden_x.to(
buffer_device
)
if not is_last_layer:
features.update("node", None, "feat", y)
return y

@dgl-bot
Copy link
Collaborator

dgl-bot commented Feb 27, 2024

Commit ID: 08ed47ef7214b0da15181b9263798da64475c60d

Build ID: 5

Status: ❌ CI test failed in Stage [GPU Build].

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Feb 28, 2024

Commit ID: c1df000b77c7e25ca864675c15d3f33778dfbff6

Build ID: 6

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

@mfbalin
Copy link
Collaborator

mfbalin commented Feb 28, 2024

Proposed modification to to_pyg_data. Note that it only works when fanout values are identical across layers.

    def to_pyg_data(self):
        """Construct a PyG Data from `MiniBatch`. This function only supports
        node classification task on a homogeneous graph and the number of
        features cannot be more than one.
        """
        from torch_geometric.data import Data

        if self.sampled_subgraphs is None:
            edge_index = None
        else:
            col_nodes = []
            row_nodes = []
            prev_num_edges = 0
            for subgraph in reversed(self.sampled_subgraphs):
                if subgraph is None:
                    continue
                sampled_csc = subgraph.sampled_csc
                indptr = sampled_csc.indptr
                indices = sampled_csc.indices
                expanded_indptr = expand_indptr(
                    indptr, dtype=indices.dtype, output_size=len(indices)
                )
                # Drop the edges sampled for dst nodes in later layers like PyG.
                my_slice = slice(prev_num_edges, expanded_indptr.size(0))
                prev_num_edges = indices.size(0)
                col_nodes.append(expanded_indptr[my_slice])
                row_nodes.append(indices[my_slice])
            col_nodes = torch.cat(col_nodes)
            row_nodes = torch.cat(row_nodes)
            edge_index = torch.stack((row_nodes, col_nodes))

@RamonZhou
Copy link
Collaborator Author

@RamonZhou It might help improve end-to-end performance if we move the copy_to operation so that it is before fetch_feature but after sample_neighbor as discussed in #7005. We will need to pin the features in this case. I see close to 2x speedup just by doing this.

Another thing is to imitate the inference function in our official node_classification example so that we use the dataloader's functionality to load the features. Below, we update the featurestore to be the next layer's features, and no manual feature fetch indexing operation needed.

def inference(self, graph, features, dataloader, storage_device):
"""Conduct layer-wise inference to get all the node embeddings."""
pin_memory = storage_device == "pinned"
buffer_device = torch.device("cpu" if pin_memory else storage_device)
for layer_idx, layer in enumerate(self.layers):
is_last_layer = layer_idx == len(self.layers) - 1
y = torch.empty(
graph.total_num_nodes,
self.out_size if is_last_layer else self.hidden_size,
dtype=torch.float32,
device=buffer_device,
pin_memory=pin_memory,
)
for data in tqdm(dataloader):
# len(blocks) = 1
hidden_x = layer(data.blocks[0], data.node_features["feat"])
if not is_last_layer:
hidden_x = F.relu(hidden_x)
hidden_x = self.dropout(hidden_x)
# By design, our output nodes are contiguous.
y[data.seed_nodes[0] : data.seed_nodes[-1] + 1] = hidden_x.to(
buffer_device
)
if not is_last_layer:
features.update("node", None, "feat", y)
return y

This suggestion can probably improve the inference performance but here we hope not to modify the PyG model too much because if PyG users want to use our GB dataloader, they won't want to do much modification to their original model.

@dgl-bot
Copy link
Collaborator

dgl-bot commented Feb 28, 2024

Commit ID: b74ea50e70e11b171b01433c4ed9a4e1029524b1

Build ID: 7

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

@mfbalin
Copy link
Collaborator

mfbalin commented Feb 28, 2024

PyG original example uses different fanouts compared to this example:
https://github.com/pyg-team/pytorch_geometric/blob/25b2f208e671eeec285bfafa2e246ea0a234b312/examples/ogbn_products_sage.py#L20-L28

Using 15, 10, 5 will be much faster than 10, 10, 10.

@mfbalin
Copy link
Collaborator

mfbalin commented Feb 28, 2024

If we are moving features to the GPU, then why not do the same for the graph as well; as usually, the graph takes even less space than the features? @frozenbugs @Rhett-Ying

@dgl-bot
Copy link
Collaborator

dgl-bot commented Feb 28, 2024

Commit ID: 65589e4f9273a9766040f3a42e13d91de6de8d68

Build ID: 8

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@Rhett-Ying
Copy link
Collaborator

If we are moving features to the GPU, then why not do the same for the graph as well; as usually, the graph takes even less space than the features? @frozenbugs @Rhett-Ying

Leaving graph on CPU is consistent with PyG canonical example?

@mfbalin
Copy link
Collaborator

mfbalin commented Feb 28, 2024

If we are moving features to the GPU, then why not do the same for the graph as well; as usually, the graph takes even less space than the features? @frozenbugs @Rhett-Ying

Leaving graph on CPU is consistent with PyG canonical example?

Then we could still leave it on CPU and pin its memory? I also tested and saw no performance difference between pinning features vs moving them to the GPU. @RamonZhou what performance difference do you see between pinning features vs moving them to the GPU? I believe we can simply pin both the graph and the features and the performance should already be better than PyG original example. Compared to PyG, we can advertise that we are using much less GPU memory because we use pinned system memory while being faster.

@dgl-bot
Copy link
Collaborator

dgl-bot commented Feb 29, 2024

Commit ID: b1d5bbcb2c8d9879b6c6951a1c0d07211d162b97

Build ID: 9

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Feb 29, 2024

Commit ID: a2f12d6

Build ID: 10

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Feb 29, 2024

Commit ID: 4a468e90195312a19b7ecec6b7df615632ea9505

Build ID: 11

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Feb 29, 2024

Commit ID: c5f617abe0b5ea2bb845f0a210901cdaf0859dda

Build ID: 12

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Feb 29, 2024

Commit ID: 9eeee0e80b9da987e547058ec6079dc06ecc69b7

Build ID: 13

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@RamonZhou RamonZhou merged commit d4c8a04 into dmlc:master Mar 1, 2024
2 checks passed
Rhett-Ying pushed a commit that referenced this pull request Mar 1, 2024
Co-authored-by: Muhammed Fatih BALIN <m.f.balin@gmail.com>
mfbalin added a commit to mfbalin/dgl-1 that referenced this pull request Mar 2, 2024
Co-authored-by: Muhammed Fatih BALIN <m.f.balin@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants