Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

matmul fails with PCC for some inputs #18748

Open
bbradelTT opened this issue Mar 6, 2025 · 0 comments
Open

matmul fails with PCC for some inputs #18748

bbradelTT opened this issue Mar 6, 2025 · 0 comments
Labels
bug Something isn't working op_cat: mm P1

Comments

@bbradelTT
Copy link
Contributor

Repro:

run pytest with

from loguru import logger
import pytest
import torch
import math
import ttnn
        
from tests.ttnn.utils_for_testing import assert_with_pcc
        
def test_interleaved_input_sharded_output_matmul(device):
    torch.manual_seed(0)
    pcc = 0.99
    # Width sharded
    torch_input_tensor_a = torch.randn([1, 1, 256, 32], dtype=torch.bfloat16)
    torch_input_tensor_b = torch.randn([1, 1, 32, 256], dtype=torch.bfloat16)
    torch_output_tensor = torch.matmul(torch_input_tensor_a, torch_input_tensor_b)

    input_tensor_a = ttnn.from_torch(torch_input_tensor_a, layout=ttnn.TILE_LAYOUT, device=device)
    input_tensor_b = ttnn.from_torch(torch_input_tensor_b, layout=ttnn.TILE_LAYOUT, device=device)
    
    out_mem_config = ttnn.create_sharded_memory_config(
        shape=(256, 256),
        core_grid=ttnn.CoreGrid(x=1, y=1),
        strategy=ttnn.ShardStrategy.BLOCK,
        orientation=ttnn.ShardOrientation.ROW_MAJOR,
    )
    
    output2 = ttnn.matmul(input_tensor_a, input_tensor_b, memory_config=out_mem_config)
    output_tensor = ttnn.to_torch(output2)
    assert_with_pcc(torch_output_tensor, output_tensor, pcc=pcc)

Expected:

  • passes
    Actual:
  • fails with a low PCC (~.17)

Also, would be good to understand what is going on with shapes:

Running

pytest tests/ttnn/unit_tests/operations/test_matmul.py::test_interleaved_input_sharded_output_matmul 

for block sharded the debug logs are

                     Op | DEBUG    | Started C++ ttnn operation: ttnn::matmul
                     Op | DEBUG    | Started C++ ttnn operation: ttnn::prim::old_infra_device_operation
                     Op | DEBUG    | Auto generated program config: MatmulMultiCoreReuseMultiCastProgramConfig(compute_with_storage_grid_size=(x=8,y=8),in0_block_w=1,out_subblock_h=1,out_subblock_w=2,out_block_h=16,out_block_w=16,per_core_M=16,per_core_N=16,transpose_mcast=0,fused_activation=std::nullopt,fuse_batch=0)
                     Op | DEBUG    | Launching Device Operation: "Matmul"
                     Op | DEBUG    | Program Hash: 0
                     Op | DEBUG    | Program Cache Hit: false
                     Op | DEBUG    | Attributes:
                     Op | DEBUG    | 	program_config = std::nullopt
                     Op | DEBUG    | 	bcast_batch = 1
                     Op | DEBUG    | 	output_mem_config = MemoryConfig(memory_layout=TensorMemoryLayout::BLOCK_SHARDED,buffer_type=BufferType::L1,shard_spec=ShardSpec(grid={[(x=0,y=0) - (x=0,y=0)]},shape={256, 256},orientation=ShardOrientation::ROW_MAJOR,mode=ShardMode::PHYSICAL,physical_shard_shape=std::nullopt))
                     Op | DEBUG    | 	output_dtype = DataType::BFLOAT16
                     Op | DEBUG    | 	compute_kernel_config = WormholeComputeKernelConfig(math_fidelity=HiFi2,math_approx_mode=0,fp32_dest_acc_en=0,packer_l1_acc=1,dst_full_sync_en=0)
                     Op | DEBUG    | 	untilize_out = false
                     Op | DEBUG    | 	user_core_coord = std::nullopt
                     Op | DEBUG    | 	user_fused_activation = std::nullopt
                     Op | DEBUG    | 	user_run_batched = false
                     Op | DEBUG    | 	transpose_a = false
                     Op | DEBUG    | 	transpose_b = false
                     Op | DEBUG    | 	output_tile = Tile(tile_shape={32, 32},face_shape={16, 16},num_faces=4)
                     Op | DEBUG    | 	global_cb = std::nullopt
                     Op | DEBUG    | Tensors Args:
                     Op | DEBUG    | 	0: Tensor(storage=DeviceStorage(memory_config=MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),tensor_spec=TensorSpec(logical_shape=Shape([1, 1, 32, 32]),tensor_layout=TensorLayout(dtype=DataType::BFLOAT16,page_config=PageConfig(config=TilePageConfig(tile=Tile(tile_shape={32, 32},face_shape={16, 16},num_faces=4))),memory_config=MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt),alignment=Alignment([32, 32]))))
                     Op | DEBUG    | 	1: Tensor(storage=DeviceStorage(memory_config=MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),tensor_spec=TensorSpec(logical_shape=Shape([1, 1, 32, 256]),tensor_layout=TensorLayout(dtype=DataType::BFLOAT16,page_config=PageConfig(config=TilePageConfig(tile=Tile(tile_shape={32, 32},face_shape={16, 16},num_faces=4))),memory_config=MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt),alignment=Alignment([32, 32]))))
                     Op | DEBUG    | 
                     Op | DEBUG    | Auto generated program config: MatmulMultiCoreReuseMultiCastProgramConfig(compute_with_storage_grid_size=(x=8,y=8),in0_block_w=1,out_subblock_h=1,out_subblock_w=2,out_block_h=8,out_block_w=8,per_core_M=8,per_core_N=8,transpose_mcast=0,fused_activation=std::nullopt,fuse_batch=0)
                     Op | DEBUG    | Auto generated program config: MatmulMultiCoreReuseMultiCastProgramConfig(compute_with_storage_grid_size=(x=8,y=8),in0_block_w=1,out_subblock_h=1,out_subblock_w=2,out_block_h=8,out_block_w=8,per_core_M=8,per_core_N=8,transpose_mcast=0,fused_activation=std::nullopt,fuse_batch=0)
                     Op | DEBUG    | Auto generated program config: MatmulMultiCoreReuseMultiCastProgramConfig(compute_with_storage_grid_size=(x=8,y=8),in0_block_w=1,out_subblock_h=1,out_subblock_w=2,out_block_h=8,out_block_w=8,per_core_M=8,per_core_N=8,transpose_mcast=0,fused_activation=std::nullopt,fuse_batch=0)
                     Op | DEBUG    | CB 0 :: PS = 2048, NP = 8, TOTAL = 16384
                     Op | DEBUG    | CB 1 :: PS = 2048, NP = 8, TOTAL = 16384
                     Op | DEBUG    | CB 4 :: PS = 2048, NP = 64, TOTAL = 131072
                     Op | DEBUG    | Finished C++ ttnn operation: ttnn::prim::old_infra_device_operation
                     Op | DEBUG    | Finished C++ ttnn operation: ttnn::matmul

Given the output is 32x256, it'd be expected that the output mem config shape should be 32x256 instead of 256x256.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working op_cat: mm P1
Projects
None yet
Development

No branches or pull requests

1 participant