We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Repro:
run pytest with
from loguru import logger import pytest import torch import math import ttnn from tests.ttnn.utils_for_testing import assert_with_pcc def test_interleaved_input_sharded_output_matmul(device): torch.manual_seed(0) pcc = 0.99 # Width sharded torch_input_tensor_a = torch.randn([1, 1, 256, 32], dtype=torch.bfloat16) torch_input_tensor_b = torch.randn([1, 1, 32, 256], dtype=torch.bfloat16) torch_output_tensor = torch.matmul(torch_input_tensor_a, torch_input_tensor_b) input_tensor_a = ttnn.from_torch(torch_input_tensor_a, layout=ttnn.TILE_LAYOUT, device=device) input_tensor_b = ttnn.from_torch(torch_input_tensor_b, layout=ttnn.TILE_LAYOUT, device=device) out_mem_config = ttnn.create_sharded_memory_config( shape=(256, 256), core_grid=ttnn.CoreGrid(x=1, y=1), strategy=ttnn.ShardStrategy.BLOCK, orientation=ttnn.ShardOrientation.ROW_MAJOR, ) output2 = ttnn.matmul(input_tensor_a, input_tensor_b, memory_config=out_mem_config) output_tensor = ttnn.to_torch(output2) assert_with_pcc(torch_output_tensor, output_tensor, pcc=pcc)
Expected:
Also, would be good to understand what is going on with shapes:
Running
pytest tests/ttnn/unit_tests/operations/test_matmul.py::test_interleaved_input_sharded_output_matmul
for block sharded the debug logs are
Op | DEBUG | Started C++ ttnn operation: ttnn::matmul Op | DEBUG | Started C++ ttnn operation: ttnn::prim::old_infra_device_operation Op | DEBUG | Auto generated program config: MatmulMultiCoreReuseMultiCastProgramConfig(compute_with_storage_grid_size=(x=8,y=8),in0_block_w=1,out_subblock_h=1,out_subblock_w=2,out_block_h=16,out_block_w=16,per_core_M=16,per_core_N=16,transpose_mcast=0,fused_activation=std::nullopt,fuse_batch=0) Op | DEBUG | Launching Device Operation: "Matmul" Op | DEBUG | Program Hash: 0 Op | DEBUG | Program Cache Hit: false Op | DEBUG | Attributes: Op | DEBUG | program_config = std::nullopt Op | DEBUG | bcast_batch = 1 Op | DEBUG | output_mem_config = MemoryConfig(memory_layout=TensorMemoryLayout::BLOCK_SHARDED,buffer_type=BufferType::L1,shard_spec=ShardSpec(grid={[(x=0,y=0) - (x=0,y=0)]},shape={256, 256},orientation=ShardOrientation::ROW_MAJOR,mode=ShardMode::PHYSICAL,physical_shard_shape=std::nullopt)) Op | DEBUG | output_dtype = DataType::BFLOAT16 Op | DEBUG | compute_kernel_config = WormholeComputeKernelConfig(math_fidelity=HiFi2,math_approx_mode=0,fp32_dest_acc_en=0,packer_l1_acc=1,dst_full_sync_en=0) Op | DEBUG | untilize_out = false Op | DEBUG | user_core_coord = std::nullopt Op | DEBUG | user_fused_activation = std::nullopt Op | DEBUG | user_run_batched = false Op | DEBUG | transpose_a = false Op | DEBUG | transpose_b = false Op | DEBUG | output_tile = Tile(tile_shape={32, 32},face_shape={16, 16},num_faces=4) Op | DEBUG | global_cb = std::nullopt Op | DEBUG | Tensors Args: Op | DEBUG | 0: Tensor(storage=DeviceStorage(memory_config=MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),tensor_spec=TensorSpec(logical_shape=Shape([1, 1, 32, 32]),tensor_layout=TensorLayout(dtype=DataType::BFLOAT16,page_config=PageConfig(config=TilePageConfig(tile=Tile(tile_shape={32, 32},face_shape={16, 16},num_faces=4))),memory_config=MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt),alignment=Alignment([32, 32])))) Op | DEBUG | 1: Tensor(storage=DeviceStorage(memory_config=MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),tensor_spec=TensorSpec(logical_shape=Shape([1, 1, 32, 256]),tensor_layout=TensorLayout(dtype=DataType::BFLOAT16,page_config=PageConfig(config=TilePageConfig(tile=Tile(tile_shape={32, 32},face_shape={16, 16},num_faces=4))),memory_config=MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt),alignment=Alignment([32, 32])))) Op | DEBUG | Op | DEBUG | Auto generated program config: MatmulMultiCoreReuseMultiCastProgramConfig(compute_with_storage_grid_size=(x=8,y=8),in0_block_w=1,out_subblock_h=1,out_subblock_w=2,out_block_h=8,out_block_w=8,per_core_M=8,per_core_N=8,transpose_mcast=0,fused_activation=std::nullopt,fuse_batch=0) Op | DEBUG | Auto generated program config: MatmulMultiCoreReuseMultiCastProgramConfig(compute_with_storage_grid_size=(x=8,y=8),in0_block_w=1,out_subblock_h=1,out_subblock_w=2,out_block_h=8,out_block_w=8,per_core_M=8,per_core_N=8,transpose_mcast=0,fused_activation=std::nullopt,fuse_batch=0) Op | DEBUG | Auto generated program config: MatmulMultiCoreReuseMultiCastProgramConfig(compute_with_storage_grid_size=(x=8,y=8),in0_block_w=1,out_subblock_h=1,out_subblock_w=2,out_block_h=8,out_block_w=8,per_core_M=8,per_core_N=8,transpose_mcast=0,fused_activation=std::nullopt,fuse_batch=0) Op | DEBUG | CB 0 :: PS = 2048, NP = 8, TOTAL = 16384 Op | DEBUG | CB 1 :: PS = 2048, NP = 8, TOTAL = 16384 Op | DEBUG | CB 4 :: PS = 2048, NP = 64, TOTAL = 131072 Op | DEBUG | Finished C++ ttnn operation: ttnn::prim::old_infra_device_operation Op | DEBUG | Finished C++ ttnn operation: ttnn::matmul
Given the output is 32x256, it'd be expected that the output mem config shape should be 32x256 instead of 256x256.
The text was updated successfully, but these errors were encountered:
No branches or pull requests
Repro:
run pytest with
Expected:
Actual:
Also, would be good to understand what is going on with shapes:
Running
for block sharded the debug logs are
Given the output is 32x256, it'd be expected that the output mem config shape should be 32x256 instead of 256x256.
The text was updated successfully, but these errors were encountered: