Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GEMM NN GPU fails even with the under-transfer fix #138

Closed
abouteiller opened this issue Mar 11, 2025 · 0 comments · Fixed by #139
Closed

GEMM NN GPU fails even with the under-transfer fix #138

abouteiller opened this issue Mar 11, 2025 · 0 comments · Fixed by #139
Assignees
Labels
bug Something isn't working

Comments

@abouteiller
Copy link
Contributor

abouteiller commented Mar 11, 2025

          Looks like the failure in GEMM is an as-of yet undiscovered prior issue.

RESOLUTION: Found issue in DPLASMA, missing dplasma_add2arena for the gpuNN gemm

OMPI_MCA_mpi_abort_print_stack=true OMPI_MCA_mpi_abort_delay=-1  PMIX_MCA_psec='' SLURM_TIMELIMIT=10 salloc -wleconte -n8 -N1  /usr/bin/srun "-n" "2"  "tests/testing_sgemm" -c 4 "-N" "1940" "-t" "320" "-v=5" "-g" "1" "-P" "1" "--" "--mca" "device_cuda_memory_number_of_blocks" "21" --mca comm_verbose 20

...
d@00001 MPI:    Retrieve datatype with mask 0x1 (remote_dep_get_datatypes) remote size 16384 @remote_dep_get_datatypes:929
[leconte:1373806] [0] func:/apps/spacks/2025-02-03/opt/spack/linux-rocky9-x86_64/gcc-11.4.1/openmpi-5.0.5-ms6djs2dvfi7hy3aisvr35t6opaznt5j/lib/libopen-pal.so.80(opal_backtrace_buffer+0x23) [0x7f5a103d5783]
[leconte:1373806] [1] func:/apps/spacks/2025-02-03/opt/spack/linux-rocky9-x86_64/gcc-11.4.1/openmpi-5.0.5-ms6djs2dvfi7hy3aisvr35t6opaznt5j/lib/libmpi.so.40(ompi_mpi_abort+0x117) [0x7f5a108bd2f7]
[leconte:1373806] [2] func:/apps/spacks/2025-02-03/opt/spack/linux-rocky9-x86_64/gcc-11.4.1/openmpi-5.0.5-ms6djs2dvfi7hy3aisvr35t6opaznt5j/lib/libmpi.so.40(ompi_mpi_errors_are_fatal_comm_handler+0xda) [0x7f5a108acd0a]
[leconte:1373806] [3] func:/apps/spacks/2025-02-03/opt/spack/linux-rocky9-x86_64/gcc-11.4.1/openmpi-5.0.5-ms6djs2dvfi7hy3aisvr35t6opaznt5j/lib/libmpi.so.40(ompi_errhandler_invoke+0x165) [0x7f5a108ac0a5]
[leconte:1373806] [4] func:/home/bouteill/parsec/dplasma-master/build.cuda/parsec/parsec/libparsec.so.4(remote_dep_mpi_retrieve_datatype+0x511) [0x7f5a5d49ffa1]

In gdb

            if(output->data.remote.dst_datatype!=PARSEC_DATATYPE_NULL) MPI_Type_get_name(output->data.remote.dst_datatype, type_name_dst, &len);

(FROM, backtrace)
 if( PARSEC_ITERATE_STOP == ontask(es, &nc, (const parsec_task_t *)this_task, &flow_of_sgemm_NN_gpu_READ_B_for_B_dep2_atline_200, &data, rank_src, rank_dst, vpid_dst, successor_repo, successor_repo_key, ontask_arg) )

We have output->data.remote_dst_datatype == NULL, which is not equal to PARSEC_DATATYPE_NULL (MPI_DATATYPE_NULL), so we go on and call the MPI_GET_NAME and crash MPI.

Two issues here

  1. the dst_datatype should not be NULL? Presumably that flow has a type and we should have retrieved it. Explanation: this comes from GLOBAL_BARRIER Y, which is a CTL, thus with no type. This looks like it is a bug in get_datatype with CTL. the arena_datatypes in GEMM_NN_GPU was not filled.
  2. should we compare to NULL instead of MPI_DATATYPE_NULL, or both? This should not crash but instead cause a clean error in parsec. Issue parsec_fatal when the datatype_arenas have not been set in the PTG  parsec#739

Not immediately clear why/if this is related to the PR, or we just fixed the other issue that was masking this one.

Originally posted by @abouteiller in ICLDisco/parsec#733 (comment)

abouteiller added a commit to abouteiller/dplasma that referenced this issue Mar 12, 2025
@abouteiller abouteiller linked a pull request Mar 12, 2025 that will close this issue
@abouteiller abouteiller self-assigned this Mar 12, 2025
@abouteiller abouteiller added the bug Something isn't working label Mar 12, 2025
abouteiller added a commit to abouteiller/parsec that referenced this issue Mar 12, 2025
abouteiller added a commit to abouteiller/dplasma that referenced this issue Mar 13, 2025
abouteiller added a commit to abouteiller/dplasma that referenced this issue Mar 13, 2025
abouteiller added a commit to abouteiller/dplasma that referenced this issue Mar 13, 2025
abouteiller added a commit to abouteiller/dplasma that referenced this issue Mar 13, 2025
abouteiller added a commit that referenced this issue Mar 27, 2025
The Gemm NN GPU is missing declaration of datatype #138
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant