Skip to content

Commit

Permalink
Merge pull request #115 from bdmoore1/bdmoore1-gh-pages
Browse files Browse the repository at this point in the history
doc: updates for github pages 2021.12
  • Loading branch information
nikitaxgusev authored Apr 15, 2024
2 parents fef07e2 + 81ea90e commit 2db7ff8
Show file tree
Hide file tree
Showing 57 changed files with 1,599 additions and 106 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,4 @@ Selection of Collective Algorithms

|product_short| supports manual selection of collective algorithms for different message size ranges.

Refer to :ref:`Collective algorithms selection` section for details.
Refer to :ref:`Collective Algorithms Selection <collective-algorithms-selection>` section for details.
Original file line number Diff line number Diff line change
Expand Up @@ -18,4 +18,4 @@ For BF16 <-> FP32 conversion |product_short| provides ``AVX512F`` and ``AVX512_B
For FP16 <-> FP32 conversion |product_short| provides ``F16C`` and ``AVX512F``-based implementations.
Both implementations require GCC 4.9 or higher.

Refer to :ref:`Low-precision datatypes` for details about relevant environment variables.
Refer to :ref:`Low-precision datatypes <low-precision-datatypes>` for details about relevant environment variables.
10 changes: 5 additions & 5 deletions _sources/advanced-configuration/operation-fusion.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,17 +2,17 @@
Fusion of Communication Operations
==================================

In some cases, it may be beneficial to postpone execution of communication operations and execute them all together as a single operation in a batch mode.
This can reduce operation setup overhead and improve interconnect saturation.
In some cases, it may be beneficial to postpone execution of communication operations and execute them all together as a single operation in a batch mode.
This can reduce operation setup overhead and improve interconnect saturation.

|product_short| provides several knobs to enable and control such optimization:

- The fusion is enabled by :ref:`CCL_FUSION`.
- The advanced configuration is controlled by:

* :ref:`CCL_FUSION_BYTES_THRESHOLD`
* :ref:`CCL_FUSION_COUNT_THRESHOLD`
* :ref:`CCL_FUSION_CYCLE_MS`
* :ref:`CCL_FUSION_BYTES_THRESHOLD <CCL_FUSION_BYTES_THRESHOLD>`
* :ref:`CCL_FUSION_COUNT_THRESHOLD <CCL_FUSION_COUNT_THRESHOLD>`
* :ref:`CCL_FUSION_CYCLE_MS <CCL_FUSION_CYCLE_MS>`

.. note::
For now, this functionality is supported for ``allreduce`` operations only.
4 changes: 2 additions & 2 deletions _sources/advanced-configuration/operation-prioritization.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
Prioritization of Communication Operations
==========================================

|product_short| supports prioritization of communication operations that controls the order in which individual communication operations are executed.
|product_short| supports prioritization of communication operations that controls the order in which individual communication operations are executed.
This allows to postpone execution of non-urgent operations to complete urgent operations earlier, which may be beneficial for many use cases.

The communication prioritization is controlled by priority value. Note that the priority must be a non-negative number with a higher number standing for a higher priority.
Expand All @@ -13,4 +13,4 @@ There are the following prioritization modes:
- Direct - you explicitly specify priority using ``priority`` field in operation attribute.
- LIFO (Last In, First Out) - priority is implicitly increased on each operation call. In this case, you do not have to specify priority.

The prioritization mode is controlled by :ref:`CCL_PRIORITY`.
The prioritization mode is controlled by :ref:`CCL_PRIORITY <CCL_PRIORITY>`.
236 changes: 236 additions & 0 deletions _sources/benchmark.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,236 @@
oneCCL Benchmark User Guide
=====================================

The oneCCL benchmark provides performance measurements for the collective operations in oneCCL, such as:

* ``allreduce``
* ``reduce``
* ``allgather``
* ``alltoall``
* ``alltoallv``
* ``reduce-scatter``
* ``broadcast``

The benchmark is distributed with the oneCCL package. You can find it in the examples directory within the oneCCL installation path.


Build oneCCL Benchmark
***********************

CPU-Only
^^^^^^^^^

To build the benchmark, complete the following steps:

1. Configure your environment. Source the installed oneCCL library for the CPU-only support:

.. code::
source <ccl installation dir>/ccl/latest/env/vars.sh --ccl-configuration=cpu
2. Navigate to ``<oneCCL install location>/share/doc/ccl/examples``
3. Build the benchmark with the following command:

.. code::
cmake -S . -B build -DCMAKE_INSTALL_PREFIX=$(pwd)/build/_install && cmake --build build -j $(nproc) -t install
CPU-GPU
^^^^^^^^

1. Configure your environment.

* Source the Intel(R) oneAPI DPC++/C++ Compiler. See the `documentation <https://www.intel.com/content/www/us/en/docs/dpcpp-cpp-compiler/get-started-guide/current/overview.html>`_ for the instructions.
* Source the installed oneCCL library for the CPU-GPU support:

.. code::
source <ccl installation dir>/ccl/latest/env/vars.sh --ccl-configuration=cpu_gpu_dpcpp
2. Navigate to ``<oneCCL install location>/share/doc/ccl/examples``.
3. Build the SYCL benchmark with the following command:

.. code::
cmake -S . -B build -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DCOMPUTE_BACKEND=dpcpp -DCMAKE_INSTALL_PREFIX=$(pwd)/build/_install && cmake --build build -j $(nproc) -t install
Run oneCCL Benchmark
*********************

To run the benchmark, use the following command:

.. code::
mpirun -np <N> -ppn <P> benchmark [arguments]
Where:

* ``N`` is the overall number of processes
* ``P`` is the number of processes within a node

The benchmark reports:

* ``#bytes`` - the message size in the number of bytes
* ``#repetitions`` - the number of iterations
* ``t_min`` - the average time across iterations of the fastest process in each iteration
* ``t_max`` - the average time across iterations of the slowest process in each iteration
* ``t_avg`` - the average time across processes and iterations
* ``stddev`` - standard deviation
* ``wait_t_avg`` - the average wait time after the collective call returns and until it completes To enable, use the ``-x`` option.

Notice that ``t_min``, ``t_max``, and ``t_avg`` measure the total collective execution time. It means the timer starts before calling oneCCL API and ends once the collective completes.
While ``wait_t_avg`` only measures the wait time. It means the timer starts after the collective API call returns control to the host/calling thread and ends once the collective completes.
Thus, ``wait_t_avg`` does not include the time spent on the oneCCL API call, while ``t_min``, ``t_max``, and ``t_avg`` include that time. Time is reported in `μsec`.


Benchmark Arguments
^^^^^^^^^^^^^^^^^^^^^

To see the benchmark arguments, use the ``--help`` argument.

The benchmark accepts the following arguments:

.. list-table::
:widths: 25 50 25
:header-rows: 1

* - Option
- Description
- Default Value
* - ``-b``, ``--backend``
- Specify the backend. The possible values are ``host`` and ``sycl``. For a CPU-only build, the backend is automatically set to ``host``, and only the host option is available.
For a CPU-GPU build, ``host`` and ``sycl`` options are available, and ``sycl`` is the default value. The host value allocates buffers in the host (CPU) memory, while the ``sycl`` value allocates buffers in the device (GPU) memory.
- ``sycl``
* - ``-i``, ``--iters``
- Specify the number of iterations executed by the benchmark.
- ``16``
* - ``-w``, ``--warmup_iters``
- Specify the number of the warmup iterations. It means the number of iterations the benchmark runs before starting the timing of the iterations specified with the ``-i`` argument.
- ``16``
* - ``-j``, ``--iter_policy``
- Specify the iteration policy. Possible values are ``off`` and ``auto``.
When the iteration policy is ``off``, the number of iterations is the same across the message sizes.
When the iteration policy is ``auto``, the number of iterations reduces based on the message size of the collective operation.
- ``auto``
* - ``-n``, ``--buf_count``
- Specify the number of collective operations the benchmark calls in each iteration. Each collective uses different ``send`` and ``receive`` buffers.
The explicit wait calls are placed for each collective after all of them are called.
- ``1``
* - ``-f``, ``--min_elem_count``
- Specify the minimum number of elements used for the collective.
- ``1``
* - ``-t``, ``--max_elem_count``
- Specify the maximum number of elements used for the collective.
- ``128``
* - ``-y``, ``--elem_counts``
- Specify a list with the number of elements used for the collective, , such as ``[-y 4, 8, 32, 131072]``.
- ``[1, 2, 4, 8, 16, 32, 64, 128]``
* - ``-c``, ``--check``
- Check for correctness. The possible values are ``off`` (disable checking), ``last`` (check the last iteration), and ``all`` (check all the iterations).
- ``last``
* - ``-p``, ``--cache``
- Specify whether to use persistent collectives (``p=1``) or not (``p=0``).

.. note:: A collective is persistent when the same collective is called with the same parameters multiple times. OneCCL generates a schedule for each collective it runs and can apply optimizations when persistent collectives are used.
It means the schedule is generated once and reused across the subsequent invocations, saving the time to generate the schedule.

- ``1``
* - ``-q``, ``--inplace``
- Specify for oneCCL to use in-place (``1``) or out-of-place (``0``) buffers. With the in-place buffers, the send and receive buffers used by the collective are the same.
With the out-of-place, the buffers are different.
- ``0``
* - ``-a``, ``--sycl_dev_type``
- Specify the type of the SYCL device. The possible values are ``host``, ``cpu``, and ``gpu``.
- ``gpu``
* - ``-g``, ``--sycl_root_dev``
- Specify to use the root devices (``0``) and sub-devices (``1``).
- ``0``
* - ``-m``, ``--sycl_mem_type``
- Specify the type of SYCL memory. The possible values are ``usm`` (unified shared memory) and ``buf`` (buffers).
- ``usm``
* - ``-u``, ``--sycl_usm_type``
- Specify the type of SYCL device. The possible values are ``device`` or ``shared``.
- ``device``
* - ``-e``, ``--sycl_queue_type``
- Specify the type of the SYCL queue. The possible values are ``in_order`` and ``out_order``.
- ``out_order``
* - ``-l``, ``--coll``
- Specify the collective to run. Accept a comma-separated list, without whitespace characters, of collectives to run. The available collectives are ``allreduce``, ``reduce``, ``alltoallv``, ``alltoall``, ``allgatherv``, ``reduce-scatter``, ``broadcast``.
- ``allreduce``
* - ``-d``, ``--dtype``
- Specify the datatype. Accept a comma-separated list, without whitespace characters, of datatypes to benchmark. The available types are ``int8``, ``int32``, ``int64``, ``uint64``, ``float16``, ``float32``, and ``bfloat16``.
- ``float32``
* - ``-r``, ``--reduction``
- Specify the type of the reduction. Accept a coma-separated list, without whitespace characters, of the reduction operations to run. The available operations are ``sum``, ``prod``, ``min``, and ``max``.
- ``sum``
* - ``-o``, ``--csv_filepath``
- Specify to store the output in the specified CSV file. User specifies the csv_filepath/file_to_store CSV-formatted data into
-
* - ``-x``, ``--ext``
- Specify to show the additional information. The possible values are ``off``, ``auto``, and ``on``. With ``on``, it also displays the average wait time.
- ``auto``
* - ``-h``, ``--help``
- Show all of the supported options.
-

.. note::

The ``-t`` and ``-f`` options specify the count in number of elements, so the total number of bytes is obtained by multiplying the number of elements by the number of bytes of the data type the collective uses.
For instance, with ``-f 128`` and ``fp32`` datatype, the total amount of bytes is 512 (128 element count * 4 bytes FP32).
The benchmark runs and reports time for message sizes that correspond to the ``-t`` and ``-f`` arguments and all message sizes that are powers of two in between these two numbers.


Example
********

GPU
^^^^

The following example shows how to run the benchmark with the GPU buffers:

.. code::
mpirun -n <N> -ppn <P> benchmark -a gpu -m usm -u device -l allreduce -i 20 -j off -f 1024 -t 67108864 -d float32 -p 0 -e in_order
The above command runs:

* The ``allreduce`` collective operation
* With a total of ``N`` processes
* With ``P`` processes per node allocating the memory in the GPU
* Using SYCL Unified Shared Memory (USM) of the device type
* 20 iterations
* With the element count from 1024 to 67108864 (the benchmark runs with all the powers on two in that range) of float32 datatype, assuming the collective is not persistent and using a SYCL in-order queue


Similar for ``allreduce`` and ``reduce_scatter``:

.. code::
mpirun -n <N> -ppn <P> benchmark -a gpu -m usm -u device -l allreduce,reduce_scatter -i 20 -j off -f 1024 -t 67108864 -d float32 -p 0 -e in_order
.. note:: In this case, the time reported is the accumulated time corresponding to the execution time of ``allreduce`` and ``reduce_scatter``.

CPU
^^^^

.. code::
mpirun -b host -n <N> -ppn <P> benchmark -l allreduce -i 20 -j off -f 1024 -t 67108864 -d float32 -p 0
The above command specifies to run

* The ``allreduce`` collective operation
* With a total of ``N`` processes
* With ``P`` processes per node
* 20 iterations
* With the element count from 1024 to 67108864 (the benchmark runs with all the powers on two in that range) of float32 datatype, assuming the collective is not persistent


Similar for ``allreduce`` and ``reduce_scatter``:

.. code::
mpirun -b host -n <N> -ppn <P> benchmark -l allreduce,reduce_scatter -i 20 -j off -f 1024 -t 67108864 -d float32 -p 0
Loading

0 comments on commit 2db7ff8

Please sign in to comment.