Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected Performance penality when using amrex::ParallelFor #3799

Closed
ankithadas opened this issue Mar 12, 2024 · 24 comments · Fixed by #3855
Closed

Unexpected Performance penality when using amrex::ParallelFor #3799

ankithadas opened this issue Mar 12, 2024 · 24 comments · Fixed by #3855

Comments

@ankithadas
Copy link
Contributor

Using amrex::ParallelFor instead of AMREX_HOST_DEVICE_PARALLEL_FOR_4D macro results in unexpected performance penalty.

Using this

amrex::ParallelFor(
    bx, AMREX_SPACEDIM,
    [=] AMREX_GPU_DEVICE (int i, int j, int k, int n) noexcept -> void {
        amrex::Real under_relax_term = under_relax_eval<under_relax>(undr_relx_coef, vel_fab(i, j, k, n), Ap_fab(i, j, k));
        amrex::Real deferred_correction_term = deferred_correction<hos, tvds>(i, j, k, n,
                AMREX_D_DECL(uxface_fab, uyface_fab, uzface_fab),
                vel_fab, dxinv, blend_factor);
        Usrc_fab(i, j, k, n) = - gp_fab(i, j, k, n) + under_relax_term
                            + deferred_correction_term;
    }
);

instead of

AMREX_HOST_DEVICE_PARALLEL_FOR_4D(bx, AMREX_SPACEDIM, i, j, k, n,
{
    amrex::Real under_relax_term = (under_relax_eval<under_relax>(undr_relx_coef, vel_fab(i, j, k, n), Ap_fab(i, j, k)));
    amrex::Real deferred_correction_term = (deferred_correction<hos, tvds>(i, j, k, n, 
            AMREX_D_DECL(uxface_fab, uyface_fab, uzface_fab), vel_fab, dxinv, blend_factor));
    Usrc_fab(i, j, k, n) = - gp_fab(i, j, k, n) + under_relax_term
                        + deferred_correction_term;
});

results in a 6.6x times slowdown. Ideally, both should result in same machine code.

When examining the de-compiled binary using ghidra, it was found that the lambda expression in ParallelFor was not inlined.

Profiler Comparison

NavierStokes::computeSourceTerm-TVD-MinMod()        2919      0.6683        8.38 % : With macro 
NavierStokes::computeSourceTerm-TVD-MinMod()        2919      4.4066       38.05 % : With amrex::ParallelFor

I am using gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0. I observed similar results with gcc 12 as well.

@WeiqunZhang
Copy link
Member

What optimization flags did you use? -O3 or -O0?

@ankithadas
Copy link
Contributor Author

-O3 -DNDEBUG -Ofast -march=native -ffast-math -Werror=return-type with AMReX_IPO turned on

@ankithadas
Copy link
Contributor Author

Is there a reason why amrex::ParallelFor is not force inlined? If it is supposed to be same as AMREX_HOST_DEVICE_PARALLEL_FOR_4D it should be force inlined

@WeiqunZhang
Copy link
Member

Is there a reason why amrex::ParallelFor is not force inlined?

I don't think it matters if ParallelFor is inlined or not. It's the functions it calls.

Do you have a reproducer? I tried some simple tests and did not find any meaningful performance difference between the two. It's probably the functions used here are big such that the compiler decides not to inline it. So if I have a reproducer, I can experiment with it.

@ankithadas
Copy link
Contributor Author

Compiling this on my machine using GCC results is massive performance difference.

Total times ----------------------------------------------------------------------
                               Function Name      NCalls        Time     Percent %
                Deferred Correction Function           1      2.1986       88.27 %
                   Deferred Correction Macro           1      0.2817       11.31 %
              The_Pinned_Arena::Initialize()           1      0.0044        0.18 %
                          FabArray::setVal()           5      0.0036        0.14 %
                        amrex_mempool_init()           1      0.0024        0.09 %
  DistributionMapping::SFCProcessorMapDoIt()           1      0.0000        0.00 %
           DistributionMapping::Distribute()           1      0.0000        0.00 %
        DistributionMapping::LeastUsedCPUs()           1      0.0000        0.00 %
         ParallelDescriptor::Gather(TsT1si)d           0      0.0000        0.00 %
==================================================================================
#include <AMReX.H>
#include <AMReX_MultiFabUtil.H>
#include <AMReX_TagBox.H>
#include <AMReX_ParmParse.H>

using namespace amrex;

enum class HOScheme {
    TVD
};

enum class TVDScheme
{
    UMIST,
};

template<typename T>
[[nodiscard]]
AMREX_GPU_HOST_DEVICE AMREX_FORCE_INLINE
T upwind_scheme (int i, int j, int, int n,
                    amrex::Array4<T const> const& uface_x,
                    amrex::Array4<T const> const& uface_y,
                    amrex::Array4<T const> const& u)
{
    constexpr T zero = T(0.0);
    const T upwind_hi_x = amrex::max(uface_x(i+1, j, 0), zero) * u(i  , j, 0, n) +
                          amrex::min(uface_x(i+1, j, 0), zero) * u(i+1, j, 0, n);
    const T upwind_lo_x = amrex::max(uface_x(i, j, 0), zero) * u(i-1, j, 0, n) +
                          amrex::min(uface_x(i, j, 0), zero) * u(i  , j, 0, n);
    const T upwind_term_x = upwind_hi_x - upwind_lo_x;

    const T upwind_hi_y = amrex::max(uface_y(i, j+1, 0), zero) * u(i, j  , 0, n) +
                          amrex::min(uface_y(i, j+1, 0), zero) * u(i, j+1, 0, n);
    const T upwind_lo_y = amrex::max(uface_y(i, j, 0), zero) * u(i, j-1, 0, n) +
                          amrex::min(uface_y(i, j, 0), zero) * u(i, j  , 0, n);
    const T upwind_term_y = upwind_hi_y - upwind_lo_y;

    return upwind_term_x + upwind_term_y;
}

#define EPSILON 1e-13

// None case
template<HOScheme hos, TVDScheme tvds>
struct limiter { };


// UMIST case
template<>
struct limiter<HOScheme::TVD, TVDScheme::UMIST> {
    template<typename T>
    AMREX_GPU_HOST_DEVICE AMREX_FORCE_INLINE
    constexpr T operator()(T num, T den) noexcept {
        den += EPSILON;
        const T r = num / den;
        using RT = amrex::Real;
        return den * amrex::max<RT>(0, amrex::min<RT>(2, 2 * r, 0.25 * (1 + 3.0 * r), 0.25 * (3.0 + r)));
    }
};


template<HOScheme hos, TVDScheme tvds>
AMREX_GPU_HOST_DEVICE AMREX_FORCE_INLINE
constexpr limiter<hos, tvds> getScheme() noexcept
{
    return limiter<hos, tvds>{};
}

template<HOScheme hos, TVDScheme tvds, typename T>
[[nodiscard]]
AMREX_GPU_HOST_DEVICE AMREX_FORCE_INLINE
T tvd_scheme (int i, int j, int k, int n,
                amrex::Array4<T const> const& uface_x,
                amrex::Array4<T const> const& uface_y,
                amrex::Array4<T const> const& u) noexcept
{
    constexpr T zero = T(0.0);

    auto limiter = getScheme<hos, tvds>();

    const T tvd_hi_x = amrex::max(uface_x(i+1, j, k), zero) * (
                            u(i  , j, k, n) + 0.5 * limiter(u(i  , j, k, n) - u(i-1, j, k, n), u(i+1, j, k, n) - u(i  , j, k, n))
                        )
                        +
                        amrex::min(uface_x(i+1, j, k), zero) * (
                            u(i+1, j, k, n) + 0.5 * limiter(u(i+1, j, k, n) - u(i+2, j, k, n), u(i  , j, k, n) - u(i+1, j, k, n))
                        );

    const T tvd_lo_x = amrex::max(uface_x(i, j, k), zero) * (
                            u(i-1, j, k, n) + 0.5 * limiter(u(i-1, j, k, n) - u(i-2, j, k, n), u(i  , j, k, n) - u(i-1, j, k, n))
                        )
                        +
                        amrex::min(uface_x(i, j, k), zero) * (
                            u(i, j, k, n) + 0.5 * limiter(u(i  , j, k, n) - u(i+1, j, k, n), u(i-1, j, k, n) - u(i  , j, k, n))
                        );

    const T tvd_term_x = tvd_hi_x - tvd_lo_x;

    const T tvd_hi_y = amrex::max(uface_y(i, j+1, k), zero) * (
                            u(i, j  , k, n) + 0.5 * limiter(u(i, j  , k, n) - u(i, j-1, k, n), u(i, j+1, k, n) - u(i, j  , k, n))
                        )
                        +
                        amrex::min(uface_y(i, j+1, k), zero) * (
                            u(i, j+1, k, n) + 0.5 * limiter(u(i, j+1, k, n) - u(i, j+2, k, n), u(i, j  , k, n) - u(i, j+1, k, n))
                        );

    const T tvd_lo_y = amrex::max(uface_y(i, j, k), zero) * (
                            u(i, j-1, k, n) + 0.5 * limiter(u(i, j-1, k, n) - u(i, j-2, k, n), u(i, j  , k, n) - u(i, j-1, k, n))
                        )
                        +
                        amrex::min(uface_y(i, j, k), zero) * (
                            u(i, j, k, n) + 0.5 * limiter(u(i, j  , k, n) - u(i, j+1, k, n), u(i, j-1, k, n) - u(i, j  , k, n))
                        );
    const T tvd_term_y = tvd_hi_y - tvd_lo_y;

    return tvd_term_x + tvd_term_y;
}

namespace {

namespace test{

template<HOScheme hos, TVDScheme tvds, typename T>
struct deferred_correction {
    [[nodiscard]]
    AMREX_GPU_HOST_DEVICE AMREX_FORCE_INLINE
    static
    T impl (int i, int j, int k, int n,
                amrex::Array4<T const> const& uface_x,
                amrex::Array4<T const> const& uface_y,
                amrex::Array4<T const> const& u,
                amrex::GpuArray<T,AMREX_SPACEDIM> const& dxinv,
                T blend_factor) noexcept
    {
        // static_assert(hos == HOScheme::TVD, "Invalid combination of schemes");
        const T ho_term = tvd_scheme<hos, tvds>(i, j, k, n, uface_x, uface_y, u);
        const T upwind_term  = upwind_scheme (i, j, k, n, uface_x, uface_y, u);

        return blend_factor * dxinv[n] * (upwind_term - ho_term);
    }
};

} // namespace detail
} // End anonymous namespace

template <HOScheme hos, TVDScheme tvds, typename T>
[[nodiscard]]
AMREX_GPU_HOST_DEVICE AMREX_FORCE_INLINE
T deferred_correction (int i, int j, int k, int n,
                    amrex::Array4<T const> const& uface_x,
                    amrex::Array4<T const> const& uface_y,
                    amrex::Array4<T const> const& u,
                    amrex::GpuArray<T,AMREX_SPACEDIM> const& dxinv,
                    T blend_factor) noexcept
{
    return test::deferred_correction<hos, tvds, T>::impl(i, j, k, n, uface_x, uface_y, u, dxinv, blend_factor);
}

#undef EPSILON


void testParallelFor()
{

    int n_cell = 512;
    int max_grid_size = 32;

    Geometry geom;
    BoxArray grids;
    DistributionMapping dmap;
    {
        RealBox rb ({AMREX_D_DECL(0.0, 0.0, 0.0)}, {AMREX_D_DECL(1.0, 1.0, 1.0)});
        Array<int,AMREX_SPACEDIM> is_periodic {AMREX_D_DECL(0,0,0)}; // Not periodic
        Geometry::Setup(&rb, 0, is_periodic.data());
        Box domain(IntVect(AMREX_D_DECL(0,0,0)), IntVect(AMREX_D_DECL(n_cell-1,n_cell-1,n_cell-1)));
        geom.define(domain);

        grids.define(domain);
        grids.maxSize(max_grid_size);
        dmap.define(grids);
    }

    MultiFab vel;
    MultiFab rhs;
    MultiFab gp;
    MultiFab Usrc;
    Array<MultiFab,AMREX_SPACEDIM> velface;

    // Fill vel, rhs, gp, velface with random values
    {
        vel.define(grids, dmap, AMREX_SPACEDIM, 2);
        rhs.define(grids, dmap, AMREX_SPACEDIM, 0);
        gp.define(grids, dmap, AMREX_SPACEDIM, 0);
        Usrc.define(grids, dmap, AMREX_SPACEDIM, 0);
        for (int idim = 0; idim < AMREX_SPACEDIM; ++idim) {
            velface[idim].define(amrex::convert(grids, IntVect::TheDimensionVector(idim)), dmap, 1, 0);
        }

        vel.setVal(0.0);
        rhs.setVal(0.0);
        gp.setVal(0.0);
        for (int idim = 0; idim < AMREX_SPACEDIM; ++idim) {
            velface[idim].setVal(0.0);
        }

        for (MFIter mfi(vel); mfi.isValid(); ++mfi) {
            const Box& bx = mfi.validbox();
            auto const& vel_arr = vel.array(mfi);
            auto const& rhs_arr = rhs.array(mfi);
            auto const& gp_arr = gp.array(mfi);
            for (int idim = 0; idim < AMREX_SPACEDIM; ++idim) {
                auto const& velface_arr = velface[idim].array(mfi);
                amrex::ParallelFor(bx, AMREX_SPACEDIM,
                    [=] AMREX_GPU_DEVICE (int i, int j, int k, int n) noexcept
                    {
                        vel_arr(i,j,k,n) = amrex::Random();
                        rhs_arr(i,j,k,n) = amrex::Random();
                        gp_arr(i,j,k,n) = amrex::Random();
                        velface_arr(i,j,k) = amrex::Random();
                    }
                );
            }
        }
    }

    BL_PROFILE_VAR("Deferred Correction Function", amrex_parallel_for_fn);

    for (int i = 0; i < 100; ++i) {
        const auto& dxinv = geom.InvCellSizeArray();
        const Real blend_factor = 0.5;
        AMREX_D_TERM(auto& uxface = velface[0];,
                    auto& uyface = velface[1];,
                    auto& uzface = velface[2];);

        MFItInfo mfi_info;
        if (Gpu::notInLaunchRegion()) { mfi_info.EnableTiling().SetDynamic(true); }
#ifdef AMREX_USE_OMP
#pragma omp parallel if (Gpu::notInLaunchRegion())
#endif
        for (MFIter mfi(vel, mfi_info); mfi.isValid(); ++mfi)
        {
            const Box& bx = mfi.tilebox();
            const auto& vel_fab = vel.const_array(mfi);
            const auto& gp_fab = gp.const_array(mfi);
            const auto& Usrc_fab = Usrc.array(mfi);

            AMREX_D_TERM(const auto& uxface_fab = uxface.const_array(mfi);,
                        const auto& uyface_fab = uyface.const_array(mfi);,
                        const auto& uzface_fab = uzface.const_array(mfi););
            amrex::ParallelFor(
                bx, AMREX_SPACEDIM,
                [=] AMREX_GPU_DEVICE (int i, int j, int k, int n) noexcept -> void {
                    amrex::Real deferred_correction_term = deferred_correction<HOScheme::TVD, TVDScheme::UMIST>(i, j, k, n,
                            AMREX_D_DECL(uxface_fab, uyface_fab, uzface_fab),
                            vel_fab, dxinv, blend_factor);
                    Usrc_fab(i, j, k, n) = - gp_fab(i, j, k, n) + deferred_correction_term;
                }
            );
        }
    }
    BL_PROFILE_VAR_STOP(amrex_parallel_for_fn);

    BL_PROFILE_VAR("Deferred Correction Macro", amrex_parallel_for_macro);

    for (int i = 0; i < 100; ++i) {
        const auto& dxinv = geom.InvCellSizeArray();
        const Real blend_factor = 0.5;
        AMREX_D_TERM(auto& uxface = velface[0];,
                    auto& uyface = velface[1];,
                    auto& uzface = velface[2];);

        MFItInfo mfi_info;
        if (Gpu::notInLaunchRegion()) { mfi_info.EnableTiling().SetDynamic(true); }
#ifdef AMREX_USE_OMP
#pragma omp parallel if (Gpu::notInLaunchRegion())
#endif
        for (MFIter mfi(vel, mfi_info); mfi.isValid(); ++mfi)
        {
            const Box& bx = mfi.tilebox();
            const auto& vel_fab = vel.const_array(mfi);
            const auto& gp_fab = gp.const_array(mfi);
            const auto& Usrc_fab = Usrc.array(mfi);

            AMREX_D_TERM(const auto& uxface_fab = uxface.const_array(mfi);,
                        const auto& uyface_fab = uyface.const_array(mfi);,
                        const auto& uzface_fab = uzface.const_array(mfi););

            AMREX_HOST_DEVICE_PARALLEL_FOR_4D(bx, AMREX_SPACEDIM, i, j, k, n,
            {
                amrex::Real deferred_correction_term = (deferred_correction<HOScheme::TVD, TVDScheme::UMIST>(i, j, k, n,
                        AMREX_D_DECL(uxface_fab, uyface_fab, uzface_fab), vel_fab, dxinv, blend_factor));
                Usrc_fab(i, j, k, n) = - gp_fab(i, j, k, n) + deferred_correction_term;
            });
        }
    }
    BL_PROFILE_VAR_STOP(amrex_parallel_for_macro);
}

int main(int argc, char* argv[])
{
    amrex::Initialize(argc, argv);

    amrex::Print() << "Running ParallelFor\n";

    testParallelFor();
    amrex::Finalize();
}

Sorry for the long comment. For some reason, I can't attach a cpp file....

@WeiqunZhang
Copy link
Member

The code does not compile. If I make the following changes, it compiles.

diff --git a/lambda_vs_macro/main.cpp b/lambda_vs_macro/main.cpp
index ef31aa4..3545ffb 100644
--- a/lambda_vs_macro/main.cpp
+++ b/lambda_vs_macro/main.cpp
@@ -1,6 +1,5 @@
 #include <AMReX.H>
 #include <AMReX_MultiFabUtil.H>
-#include <AMReX_TagBox.H>
 #include <AMReX_ParmParse.H>
 
 using namespace amrex;
@@ -252,7 +251,7 @@ void testParallelFor()
                 [=] AMREX_GPU_DEVICE (int i, int j, int k, int n) noexcept -> void {
                     amrex::Real deferred_correction_term = deferred_correction<HOScheme::TVD, TVDScheme::UMIST>(i, j, k, n,
                             AMREX_D_DECL(uxface_fab, uyface_fab, uzface_fab),
-                            vel_fab, dxinv, blend_factor);
+                            dxinv, blend_factor);
                     Usrc_fab(i, j, k, n) = - gp_fab(i, j, k, n) + deferred_correction_term;
                 }
             );
@@ -288,7 +287,7 @@ void testParallelFor()
             AMREX_HOST_DEVICE_PARALLEL_FOR_4D(bx, AMREX_SPACEDIM, i, j, k, n,
             {
                 amrex::Real deferred_correction_term = (deferred_correction<HOScheme::TVD, TVDScheme::UMIST>(i, j, k, n,
-                        AMREX_D_DECL(uxface_fab, uyface_fab, uzface_fab), vel_fab, dxinv, blend_factor));
+                        AMREX_D_DECL(uxface_fab, uyface_fab, uzface_fab), dxinv, blend_factor));
                 Usrc_fab(i, j, k, n) = - gp_fab(i, j, k, n) + deferred_correction_term;
             });
         }

However, when I run it, there is segfault.

@ankithadas
Copy link
Contributor Author

Sorry I forgot to add. Please build for AMReX_SPACEDIM=2

@WeiqunZhang
Copy link
Member

The issue seems to be a gcc issue especially the recent versions. For this test, https://github.com/WeiqunZhang/amrex-devtests/tree/main/lambda_vs_macro, I have tried a number of compilers on my desktop computer. I get

| Compiler  | macro | lambda |
|-----------+-------+--------|
| gcc-9     | 1.355 |  1.720 |
| gcc-12    | 0.485 |  2.156 |
| clang-17  | 0.915 |  0.990 |
| icpx-2024 | 0.849 |  0.847 |

gcc-12 is very fast with the macro, but very slow with the lambda.

@WeiqunZhang
Copy link
Member

With -finline-limit=100000, the gcc-12 number for lambda goes down to 0.483.

@WeiqunZhang
Copy link
Member

Not sure what's a good number for us. We could even set -finline-limit=0 meaning unlimited.

@WeiqunZhang
Copy link
Member

WeiqunZhang commented Mar 22, 2024

@ankithadas Could you verify that -finline-limit=100000 solves the performance issue for you? Then we can think about what's a good number to use.

@WeiqunZhang
Copy link
Member

WeiqunZhang commented Mar 23, 2024

I must have made a mistake during testing. 0 seems to mean noinline, not unlimited. I did more investigation. Here is what I got for WarpX in terms of executable size and compile time. gcc -O3 -g is used.

-finline-limit=n
|      n | size (MB) | compile time |
|--------+-----------+--------------|
|   none |       395 | 6m50s        |
|  10000 |       429 | 7m26s        |
| 100000 |       429 | 7m30s        |

Maybe 100000 is a good number.

@WeiqunZhang
Copy link
Member

Or the meaning of 0 has changed. It seems to mean different things for different versions. So it's better to avoid it.

@BenWibking
Copy link
Contributor

Maybe amrex::ParallelFor could be marked with __attribute__((flatten)) (e.g.: https://awesomekling.github.io/Smarter-C++-inlining-with-attribute-flatten/)?

@WeiqunZhang
Copy link
Member

Nice! The flatten attribute works.

@WeiqunZhang
Copy link
Member

The danger might be this it too aggressive. It will inline functions introduced by inlining unless it's explicitly marked noinline.

@WeiqunZhang
Copy link
Member

MSVC has another attribute in addition to flatten. https://learn.microsoft.com/en-us/cpp/cpp/attributes?view=msvc-170#msvcforceinline_calls

GCC does not seem to have it.

@WeiqunZhang
Copy link
Member

We could add a new macro AMREX_FLATTEN, but use it on all ParallelFor might be dangerous.

@WeiqunZhang
Copy link
Member

I added a new macro AMREX_FLATTEN in #3840.

@WeiqunZhang
Copy link
Member

In #3841, -finline-limit=43210 is added by default.

@ankithadas
Copy link
Contributor Author

I did some performance testing (without -finline-limit) and it looks like gcc-13.1 (g++ (Ubuntu 13.1.0-8ubuntu1~22.04)) is not much better either

| Compiler  | macro | lambda |
|-----------+-------+--------|
| gcc-11    | 0.4744| 1.061  |
| gcc-12    | 0.4726| 1.982  |
| gcc-13    | 1.694 | 2.043  |

Using gcc-11, without -finline-limit=10000

Total Timers     =       9.48 seconds.


Total times ------------------------------------------------------------------------
                                 Function Name      NCalls        Time     Percent %
  NavierStokes::computeSourceTerm-TVD-MinMod()        2402      2.6888       28.37 %
                    MLABecLaplacian::Fsmooth()       22136      0.9474        9.99 %
                  MLAConBecLaplacian::Fapply()        8490      0.9132        9.63 %
                               PhysBCFunct::()       10239      0.7486        7.90 %
                            FabArray::setVal()       37874      0.7017        7.40 %
                                 amrex::Copy()       19148      0.3881        4.09 %
                             FabArray::Saxpy()       30084      0.3274        3.45 %
         MLAConBecLaplacian::fillMatrxCoeffs()        2216      0.2720        2.87 %
         NavierStokes::calculateFaceVelocity()        2216      0.2699        2.85 %

With -finline-limit=10000

Total Timers     =       7.56 seconds.


Total times ------------------------------------------------------------------------
                                 Function Name      NCalls        Time     Percent %
                    MLABecLaplacian::Fsmooth()       22136      0.9508       12.58 %
                  MLAConBecLaplacian::Fapply()        8490      0.8931       11.82 %
  NavierStokes::computeSourceTerm-TVD-MinMod()        2402      0.7421        9.82 %
                               PhysBCFunct::()       10239      0.7339        9.71 %
                            FabArray::setVal()       37874      0.6996        9.26 %
                                 amrex::Copy()       19148      0.3842        5.08 %
                             FabArray::Saxpy()       30086      0.3269        4.33 %
         MLAConBecLaplacian::fillMatrxCoeffs()        2216      0.3216        4.26 %

In short, setting finline-limit to a big enough number seems to fix this issue. However, I think this lambda should be inlined regardless of it's size (after all it is a rvalue lambda).

I agree with @BenWibking suggestion for marking amrex:ParallelFor with __attribute__((flatten)). The name ParallelFor implies expectations of parallel optimizations such as vectorizing, inlining, loop unrolling, etc., to be applied to the lambda being passed. If these optimisations are not expected, the user should use amrex::For instead.

@WeiqunZhang
Copy link
Member

We could add a build flag for people to opt in.

However, I think this lambda should be inlined regardless of it's size (after all it is a rvalue lambda).

I don't think its value type is important. We could do

auto f = [] ....;
ParallelFor(Box, f); // f is lvalue

So ParallelFor is not just a function template for rvalue.

@ankithadas
Copy link
Contributor Author

Sorry, I forgot rules about forwarding references. I think a compiler flag to opt in sounds like a good idea.

@WeiqunZhang WeiqunZhang linked a pull request Mar 27, 2024 that will close this issue
@WeiqunZhang
Copy link
Member

#3860 Looks like having it on by default might make compilation too slow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants