Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: Runtime is not as expected #918

Open
hansenidden18 opened this issue Feb 6, 2025 · 8 comments
Open

[BUG]: Runtime is not as expected #918

hansenidden18 opened this issue Feb 6, 2025 · 8 comments

Comments

@hansenidden18
Copy link

Problem description

Dear developer,

I was trying to compare the interface overhead between nanobind vs pybind vs python with a simple numpy.sum() replication. I found that nanobind result was not as I expected, even after I had already passed the value by reference.

Image

The result I attached shows that Nanobind runs much slower than Pybind and Python itself. I think Nanobind should run faster than Pybind, at least. Here is my code for the Nanobind numpy calculation. Do I misuse it?

#include <nanobind/nanobind.h>
#include <nanobind/ndarray.h>
namespace nb = nanobind;

int process_array(const nb::ndarray<const int32_t>& input) { 
   int sum = 0;
   for(size_t i = 0; i < input.size(); i++) {
       sum += input.data()[i];
   }
   return sum;
}

NB_MODULE(array_module, m) {
   m.def("process_array", &process_array, "Sum elements in int32 array");
}

Thank you.

Reproducible example code

#include <nanobind/nanobind.h>
#include <nanobind/ndarray.h>
namespace nb = nanobind;

int process_array(const nb::ndarray<const int32_t>& input) {  // Add const& here
   int sum = 0;
   for(size_t i = 0; i < input.size(); i++) {
       sum += input.data()[i];
   }
   return sum;
}

NB_MODULE(array_module, m) {
   m.def("process_array", &process_array, "Sum elements in int32 array");
}
@wojdyr
Copy link
Contributor

wojdyr commented Feb 6, 2025

note that using data() like this will work only for contiguous arrays

perhaps it would be faster with view(), see https://nanobind.readthedocs.io/en/latest/ndarray.html#fast-array-views

@wjakob
Copy link
Owner

wjakob commented Feb 8, 2025

Thanks @wojdyr , that's right.

C++ cannot vectorize over an arbitrary nb::ndarray since the representation is too generic. You need several things to get similar performance:

  1. specify the array dimension (e.g. nb::shape<-1>)
  2. Expect contiguous arrays (e.g. nb:c_contig)
  3. Get a view and then loop over that.

If you are comparing performance across projects, you will also want to add the NOMINSIZE Cmake flag to nanobind_add_module. @hansenidden18 Can you please incorporate these changes and then report that.

@hansenidden18
Copy link
Author

Thank you for all of the responses,

For @wojdyr , I have tried using view() for the python and adding -O3 flag in the compilation to optimize it. The result is indeed improving significantly, but it is still not as fast as Pybind.

Image

And for @wjakob suggestions, I also tried this suggestion, but unfortunately, the result is not as expected.

Image

Somehow, the performance is better just with view() and -O3 flag. Do I misuse your suggestions? Here I provide the CMakeList file and my source code as well.

Thank you.

CMakeList.txt

cmake_minimum_required(VERSION 3.15...3.26)
project(array_module)
find_package(Python 3.10 COMPONENTS Interpreter Development REQUIRED)
find_package(nanobind CONFIG REQUIRED)
nanobind_add_module(array_module array.cpp NOMINSIZE)

add_compile_options(-O3)

array.cpp

#include <nanobind/nanobind.h>
#include <nanobind/ndarray.h>
namespace nb = nanobind;

int process_array(const nb::ndarray<const int32_t, nb::shape<-1>, nb::c_contig>& input) { 
   int sum = 0;
   auto v = input.view<const int32_t, nb::ndim<1>>();
   for(size_t i = 0; i < v.shape(0); i++) {
       sum += v(i);
   }
   return sum;
}

NB_MODULE(array_module, m) {
   m.def("process_array", &process_array, "Sum elements in int32 array");
}

@hpkfft
Copy link
Contributor

hpkfft commented Feb 11, 2025

Just some quick thoughts:

  1. Are you using cmake -DCMAKE_BUILD_TYPE=Release
  2. If you are (and you should), I think the default is O3, so you probably don't need add_compile_options(-O3)
  3. Since you now have nb::c_contig, you can use data() if you like:
    const int32_t* a = input.data();
    const size_t size = input.shape(0);
    int sum = 0;
    for(size_t i = 0; i < size; i++) {
        sum += a[i];
    }
    return sum;
  1. How are you measuring performance from python? In particular, is the array already of the correct dtype? If not it may be converted for you (which is expensive). So, in python, are you doing something like a = np.zeros(4096, dtype=np.int32)? Also, if conversion is the problem, you might like
m.def("process_array", &process_array, nb::arg().noconvert(), "Sum elements..........");

Anyway, I hope something above is useful or sparks an idea....

@hansenidden18
Copy link
Author

Thank you for the response again,

Your suggestions indeed make the performance get close to Pybind.

Image

In the python side, indeed I use a = np.ones(1024, dtype=np.int32) and also tried a = np.random.randint(0, 100, size=1024, dtype=np.int32).
My expectation is that nanobind will run similar or faster than pybind, that's why I created this issue in the first place.
Thank you.

@wjakob
Copy link
Owner

wjakob commented Feb 14, 2025

Hi @hansenidden18,

it's still weird that nanobind would be 200ms slower :-/. Can you post a full reproducer including your python driver script?

@wjakob
Copy link
Owner

wjakob commented Feb 25, 2025

ping @hansenidden18

@hansenidden18
Copy link
Author

Hi, @wjakob , sorry for the late reply,

Here I provide all of the code that I used
test.py

import numpy as np
import time
import array_module
import math_module

def py_add(arr) -> int:
    return np.sum(arr)

def benchmark(iterations: int = 1000000):
    #ini = np.ones(1024, dtype=np.int32)
    ini = np.random.randint(0, 100, size=1024, dtype=np.int32)  # Random integers between 0 and 100

    # Warmup
    print(f"Nanobind result: {array_module.process_array(ini)}")
    print(f"Python result: {py_add(ini)}")
    print(f"Pybind result: {math_module.process_array(ini)}")    
    start = time.perf_counter()
    for _ in range(iterations):
        array_module.process_array(ini)
    nanobind_time = time.perf_counter() - start
    
    start = time.perf_counter()
    for _ in range(iterations):
        math_module.process_array(ini)
    pybind_time = time.perf_counter() - start

    start = time.perf_counter()
    for _ in range(iterations):
        py_add(ini)
    python_time = time.perf_counter() - start
    

    print(f"Pybind:   {pybind_time:.6f} seconds")
    print(f"Nanobind: {nanobind_time:.6f} seconds")
    print(f"Python:   {python_time:.6f} seconds")
    print(f"Nanobind Overhead: {(nanobind_time/python_time - 1)*100:.1f}%")
    print(f"Pybind Overhead: {(pybind_time/python_time - 1)*100:.1f}%")

if __name__ == "__main__":
    benchmark()

And here is the nanobind implementation

#include <nanobind/nanobind.h>
#include <nanobind/ndarray.h>
namespace nb = nanobind;

int process_array(const nb::ndarray<const int32_t, nb::shape<-1>, nb::c_contig>& input) {
   const int32_t* a = input.data();
   const size_t size = input.shape(0);
   int sum = 0;
   for(size_t i = 0; i < size; i++) {
       sum += a[i];
   }
   return sum;
}

NB_MODULE(array_module, m) {
   m.def("process_array", &process_array, nb::arg().noconvert(), "Sum elements..........");
}

Lastly, this is my CMakeLists.txt

cmake_minimum_required(VERSION 3.15...3.26)
project(array_module)
find_package(Python 3.10 COMPONENTS Interpreter Development REQUIRED)
find_package(nanobind CONFIG REQUIRED)
nanobind_add_module(array_module array.cpp NOMINSIZE)

Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants