Skip to content

Commit

Permalink
Support basic import/export for parquet format. (#1446)
Browse files Browse the repository at this point in the history
### What problem does this PR solve?

Support basic import/export for parquet format.
Things still need to do:
1. support exporting data with sparse vector and tensor.
2. support python and http api.
3. add test cases.

Issue link:#1330

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
- [x] Refactoring

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
Co-authored-by: Jin Hai <haijin.chn@gmail.com>
  • Loading branch information
Ognimalf and JinHai-CN authored Jul 21, 2024
1 parent 7de87bf commit f25f591
Show file tree
Hide file tree
Showing 1,001 changed files with 379,447 additions and 40,513 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ jobs:
- name: Build release version
run: |
sed -i "s/^version = \".*\"/version = \"$(echo $RELEASE_TAG | cut -c2-)\"/" pyproject.toml
sudo docker exec ${BUILDER_CONTAINER} bash -c "git config --global safe.directory \"*\" && cd /infinity && rm -fr cmake-build-release && mkdir -p cmake-build-release && cmake -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCPACK_PACKAGE_VERSION=${{ env.RELEASE_TAG }} -DCPACK_DEBIAN_PACKAGE_ARCHITECTURE=amd64 -DCMAKE_JOB_POOLS:STRING='link=1' -S /infinity -B /infinity/cmake-build-release && cmake --build /infinity/cmake-build-release --target infinity"
sudo docker exec ${BUILDER_CONTAINER} bash -c "git config --global safe.directory \"*\" && cd /infinity && rm -fr cmake-build-release && mkdir -p cmake-build-release && cmake -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DARROW_BUILD_SHARED=OFF -DARROW_ENABLE_TIMING_TESTS=OFF -DARROW_GGDB_DEBUG=OFF -DARROW_PARQUET=ON -DARROW_DEPENDENCY_USE_SHARED=OFF -DCPACK_PACKAGE_VERSION=${{ env.RELEASE_TAG }} -DCPACK_DEBIAN_PACKAGE_ARCHITECTURE=amd64 -DCMAKE_JOB_POOLS:STRING='link=1' -S /infinity -B /infinity/cmake-build-release && cmake --build /infinity/cmake-build-release --target infinity"
- name: Download resources
run: rm -rf resource && git clone --depth=1 https://github.com/infiniflow/resource.git
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/slow_test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ jobs:
- name: Build release version
if: ${{ !cancelled() && !failure() }}
run: sudo docker exec ${BUILDER_CONTAINER} bash -c "git config --global safe.directory \"*\" && cd /infinity && rm -fr cmake-build-release && mkdir -p cmake-build-release && cmake -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_JOB_POOLS:STRING=link=4 -S /infinity -B /infinity/cmake-build-release && cmake --build /infinity/cmake-build-release"
run: sudo docker exec ${BUILDER_CONTAINER} bash -c "git config --global safe.directory \"*\" && cd /infinity && rm -fr cmake-build-release && mkdir -p cmake-build-release && cmake -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DARROW_BUILD_SHARED=OFF -DARROW_ENABLE_TIMING_TESTS=OFF -DARROW_GGDB_DEBUG=OFF -DARROW_PARQUET=ON -DARROW_DEPENDENCY_USE_SHARED=OFF -DCMAKE_JOB_POOLS:STRING=link=4 -S /infinity -B /infinity/cmake-build-release && cmake --build /infinity/cmake-build-release"

- name: Download resources
run: rm -rf resource && git clone --depth=1 https://github.com/infiniflow/resource.git
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ jobs:
- name: Build debug version
if: ${{ !cancelled() && !failure() }}
run: sudo docker exec ${BUILDER_CONTAINER} bash -c "git config --global safe.directory \"*\" && cd /infinity && rm -fr cmake-build-debug && mkdir -p cmake-build-debug && cmake -G Ninja -DCMAKE_BUILD_TYPE=Debug -DCMAKE_JOB_POOLS:STRING=link=4 -S /infinity -B /infinity/cmake-build-debug && cmake --build /infinity/cmake-build-debug --target infinity test_main"
run: sudo docker exec ${BUILDER_CONTAINER} bash -c "git config --global safe.directory \"*\" && cd /infinity && rm -fr cmake-build-debug && mkdir -p cmake-build-debug && cmake -G Ninja -DCMAKE_BUILD_TYPE=Debug -DARROW_BUILD_SHARED=OFF -DARROW_ENABLE_TIMING_TESTS=OFF -DARROW_GGDB_DEBUG=OFF -DARROW_PARQUET=ON -DARROW_DEPENDENCY_USE_SHARED=OFF -DENABLE_JEMALLOC=OFF -DCMAKE_JOB_POOLS:STRING=link=4 -S /infinity -B /infinity/cmake-build-debug && cmake --build /infinity/cmake-build-debug --target infinity test_main"

- name: Unit test debug version
if: ${{ !cancelled() && !failure() }}
Expand Down Expand Up @@ -146,7 +146,7 @@ jobs:
- name: Build release version
if: ${{ !cancelled() && !failure() }}
run: sudo docker exec ${BUILDER_CONTAINER} bash -c "git config --global safe.directory \"*\" && cd /infinity && rm -fr cmake-build-release && mkdir -p cmake-build-release && cmake -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_JOB_POOLS:STRING=link=4 -S /infinity -B /infinity/cmake-build-release && cmake --build /infinity/cmake-build-release --target infinity test_main knn_import_benchmark knn_query_benchmark"
run: sudo docker exec ${BUILDER_CONTAINER} bash -c "git config --global safe.directory \"*\" && cd /infinity && rm -fr cmake-build-release && mkdir -p cmake-build-release && cmake -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DARROW_BUILD_SHARED=OFF -DARROW_ENABLE_TIMING_TESTS=OFF -DARROW_GGDB_DEBUG=OFF -DARROW_PARQUET=ON -DARROW_DEPENDENCY_USE_SHARED=OFF -DENABLE_JEMALLOC=OFF -DCMAKE_JOB_POOLS:STRING=link=4 -S /infinity -B /infinity/cmake-build-release && cmake --build /infinity/cmake-build-release --target infinity test_main knn_import_benchmark knn_query_benchmark"

- name: Unit test release version
if: ${{ !cancelled() && !failure() }}
Expand Down
35 changes: 21 additions & 14 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -116,18 +116,22 @@ elseif ("${CMAKE_BUILD_TYPE}" STREQUAL "Debug")
set(CMAKE_CXX_FLAGS "-O0 -g")
set(CMAKE_C_FLAGS "-O0 -g")

set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fno-stack-protector -fno-var-tracking ")
add_compile_options(-fsanitize=address -fsanitize-recover=all -fsanitize=leak)
add_link_options(-fsanitize=address -fsanitize-recover=all -fsanitize=leak)
if(NOT ENABLE_JEMALLOC)

add_compile_options("-fno-omit-frame-pointer")
add_link_options("-fno-omit-frame-pointer")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fno-stack-protector -fno-var-tracking ")
add_compile_options(-fsanitize=address -fsanitize-recover=all -fsanitize=leak)
add_link_options(-fsanitize=address -fsanitize-recover=all -fsanitize=leak)

# add_compile_options("-fsanitize=undefined")
# add_link_options("-fsanitize=undefined")
add_compile_options("-fno-omit-frame-pointer")
add_link_options("-fno-omit-frame-pointer")

# add_compile_options("-fsanitize=thread")
# add_link_options("-fsanitize=thread")
# add_compile_options("-fsanitize=undefined")
# add_link_options("-fsanitize=undefined")

# add_compile_options("-fsanitize=thread")
# add_link_options("-fsanitize=thread")

endif()

set(CMAKE_DEBUG_POSTFIX "")

Expand Down Expand Up @@ -166,12 +170,15 @@ endif()
find_package(Lz4 REQUIRED)

# You can disable jemalloc by passing the `-DENABLE_JEMALLOC=OFF` option to CMake.
option(ENABLE_JEMALLOC "Enable jemalloc support" ON)
if(ENABLE_JEMALLOC AND NOT "${CMAKE_BUILD_TYPE}" STREQUAL "Debug")
option(ENABLE_JEMALLOC "Enable jemalloc support" OFF)
if(ENABLE_JEMALLOC)
find_package(jemalloc REQUIRED)
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -DENABLE_JEMALLOC_PROF")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -DENABLE_JEMALLOC_PROF")
endif()
set(JEMALLOC_STATIC_LIB "jemalloc.a")
if(NOT "${CMAKE_BUILD_TYPE}" STREQUAL "Debug")
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -DENABLE_JEMALLOC_PROF")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -DENABLE_JEMALLOC_PROF")
endif ()
endif ()

set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fPIC")

Expand Down
106 changes: 82 additions & 24 deletions benchmark/local_infinity/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,21 +6,32 @@ add_executable(infinity_benchmark
target_include_directories(infinity_benchmark PUBLIC "${CMAKE_SOURCE_DIR}/src")
target_link_libraries(
infinity_benchmark
infinity_core
benchmark_profiler
infinity_core
sql_parser
onnxruntime_mlas
zsv_parser
newpfor
fastpfor
# profiler
jma
opencc
dl
parquet.a
arrow.a
thrift.a
thriftnb.a
lz4.a
atomic.a
event.a
c++.a
c++abi.a
jma
opencc
${JEMALLOC_STATIC_LIB}
)

target_link_directories(infinity_benchmark PUBLIC "${CMAKE_BINARY_DIR}/lib")
target_link_directories(infinity_benchmark PUBLIC "${CMAKE_BINARY_DIR}/third_party/arrow/")

# ########################################
# knn
# import benchmark
Expand All @@ -38,14 +49,25 @@ target_link_libraries(
zsv_parser
newpfor
fastpfor
jma
opencc
dl
lz4.a
atomic.a
event.a
c++.a
c++abi.a
jma
opencc
# profiler
parquet.a
arrow.a
thrift.a
thriftnb.a
${JEMALLOC_STATIC_LIB}
)

target_link_directories(knn_import_benchmark BEFORE PUBLIC "${CMAKE_BINARY_DIR}/lib")
target_link_directories(knn_import_benchmark PUBLIC "${CMAKE_BINARY_DIR}/third_party/arrow/")

# query benchmark
add_executable(knn_query_benchmark
./knn/knn_query_benchmark.cpp
Expand All @@ -61,14 +83,23 @@ target_link_libraries(
zsv_parser
newpfor
fastpfor
jma
opencc
dl
lz4.a
atomic.a
c++.a
c++abi.a
jma
opencc
parquet.a
arrow.a
thrift.a
thriftnb.a
${JEMALLOC_STATIC_LIB}
)

target_link_directories(knn_query_benchmark BEFORE PUBLIC "${CMAKE_BINARY_DIR}/lib")
target_link_directories(knn_query_benchmark PUBLIC "${CMAKE_BINARY_DIR}/third_party/arrow/")

# ########################################
# fulltext
# import benchmark
Expand All @@ -86,14 +117,23 @@ target_link_libraries(
zsv_parser
newpfor
fastpfor
jma
opencc
dl
lz4.a
atomic.a
c++.a
c++abi.a
jma
opencc
parquet.a
arrow.a
thrift.a
thriftnb.a
${JEMALLOC_STATIC_LIB}
)

target_link_directories(fulltext_benchmark BEFORE PUBLIC "${CMAKE_BINARY_DIR}/lib")
target_link_directories(fulltext_benchmark PUBLIC "${CMAKE_BINARY_DIR}/third_party/arrow/")

# ########################################
add_executable(sparse_benchmark
./sparse/sparse_benchmark.cpp
Expand All @@ -109,14 +149,23 @@ target_link_libraries(
zsv_parser
newpfor
fastpfor
jma
opencc
dl
lz4.a
atomic.a
jma
c++.a
c++abi.a
opencc
parquet.a
arrow.a
thrift.a
thriftnb.a
${JEMALLOC_STATIC_LIB}
)

target_link_directories(sparse_benchmark BEFORE PUBLIC "${CMAKE_BINARY_DIR}/lib")
target_link_directories(sparse_benchmark PUBLIC "${CMAKE_BINARY_DIR}/third_party/arrow/")

add_executable(bmp_benchmark
./sparse/bmp_benchmark.cpp
)
Expand All @@ -131,14 +180,23 @@ target_link_libraries(
zsv_parser
newpfor
fastpfor
jma
opencc
dl
lz4.a
atomic.a
jma
c++.a
c++abi.a
opencc
parquet.a
arrow.a
thrift.a
thriftnb.a
${JEMALLOC_STATIC_LIB}
)

target_link_directories(bmp_benchmark BEFORE PUBLIC "${CMAKE_BINARY_DIR}/lib")
target_link_directories(bmp_benchmark PUBLIC "${CMAKE_BINARY_DIR}/third_party/arrow/")

add_executable(hnsw_benchmark
./knn/hnsw_benchmark.cpp
)
Expand All @@ -153,23 +211,23 @@ target_link_libraries(
zsv_parser
newpfor
fastpfor
jma
opencc
dl
lz4.a
atomic.a
jma

c++.a
c++abi.a
opencc
parquet.a
arrow.a
thrift.a
thriftnb.a
${JEMALLOC_STATIC_LIB}
)

if(ENABLE_JEMALLOC)
target_link_libraries(infinity_benchmark jemalloc.a)
target_link_libraries(knn_import_benchmark jemalloc.a)
target_link_libraries(knn_query_benchmark jemalloc.a)
target_link_libraries(fulltext_benchmark jemalloc.a)
target_link_libraries(sparse_benchmark jemalloc.a)
target_link_libraries(bmp_benchmark jemalloc.a)
target_link_libraries(hnsw_benchmark jemalloc.a)
endif()
target_link_directories(hnsw_benchmark BEFORE PUBLIC "${CMAKE_BINARY_DIR}/lib")
target_link_directories(hnsw_benchmark PUBLIC "${CMAKE_BINARY_DIR}/third_party/arrow/")

# add_definitions(-march=native)
# add_definitions(-msse4.2 -mfma)
Expand Down
13 changes: 7 additions & 6 deletions benchmark/remote_infinity/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ target_include_directories(remote_query_benchmark PUBLIC "${CMAKE_SOURCE_DIR}/sr
target_include_directories(remote_query_benchmark PUBLIC "${CMAKE_SOURCE_DIR}/third_party/thrift/lib/cpp/src")
target_include_directories(remote_query_benchmark PUBLIC "${CMAKE_BINARY_DIR}/third_party/thrift/")
target_link_directories(remote_query_benchmark PUBLIC "${CMAKE_BINARY_DIR}/lib")
target_link_directories(remote_query_benchmark PUBLIC "${CMAKE_BINARY_DIR}/third_party/arrow/")

target_link_libraries(
remote_query_benchmark
Expand All @@ -21,19 +22,19 @@ target_link_libraries(
zsv_parser
newpfor
fastpfor
jma
opencc
dl
lz4.a
atomic.a
thrift.a
c++.a
c++abi.a
jma
opencc
parquet.a
arrow.a
${JEMALLOC_STATIC_LIB}
)

if(ENABLE_JEMALLOC)
target_link_libraries(remote_query_benchmark jemalloc.a)
endif()

# add_definitions(-march=native)
# add_definitions(-msse4.2 -mfma)
# add_definitions(-mavx2 -mf16c -mpopcnt)
Expand Down
14 changes: 14 additions & 0 deletions docs/getstarted/build_from_source.md
Original file line number Diff line number Diff line change
Expand Up @@ -136,6 +136,20 @@ sudo ln -s /usr/bin/clang-format-18 /usr/bin/clang-format
sudo ln -s /usr/bin/clang-tidy-18 /usr/bin/clang-tidy
sudo ln -s /usr/bin/llvm-symbolizer-18 /usr/bin/llvm-symbolizer
sudo ln -s /usr/lib/llvm-18/include/x86_64-pc-linux-gnu/c++/v1/__config_site /usr/lib/llvm-18/include/c++/v1/__config_site
sudo apt install -y -V ca-certificates lsb-release
wget https://apache.jfrog.io/artifactory/arrow/$(lsb_release --id --short | tr 'A-Z' 'a-z')/apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb
sudo apt update
sudo apt install -y -V libarrow-dev libparquet-dev
wget https://github.com/infiniflow/arrow/archive/refs/heads/main.zip -O arrow.zip
unzip arrow.zip
cd arrow-main && cd cpp && mkdir build && cd build
export CC=/usr/bin/clang-18
export CXX=/usr/bin/clang++-18
cmake -G Ninja -DCMAKE_BUILD_TYPE=Release -DARROW_BUILD_SHARED=OFF -DARROW_ENABLE_TIMING_TESTS=OFF -DARROW_GGDB_DEBUG=OFF -DARROW_PARQUET=ON ..
ninja -j 0 arrow_static parquet_static
sudo cp ./release/libarrow.a /usr/lib/x86_64-linux-gnu/libarrow.a
sudo cp ./release/libparquet.a /usr/lib/x86_64-linux-gnu/libparquet.a
cd ../../../ && rm -rf arrow-main
```

### Step2 Download Source Code
Expand Down
4 changes: 4 additions & 0 deletions python/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,10 @@ Build the debug version of infinity-sdk in the target location `cmake-build-debu
```shell
pip install . -v --config-settings=cmake.build-type="Debug" --config-settings=build-dir="cmake-build-debug"
```
Note: If you run with the release version and turn jemalloc compile flag on, you must set environment variable, for example
```shell
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so python3 example/simple_example.py
```
Note: If you run with the debug version, you must set the **libasan** environment variable, for example
```shell
LD_PRELOAD=/usr/lib/llvm-18/lib/clang/18/lib/x86_64-pc-linux-gnu/libclang_rt.asan.so python3 example/simple_example.py
Expand Down
Loading

0 comments on commit f25f591

Please sign in to comment.