Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added 0.1.x quickstart #1167

Closed
wants to merge 9 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 2 additions & 4 deletions .github/workflows/slow_test.yml
Original file line number Diff line number Diff line change
@@ -1,8 +1,6 @@
name: slow_tests

on:
schedule:
- cron: '0 21 * * *' # This schedule runs every 21:00:00Z(05:00:00+08:00)
# The "create tags" trigger is specifically focused on the creation of new tags, while the "push tags" trigger is activated when tags are pushed, including both new tag creations and updates to existing tags.
create:
tags:
Expand Down Expand Up @@ -31,7 +29,7 @@ jobs:
if: ${{ !cancelled() && !failure() }}
run: |
TZ=$(readlink -f /etc/localtime | awk -F '/zoneinfo/' '{print $2}')
sudo docker rm -f infinity_build && sudo docker run -d --privileged --name infinity_build -e TZ=$TZ -v $PWD:/infinity -v /boot:/boot infiniflow/infinity_builder:ubuntu2310
sudo docker rm -f infinity_build && sudo docker run -d --privileged --name infinity_build -e TZ=$TZ -v $PWD:/infinity -v /boot:/boot infiniflow/infinity_builder:centos7

- name: Build release version
if: ${{ !cancelled() && !failure() }}
Expand All @@ -43,7 +41,7 @@ jobs:

- name: Prepare dataset
if: ${{ !cancelled() && !failure() }}
run: sudo mkdir -p test/data/benchmark && sudo ln -s $HOME/benchmark_dataset/dbpedia-entity test/data/benchmark/dbpedia-entity && sudo ln -s $HOME/benchmark_dataset/sift1M test/data/benchmark/sift_1m
run: sudo mkdir -p test/data/benchmark && sudo ln -s $HOME/benchmark/dbpedia-entity test/data/benchmark/dbpedia-entity && sudo ln -s $HOME/benchmark/sift1M test/data/benchmark/sift_1m

- name: benchmark test
if: ${{ !cancelled() && !failure() }}
Expand Down
5 changes: 4 additions & 1 deletion .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,10 @@ name: tests

on:
push:
branches: [ main, libcxx ]
branches:
- 'main'
- '*.*.*'
- 'libcxx'
paths-ignore:
- 'docs/**'
- '*.md'
Expand Down
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ Supports a wide range of data types including strings, numerics, vectors, and mo
```bash
sudo mkdir -p /var/infinity && sudo chown -R $USER /var/infinity
docker pull infiniflow/infinity:nightly
docker run -d --name infinity -v /var/infinity/:/var/infinity --network=host infiniflow/infinity:nightly
docker run -d --name infinity -v /var/infinity/:/var/infinity --ulimit nofile=500000:500000 --network=host infiniflow/infinity:nightly
```

#### Deploy Infinity using binary package on Linux x86_64
Expand All @@ -84,7 +84,7 @@ See [Build from Source](docs/build_from_source.md).
`infinity-sdk` requires Python 3.10+.

```bash
pip3 install infinity-sdk
pip3 install infinity-sdk==v0.1.0
```

### Import necessary modules
Expand Down
8 changes: 0 additions & 8 deletions docs/_category_.json

This file was deleted.

8 changes: 8 additions & 0 deletions docs/getstarted/_category_.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
{
"label": "Get started",
"position": 0,
"link": {
"type": "generated-index",
"description": "quickstart and more"
}
}
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
sidebar_position: 1
slug: /
sidebar_position: 2
slug: /build_from_source
---

# Build from Source
Expand Down
108 changes: 108 additions & 0 deletions docs/getstarted/quickstart.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
---
sidebar_position: 1
slug: /
---

# Quickstart

## Prerequisites

CPU >= 4 cores, with FMA and SSE4_2
RAM >= 16 GB
Disk >= 50 GB
OS: Linux x86_64 or aarch64
Glibc >=2.17

## Deploy Infinity database

### Deploy Infinity using Docker on Linux x86_64 and MacOS x86_64

```bash
sudo mkdir -p /var/infinity && sudo chown -R $USER /var/infinity
docker pull infiniflow/infinity:nightly
docker run -d --name infinity -v /var/infinity/:/var/infinity --ulimit nofile=500000:500000 --network=host infiniflow/infinity:nightly
```

### Deploy Infinity using binary package on Linux x86_64

You can download the binary package (deb, rpm, or tgz) for your respective host operating system from https://github.com/infiniflow/infinity/releases. The prebuilt packages are compatible with Linux distributions based on glibc 2.17 or later, for example, RHEL 7, Debian 8, Ubuntu 14.04.

Fedora/RHEL/CentOS/OpenSUSE
```bash
sudo rpm -i infinity-0.2.0-dev-x86_64.rpm
sudo systemctl start infinity
```

Ubuntu/Debian
```bash
sudo dpkg -i infinity-0.2.0-dev-x86_64.deb
sudo systemctl start infinity
```
### 🛠️ Build from Source

See [Build from Source](./build_from_source.md).

## Install a Python client

`infinity-sdk` requires Python 3.10+.

```bash
pip3 install infinity-sdk
```

## Import necessary modules

```python
import infinity
import infinity.index as index
from infinity.common import REMOTE_HOST
from infinity.common import ConflictType
```



## Connect to the remote server

```python
infinity_obj = infinity.connect(REMOTE_HOST)
```


## Get a database

```python
db = infinity_obj.get_database("default_db")
```


## Create a table

```python
# Drop my_table if it already exists
db.drop_table("my_table", ConflictType.Ignore)
# Create a table named "my_table"
table = db.create_table(
"my_table", {
"num": {"type": "integer"},
"body": {"type": "varchar"},
"vec": {"type": "vector, 4, float"}
})
```


## Insert two records

```python
table.insert([{"num": 1, "body": "unnecessary and harmful", "vec": [1.0, 1.2, 0.8, 0.9]}])
table.insert([{"num": 2, "body": "Office for Harmful Blooms", "vec": [4.0, 4.2, 4.3, 4.5]}])
```


## Execute a vector search

```python
res = table.output(["*"]).knn("vec", [3.0, 2.8, 2.7, 3.1], "float", "ip", 2).to_pl()
print(res)
```

> 💡 For more information about the Python API, see the [Python API Reference](../references/pysdk_api_reference.md).
2 changes: 1 addition & 1 deletion docs/references/_category_.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"label": "References",
"position": 2,
"position": 3,
"link": {
"type": "generated-index",
"description": "miscellaneous references"
Expand Down
19 changes: 18 additions & 1 deletion docs/references/benchmark.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,23 @@ sidebar_position: 1
slug: /benchmark
---
# Benchmark
This document compares the following key specifications of Elasticsearch, Qdrant, and Infinity:

- QPS
- Recall
- Time to insert & build index
- Time to import & build index
- Disk usage
- Peak memory usage

## Versions
| | Version |
| ----------------- |---------|
| **Elasticsearch** | v8.13.0 |
| **Qdrant** | v1.8.2 |
| **Infinity** | v0.1.0 |

## Run Benchmark

1. Install necessary dependencies.

Expand Down Expand Up @@ -173,4 +190,4 @@ python remote_benchmark_knn.py -t 16 -r 1 -d gist_1m
- **Dataset**: SIFT1M; **topk**: 100; **recall**: 97%+
- **P99 QPS**: 15,688 (16 clients)
- **P99 Latency**: 0.36 ms
- **Memory usage**: 408 MB
- **Memory usage**: 408 MB
23 changes: 17 additions & 6 deletions python/benchmark/clients/base_client.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,11 @@
import argparse
from abc import abstractmethod
from typing import Any, List, Optional, Dict, Union
from enum import Enum
from typing import Any
import subprocess
import sys
import os
from urllib.parse import urlparse
import time
import logging


class BaseClient:
"""
Expand All @@ -25,14 +24,21 @@ def __init__(self,
"""
pass

@abstractmethod
def upload(self):
"""
Upload data and build indexes (parameters are parsed by __init__).
"""
pass

@abstractmethod
def search(self) -> list[list[Any]]:
"""
Execute the corresponding query tasks (vector search, full-text search, hybrid search) based on the parsed parameters.
The function returns id list.
"""
pass

def download_data(self, url, target_path):
"""
Download dataset and extract it into path.
Expand All @@ -59,6 +65,11 @@ def run_experiment(self, args):
"""
run experiment and save results.
"""
if args.import_data:
start_time = time.time()
self.upload()
finish_time = time.time()
logging.info(f"upload finish, cost time = {finish_time - start_time}")
if args.query:
results = self.search()
self.check_and_save_results(results)
self.check_and_save_results(results)
14 changes: 7 additions & 7 deletions python/benchmark/clients/elasticsearch_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,12 @@
from elasticsearch import Elasticsearch, helpers
import json
import time
from typing import List, Optional
from typing import List
import os
import h5py
import uuid
import numpy as np
import csv
import logging

from .base_client import BaseClient

Expand Down Expand Up @@ -74,7 +74,7 @@ def upload(self):
for i, line in enumerate(data_file):
row = line.strip().split('\t')
if len(row) != len(headers):
print(f"row = {i}, row_len = {len(row)}, not equal headers len, skip")
logging.info(f"row = {i}, row_len = {len(row)}, not equal headers len, skip")
continue
row_dict = {header: value for header, value in zip(headers, row)}
current_batch.append({"_index": self.collection_name, "_id": uuid.UUID(int=i).hex, "_source": row_dict})
Expand Down Expand Up @@ -133,7 +133,7 @@ def search(self) -> list[list[Any]]:
The function returns id list.
"""
query_path = os.path.join(self.path_prefix, self.data["query_path"])
print(query_path)
logging.info(query_path)
results = []
_, ext = os.path.splitext(query_path)
if ext == '.json' or ext == '.jsonl':
Expand Down Expand Up @@ -184,7 +184,7 @@ def search(self) -> list[list[Any]]:
latency = (end - start) * 1000
result = [(uuid.UUID(hex=hit['_id']).int, hit['_score']) for hit in result['hits']['hits']]
result.append(latency)
print(f"{line[:-1]}, {latency}")
logging.info(f"{line[:-1]}, {latency}")
results.append(result)
else:
raise TypeError("Unsupported file type")
Expand Down Expand Up @@ -214,7 +214,7 @@ def check_and_save_results(self, results: List[List[Any]]):
precisions.append(precision)
latencies.append(result[-1])

print(
logging.info(
f'''mean_time: {np.mean(latencies)}, mean_precisions: {np.mean(precisions)},
std_time: {np.std(latencies)}, min_time: {np.min(latencies)}, \n
max_time: {np.max(latencies)}, p95_time: {np.percentile(latencies, 95)},
Expand All @@ -223,7 +223,7 @@ def check_and_save_results(self, results: List[List[Any]]):
latencies = []
for result in results:
latencies.append(result[-1])
print(
logging.info(
f'''mean_time: {np.mean(latencies)}, std_time: {np.std(latencies)},
max_time: {np.max(latencies)}, min_time: {np.min(latencies)},
p95_time: {np.percentile(latencies, 95)}, p99_time: {np.percentile(latencies, 99)}''')
13 changes: 6 additions & 7 deletions python/benchmark/clients/infinity_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,12 @@
import time
import numpy as np
from typing import Any, List
import logging

import infinity
import infinity.index as index
from infinity import NetworkAddress
from .base_client import BaseClient
import infinity.remote_thrift.infinity_thrift_rpc.ttypes as ttypes
import csv

class InfinityClient(BaseClient):
def __init__(self,
Expand Down Expand Up @@ -93,9 +92,9 @@ def upload(self):
for i, line in enumerate(data_file):
row = line.strip().split('\t')
if (i % 100000 == 0):
print(f"row {i}")
logging.info(f"row {i}")
if len(row) != len(headers):
print(f"row = {i}, row_len = {len(row)}, not equal headers len, skip")
logging.info(f"row = {i}, row_len = {len(row)}, not equal headers len, skip")
continue
row_dict = {header: value for header, value in zip(headers, row)}
current_batch.append(row_dict)
Expand Down Expand Up @@ -166,7 +165,7 @@ def search(self) -> list[list[Any]]:
latency = (time.time() - start) * 1000
result = [(row_id[0], score) for row_id, score in zip(res['ROW_ID'], res['SCORE'])]
result.append(latency)
print(f"{query}, {latency}")
logging.info(f"{query}, {latency}")
results.append(result)
else:
raise TypeError("Unsupported file type")
Expand Down Expand Up @@ -197,7 +196,7 @@ def check_and_save_results(self, results: List[List[Any]]):
precisions.append(precision)
latencies.append(result[-1])

print(
logging.info(
f'''mean_time: {np.mean(latencies)}, mean_precisions: {np.mean(precisions)},
std_time: {np.std(latencies)}, min_time: {np.min(latencies)},
max_time: {np.max(latencies)}, p95_time: {np.percentile(latencies, 95)},
Expand All @@ -206,7 +205,7 @@ def check_and_save_results(self, results: List[List[Any]]):
latencies = []
for result in results:
latencies.append(result[-1])
print(
logging.info(
f'''mean_time: {np.mean(latencies)}, std_time: {np.std(latencies)},
max_time: {np.max(latencies)}, min_time: {np.min(latencies)},
p95_time: {np.percentile(latencies, 95)}, p99_time: {np.percentile(latencies, 99)}''')
Loading