Skip to content

Commit

Permalink
[BladeLLM] Support dispatch feature for BladeLLM (#86)
Browse files Browse the repository at this point in the history
Co-authored-by: Xinyi-ECNU <1668529909@qq.com>
  • Loading branch information
KuilongCui and Xinyi-ECNU authored Dec 18, 2024
1 parent 4029b00 commit 156ce24
Show file tree
Hide file tree
Showing 50 changed files with 1,221 additions and 196 deletions.
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
# Proto files
*_pb2.py
*_pb2_grpc.py

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
Expand Down
28 changes: 27 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ init:

.PHONY: install
install:
@pip install -e .
@pip install -e .[vllm]

.PHONY: lint
lint: check_pylint_installed check_pytest_installed
Expand All @@ -27,6 +27,30 @@ lint: check_pylint_installed check_pytest_installed
--disable=protected-access,super-init-not-called,unused-argument,redefined-outer-name,invalid-name \
-s n --jobs=128 ./tests

.PHONY: clean
clean: proto-clean

###################################### proto begin ######################################

.PHONY: proto
proto:
@find . -type d -name "proto" | while read dir; do \
dir_base=$$(dirname $$dir); \
find $$dir -name "*.proto" | while read proto_file; do \
echo "Compiling $$proto_file"; \
PYTHONWARNINGS="ignore::DeprecationWarning" python -m grpc_tools.protoc --proto_path=. --python_out=. --grpc_python_out=. $$proto_file; \
done; \
done;

.PHONY: proto-clean
proto-clean:
@find . -name "*_pb2_grpc.py" | xargs rm -f
@find . -name "*_pb2.py" | xargs rm -f

####################################### proto end #######################################

###################################### test begin #######################################

.PHONY: test
test: check_pytest_installed
@pytest -v --ignore=third_party/ --ignore=tests/e2e_test --disable-warnings
Expand Down Expand Up @@ -55,6 +79,8 @@ bench_test:
migration_test:
@pytest -v -x -s --tb=long ./tests/e2e_test/test_migration.py

####################################### test end ########################################

#################### pygloo install for gloo migration backend begin ####################

BAZEL_CMD = bazel
Expand Down
22 changes: 22 additions & 0 deletions configs/bladellm.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
SERVER:
RAY_CLUSTER_PORT: 6379
LAUNCH_RAY_CLUSTER: True
REQUEST_OUTPUT_QUEUE_TYPE: "rayqueue"

MANAGER:
DISABLE_FIXED_NODE_INIT_INSTANCE: False
DISABLE_INIT_INSTANCE_BY_MANAGER: True

LOAD_METRIC: 'remaining_steps'
DISPATCH_POLICY: 'load'

ENABLE_MIGRATION: False
ENABLE_DEFRAG: True
REQUEST_MIGRATION_POLICY: 'SR'

MIGRATION_BACKEND: 'grpc'
MIGRATION_BUFFER_BLOCKS: 512

ENABLE_SCALING: False

LOG_INSTANCE_INFO: False
File renamed without changes.
24 changes: 20 additions & 4 deletions docs/Arguments.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,8 +32,11 @@ usage: -m llumnix.entrypoints.vllm.api_server [-h]
[--profiling-result-file-path PROFILING_RESULT_FILE_PATH]
[--gpu-type GPU_TYPE]
[--polling-interval POLLING_INTERVAL]
[--migration-backend {gloo,nccl,rpc}]
[--migration-backend {gloo,nccl,rayrpc,grpc,kvtransfer}]
[--migration-buffer-blocks MIGRATION_BUFFER_BLOCKS]
[--migration-backend-transfer-type {cuda_ipc,rdma,}]
[--migration-backend-kvtransfer-naming-url MIGRATION_BACKEND_KVTRANSFER_NAMING_URL]
[--migration-backend-server-address MIGRATION_BACKEND_SERVER_ADDRESS]
[--migration-backend-init-timeout MIGRATION_BACKEND_INIT_TIMEOUT]
[--migration-num-layers MIGRATION_NUM_LAYERS]
[--last-stage-max-blocks LAST_STAGE_MAX_BLOCKS]
Expand Down Expand Up @@ -144,11 +147,24 @@ usage: -m llumnix.entrypoints.vllm.api_server [-h]

`--migration-backend`
- Communication backend of migration.
- Possible choices: gloo, rpc
- Default: "rpc"
- Possible choices: gloo, rayrpc, nccl, grpc, kvtransfer. [gloo, rayrpc, nccl] are available for vllm and [grpc, kvtransfer] are available for bladellm.
- Default: "gloo"

`--migration-backend-transfer-type`
- Transfer type for migration backend kvTransfer.
- Possible choices: cuda_ipc, rdma
- Default: "rdma"

`--migration-backend-server-address`
- Address of grpc server for migration backend
- Default: "127.0.0.1:50051"

`--migration-backend-kvtransfer-naming-url`
- URL of naming server for kvtransfer migration backend
- Default: "file:/tmp/llumnix/naming/"

`--migration-buffer-blocks`
- Number of cache blocks in migration.
- Number of buffer blocks in migration.
- Default: 512

`--migration-backend-init-timeout`
Expand Down
2 changes: 1 addition & 1 deletion docs/Quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ cd llumnix
make install
```

The default migration backend is RPC. If you want to use NCCL as the migration backend, run `make cupy-cuda` to install [cupy-cuda](https://pypi.org/search/?q=cupy-cuda) manually, as it is related to the CUDA version.
The default migration backend is rayrpc. If you want to use NCCL as the migration backend, run `make cupy-cuda` to install [cupy-cuda](https://pypi.org/search/?q=cupy-cuda) manually, as it is related to the CUDA version.

If you want to use Gloo as migration backend, **in addition to installing cupy-cuda**, please refer to [this link](https://github.com/ZeldaHuang/pygloo/blob/main/.github/workflows/ubuntu_basic.yml#L24C1-L26C1) to install [Bazel](https://github.com/bazelbuild/bazel) >= 5.1.0. Then, run `make pygloo` to install [pygloo](https://github.com/ZeldaHuang/pygloo).

Expand Down
4 changes: 2 additions & 2 deletions examlpes/offline_inference.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

from llumnix import launch_ray_cluster, connect_to_ray_cluster, init_manager, init_llumlets
from llumnix import (SamplingParams, ServerInfo, EngineManagerArgs, LLMEngineManager, Llumlet,
EngineArgs, QueueType)
EngineArgs, QueueType, BackendType)
from llumnix.utils import random_uuid
from llumnix.queue.ray_queue_server import RayQueueServer

Expand Down Expand Up @@ -40,7 +40,7 @@
llumlets: List[Llumlet] = None
llumlet_ids, llumlets = init_llumlets(
manager_args, engine_args, ray.get_runtime_context().get_node_id(),
QueueType("rayqueue")
QueueType("rayqueue"), BackendType.VLLM, 1,
)


Expand Down
23 changes: 17 additions & 6 deletions llumnix/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,6 @@
# See the License for the specific language governing permissions and
# limitations under the License.

import vllm
from vllm import *

from llumnix.server_info import ServerInfo
from llumnix.entrypoints.setup import (launch_ray_cluster,
connect_to_ray_cluster,
Expand All @@ -23,8 +20,8 @@
from llumnix.llm_engine_manager import LLMEngineManager
from llumnix.llumlet.llumlet import Llumlet
from llumnix.queue.queue_type import QueueType

from .version import __version__
from llumnix.backends.backend_interface import BackendType
from llumnix.version import __version__

__all__ = [
"__version__",
Expand All @@ -37,6 +34,20 @@
"LLMEngineManager",
"Llumlet",
"QueueType",
"BackendType",
]

__all__.extend(getattr(vllm, "__all__", []))
try:
import vllm
from vllm import *
__all__.extend(getattr(vllm, "__all__", []))
except ImportError:
pass

# TODO(KuilongCui): import blade_llm after cuda is ready
# try:
# import blade_llm
# from blade_llm import *
# __all__.extend(getattr(blade_llm, "__all__", []))
# except ImportError:
# pass
41 changes: 32 additions & 9 deletions llumnix/arg_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ class LlumnixEntrypointsArgs:
request_output_queue_port: int = None
disable_log_requests_server: bool = None
log_request_timestamps: bool = None
config_file: bool = None
config_file: str = None

def __post_init__(self):
for attr in dataclasses.fields(self):
Expand Down Expand Up @@ -132,9 +132,12 @@ class EngineManagerArgs:
log_instance_info: bool = None
profiling_result_file_path: str = None

migration_backend_kvtransfer_naming_url: str = None
migration_backend_server_address: str = None
migration_backend_init_timeout: float = None
migration_backend: str = None
migration_buffer_blocks: int = None
migration_backend_transfer_type: str = None
migration_num_layers: int = None
last_stage_max_blocks: int = None
max_stages: int = None
Expand Down Expand Up @@ -177,7 +180,10 @@ def create_migration_config(self) -> MigrationConfig:
self.migration_num_layers,
self.last_stage_max_blocks,
self.max_stages,
self.migration_backend_init_timeout)
self.migration_backend_init_timeout,
self.migration_backend_transfer_type,
self.migration_backend_server_address,
self.migration_backend_kvtransfer_naming_url)
return migration_config

@classmethod
Expand All @@ -194,16 +200,23 @@ def check_args(cls, args: 'EngineManagerArgs', parser: argparse.ArgumentParser):
# pylint: disable=protected-access
for action in parser._optionals._actions:
if hasattr(action, 'choices') and action.choices is not None and hasattr(args, action.dest):
assert getattr(args, action.dest) in action.choices, f"{action.dest} should be one of {action.choices}."
cur_arg = getattr(args, action.dest)
assert cur_arg in action.choices, f"{action.dest} should be one of {action.choices}, but {cur_arg} is set."

# vllm only
assert args.migration_backend != 'gloo' or (args.migration_backend == 'gloo' \
and not args.disable_init_instance_by_manager and not args.disable_fixed_node_init_instance), \
("When using gloo as migration backend, "
"do not set --disable-init-instance-by-manager and --disable-fixed-node-init-instance.")

# bladellm only
assert args.migration_backend not in ['kvtransfer'] or (args.migration_backend == 'kvtransfer' \
and args.migration_backend_transfer_type), \
("When using kvTransfer as migration backend, "
"do not set --migration-backend-transfer-type as empty.")

@staticmethod
def add_cli_args(
parser: argparse.ArgumentParser) -> argparse.ArgumentParser:
def add_cli_args(parser: argparse.ArgumentParser) -> argparse.ArgumentParser:
parser.add_argument('--disable-fixed-node-init-instance',
action='store_true',
help='disable fixing the placement of instance to current node')
Expand Down Expand Up @@ -302,17 +315,27 @@ def add_cli_args(
parser.add_argument('--profiling-result-file-path',
type=str,
help='profiling result file path')

parser.add_argument('--migration-backend',
type=str,
choices=['gloo', 'nccl', 'rpc'],
help='communication backend of migration')
choices=['gloo','nccl','rayrpc','grpc','kvtransfer'],
help='communication backend of migration, [gloo, rayrpc, nccl] are available for vllm \
and [grpc, kvtransfer] are available for bladellm')
parser.add_argument('--migration-backend-transfer-type',
type=str,
choices=['cuda_ipc','rdma', ''],
help='transfer type for migration backend grpc and kvTransfer')
parser.add_argument('--grpc-migration-backend-address',
type=str,
help='address of grpc server for migration backend')
parser.add_argument('--migration-backend-kvtransfer-naming-url',
type=str,
help='url of naming server for kvtransfer migration backend')
parser.add_argument('--migration-backend-init-timeout',
type=float,
help='timeout(s) for initializing migration backend')
parser.add_argument('--migration-buffer-blocks',
type=int,
help='number of cache blocks in migration')
help='number of buffer blocks in migration')
parser.add_argument('--migration-num-layers',
type=int,
help='number of kv-cache layers to transfer in each round during migration')
Expand Down
8 changes: 2 additions & 6 deletions llumnix/backends/backend_interface.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,13 +27,15 @@ class EngineState(str, Enum):
class BackendType(str, Enum):
VLLM = "VLLM"
SIM_VLLM = "SIM_VLLM"
BLADELLM = "BLADELLM"

@staticmethod
def is_sim_backend(status: "BackendType") -> bool:
return status in [
BackendType.SIM_VLLM,
]

# TODO(KuilongCui): separate backend interface into two parts: DispatchBackendInterface and MigrationBackendInterface
class BackendInterface(ABC):
# Methods for inference
@abstractmethod
Expand Down Expand Up @@ -67,12 +69,6 @@ def abort_request(self, request_id: Union[str, Iterable[str]]) -> None:
"""
raise NotImplementedError

@abstractmethod
async def _start_engine_step_loop(self) -> None:
"""Start step loop of backend engine.
"""
raise NotImplementedError

# Methods for migration
@abstractmethod
def get_request_incremental_blocks(self, backend_request: LlumnixRequest, pre_stage_num_blocks: int) -> List[int]:
Expand Down
12 changes: 12 additions & 0 deletions llumnix/backends/bladellm/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# Copyright (c) 2024, Alibaba Group;
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at

# http://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
Loading

0 comments on commit 156ce24

Please sign in to comment.