Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Deployment] Support global launch in addition to local launch #88

Merged
merged 92 commits into from
Jan 10, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
92 commits
Select commit Hold shift + click to select a range
034d76a
demo
s5u13b Dec 17, 2024
b3ded9f
Serve demo done
s5u13b Dec 17, 2024
7159ca6
Add entrypoints actor unit test
s5u13b Dec 18, 2024
a2d4dd2
Change unit test time limit
s5u13b Dec 18, 2024
2e2b994
Fix pylint hang due to jobs concurrency
s5u13b Dec 18, 2024
fed2f59
placement group demo done
s5u13b Dec 19, 2024
7aa6cda
Doing manager service demo
s5u13b Dec 19, 2024
f6ac47a
Add pg api test
s5u13b Dec 23, 2024
075394b
New service codes done & Pylint
s5u13b Dec 23, 2024
c94b544
Refine codes
s5u13b Dec 23, 2024
1c76309
Test pg ready
s5u13b Dec 23, 2024
76c10b3
Test threading uvicorn run
s5u13b Dec 23, 2024
1bbe1e5
Refine
s5u13b Dec 23, 2024
39b885d
Fix
s5u13b Dec 23, 2024
1abb28e
Add actor demo
s5u13b Dec 23, 2024
135e4ab
Specify get_actor namespace
s5u13b Dec 23, 2024
21c1552
Fix TODO
s5u13b Dec 23, 2024
217c3e4
Testing service
s5u13b Dec 24, 2024
6f2a809
Fix _check_deployment_correctness_loop
s5u13b Dec 24, 2024
4fe40b8
Func _check_deployment_states_correctness_loop done
s5u13b Dec 24, 2024
2764e53
Test restart done
s5u13b Dec 24, 2024
907050c
Simplify deployment
s5u13b Dec 24, 2024
fb68879
Rename manager
s5u13b Dec 24, 2024
aa9db56
Fix unit test of llumlet
s5u13b Dec 24, 2024
c3bb8d0
Fix backends unit test
s5u13b Dec 24, 2024
c4acd92
Fix offline test
s5u13b Dec 24, 2024
cb9137f
Fix e2e test
s5u13b Dec 24, 2024
08b8c19
Remove pg
s5u13b Dec 24, 2024
eda4e1c
Fix lint & Support pg management in manager
s5u13b Dec 24, 2024
fca07ae
Fix offline test
s5u13b Dec 24, 2024
adb7576
Add TODO & Fix get_instance_name
s5u13b Dec 24, 2024
31da3f5
Fix backends unit test
s5u13b Dec 24, 2024
8a97178
Fix lint
s5u13b Dec 24, 2024
e87ef82
Use pg when use simulator
s5u13b Dec 25, 2024
2aa1f02
Fix global scheduler unit test
s5u13b Dec 25, 2024
467af60
Fix offline test
s5u13b Dec 25, 2024
55dc598
Refine logger
s5u13b Dec 26, 2024
8a7feeb
Fix lint
s5u13b Dec 26, 2024
85275c2
Move MANAGER_NAME to utils
s5u13b Dec 26, 2024
b219120
Refactor deployment and actor construction for supporting global depl…
s5u13b Jan 2, 2025
d89dfb8
Refine actor construction args
s5u13b Jan 2, 2025
54a7680
Fix _connect_to_instances
s5u13b Jan 2, 2025
221281d
Minors
s5u13b Jan 2, 2025
5c2c276
Pass lint, unit, e2e, offline test
s5u13b Jan 6, 2025
a4358e6
Refine api server unit test
s5u13b Jan 6, 2025
e39fc90
Refine api server unit test
s5u13b Jan 6, 2025
027325a
Fix api server unit test
s5u13b Jan 6, 2025
af1e148
Done api server unit test
s5u13b Jan 6, 2025
641cfed
Remove demo dir
s5u13b Jan 6, 2025
77a2c89
Pass test_init_server_and_instance
s5u13b Jan 6, 2025
ba4a1ca
Done test_clear_instance_ray_resources
s5u13b Jan 6, 2025
43886a8
Done test_auto_scale_up_loop_and_get_curr_deployment
s5u13b Jan 7, 2025
30d13b8
Refine global deployment test
s5u13b Jan 7, 2025
c6fe8fe
Done test_check_deployment_states_loop_and_auto_scale_up_loop
s5u13b Jan 7, 2025
5429aad
Add global deployment mode in bench test
s5u13b Jan 7, 2025
e895c09
Fix bench test
s5u13b Jan 7, 2025
98e95d4
Fix lint
s5u13b Jan 7, 2025
9121e19
Fix port increment
s5u13b Jan 7, 2025
0e97551
Updata ray requirements
s5u13b Jan 7, 2025
2561691
Fix cr comments
s5u13b Jan 7, 2025
8b4f257
Fix test_engine_step_exception
s5u13b Jan 7, 2025
de82d8f
Fix _check_deployment_states_loop
s5u13b Jan 7, 2025
fdb5485
Fix test_engine_step_exception
s5u13b Jan 7, 2025
e92b6b8
Minors
s5u13b Jan 7, 2025
a738d7c
Refine arguments
s5u13b Jan 8, 2025
c8f772c
Fix backends unit test
s5u13b Jan 8, 2025
0d76812
Fix global scheduler unit test
s5u13b Jan 8, 2025
7500be4
Fix offline test
s5u13b Jan 8, 2025
deb25b9
Add watch instance deployment time
s5u13b Jan 8, 2025
546d3df
Decrease WATCH_DEPLOYMENT_INTERVAL
s5u13b Jan 8, 2025
f0c86aa
Change TODOs
s5u13b Jan 8, 2025
3f31c58
Consider scheduling pg state
s5u13b Jan 8, 2025
207ba57
Minor
s5u13b Jan 8, 2025
d3cf1df
Fix benchmark
s5u13b Jan 8, 2025
7728307
Fix error raise
s5u13b Jan 8, 2025
2fa1a5f
Minor
s5u13b Jan 9, 2025
e86be3e
Update Quickstart
s5u13b Jan 9, 2025
658ae55
Add disable-keep-serve-process-alive
s5u13b Jan 9, 2025
33231e5
Fix typos in readme
s5u13b Jan 9, 2025
1f22a61
Simplify kill
s5u13b Jan 9, 2025
846eef9
Simplify initialize_placement_group
s5u13b Jan 9, 2025
0502668
Add one log
s5u13b Jan 9, 2025
f1af4e4
Call scale_up in init_instances
s5u13b Jan 9, 2025
8d59d29
Simplify FastAPIServerActor run
s5u13b Jan 9, 2025
2a4639c
Rename FastAPIServer to FastAPIServerActor
s5u13b Jan 9, 2025
ae89c4e
Rename deployment to launch
s5u13b Jan 9, 2025
3ef7ca0
Support port_offset kv store
s5u13b Jan 9, 2025
99ee561
Fix lint
s5u13b Jan 9, 2025
5dcefb9
Minor
s5u13b Jan 9, 2025
a27f9e8
Refine key value store function log
s5u13b Jan 9, 2025
fb1b841
FIx manager unit test
s5u13b Jan 9, 2025
8d8984d
Refine simulator mode
s5u13b Jan 10, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/unit_test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ jobs:
unit_tests:
needs: cancel_previous_workflows
runs-on: [self-hosted]
timeout-minutes: 30
timeout-minutes: 45
steps:
- name: Checkout
uses: actions/checkout@v4
Expand Down
8 changes: 4 additions & 4 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,8 @@ install:

.PHONY: lint
lint: check_pylint_installed check_pytest_installed
@pylint --rcfile=.pylintrc -s n --jobs=128 ./llumnix
@pylint --rcfile=.pylintrc -s n --jobs=128 ./llumnix

@pylint --rcfile=.pylintrc \
--disable=protected-access,super-init-not-called,unused-argument,redefined-outer-name,invalid-name \
-s n --jobs=128 ./tests
Expand Down Expand Up @@ -53,15 +53,15 @@ proto-clean:

.PHONY: test
test: check_pytest_installed
@pytest -v --ignore=third_party/ --ignore=tests/e2e_test --disable-warnings
@pytest -v --ignore=third_party --ignore=tests/e2e_test --disable-warnings
@python examlpes/offline_inference.py
@pytest -v -x -s --tb=long ./tests/e2e_test/test_e2e.py
@pytest -v -x -s --tb=long ./tests/e2e_test/test_bench.py
@pytest -v -x -s --tb=long ./tests/e2e_test/test_migration.py

.PHONY: unit_test
unit_test: check_pytest_installed
@pytest -v --ignore=third_party/ --ignore=tests/e2e_test --disable-warnings
@pytest -v --ignore=third_party --ignore=tests/e2e_test --disable-warnings

.PHONY: offline_test
offline_test:
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ Llumnix is easy to use with:

## Getting Started

If you are already utilizing vLLM for multi-instance LLM serving deployments, simply replace the vLLM serving deployment command `python -m vllm.entrypoints.api_server ...` for each instance with the command provided below:
If you are already utilizing vLLM for multi-instance LLM serving deployments, simply replace the vLLM serving deployment command `python -m entrypoints.vllm.api_server ...` for each instance with the command provided below:
```
python -m llumnix.entrypoints.vllm.api_server \
--host $HOST \
Expand Down
135 changes: 94 additions & 41 deletions docs/Arguments.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,17 +6,28 @@ Note: since Llumnix is still in alpha stage, the interface and arguments are *su

```
usage: -m llumnix.entrypoints.vllm.api_server [-h]
[--host HOST]
[--port PORT]
[--ssl-keyfile SSL_KEYFILE]
[--ssl-certfile SSL_CERTFILE]
[--log-level {debug,info,warning,error}]
[--launch-ray-cluster]
[--ray-cluster-port RAY_CLUSTER_PORT]
[--request-output-queue-type {rayqueue,zmq}]
[--request-output-queue-port REQUEST_OUTPUT_QUEUE_PORT]
[--disable-log-requests-server]
[--log-request-timestamps]
[--config-file CONFIG_FILE]
[--initial-instances INITIAL_INSTANCES]
[--load-metric {remaining_steps,usage_ratio}]
[--polling-interval POLLING_INTERVAL]
[--dispatch-policy {balanced,load,queue,rr}]
[--enable-migration]
[--enable-defrag]
[--pair-migration-frequency PAIR_MIGRATION_FREQUENCY]
[--pair-migration-policy {balanced,defrag_constrained,defrag_relaxed}]
[--migrate-out-threshold MIGRATE_OUT_THRESHOLD]
[--request-migration-policy {LCR,SR,LR,FCW,FCWSR}]
[--enable-defrag ENABLE_DEFRAG]
[--enable-scaling]
[--min-instances MIN_INSTANCES]
[--max-instances MAX_INSTANCES]
Expand All @@ -27,26 +38,69 @@ usage: -m llumnix.entrypoints.vllm.api_server [-h]
[--disable-log-requests-manager]
[--log-instance-info]
[--log-filename LOG_FILENAME]
[--simulator-mode]
[--profiling-result-file-path PROFILING_RESULT_FILE_PATH]
[--gpu-type GPU_TYPE]
[--polling-interval POLLING_INTERVAL]
[--migration-backend {gloo,nccl,rayrpc,grpc,kvtransfer}]
[--migration-buffer-blocks MIGRATION_BUFFER_BLOCKS]
[--migration-backend-transfer-type {cuda_ipc,rdma,}]
[--migration-backend-kvtransfer-naming-url MIGRATION_BACKEND_KVTRANSFER_NAMING_URL]
[--migration-backend-server-address MIGRATION_BACKEND_SERVER_ADDRESS]
[--migration-backend-init-timeout MIGRATION_BACKEND_INIT_TIMEOUT]
[--migration-num-layers MIGRATION_NUM_LAYERS]
[--last-stage-max-blocks LAST_STAGE_MAX_BLOCKS]
[--migration-backend-init-timeout MIGRATION_BACKEND_INIT_TIMEOUT]
[--migration-backend-transfer-type {cuda_ipc,rdma,}]
[--grpc-migration-backend-server-address GRPC_MIGRATION_BACKEND_SERVER_ADDRESS]
[--kvtransfer-migration-backend-naming-url KVTRANSFER_MIGRATION_BACKEND_NAMING_URL]
[--max-stages MAX_STAGES]
[--last-stage-max-blocks LAST_STAGE_MAX_BLOCKS]
[--enable-pd-disagg]
[--num-dispatch-instances NUM_DISPATCH_INSTANCES]
[--log-request-timestamps]

[--enable-port-increment]
```

`--host`
- Hostname of the server.
- Default: "localhost"

`--port`
- Port number of the server.
- Default: 8000

`--ssl-keyfile`
- Path to SSL key file.
- Default: None

`--ssl-certfile`
- Path to SSL certificate file.
- Default: None

`--log-level`
- Log level for the server.
- Possible choices: debug, info, warning, error
- Default: "info"

`--launch-ray-cluster`
- If launch ray cluster.

`--ray-cluster-port`
- Ray cluster port.
- Default: 6379

`--request-output-queue-type`
- Queue type for request output queue.
- Possible choices: rayqueue, zmq
- Default: "rayqueue"

`--request-output-queue-port`
- Port number for the zmq request output queue.
- Default: 1234

`--disable-log-requests-server`
- Disable logging requests in server.

`--log-request-timestamps`
- If log request timestamps.

`--config-file`
- Path to config file.
- Path to config file of arguments.
- Default: None

`--initial-instances`
- Number of instances created at initialization.
Expand All @@ -69,6 +123,9 @@ usage: -m llumnix.entrypoints.vllm.api_server [-h]
`--enable-migration`
- Enable migrate requests between instances.

`--enable-defrag`
- Enable defragmentation through migration based on virtual usage.

`--pair-migration-frequency`
- Pair migration frequency.
- Default: 1
Expand All @@ -87,10 +144,6 @@ usage: -m llumnix.entrypoints.vllm.api_server [-h]
- Possible choices: LCR, SR, LR, FCW, FCWSR
- Default: "SR"

`--enable-defrag`
- Enable defragmentation through migration based on virtual usage.
- Default: False

`--enable-scaling`
- Enable auto scaling.

Expand Down Expand Up @@ -129,60 +182,60 @@ usage: -m llumnix.entrypoints.vllm.api_server [-h]
- Log filename.
- Default: "server.log"

`--profiling-result-file-path`
- Profiling result file path.
- Default: ""
`--simulator-mode`
- Enable simulator mode.

`--gpu-type`
- GPU type specified when using simulator.
- Default: "a10"
`--profiling-result-file-path`
- Profiling result file path when using simulator.
- Default: None

`--migration-backend`
- Communication backend of migration.
- Possible choices: gloo, rayrpc, nccl, grpc, kvtransfer. [gloo, rayrpc, nccl] are available for vllm and [grpc, kvtransfer] are available for bladellm.
- Default: "gloo"

`--migration-backend-transfer-type`
- Transfer type for migration backend kvTransfer.
- Possible choices: cuda_ipc, rdma
- Default: "rdma"

`--migration-backend-server-address`
- Address of grpc server for migration backend
- Default: "127.0.0.1:50051"

`--migration-backend-kvtransfer-naming-url`
- URL of naming server for kvtransfer migration backend
- Default: "file:/tmp/llumnix/naming/"

`--migration-buffer-blocks`
- Number of buffer blocks in migration.
- Default: 512

`--migration-num-layers`
- number of kv-cache layers to transfer in each round during migration
- Default: 1

`--migration-backend-init-timeout`
- Timeout(s) for initializing migration backend.
- Default: 10.0

`--migration-num-layers`
- number of kv-cache layers to transfer in each round during migration
- Default: 1
`--migration-backend-transfer-type`
- Transfer type for migration backend grpc and kvTransfer.
- Possible choices: cuda_ipc, rdma
- Default: "rdma"

`--last-stage-max-blocks`
- If the number of remaining blocks < last_stage_max_blocks, do last stage migration.
- Default: 4
`--grpc-migration-backend-server-address`
- Address of grpc server for migration backend
- Default: "127.0.0.1:50051"

`--kvtransfer-migration-backend-naming-url`
- URL of naming server for kvtransfer migration backend
- Default: "file:/tmp/llumnix/naming/"

`--max-stages`
- Drop migration if the number of stages > max_stages.
- Default: 3

`--log-request-timestamps`
- Enable logging request timestamps.
`--last-stage-max-blocks`
- If the number of remaining blocks < last_stage_max_blocks, do last stage migration.
- Default: 16

`--enable-pd-disagg`
- Enable prefill decoding disaggregation.

`--num-dispatch-instances`
- Number of available instances for dispatch.
- Default: math.inf

`--enable-port-increment`
- Enable port increment when desploying multiple servers.

# Unsupported vLLM feature options

Expand Down
21 changes: 18 additions & 3 deletions docs/Quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ After installation, you can follow this guide to use Llumnix for multi-instance

## Migrating from Existing Deployments

Inference engines like vLLM provide an API server user interface, e.g., `python -m vllm.entrypoints.api_server`. To deploy multiple instances, people start multiple such API servers, each corresponding to one instance, on multiple nodes / containers / k8s pods.
Inference engines like vLLM provide an API server user interface, e.g., `python -m entrypoints.vllm.api_server`. To deploy multiple instances, people start multiple such API servers, each corresponding to one instance, on multiple nodes / containers / k8s pods.

Llumnix provides a similar user interface to enable seamless integration with such existing multi-instance deployments.
You only need two simple steps to migrate from a deployed vLLM service to Llumnix:
Expand Down Expand Up @@ -62,11 +62,25 @@ export HEAD_NODE=1

During the execution of serving deployment, Llumnix will:
- Initiate the Ray cluster for distributed execution.
- Start Llumnix actor components, including LLMEngineManager, Llumlet, among others.
- Start Llumnix actor components, including Manager, Llumlet, among others.
- Launch the vLLM engine instances.

Following these steps, Llumnix acts as the request scheduling layer situated behind the multiple frontend API servers and above the multiple backend vLLM engine instances. This positioning allows Llumnix to significantly enhance serving performance through its dynamic, fine-grained, and KV-cache-aware request scheduling and rescheduling across instances.

## Centralized Deployment

Llumnix also supports deploying multiple servers and instances at once by running `python -m entrypoints.vllm.serve`, which is named as centralized deployment.

```
python -m llumnix.entrypoints.vllm.serve \
--config-file $CONFIG_PATH \
# vLLM arguments ...
# Llumnix arguments ...
...
```

Centralized deployment assumes that user has already launch a Ray cluter. Upon running the serve module, Llumnix will automatically connect to the existing Ray cluster, start the Llumnix components, and deploy multiple servers and instances to the Ray cluster until there is no more available gpus or cpus.

## Ray Cluster Notice
When you include the --launch-ray-cluster option in Llumnix's serving deployment command, Llumnix automatically builds a Ray cluster during the execution of serving deployment. This action will overwrite any existing Ray cluster. If this behavior is not desired, simply omit the --launch-ray-cluster option, and Llumnix will initiate its actor components within the current Ray cluster.

Expand All @@ -84,7 +98,8 @@ HEAD_NODE=1 python -m llumnix.entrypoints.vllm.api_server \
--model $MODEL_PATH \
--engine-use-ray \
--worker-use-ray \
--max-model-len 4096
--max-model-len 4096 \
--migration-backend rayrpc \
```
`CONFIG_PATH` is the path to the configuration file for Llumnix, and we give an example configuration file [here](../configs/base.yml). `MODEL_PATH` defines the location of your model. `INITIAL_INSTANCES` determines the number of instances to be launched on the current node,

Expand Down
22 changes: 9 additions & 13 deletions examlpes/offline_inference.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,9 @@
import ray

from llumnix import launch_ray_cluster, connect_to_ray_cluster, init_manager
from llumnix import (SamplingParams, ServerInfo, EngineManagerArgs, LLMEngineManager, Llumlet,
EngineArgs, QueueType, BackendType)
from llumnix import (ManagerArgs, EngineArgs, Manager,
Llumlet, ServerInfo, QueueType, BackendType,
SamplingParams)
from llumnix.utils import random_uuid
from llumnix.queue.ray_queue_server import RayQueueServer

Expand All @@ -33,23 +34,18 @@
connect_to_ray_cluster(port=ray_cluster_port)

# Set manager args and engine args.
manager_args = EngineManagerArgs()
manager_args = ManagerArgs()
engine_args = EngineArgs(model="facebook/opt-125m", worker_use_ray=True,
trust_remote_code=True, max_model_len=370)

# Create a manager. If the manager is created first, and then the llumlets are created, manager.scale_up
# need to be called to add the newly created llumlets to the management of the manager.
manager: LLMEngineManager = init_manager(manager_args)
# Create a manager. If the manager is created first, and then the instances are created.
manager: Manager = init_manager(manager_args)
ray.get(manager.is_ready.remote())

# Create llumlets.
# Create instances.
instance_ids: List[str] = None
llumlets: List[Llumlet] = None
instance_ids, llumlets = ray.get(manager.init_llumlets.remote(
engine_args, QueueType("rayqueue"), BackendType.VLLM, 1,
))

ray.get(manager.scale_up.remote(instance_ids, llumlets))
instances: List[Llumlet] = None
instance_ids, instances = ray.get(manager.init_instances.remote(QueueType("rayqueue"), BackendType.VLLM, engine_args))

# The requests‘ outputs will be put to the request_output_queue no matter which instance it's running in.
server_id = random_uuid()
Expand Down
8 changes: 4 additions & 4 deletions llumnix/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,8 @@
from llumnix.entrypoints.setup import (launch_ray_cluster,
connect_to_ray_cluster,
init_manager)
from llumnix.arg_utils import EngineManagerArgs
from llumnix.llm_engine_manager import LLMEngineManager
from llumnix.arg_utils import ManagerArgs
from llumnix.manager import Manager
from llumnix.llumlet.llumlet import Llumlet
from llumnix.queue.queue_type import QueueType
from llumnix.backends.backend_interface import BackendType
Expand All @@ -28,8 +28,8 @@
"launch_ray_cluster",
"connect_to_ray_cluster",
"init_manager",
"EngineManagerArgs",
"LLMEngineManager",
"ManagerArgs",
"Manager",
"Llumlet",
"QueueType",
"BackendType",
Expand Down
Loading
Loading