-
Notifications
You must be signed in to change notification settings - Fork 871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PyTorch serve with custom translators returning Unknown exception #3032
Comments
@samruds, looking at the logs:
It seems like the model load failed because:
A few follow up questions to help debug further:
|
Could you share your inference.py implementation? `"""This module is for SageMaker inference.py.""" logger = logging.getLogger(name) inference_spec = None def model_fn(model_dir):
def input_fn(input_data, content_type): def predict_fn(input_data, predict_callable): def output_fn(predictions, accept_type): def _run_preflight_diagnostics(): def _py_vs_parity_check():
def _pickle_file_integrity_check():
on import, execute_run_preflight_diagnostics()` |
|
Issue was resolved on our end by bringing extra dependencies for pytorch servers. |
🐛 Describe the bug
I am deploying a model with a new pytorch image. The model artifacts build and server worker is assigned a PID. However, the worker crashes without a clear exception (I have fixed any import errors in the implementation interface)
Error logs
2024-03-20T05:50:51.648Z 2024-03-20T05:50:51,529 [INFO ] W-9003-model_1.0-stdout MODEL_LOG - service = model_loader.load( AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:51.648Z 2024-03-20T05:50:51,530 [INFO ] W-9003-model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9003-model_1.0-stdout AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:51.648Z 2024-03-20T05:50:51,530 [INFO ] W-9002-model_1.0-stdout MODEL_LOG - user_module = importlib.import_module(user_module_name) AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:51.648Z 2024-03-20T05:50:51,530 [INFO ] W-9002-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/importlib/init.py", line 126, in import_module AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:51.648Z 2024-03-20T05:50:51,531 [INFO ] W-9002-model_1.0-stdout MODEL_LOG - return _bootstrap._gcd_import(name[level:], package, level) AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:51.648Z 2024-03-20T05:50:51,531 [INFO ] W-9002-model_1.0-stdout MODEL_LOG - File "", line 1050, in _gcd_import AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:51.648Z 2024-03-20T05:50:51,531 [INFO ] W-9002-model_1.0-stdout MODEL_LOG - File "", line 1027, in _find_and_load AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:51.648Z 2024-03-20T05:50:51,531 [INFO ] W-9002-model_1.0-stdout MODEL_LOG - File "", line 1006, in _find_and_load_unlocked AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:51.648Z 2024-03-20T05:50:51,531 [INFO ] W-9002-model_1.0-stdout MODEL_LOG - File "", line 688, in _load_unlocked AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:51.648Z 2024-03-20T05:50:51,532 [INFO ] W-9002-model_1.0-stdout MODEL_LOG - File "", line 883, in exec_module AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:51.648Z 2024-03-20T05:50:51,532 [INFO ] W-9002-model_1.0-stdout MODEL_LOG - File "", line 241, in _call_with_frames_removed AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:51.648Z 2024-03-20T05:50:51,532 [INFO ] W-9002-model_1.0-stdout MODEL_LOG - File "/opt/ml/model/code/inference.py", line 10, in AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:51.648Z 2024-03-20T05:50:51,532 [INFO ] W-9002-model_1.0-stdout MODEL_LOG - from sagemaker.serve.validations.check_integrity import perform_integrity_check AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:51.648Z 2024-03-20T05:50:51,532 [INFO ] W-9002-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/sagemaker/serve/init.py", line 5, in AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:51.648Z 2024-03-20T05:50:51,533 [INFO ] W-9002-model_1.0-stdout MODEL_LOG - from sagemaker.serve.builder.model_builder import ModelBuilder AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:51.648Z 2024-03-20T05:50:51,545 [INFO ] epollEventLoopGroup-5-2 org.pytorch.serve.wlm.WorkerThread - 9002 Worker disconnected. WORKER_STARTED AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:51.648Z 2024-03-20T05:50:51,546 [WARN ] W-9002-model_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: model, error: Worker died. AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:51.648Z 2024-03-20T05:50:51,547 [INFO ] W-9002-model_1.0 org.pytorch.serve.wlm.WorkerThread - Auto recovery start timestamp: 1710913851547 AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:51.648Z 2024-03-20T05:50:51,547 [WARN ] W-9002-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9002-model_1.0-stderr AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:51.648Z 2024-03-20T05:50:51,547 [WARN ] W-9002-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9002-model_1.0-stdout AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:51.648Z 2024-03-20T05:50:51,547 [INFO ] W-9002-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/sagemaker/serve/builder/model_builder.py", line 23, in AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:51.648Z 2024-03-20T05:50:51,548 [INFO ] W-9002-model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9002-model_1.0-stdout AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:51.898Z 2024-03-20T05:50:51,547 [INFO ] W-9002-model_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9002 in 1 seconds. AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:51.898Z 2024-03-20T05:50:51,885 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Backend worker process died. AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:51.898Z 2024-03-20T05:50:51,885 [INFO ] epollEventLoopGroup-5-3 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:51.898Z 2024-03-20T05:50:51,889 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Traceback (most recent call last): AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:51.898Z 2024-03-20T05:50:51,890 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/ts/model_service_worker.py", line 253, in AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:51.898Z 2024-03-20T05:50:51,890 [WARN ] W-9000-model_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: model, error: Worker died. AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:51.898Z 2024-03-20T05:50:51,890 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - worker.run_server() AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:51.898Z 2024-03-20T05:50:51,891 [INFO ] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Auto recovery start timestamp: 1710913851891 AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:51.898Z 2024-03-20T05:50:51,892 [WARN ] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-model_1.0-stderr AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:51.898Z 2024-03-20T05:50:51,892 [WARN ] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-model_1.0-stdout AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:51.898Z 2024-03-20T05:50:51,892 [INFO ] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 1 seconds. AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:51.898Z 2024-03-20T05:50:51,891 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/ts/model_service_worker.py", line 221, in run_server AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:52.149Z 2024-03-20T05:50:51,893 [INFO ] W-9000-model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-model_1.0-stdout AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:52.149Z 2024-03-20T05:50:51,914 [INFO ] W-9001-model_1.0-stdout MODEL_LOG - Backend worker process died. AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:52.149Z 2024-03-20T05:50:51,915 [INFO ] W-9001-model_1.0-stdout MODEL_LOG - Traceback (most recent call last): AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:52.149Z 2024-03-20T05:50:51,915 [INFO ] W-9001-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/ts/model_service_worker.py", line 253, in AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:52.149Z 2024-03-20T05:50:51,916 [INFO ] W-9001-model_1.0-stdout MODEL_LOG - worker.run_server() AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:52.149Z 2024-03-20T05:50:51,916 [INFO ] W-9001-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/ts/model_service_worker.py", line 221, in run_server AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:52.149Z 2024-03-20T05:50:51,915 [INFO ] epollEventLoopGroup-5-4 org.pytorch.serve.wlm.WorkerThread - 9001 Worker disconnected. WORKER_STARTED AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:52.149Z 2024-03-20T05:50:51,917 [WARN ] W-9001-model_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: model, error: Worker died.``
Installation instructions
Install torchserve from source: No it is an image through sagemaker
Docker: no
Model Packaing
aws/sagemaker-python-sdk#4502
config.properties
2024-03-20T05:50:45.547Z Torchserve version: 0.8.2 AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:45.547Z TS Home: /opt/conda/lib/python3.10/site-packages AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:45.547Z Current directory: / AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:45.547Z Temp directory: /home/model-server/tmp AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:45.547Z Metrics config path: /opt/conda/lib/python3.10/site-packages/ts/configs/metrics.yaml AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:45.547Z Number of GPUs: 0 AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:45.547Z Number of CPUs: 4 AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:45.547Z Max heap size: 4008 M AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:45.547Z Python executable: /opt/conda/bin/python3.10 AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:45.547Z Config file: /etc/sagemaker-ts.properties AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:45.547Z Inference address: http://0.0.0.0:8080 AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:45.547Z Management address: http://0.0.0.0:8080 AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:45.547Z Metrics address: http://127.0.0.1:8082 AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:45.547Z Model Store: /.sagemaker/ts/models AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:45.547Z Initial Models: model=/opt/ml/model AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:45.547Z Log dir: /logs AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:45.547Z Metrics dir: /logs AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:45.547Z Netty threads: 0 AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:45.547Z Netty client threads: 0 AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:45.547Z Default workers per model: 4 AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:45.547Z Blacklist Regex: N/A AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:45.547Z Maximum Response Size: 6553500 AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:45.547Z Maximum Request Size: 6553500 AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:45.547Z Limit Maximum Image Pixels: true AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:45.547Z Prefer direct buffer: false AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:45.547Z Allowed Urls: [file://.|http(s)?://.] AllTraffic/i-0cb527de76ac4f29a
2024-03-20T05:50:45.547Z Custom python dependency for model allowed: false``
Versions
Torchserve version: 0.8.2
Python runtime: 3.10.9
Repro instructions
N/A
Possible Solution
No response
The text was updated successfully, but these errors were encountered: