Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Converting SmolVLM to ONNX #60

Open
bharathsivaram10 opened this issue Feb 18, 2025 · 0 comments
Open

Converting SmolVLM to ONNX #60

bharathsivaram10 opened this issue Feb 18, 2025 · 0 comments

Comments

@bharathsivaram10
Copy link

bharathsivaram10 commented Feb 18, 2025

Hello!

I'm trying to convert a fine-tuned SmolVLM to ONNX and hopefully quantize to int8 to run on CPU. Two questions:

  1. Has anyone tried this? I'd love to hear if I'm actually going about this the right way
  2. My onnx conversion code is shown below, and it seems to take forever to run. It actually ends up crashing due to RAM overflow (on colab). And I don't think the architecture is supported by optimum
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image

# Load images
image1 = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")

# Initialize processor and model
processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM-Instruct")

# Create input messages
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Can you describe the image?"}
        ]
    },
]

# Prepare inputs
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image1], return_tensors="pt")

# Create a dictionary of regular tensors

tensor_inputs = {
    "pixel_values": inputs["pixel_values"].to(torch.float32),
    "pixel_attention_mask": inputs["pixel_attention_mask"].to(torch.float32),
    "input_ids": inputs["input_ids"],
    "attention_mask": inputs["attention_mask"]
}

model = AutoModelForVision2Seq.from_pretrained(
    "HuggingFaceTB/SmolVLM-Instruct",
    torch_dtype=torch.bfloat16  # Changed to float32 for ONNX compatibility
)
model.eval()

# Dynamic axes for variable batch size and sequence length
dynamic_axes = {
    "pixel_values": {0: "batch_size"},
    "pixel_attention_mask": {0: "batch_size"},
    "input_ids": {0: "batch_size", 1: "sequence_length"},
    "attention_mask": {0: "batch_size", 1: "sequence_length"},
    "output": {0: "batch_size", 1: "sequence_length"}
}

# Export to ONNX
torch.onnx.export(
    model,
    (tensor_inputs,),  # Use tensor_inputs instead of inputs
    "smolvlm.onnx",
    input_names=list(tensor_inputs.keys()),
    output_names=["output"],
    dynamic_axes=dynamic_axes,
    opset_version=13,
    do_constant_folding=True,
    export_params=True
)

print("ONNX model saved as smolvlm.onnx")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant