harry's blog

以 Qwen2-VL 为例，介绍如何使用 vLLM 部署大模型并提供 OpenAI 兼容的接口服务.

环境安装

有几个坑，对于 qwen2-vl，要用 transformers 的固定版本

# 不是这个版本会报错，有一个很奇怪的错误 https://github.com/QwenLM/Qwen3-VL/issues/247
cd transformers
git checkout 21fac7abba2a37fae86106f87fcf9974fd1e3830
pip install .
pip install "vllm==0.6.1.post2" -i https://pypi.mirrors.ustc.edu.cn/simple/

pyairports 要在 github 上下载然后安装 https://github.com/NICTA/pyairports，默认的 0.0.1 版本会报错

git clone https://github.com/NICTA/pyairports
cd pyairports
pip install .

下面是部署脚本

#!/usr/bin/env bash
# Launch Qwen2-VL with vLLM as an OpenAI-compatible API service
# Usage: bash deploy_vllm_qwen2vl.sh

# ======= Configurable Parameters =======
export CUDA_VISIBLE_DEVICES=0          # Specify which GPUs to use
MODEL_PATH="/aiarena/group/mmitgroup/wangyh/models/Qwen/Qwen2-VL-7B-Instruct"
HOST=0.0.0.0
PORT=8000                              # Service port
GPU_MEM_UTIL=0.95                      # GPU memory utilization ratio (0–1)
TP_SIZE=$(echo $CUDA_VISIBLE_DEVICES | tr ',' '\n' | wc -l)
echo "[vLLM] Detected number of available GPUs: ${TP_SIZE}"

# ======= Start the Server =======
echo "[vLLM] Serving model: ${MODEL_PATH} on GPUs: ${CUDA_VISIBLE_DEVICES}"
python -m vllm.entrypoints.openai.api_server \
  --model "${MODEL_PATH}" \
  --served-model-name qwen2vl7b \
  --dtype auto \
  --max-model-len 4096 \
  --tensor-parallel-size "${TP_SIZE}" \
  --gpu-memory-utilization "${GPU_MEM_UTIL}" \
  --host "${HOST}" \
  --port "${PORT}" \
  --trust-remote-code

# Notes:
# --trust-remote-code: required for some multimodal models with custom preprocessing
# OpenAI-compatible endpoint: http://localhost:${PORT}/v1/

这里是调用 api 的脚本，可以直接嵌入到 python 程序中

import os
import sys
import base64
from typing import Optional

sys.path.append(os.getcwd())
sys.path.append("..")

from openai import OpenAI


def _encode_image_to_b64(image_path: str) -> str:
    with open(image_path, "rb") as f:
        return base64.b64encode(f.read()).decode("utf-8")


class VllmOpenAIClient:
    def __init__(self, base_url="http://127.0.0.1:8000/v1", api_key="token-abc123", timeout=None,):
        self.base_url = base_url
        self.api_key = api_key
        self.timeout = timeout
        self.client = None
        self.model_name = None

    def initialize_llm(self, checkpoint: str):
        self.client = OpenAI(
            base_url=self.base_url,
            api_key=self.api_key,
            timeout=self.timeout
        )
        self.model_name = checkpoint

    def run_llm(self, query, image_path, sys_message="You are an AI assistant that helps people describe images."):
        if self.client is None or self.model_name is None:
            raise RuntimeError("Client or model not initialized. Call initialize_llm(checkpoint) first.")

        b64 = _encode_image_to_b64(image_path)
        data_url = f"data:image/jpeg;base64,{b64}"
        messages = [
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url", 
                        "image_url": {"url": data_url}
                    },
                    {"type": "text", "text": query},
                ],
            },
        ]

        while True:
            try:
                resp = self.client.chat.completions.create(
                    model=self.model_name,
                    messages=messages,
                    temperature=0.0,
                    max_tokens=512,
                )
                text = resp.choices[0].message.content if resp and resp.choices else None
                answer = text or ""
                break
            except Exception as e:
                print(f"Error in vllm {self.model_name}: {e}")

        return answer

if __name__ == "__main__":
    api_client = VllmOpenAIClient(
        base_url="http://127.0.0.1:8000/v1",
        api_key="token-abc123",
    )
    api_client.initialize_llm(checkpoint="qwen2vl7b")

    answer = api_client.run_llm(
        query="该图片描述了什么？",
        image_path="asset/1.jpg",
        sys_message="You are an AI assistant that explains image content clearly and concisely."
    )
    print(answer)

本文由 Yonghui Wang 创作，采用知识共享署名4.0 国际许可协议进行许可
本站文章除注明转载/出处外，均为本站原创或翻译，转载前请务必署名
最后编辑时间为: Nov 04, 2025 11:06 pm

vllm 部署大模型