『智谱清言』CogVLM2部署实践

入职智谱后的第一项工作的一个Part，浅浅记录。

前置准备

CUDA 12.1 的ubuntu宿主机一台

注意：cuda版本应与torch版本对应，实测 CUDA 12.0 也可以正常运行

下载适配TGI的CogVLM2模型至路径/model：

1
2
3

from modelscope import snapshot_download 

model_dir = snapshot_download('ZhipuAI/cogvlm2-llama3-chinese-chat-19B-tgi', cache_dir="/data/cogvlm2-llama3-chinese-chat-19B")

下载适配CogVLM2的TGI包至路径/tgi: https://github.com/leizhao1234/cogvlm2

正式流程

# 下载conda并安装
curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
sh Miniconda3-latest-Linux-x86_64.sh
#创建conda环境
conda create -n tgi python=3.10
conda activate tgi

# 加速rust下载
export RUSTUP_DIST_SERVER=https://mirrors.ustc.edu.cn/rust-static
export RUSTUP_UPDATE_ROOT=https://mirrors.ustc.edu.cn/rust-static/rustup
# 安装rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# 下载protobuf
PROTOC_ZIP=protoc-21.12-linux-x86_64.zip
curl -OL https://github.com/protocolbuffers/protobuf/releases/download/v21.12/$PROTOC_ZIP
sudo unzip -o $PROTOC_ZIP -d /usr/local bin/protoc
sudo unzip -o $PROTOC_ZIP -d /usr/local 'include/*'
rm -f $PROTOC_ZIP

pip install torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/cu121 # 注意torch和cuda版本的对应！

# 安装 openssl 和 pkg-config 工具
apt-get install libssl-dev gcc -y
apt install pkg-config

# 安装 cmake
apt install cmake

此时基础环境安装完毕，接下来是编译环节。

首先打开/tgi/rust-toolchain.toml，将版本修改为>=1.79.0，如下：

[toolchain]

# Released on: 02 May, 2024

# https://releases.rs/docs/1.79.0/

channel = "1.79.0" 
components = ["rustfmt", "clippy"]

接下来进入TGI的根目录/tgi：

#进入text-text-generation-inference目录，编译基本组件 
BUILD_EXTENSIONS=True make install

cd server
make install-vllm-cuda 
make build-apex 
make install-flash-attention-v2-cuda

注意：Makefile中涉及从Github上拉取vllm和apex包的代码，若服务器无法访问Github，则需使用其他方法下载vllm(https://github.com/Narsil/vllm)和apex(https://github.com/NVIDIA/apex)后（master分支即可），将其放于/tgi/server目录下，最终文件结构为/tgi/server/vllm, /tgi/server/apex。随后删除/tgi目录下的Makefile-apex, Makefile-vllm文件中git相关命令，再执行上述make命令编译。

测试与部署

检查TGI是否编译成功：

1 2	text-generation-launcher --version text-generation-launcher 2.0.5-dev0

使用TGI部署模型，其余配置含义请参考（https://huggingface.co/docs/text-generation-inference/en/reference/launcher）：

CUDA_VISIBLE_DEVICES=6,7 text-generation-launcher --model-id {/model} --num-shard 2 --port 8081 --max-concurrent-requests 409600 --max-input-length 8190 --max-total-tokens 8192 --max-batch-prefill-tokens 8192 --trust-remote-code --max-waiting-tokens 2 --waiting-served-ratio 0.2 --cuda-memory-fraction 0.97 --cuda-graphs 0 --dtype bfloat16

CUDA_VISIBLE_DEVICES=6,7: 选定使用的GPU编号
--num-shard 2: 使用GPU数目
--port 8081: 开放端口

全程无报错即为启动成功！

测试参考https://modelscope.cn/models/ZhipuAI/cogvlm2-llama3-chinese-chat-19B-tgi，暴露的路径应为/generate，因此需要修改代码中url变量为url = 'http://localhost:{port}/generate'。