模型部署的追踪性混乱是我们团队在生产环境中遇到的一个棘手问题。开发人员更新服务代码,算法工程师则迭代模型文件。最初,我们依赖手动流程:算法同学在共享存储上放置一个新模型,然后在配置文件中手动更新路径,最后触发一次代码发布。这套流程很快就暴露了问题:哪个版本的代码究竟在运行哪个版本的模型?一次失败的回滚,是应该回滚代码,还是模型文件?当线上出现预测偏差时,我们无法快速、准确地复现出问题的代码与模型的组合。
我们需要的是一个能同时理解代码(Git)和数据(模型文件)版本,并将两者原子化地绑定、构建和部署的自动化流水线。目标是实现完全的可复现性:任何一次部署,都必须能精确追溯到其对应的 Git Commit Hash 和 DVC Data Hash。
经过一番调研和权衡,我们敲定了一套有些非主流但极为务实的技术栈:使用 Express.js 构建轻量级模型服务API,DVC (Data Version Control) 负责追踪模型版本,Tekton 作为云原生的CI/CD引擎来编排整个流程,最终部署到我们的 Docker Swarm 集群上。为什么是 Swarm 而不是 Kubernetes?因为对于我们当前中等规模的无状态服务集群而言,Swarm 提供了足够的编排能力,其运维复杂度和资源开销远低于 K8s,这是一个典型的工程权衡。
初始设定:应用与数据结构
我们的起点是一个标准的 Node.js 项目,它使用 Express 来包装一个机器学习模型。为了演示,这里假设模型是一个 ONNX 格式的文件,但它可以是任何二进制文件。
项目结构如下:
.
├── .dvc/
├── .dvcignore
├── data/
│ ├── model.onnx
│ └── model.onnx.dvc
├── tekton/
│ ├── 01-task-git-clone-dvc.yaml
│ ├── 02-task-build-push.yaml
│ ├── 03-task-deploy-swarm.yaml
│ └── 04-pipeline.yaml
├── .dockerignore
├── app.js
├── Dockerfile
├── package.json
└── dvc.yaml
1. Express.js 模型服务
app.js
的实现非常直接。它启动一个HTTP服务器,加载位于 /app/data/model.onnx
的模型,并提供一个 /predict
端点。在真实项目中,这里会包含更复杂的模型加载、预处理和后处理逻辑。
// app.js
const express = require('express');
const fs = require('fs');
const path = require('path');
const app = express();
app.use(express.json());
const PORT = process.env.PORT || 3000;
const MODEL_PATH = path.join(__dirname, 'data', 'model.onnx');
let model; // In a real app, this would be an ONNX runtime session or similar
let modelMetadata = {
path: MODEL_PATH,
size: 0,
lastModified: null,
gitCommit: process.env.GIT_COMMIT || 'unknown',
dvcDataHash: process.env.DVC_DATA_HASH || 'unknown'
};
// Application-level error handling
app.use((err, req, res, next) => {
console.error('[Global Error Handler]', err.stack);
res.status(500).send({ error: 'Something went wrong!' });
});
/**
* Loads the model from disk. In a production scenario, this would involve
* initializing a inference session which can be time-consuming.
*/
function loadModel() {
console.log(`Attempting to load model from: ${MODEL_PATH}`);
try {
if (fs.existsSync(MODEL_PATH)) {
const stats = fs.statSync(MODEL_PATH);
modelMetadata.size = stats.size;
modelMetadata.lastModified = stats.mtime;
// Simulate loading model into memory
model = fs.readFileSync(MODEL_PATH);
console.log(`Model loaded successfully. Size: ${modelMetadata.size} bytes.`);
console.log(`Running with Git Commit: ${modelMetadata.gitCommit}`);
console.log(`Running with DVC Data Hash: ${modelMetadata.dvcDataHash}`);
} else {
console.error('FATAL: Model file not found!');
// This should cause the container to fail, which is desirable.
process.exit(1);
}
} catch (error) {
console.error('FATAL: Failed to load model!', error);
process.exit(1);
}
}
// Endpoint for health checks
app.get('/health', (req, res) => {
res.status(200).send({ status: 'ok' });
});
// Endpoint to get model metadata
app.get('/info', (req, res) => {
res.status(200).json(modelMetadata);
});
// The actual prediction endpoint
app.post('/predict', (req, res) => {
// Input validation
if (!req.body || !req.body.features) {
return res.status(400).json({ error: 'Missing "features" in request body.' });
}
try {
// In a real application, you would run inference using the loaded model.
// Here we just simulate a successful prediction.
console.log(`Received prediction request with features:`, req.body.features);
const prediction = Math.random();
res.json({ prediction });
} catch(e) {
console.error('Prediction failed:', e);
// Pass to the generic error handler
next(e);
}
});
app.listen(PORT, () => {
console.log(`Model serving API listening on port ${PORT}`);
loadModel();
});
这个服务通过环境变量 GIT_COMMIT
和 DVC_DATA_HASH
注入版本信息,并通过 /info
端点暴露出来,这对于线上问题排查和可观测性至关重要。
2. DVC 集成
我们使用DVC来追踪data/model.onnx
。首先初始化DVC,并将其指向一个S3兼容的远程存储(例如MinIO)。
# Initialize DVC
dvc init
# Configure remote storage (replace with your actual endpoint)
dvc remote add -d storage s3://dvc-bucket/models
dvc remote modify storage endpointurl http://minio.example.com:9000
dvc remote modify storage access_key_id <your-key>
dvc remote modify storage secret_access_key <your-secret>
# Track the model file
dvc add data/model.onnx
# This creates `data/model.onnx.dvc` and adds `data/model.onnx` to `.gitignore`
# Now, commit the .dvc file to Git
git add data/.gitignore data/model.onnx.dvc
git commit -m "feat: Track initial model with DVC"
# Push data to remote storage
dvc push
现在,model.onnx.dvc
文件中包含了模型文件的哈希值和位置信息,它是一个纯文本文件,可以安全地提交到Git。模型二进制文件本身则被上传到了S3。
3. Dockerfile: 数据感知的多阶段构建
Dockerfile的设计是整个流程的关键。它必须在构建时从DVC远程存储中拉取正确的模型版本。一个常见的错误是直接在最终镜像中安装DVC工具链,这会不必要地增大镜像体积。我们采用多阶段构建来解决这个问题。
# Dockerfile
# --- Stage 1: The Builder with DVC ---
# Use a base image that has Python for DVC. A slim image is sufficient.
FROM python:3.9-slim as builder
# Install DVC
RUN pip install dvc[s3]
# Set up the application directory
WORKDIR /app
# Copy only the files needed to pull data
COPY .dvc .dvc
COPY data/model.onnx.dvc data/model.onnx.dvc
COPY dvc.yaml dvc.yaml
COPY .dvcignore .dvcignore
# Pull the data from remote storage.
# Credentials will be mounted as secrets during the Tekton build process.
# This command will fail if credentials are not correctly configured.
RUN \
dvc pull -f
# --- Stage 2: The Final Application Image ---
# Use an official Node.js image for the final application.
FROM node:18-alpine
WORKDIR /app
# Copy package files and install dependencies
COPY package*.json ./
RUN npm install --production --no-optional && npm cache clean --force
# Copy the application code AND the data from the builder stage
COPY . .
COPY /app/data ./data
# Environment variables for version tracking, will be passed by Tekton
ARG GIT_COMMIT=unspecified
ARG DVC_DATA_HASH=unspecified
ENV GIT_COMMIT=$GIT_COMMIT
ENV DVC_DATA_HASH=$DVC_DATA_HASH
EXPOSE 3000
CMD [ "node", "app.js" ]
注意这里的 RUN --mount=type=secret,id=dvc-config...
。这是BuildKit的一个特性,允许我们在构建过程中安全地挂载机密文件(DVC凭证),而不会将它们泄露到镜像的任何图层中。
Tekton 流水线:编排一切
Tekton的核心是Task
和Pipeline
。Task
是最小的执行单元(一个或多个容器化的步骤),而Pipeline
则将多个Task
串联起来。
任务1: 克隆代码并准备数据 (01-task-git-clone-dvc.yaml
)
标准的 git-clone
任务不足以满足我们的需求,因为它不理解 DVC。我们需要一个自定义任务,它首先克隆Git仓库,然后解析出.dvc
文件中的数据哈希。
# tekton/01-task-git-clone-dvc.yaml
apiVersion: tekton.dev/v1beta1
kind: Task
metadata:
name: git-clone-dvc
spec:
workspaces:
- name: source
description: The workspace to clone the source code into.
params:
- name: repoUrl
description: The git repository URL to clone from.
- name: revision
description: The git revision to clone.
default: "main"
- name: dvcFilePath
description: "Path to the .dvc file to extract hash from."
default: "data/model.onnx.dvc"
results:
- name: commit
description: The precise commit SHA that was fetched.
- name: dvc-data-hash
description: The MD5 hash of the data file tracked by DVC.
steps:
- name: git-clone
image: alpine/git:v2.36.1
workingDir: $(workspaces.source.path)
script: |
#!/bin/sh
set -ex
git clone $(params.repoUrl) .
git checkout $(params.revision)
COMMIT_SHA=$(git rev-parse HEAD | tr -d '\n')
echo -n "$COMMIT_SHA" > $(results.commit.path)
- name: extract-dvc-hash
image: mikefarah/yq:4.30.5 # A lightweight YAML processor
workingDir: $(workspaces.source.path)
script: |
#!/bin/sh
set -ex
# DVC files are YAML. We extract the md5 hash from the 'outs' array.
# This is more robust than using grep/sed.
DATA_HASH=$(yq '.outs[0].md5' $(params.dvcFilePath))
echo "Extracted DVC data hash: $DATA_HASH"
echo -n "$DATA_HASH" > $(results.dvc-data-hash.path)
这个Task
有两个步骤:第一步用标准的git
镜像克隆代码并输出commit SHA;第二步使用yq
这个小工具来安全地解析DVC文件的YAML内容,提取出模型的MD5哈希,并将其作为Task
的result
输出。
任务2: 构建并推送镜像 (02-task-build-push.yaml
)
我们使用Kaniko来在集群内安全地构建镜像,它不需要访问Docker守护进程。
# tekton/02-task-build-push.yaml
apiVersion: tekton.dev/v1beta1
kind: Task
metadata:
name: kaniko-build-dvc
spec:
workspaces:
- name: source
params:
- name: imageUrl
description: Name of the image to build and push.
- name: gitCommit
description: The git commit SHA to be baked into the image.
- name: dvcDataHash
description: The DVC data hash to be baked into the image.
steps:
- name: build-and-push
image: gcr.io/kaniko-project/executor:v1.9.0
# Kaniko requires docker-config.json to be at this specific path
command:
- /kaniko/executor
args:
- --dockerfile=$(workspaces.source.path)/Dockerfile
- --context=dir://$(workspaces.source.path)
- --destination=$(params.imageUrl):$(params.gitCommit)-$(params.dvcDataHash)
- --build-arg=GIT_COMMIT=$(params.gitCommit)
- --build-arg=DVC_DATA_HASH=$(params.dvcDataHash)
# This secret contains the DVC remote storage config
volumeMounts:
- name: dvc-config-volume
mountPath: /kaniko/secrets/dvc-config
# This secret contains the docker registry credentials
env:
- name: DOCKER_CONFIG
value: /kaniko/.docker
volumeMounts:
- name: docker-config-volume
mountPath: /kaniko/.docker
这个任务接收代码和数据哈希作为参数,并将它们作为build-arg
传递给Dockerfile
。镜像的标签被设计为commit-datahash
的组合,确保了每个镜像的唯一性和可追溯性。注意,它依赖两个Kubernetes Secret:一个用于DVC远程存储的凭证,另一个用于镜像仓库的凭证。
任务3: 部署到 Docker Swarm (03-task-deploy-swarm.yaml
)
这是与Swarm交互的部分。Tekton运行在Kubernetes上,而我们的目标是Swarm。这里的关键是创建一个能远程操作Swarm的Task
。我们通过配置Docker客户端使用TCP套接字连接到Swarm Manager来实现。
# tekton/03-task-deploy-swarm.yaml
apiVersion: tekton.dev/v1beta1
kind: Task
metadata:
name: deploy-to-swarm
spec:
params:
- name: serviceName
description: The name of the service to update in Docker Swarm.
- name: newImage
description: The full tag of the new image to deploy.
- name: swarmHost
description: The Docker Swarm manager host (e.g., tcp://swarm-manager:2375).
steps:
- name: update-service
image: docker:20.10
script: |
#!/bin/sh
set -ex
echo "Connecting to Swarm manager at $(params.swarmHost)..."
export DOCKER_HOST=$(params.swarmHost)
echo "Updating service $(params.serviceName) with image $(params.newImage)..."
# The --with-registry-auth flag is crucial for private registries.
# It sends the registry authentication details from the client to the Swarm nodes.
docker service update \
--image $(params.newImage) \
--with-registry-auth \
--detach=true \
$(params.serviceName)
echo "Service update command sent successfully."
这个Task
非常简单:它使用官方的docker
镜像,通过设置DOCKER_HOST
环境变量来指向Swarm Manager。docker service update
命令会触发Swarm的滚动更新。这里的安全考量是,Swarm Manager的Docker API端口不应暴露在公网上。在我们的环境中,Tekton集群和Swarm集群位于同一个VPC内,网络访问是受控的。
最终的 Pipeline (04-pipeline.yaml
)
现在,将所有Task
串联起来。
# tekton/04-pipeline.yaml
apiVersion: tekton.dev/v1beta1
kind: Pipeline
metadata:
name: model-serving-pipeline
spec:
workspaces:
- name: shared-workspace
params:
- name: repoUrl
description: Git repository URL
- name: revision
description: Git revision
- name: serviceName
description: Docker Swarm service name
default: "model-api"
- name: imageUrl
description: Base image URL for registry
default: "my-registry/model-api"
- name: dvcFilePath
description: Path to DVC file
default: "data/model.onnx.dvc"
tasks:
- name: fetch-source-and-data-info
taskRef:
name: git-clone-dvc
workspaces:
- name: source
workspace: shared-workspace
params:
- name: repoUrl
value: $(params.repoUrl)
- name: revision
value: $(params.revision)
- name: dvcFilePath
value: $(params.dvcFilePath)
- name: build-image
runAfter: [fetch-source-and-data-info]
taskRef:
name: kaniko-build-dvc
workspaces:
- name: source
workspace: shared-workspace
params:
- name: imageUrl
value: $(params.imageUrl)
- name: gitCommit
value: $(tasks.fetch-source-and-data-info.results.commit)
- name: dvcDataHash
value: $(tasks.fetch-source-and-data-info.results.dvc-data-hash)
- name: deploy
runAfter: [build-image]
taskRef:
name: deploy-to-swarm
params:
- name: serviceName
value: $(params.serviceName)
- name: swarmHost
value: "tcp://docker-swarm-manager.internal:2375" # Use internal DNS
- name: newImage
value: "$(params.imageUrl):$(tasks.fetch-source-and-data-info.results.commit)-$(tasks.fetch-source-and-data-info.results.dvc-data-hash)"
Pipeline
的定义清晰地展示了数据流:fetch-source-and-data-info
任务的结果(commit hash 和 dvc hash)被传递给后续的build-image
和deploy
任务。整个流程是参数化的,可以轻松适应不同的服务和仓库。
下面是整个流程的Mermaid图示:
graph TD subgraph "Developer Workflow" A[Git Push on 'main' branch] --> B{Webhook}; C[Update model & DVC Push] --> A; end subgraph "Tekton on Kubernetes" B --> D[Tekton EventListener]; D --> E[Trigger & PipelineRun]; E --> F(Task: git-clone-dvc); F -- commit / dvc_hash --> G(Task: kaniko-build-dvc); F -- source code --> G; G -- image_tag --> H(Task: deploy-to-swarm); end subgraph "External Systems" I[Git Repository]; J[DVC S3 Storage]; K[Docker Registry]; L[Docker Swarm Cluster]; end F --> I; G -- Pulls data --> J; G -- Pushes image --> K; H -- Sends 'docker service update' --> L;
局限性与未来方向
这套系统解决了我们最初的可追溯性问题,但它并非完美。首先,deploy-to-swarm
任务依赖于暴露的Docker TCP套接字,虽然在内网中,但这仍然是一个需要严格管控的安全点。一个更安全的替代方案可能是运行一个专用的 “agent” 服务在Swarm Manager上,通过更安全的API来接收部署指令。
其次,流水线的触发机制目前仅限于Git Webhook。当只有模型更新时(dvc push
),开发者仍需提交一次.dvc
文件的变更到Git来触发流水线。一个更高级的系统可以监控DVC的S3存储桶,当有新版本数据出现时自动触发Tekton流水线,从而实现真正的代码与数据分离的触发。
最后,随着服务数量的增加和依赖关系的复杂化,Docker Swarm的服务发现和配置管理能力可能会达到瓶颈。届时,迁移到Kubernetes并使用GitOps工具(如ArgoCD)来管理部署状态,将是自然而然的演进路径。但就目前而言,这个基于Tekton、DVC和Swarm的组合,以其相对的简洁性和强大的可追溯性,为我们的MLOps实践提供了一个坚实且务实的基础。