AI 应用的成本优化：Token、延迟与预算

AI 应用的成本现实

很多开发者在原型阶段用着最强的模型，觉得成本不高。但一旦上线，用户量增长，账单就会让人大吃一惊。

原型阶段：每天 100 次调用 × $0.01/次 = $1/天 → 没感觉
上线后：每天 100,000 次调用 × $0.01/次 = $1,000/天 → $30,000/月

理解成本结构、掌握优化技巧，是每个 AI 应用开发者的必修课。

理解 Token 定价

Token 是什么

API 按 Token 计费，而不是按字符或单词。不同语言的 Token 效率不同：

英文: "Hello, how are you?" → 6 tokens
中文: "你好，你怎么样？" → 8-10 tokens（中文通常更多）
代码: "console.log('hello')" → 5 tokens

主流模型定价对比（2026 年初）

模型	输入价格 ($/1M tokens)	输出价格 ($/1M tokens)	上下文窗口
Claude Opus 4	$15.00	$75.00	200K
Claude Sonnet 4	$3.00	$15.00	200K
Claude Haiku 3.5	$0.80	$4.00	200K
GPT-4o	$2.50	$10.00	128K
GPT-4o mini	$0.15	$0.60	128K
o3	$10.00	$40.00	200K
o3-mini	$1.10	$4.40	200K

注意：输出 Token 通常比输入 Token 贵 3-5 倍，因为生成比理解更耗计算资源。

成本计算实例

def calculate_cost(
    input_tokens: int,
    output_tokens: int,
    input_price_per_million: float,
    output_price_per_million: float,
    calls_per_day: int = 1
) -> dict:
    """计算 API 调用成本"""
    input_cost = (input_tokens / 1_000_000) * input_price_per_million
    output_cost = (output_tokens / 1_000_000) * output_price_per_million
    cost_per_call = input_cost + output_cost

    return {
        "单次成本": f"${cost_per_call:.4f}",
        "日成本": f"${cost_per_call * calls_per_day:.2f}",
        "月成本": f"${cost_per_call * calls_per_day * 30:.2f}",
        "年成本": f"${cost_per_call * calls_per_day * 365:.2f}",
    }

# 场景：客服机器人
# 平均每次对话：输入 800 tokens，输出 400 tokens
# 每天 5000 次对话

# 使用 Claude Sonnet
sonnet_cost = calculate_cost(800, 400, 3.00, 15.00, 5000)
print("Claude Sonnet:", sonnet_cost)
# 单次: $0.0084, 日: $42.00, 月: $1,260.00

# 使用 Claude Haiku
haiku_cost = calculate_cost(800, 400, 0.80, 4.00, 5000)
print("Claude Haiku:", haiku_cost)
# 单次: $0.0022, 日: $11.20, 月: $336.00

# 差距：Haiku 便宜约 73%

模型分层选择策略

不是所有任务都需要最强的模型。合理的分层策略可以大幅降低成本。

模型选择矩阵

任务类型	推荐模型层级	理由
简单分类/提取	Haiku / GPT-4o mini	小模型足够
客服问答	Sonnet / GPT-4o	平衡质量和成本
代码生成	Sonnet / GPT-4o	需要较强推理
复杂推理/分析	Opus / o3	需要最强能力
内容审核	Haiku / GPT-4o mini	速度优先
翻译	Sonnet	质量要求中等

路由策略：根据任务复杂度选择模型

interface ModelConfig {
  name: string;
  inputPrice: number;  // per million tokens
  outputPrice: number;
  maxTokens: number;
}

const MODELS: Record<string, ModelConfig> = {
  fast: {
    name: "claude-3-5-haiku-20241022",
    inputPrice: 0.80,
    outputPrice: 4.00,
    maxTokens: 4096,
  },
  balanced: {
    name: "claude-sonnet-4-20250514",
    inputPrice: 3.00,
    outputPrice: 15.00,
    maxTokens: 8192,
  },
  powerful: {
    name: "claude-opus-4-20250514",
    inputPrice: 15.00,
    outputPrice: 75.00,
    maxTokens: 4096,
  },
};

function selectModel(task: {
  type: string;
  complexity: "low" | "medium" | "high";
  latencyRequirement: "realtime" | "normal" | "batch";
}): ModelConfig {
  // 实时 + 低复杂度 → 快速模型
  if (task.latencyRequirement === "realtime" && task.complexity === "low") {
    return MODELS.fast;
  }

  // 高复杂度任务 → 强力模型
  if (task.complexity === "high") {
    return MODELS.powerful;
  }

  // 默认使用平衡模型
  return MODELS.balanced;
}

级联策略：先用小模型，不行再用大模型

async function cascadeGenerate(
  prompt: string,
  qualityThreshold: number = 0.8
): Promise<{ response: string; model: string; cost: number }> {
  // 第一步：用小模型尝试
  const fastResponse = await callModel(MODELS.fast, prompt);

  // 评估质量（可以用简单的启发式规则）
  const quality = evaluateQuality(fastResponse);

  if (quality >= qualityThreshold) {
    return {
      response: fastResponse,
      model: MODELS.fast.name,
      cost: calculateCost(prompt, fastResponse, MODELS.fast),
    };
  }

  // 第二步：质量不够，用更强的模型
  const balancedResponse = await callModel(MODELS.balanced, prompt);
  return {
    response: balancedResponse,
    model: MODELS.balanced.name,
    cost: calculateCost(prompt, balancedResponse, MODELS.balanced),
  };
}

Prompt 压缩技巧

输入 Token 越少，成本越低。Prompt 压缩是最直接的优化手段。

1. 精简 System Prompt

❌ 冗长版（约 200 tokens）：
"你是一个非常专业的、经验丰富的客服助手。你的任务是帮助用户解决
他们遇到的各种各样的问题。你应该始终保持友好、耐心和专业的态度。
当用户提出问题时，你需要仔细分析问题，然后给出清晰、准确的回答。
如果你不确定答案，请诚实地告诉用户..."

✅ 精简版（约 50 tokens）：
"你是客服助手。友好专业地回答产品问题。不确定时说明。"

2. 压缩上下文

// 对话历史压缩
function compressHistory(
  messages: Message[],
  maxTokens: number
): Message[] {
  const estimated = estimateTokens(messages);

  if (estimated <= maxTokens) {
    return messages;
  }

  // 策略 1：只保留最近 N 轮
  const recentMessages = messages.slice(-6); // 最近 3 轮对话

  // 策略 2：摘要早期对话
  const earlyMessages = messages.slice(0, -6);
  const summary = summarize(earlyMessages);

  return [
    { role: "system", content: `之前的对话摘要：${summary}` },
    ...recentMessages,
  ];
}

3. 使用结构化输出减少输出 Token

# ❌ 自由格式输出（约 200 tokens）
prompt_verbose = "分析这段代码的问题，详细解释每个问题。"

# ✅ 结构化输出（约 50 tokens）
prompt_structured = """分析代码问题，用 JSON 输出：
{"issues": [{"line": 行号, "type": "类型", "fix": "修复建议"}]}
只输出 JSON，不要其他内容。"""

4. 使用 Few-shot 时精简示例

# ❌ 冗长的 few-shot（每个示例 100+ tokens）
examples_verbose = """
示例 1：
输入：这个产品太棒了，我非常喜欢它的设计和功能，强烈推荐给大家！
分析：这条评论表达了用户对产品的高度满意，包括设计和功能两个方面...
情感：正面

示例 2：
...
"""

# ✅ 精简的 few-shot（每个示例 30 tokens）
examples_compact = """
示例：
"产品很棒，推荐！" → 正面
"质量差，退货了" → 负面
"还行，一般般" → 中性
"""

缓存策略

缓存是降低成本最有效的方法之一。相同或相似的请求不需要重复调用 API。

1. 精确匹配缓存

import { createHash } from "crypto";

class ResponseCache {
  private cache = new Map<string, {
    response: string;
    timestamp: number;
    ttl: number;
  }>();

  private generateKey(prompt: string, model: string): string {
    return createHash("sha256")
      .update(`${model}:${prompt}`)
      .digest("hex");
  }

  get(prompt: string, model: string): string | null {
    const key = this.generateKey(prompt, model);
    const entry = this.cache.get(key);

    if (!entry) return null;
    if (Date.now() - entry.timestamp > entry.ttl) {
      this.cache.delete(key);
      return null;
    }

    return entry.response;
  }

  set(prompt: string, model: string, response: string, ttl = 3600000) {
    const key = this.generateKey(prompt, model);
    this.cache.set(key, { response, timestamp: Date.now(), ttl });
  }
}

// 使用缓存的 API 调用
const cache = new ResponseCache();

async function cachedGenerate(prompt: string, model: string): Promise<string> {
  const cached = cache.get(prompt, model);
  if (cached) {
    console.log("Cache hit - 节省了一次 API 调用");
    return cached;
  }

  const response = await callAPI(prompt, model);
  cache.set(prompt, model, response);
  return response;
}

2. Anthropic Prompt Caching

Anthropic 提供了原生的 Prompt Caching 功能，对于重复使用相同 System Prompt 的场景非常有效：

import anthropic

client = anthropic.Anthropic()

# 使用 Prompt Caching
# 长的 System Prompt 只需要在第一次调用时传输
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "这里是一个很长的 System Prompt，包含大量的产品知识库...",
            "cache_control": {"type": "ephemeral"}  # 启用缓存
        }
    ],
    messages=[{"role": "user", "content": "用户的问题"}]
)

# 缓存命中时，输入 Token 价格降低 90%
# 对于包含大量上下文的应用，节省非常显著

3. 语义缓存

对于语义相似但措辞不同的查询，可以使用 Embedding 做语义匹配：

from numpy import dot
from numpy.linalg import norm

class SemanticCache:
    def __init__(self, similarity_threshold=0.95):
        self.entries = []  # [(embedding, prompt, response)]
        self.threshold = similarity_threshold

    def cosine_similarity(self, a, b):
        return dot(a, b) / (norm(a) * norm(b))

    def get(self, query_embedding):
        best_match = None
        best_score = 0

        for emb, prompt, response in self.entries:
            score = self.cosine_similarity(query_embedding, emb)
            if score > best_score:
                best_score = score
                best_match = response

        if best_score >= self.threshold:
            return best_match
        return None

    def set(self, embedding, prompt, response):
        self.entries.append((embedding, prompt, response))

批处理与速率控制

批处理：降低单次调用开销

// 批量处理多个请求
async function batchProcess(
  items: string[],
  batchSize: number = 10
): Promise<string[]> {
  const results: string[] = [];

  for (let i = 0; i < items.length; i += batchSize) {
    const batch = items.slice(i, i + batchSize);

    // 将多个小请求合并为一个大请求
    const combinedPrompt = batch
      .map((item, idx) => `[${idx + 1}] ${item}`)
      .join("\n");

    const response = await callAPI(
      `请分别处理以下 ${batch.length} 个请求，用 [编号] 标记每个回答：\n${combinedPrompt}`
    );

    // 解析批量响应
    const parsed = parseBatchResponse(response, batch.length);
    results.push(...parsed);
  }

  return results;
}

速率限制：避免超额

class RateLimiter {
  private tokens: number;
  private lastRefill: number;
  private readonly maxTokens: number;
  private readonly refillRate: number; // tokens per second

  constructor(maxTokensPerMinute: number) {
    this.maxTokens = maxTokensPerMinute;
    this.tokens = maxTokensPerMinute;
    this.lastRefill = Date.now();
    this.refillRate = maxTokensPerMinute / 60;
  }

  async waitForTokens(needed: number): Promise<void> {
    while (true) {
      this.refill();
      if (this.tokens >= needed) {
        this.tokens -= needed;
        return;
      }
      // 等待足够的 token 恢复
      const waitTime = ((needed - this.tokens) / this.refillRate) * 1000;
      await new Promise(resolve => setTimeout(resolve, waitTime));
    }
  }

  private refill() {
    const now = Date.now();
    const elapsed = (now - this.lastRefill) / 1000;
    this.tokens = Math.min(
      this.maxTokens,
      this.tokens + elapsed * this.refillRate
    );
    this.lastRefill = now;
  }
}

成本监控与预算告警

实时成本追踪

class CostTracker {
  private dailyCost = 0;
  private monthlyCost = 0;
  private readonly dailyBudget: number;
  private readonly monthlyBudget: number;

  constructor(dailyBudget: number, monthlyBudget: number) {
    this.dailyBudget = dailyBudget;
    this.monthlyBudget = monthlyBudget;
  }

  recordUsage(inputTokens: number, outputTokens: number, model: ModelConfig) {
    const cost =
      (inputTokens / 1_000_000) * model.inputPrice +
      (outputTokens / 1_000_000) * model.outputPrice;

    this.dailyCost += cost;
    this.monthlyCost += cost;

    // 检查预算
    if (this.dailyCost > this.dailyBudget * 0.8) {
      this.alert("日预算已使用 80%", this.dailyCost, this.dailyBudget);
    }
    if (this.monthlyCost > this.monthlyBudget * 0.8) {
      this.alert("月预算已使用 80%", this.monthlyCost, this.monthlyBudget);
    }

    return cost;
  }

  private alert(message: string, current: number, budget: number) {
    console.warn(
      `⚠ 成本告警: ${message} ($${current.toFixed(2)} / $${budget.toFixed(2)})`
    );
    // 发送通知（邮件、Slack 等）
  }

  getReport(): string {
    return `日成本: $${this.dailyCost.toFixed(2)} / $${this.dailyBudget}
月成本: $${this.monthlyCost.toFixed(2)} / $${this.monthlyBudget}`;
  }
}

成本优化效果对比

优化策略	预期节省	实现难度	适用场景
模型降级	50-90%	低	简单任务
Prompt 压缩	20-50%	低	所有场景
响应缓存	30-70%	中	重复查询多
Prompt Caching	50-90%（输入部分）	低	长 System Prompt
批处理	20-40%	中	批量任务
级联策略	40-60%	高	复杂度差异大

总结

AI 应用的成本优化是一个系统工程，需要从多个层面入手：

理解 Token 定价模型，准确估算成本
根据任务复杂度选择合适的模型层级
通过 Prompt 压缩减少输入输出 Token
利用缓存避免重复计算
建立成本监控和预算告警机制

最贵的 API 调用不是价格最高的那个，而是不必要的那个。优化成本的第一步，是搞清楚哪些调用是真正需要的。

加载中...