OpenAI API 接口对接完全教程 / 09 - Whisper 语音转文字

第 09 章 · Whisper 语音转文字 API

Whisper 是 OpenAI 开源的语音识别模型，通过 API 可以轻松实现音频转文字。本章详解转录接口、翻译接口、多语言支持和实时转录方案。

9.1 API 概览

接口	端点	功能
Transcriptions	`POST /v1/audio/transcriptions`	音频 → 文字（原语言）
Translations	`POST /v1/audio/translations`	音频 → 英文文字

支持的音频格式

格式	扩展名	说明
MP3	`.mp3`	最常用
M4A	`.m4a`	Apple 设备常用
WAV	`.wav`	无损，文件较大
WEBM	`.webm`	浏览器录音格式
MP4	`.mp4`	视频文件（提取音频）
FLAC	`.flac`	无损压缩
OGG	`.ogg`	开源格式

限制

限制项	值
文件大小上限	25 MB
支持语言	57+ 种
模型	`whisper-1`
定价	$0.006 / 分钟

9.2 基础转录

Python 示例

from openai import OpenAI
from pathlib import Path

client = OpenAI()

def transcribe_file(file_path: str, language: str = None) -> str:
    """将音频文件转录为文字"""
    with open(file_path, "rb") as audio_file:
        kwargs = {
            "model": "whisper-1",
            "file": audio_file,
            "response_format": "text",
        }
        if language:
            kwargs["language"] = language

        response = client.audio.transcriptions.create(**kwargs)

    return response if isinstance(response, str) else response.text

# 使用
text = transcribe_file("meeting.mp3", language="zh")
print(text)

curl 示例

curl https://api.openai.com/v1/audio/transcriptions \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -F file="@audio.mp3" \
  -F model="whisper-1" \
  -F language="zh" \
  -F response_format="text"

9.3 响应格式

格式	参数值	返回内容	适用场景
Text	`text`	纯文本	简单使用
JSON	`json`	JSON 对象	默认格式
Verbose JSON	`verbose_json`	详细 JSON	需要时间戳/分段
SRT	`srt`	SRT 字幕	视频字幕
VTT	`vtt`	WebVTT 字幕	网页字幕

Verbose JSON 示例

response = client.audio.transcriptions.create(
    model="whisper-1",
    file=open("speech.mp3", "rb"),
    response_format="verbose_json",
    timestamp_granularities=["segment", "word"],
)

# 段落信息
for segment in response.segments:
    print(f"[{segment['start']:.1f}s → {segment['end']:.1f}s] {segment['text']}")

# 词级时间戳
for word in response.words:
    print(f"  {word['word']} [{word['start']:.2f}s - {word['end']:.2f}s]")

SRT 字幕生成

def generate_srt(file_path: str, output_path: str):
    """生成 SRT 字幕文件"""
    response = client.audio.transcriptions.create(
        model="whisper-1",
        file=open(file_path, "rb"),
        response_format="verbose_json",
        timestamp_granularities=["segment"],
    )

    srt_lines = []
    for i, segment in enumerate(response.segments, 1):
        start = format_srt_time(segment['start'])
        end = format_srt_time(segment['end'])
        srt_lines.append(f"{i}")
        srt_lines.append(f"{start} --> {end}")
        srt_lines.append(segment['text'])
        srt_lines.append("")

    with open(output_path, "w", encoding="utf-8") as f:
        f.write("\n".join(srt_lines))

def format_srt_time(seconds: float) -> str:
    """格式化 SRT 时间戳"""
    h = int(seconds // 3600)
    m = int((seconds % 3600) // 60)
    s = int(seconds % 60)
    ms = int((seconds % 1) * 1000)
    return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"

9.4 音频翻译

将任意语言音频翻译为英文：

def translate_to_english(file_path: str) -> str:
    """将音频翻译为英文文本"""
    with open(file_path, "rb") as audio_file:
        response = client.audio.translations.create(
            model="whisper-1",
            file=audio_file,
            response_format="text",
        )
    return response if isinstance(response, str) else response.text

# 中文音频 → 英文文本
english_text = translate_to_english("chinese_speech.mp3")
print(english_text)

9.5 大文件分片处理

超过 25MB 的文件需要分片：

from pydub import AudioSegment
import tempfile
import os

def transcribe_large_file(file_path: str, chunk_minutes: int = 10) -> str:
    """分片转录大文件"""
    audio = AudioSegment.from_file(file_path)
    chunk_ms = chunk_minutes * 60 * 1000
    total_ms = len(audio)

    full_text = []
    for start in range(0, total_ms, chunk_ms):
        end = min(start + chunk_ms, total_ms)
        chunk = audio[start:end]

        # 导出临时文件
        with tempfile.NamedTemporaryFile(suffix=".mp3", delete=False) as tmp:
            chunk.export(tmp.name, format="mp3")
            tmp_path = tmp.name

        try:
            with open(tmp_path, "rb") as f:
                response = client.audio.transcriptions.create(
                    model="whisper-1",
                    file=f,
                    response_format="text",
                    language="zh",
                )
            full_text.append(response if isinstance(response, str) else response.text)
            print(f"已完成 {start//60000}-{end//60000} 分钟")
        finally:
            os.unlink(tmp_path)

    return "\n".join(full_text)

9.6 Node.js 实现

import OpenAI from 'openai';
import fs from 'fs';

const client = new OpenAI();

async function transcribeFile(filePath, language = 'zh') {
  const response = await client.audio.transcriptions.create({
    model: 'whisper-1',
    file: fs.createReadStream(filePath),
    language: language,
    response_format: 'verbose_json',
  });

  console.log(`文本: ${response.text}`);
  console.log(`时长: ${response.duration}s`);
  console.log(`语言: ${response.language}`);

  return response;
}

transcribeFile('meeting.mp3');

9.7 多语言支持

Whisper 支持 57+ 种语言，常用语言代码：

语言	代码	语言	代码
中文	`zh`	英文	`en`
日文	`ja`	韩文	`ko`
法文	`fr`	德文	`de`
西班牙文	`es`	俄文	`ru`
阿拉伯文	`ar`	葡萄牙文	`pt`
泰文	`th`	越南文	`vi`

建议：指定 language 参数可提高准确率和速度。

9.8 语音对话方案

结合 Whisper + Chat Completions + TTS 实现语音对话：

def voice_chat(audio_path: str) -> tuple[str, str]:
    """语音对话：听 → 想 → 说"""
    # 1. 语音转文字
    with open(audio_path, "rb") as f:
        transcript = client.audio.transcriptions.create(
            model="whisper-1", file=f, language="zh",
        )
    user_text = transcript if isinstance(transcript, str) else transcript.text
    print(f"用户说: {user_text}")

    # 2. 生成回复
    chat_response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "你是一个友好的语音助手，简洁回答。"},
            {"role": "user", "content": user_text},
        ],
        max_tokens=300,
    )
    reply_text = chat_response.choices[0].message.content
    print(f"AI说: {reply_text}")

    # 3. 文字转语音
    speech_response = client.audio.speech.create(
        model="tts-1", voice="nova", input=reply_text,
    )
    speech_response.stream_to_file("reply.mp3")

    return user_text, reply_text

9.9 实时转录方案

浏览器端实时录音 + 后端转录：

后端 API

from fastapi import FastAPI, UploadFile, File

app = FastAPI()

@app.post("/api/transcribe")
async def transcribe_audio(file: UploadFile = File(...)):
    """接收音频片段并转录"""
    response = client.audio.transcriptions.create(
        model="whisper-1",
        file=file.file,
        language="zh",
        response_format="verbose_json",
    )
    return {
        "text": response.text,
        "duration": response.duration,
        "language": response.language,
    }

前端录音

async function startLiveTranscription() {
  const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
  const mediaRecorder = new MediaRecorder(stream, { mimeType: 'audio/webm' });
  const chunks = [];

  // 每 5 秒发送一段
  mediaRecorder.ondataavailable = async (e) => {
    if (e.data.size > 0) {
      chunks.push(e.data);
      const blob = new Blob(chunks, { type: 'audio/webm' });
      chunks.length = 0;

      const formData = new FormData();
      formData.append('file', blob, 'chunk.webm');

      const res = await fetch('/api/transcribe', {
        method: 'POST',
        body: formData,
      });
      const data = await res.json();
      document.getElementById('output').textContent += data.text + ' ';
    }
  };

  mediaRecorder.start(5000); // 每 5 秒触发一次
}

9.10 业务场景

场景	配置建议	说明
会议记录	verbose_json + 分段	生成带时间戳的完整记录
视频字幕	srt/vtt 格式	直接用于字幕文件
语音搜索	text 格式	简单文本用于搜索
客服质检	分段 + 情感分析	分析通话质量
访谈转录	word 级时间戳	精确定位每句话

9.11 注意事项

文件大小：超过 25MB 需分片
音频质量：噪音会影响准确率，建议预处理降噪
语言检测：不指定语言时 Whisper 会自动检测，但可能不准
成本计算：按音频时长计费，$0.006/分钟 ≈ $0.36/小时
长音频分片：建议每片 10-15 分钟，避免截断在句子中间
格式选择：MP3 体积小适合上传，WAV 质量高但体积大

9.12 扩展阅读

下一章：10 - TTS 语音合成 — 文本转语音、流式合成、多音色选择。