强曰为道
与天地相似,故不违。知周乎万物,而道济天下,故不过。旁行而不流,乐天知命,故不忧.
文档目录

OpenAI API 接口对接完全教程 / 09 - Whisper 语音转文字

第 09 章 · Whisper 语音转文字 API

Whisper 是 OpenAI 开源的语音识别模型,通过 API 可以轻松实现音频转文字。本章详解转录接口、翻译接口、多语言支持和实时转录方案。


9.1 API 概览

接口端点功能
TranscriptionsPOST /v1/audio/transcriptions音频 → 文字(原语言)
TranslationsPOST /v1/audio/translations音频 → 英文文字

支持的音频格式

格式扩展名说明
MP3.mp3最常用
M4A.m4aApple 设备常用
WAV.wav无损,文件较大
WEBM.webm浏览器录音格式
MP4.mp4视频文件(提取音频)
FLAC.flac无损压缩
OGG.ogg开源格式

限制

限制项
文件大小上限25 MB
支持语言57+ 种
模型whisper-1
定价$0.006 / 分钟

9.2 基础转录

Python 示例

from openai import OpenAI
from pathlib import Path

client = OpenAI()

def transcribe_file(file_path: str, language: str = None) -> str:
    """将音频文件转录为文字"""
    with open(file_path, "rb") as audio_file:
        kwargs = {
            "model": "whisper-1",
            "file": audio_file,
            "response_format": "text",
        }
        if language:
            kwargs["language"] = language

        response = client.audio.transcriptions.create(**kwargs)

    return response if isinstance(response, str) else response.text

# 使用
text = transcribe_file("meeting.mp3", language="zh")
print(text)

curl 示例

curl https://api.openai.com/v1/audio/transcriptions \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -F file="@audio.mp3" \
  -F model="whisper-1" \
  -F language="zh" \
  -F response_format="text"

9.3 响应格式

格式参数值返回内容适用场景
Texttext纯文本简单使用
JSONjsonJSON 对象默认格式
Verbose JSONverbose_json详细 JSON需要时间戳/分段
SRTsrtSRT 字幕视频字幕
VTTvttWebVTT 字幕网页字幕

Verbose JSON 示例

response = client.audio.transcriptions.create(
    model="whisper-1",
    file=open("speech.mp3", "rb"),
    response_format="verbose_json",
    timestamp_granularities=["segment", "word"],
)

# 段落信息
for segment in response.segments:
    print(f"[{segment['start']:.1f}s → {segment['end']:.1f}s] {segment['text']}")

# 词级时间戳
for word in response.words:
    print(f"  {word['word']} [{word['start']:.2f}s - {word['end']:.2f}s]")

SRT 字幕生成

def generate_srt(file_path: str, output_path: str):
    """生成 SRT 字幕文件"""
    response = client.audio.transcriptions.create(
        model="whisper-1",
        file=open(file_path, "rb"),
        response_format="verbose_json",
        timestamp_granularities=["segment"],
    )

    srt_lines = []
    for i, segment in enumerate(response.segments, 1):
        start = format_srt_time(segment['start'])
        end = format_srt_time(segment['end'])
        srt_lines.append(f"{i}")
        srt_lines.append(f"{start} --> {end}")
        srt_lines.append(segment['text'])
        srt_lines.append("")

    with open(output_path, "w", encoding="utf-8") as f:
        f.write("\n".join(srt_lines))

def format_srt_time(seconds: float) -> str:
    """格式化 SRT 时间戳"""
    h = int(seconds // 3600)
    m = int((seconds % 3600) // 60)
    s = int(seconds % 60)
    ms = int((seconds % 1) * 1000)
    return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"

9.4 音频翻译

将任意语言音频翻译为英文:

def translate_to_english(file_path: str) -> str:
    """将音频翻译为英文文本"""
    with open(file_path, "rb") as audio_file:
        response = client.audio.translations.create(
            model="whisper-1",
            file=audio_file,
            response_format="text",
        )
    return response if isinstance(response, str) else response.text

# 中文音频 → 英文文本
english_text = translate_to_english("chinese_speech.mp3")
print(english_text)

9.5 大文件分片处理

超过 25MB 的文件需要分片:

from pydub import AudioSegment
import tempfile
import os

def transcribe_large_file(file_path: str, chunk_minutes: int = 10) -> str:
    """分片转录大文件"""
    audio = AudioSegment.from_file(file_path)
    chunk_ms = chunk_minutes * 60 * 1000
    total_ms = len(audio)

    full_text = []
    for start in range(0, total_ms, chunk_ms):
        end = min(start + chunk_ms, total_ms)
        chunk = audio[start:end]

        # 导出临时文件
        with tempfile.NamedTemporaryFile(suffix=".mp3", delete=False) as tmp:
            chunk.export(tmp.name, format="mp3")
            tmp_path = tmp.name

        try:
            with open(tmp_path, "rb") as f:
                response = client.audio.transcriptions.create(
                    model="whisper-1",
                    file=f,
                    response_format="text",
                    language="zh",
                )
            full_text.append(response if isinstance(response, str) else response.text)
            print(f"已完成 {start//60000}-{end//60000} 分钟")
        finally:
            os.unlink(tmp_path)

    return "\n".join(full_text)

9.6 Node.js 实现

import OpenAI from 'openai';
import fs from 'fs';

const client = new OpenAI();

async function transcribeFile(filePath, language = 'zh') {
  const response = await client.audio.transcriptions.create({
    model: 'whisper-1',
    file: fs.createReadStream(filePath),
    language: language,
    response_format: 'verbose_json',
  });

  console.log(`文本: ${response.text}`);
  console.log(`时长: ${response.duration}s`);
  console.log(`语言: ${response.language}`);

  return response;
}

transcribeFile('meeting.mp3');

9.7 多语言支持

Whisper 支持 57+ 种语言,常用语言代码:

语言代码语言代码
中文zh英文en
日文ja韩文ko
法文fr德文de
西班牙文es俄文ru
阿拉伯文ar葡萄牙文pt
泰文th越南文vi

建议:指定 language 参数可提高准确率和速度。


9.8 语音对话方案

结合 Whisper + Chat Completions + TTS 实现语音对话:

def voice_chat(audio_path: str) -> tuple[str, str]:
    """语音对话:听 → 想 → 说"""
    # 1. 语音转文字
    with open(audio_path, "rb") as f:
        transcript = client.audio.transcriptions.create(
            model="whisper-1", file=f, language="zh",
        )
    user_text = transcript if isinstance(transcript, str) else transcript.text
    print(f"用户说: {user_text}")

    # 2. 生成回复
    chat_response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "你是一个友好的语音助手,简洁回答。"},
            {"role": "user", "content": user_text},
        ],
        max_tokens=300,
    )
    reply_text = chat_response.choices[0].message.content
    print(f"AI说: {reply_text}")

    # 3. 文字转语音
    speech_response = client.audio.speech.create(
        model="tts-1", voice="nova", input=reply_text,
    )
    speech_response.stream_to_file("reply.mp3")

    return user_text, reply_text

9.9 实时转录方案

浏览器端实时录音 + 后端转录:

后端 API

from fastapi import FastAPI, UploadFile, File

app = FastAPI()

@app.post("/api/transcribe")
async def transcribe_audio(file: UploadFile = File(...)):
    """接收音频片段并转录"""
    response = client.audio.transcriptions.create(
        model="whisper-1",
        file=file.file,
        language="zh",
        response_format="verbose_json",
    )
    return {
        "text": response.text,
        "duration": response.duration,
        "language": response.language,
    }

前端录音

async function startLiveTranscription() {
  const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
  const mediaRecorder = new MediaRecorder(stream, { mimeType: 'audio/webm' });
  const chunks = [];

  // 每 5 秒发送一段
  mediaRecorder.ondataavailable = async (e) => {
    if (e.data.size > 0) {
      chunks.push(e.data);
      const blob = new Blob(chunks, { type: 'audio/webm' });
      chunks.length = 0;

      const formData = new FormData();
      formData.append('file', blob, 'chunk.webm');

      const res = await fetch('/api/transcribe', {
        method: 'POST',
        body: formData,
      });
      const data = await res.json();
      document.getElementById('output').textContent += data.text + ' ';
    }
  };

  mediaRecorder.start(5000); // 每 5 秒触发一次
}

9.10 业务场景

场景配置建议说明
会议记录verbose_json + 分段生成带时间戳的完整记录
视频字幕srt/vtt 格式直接用于字幕文件
语音搜索text 格式简单文本用于搜索
客服质检分段 + 情感分析分析通话质量
访谈转录word 级时间戳精确定位每句话

9.11 注意事项

  1. 文件大小:超过 25MB 需分片
  2. 音频质量:噪音会影响准确率,建议预处理降噪
  3. 语言检测:不指定语言时 Whisper 会自动检测,但可能不准
  4. 成本计算:按音频时长计费,$0.006/分钟 ≈ $0.36/小时
  5. 长音频分片:建议每片 10-15 分钟,避免截断在句子中间
  6. 格式选择:MP3 体积小适合上传,WAV 质量高但体积大

9.12 扩展阅读


下一章10 - TTS 语音合成 — 文本转语音、流式合成、多音色选择。