LM Studio 本地模型使用指南 / 05 - 本地服务器

本地服务器

使用 LM Studio 的本地 API 服务器，为你的应用提供 OpenAI 兼容的 LLM 接口。

5.1 本地服务器概述

什么是本地服务器？

LM Studio 的 Local Server 功能将你的电脑变成一个 API 服务器，提供与 OpenAI API 完全兼容的接口。任何支持 OpenAI SDK 的应用都可以无缝切换到本地模型。

架构示意：

┌──────────────┐     HTTP      ┌──────────────────┐
│  你的应用     │ ──────────→  │  LM Studio       │
│  (Python/JS) │  localhost    │  Local Server    │
│              │  :1234        │  (OpenAI 兼容)    │
└──────────────┘               └────────┬─────────┘
                                        │
                                        ▼
                               ┌──────────────────┐
                               │  本地模型         │
                               │  (GGUF)          │
                               └──────────────────┘

核心特性

特性	说明
OpenAI 兼容	完全兼容 OpenAI Chat Completions API
流式响应	支持 Server-Sent Events (SSE) 流式输出
多模型切换	服务器运行时可切换模型
并发支持	支持多客户端同时连接
CORS 支持	允许浏览器端应用直接调用
自定义端口	可配置任意可用端口

5.2 启动服务器

操作步骤

1. 在 LM Studio 左侧导航栏点击 🖥️ "Local Server" 图标
2. 在顶部选择要加载的模型
3. 配置服务器参数（端口、并发数等）
4. 点击 "Start Server" 按钮
5. 等待模型加载完成，状态变为 "Running"

服务器界面

┌──────────────────────────────────────────────────────────┐
│ Local Server                                     [● 运行中]│
├──────────────────────────────────────────────────────────┤
│                                                          │
│  模型: [Qwen2.5-7B-Instruct ▾]                          │
│                                                          │
│  ┌───────────────────────────────────────────────────┐   │
│  │ Server Status: Running                            │   │
│  │ URL: http://localhost:1234                        │   │
│  │ Model: qwen2.5-7b-instruct-q4_k_m               │   │
│  │ GPU: NVIDIA RTX 4070 (12 GB)                     │   │
│  │ Requests served: 42                               │   │
│  │ Avg latency: 1.2s                                 │   │
│  └───────────────────────────────────────────────────┘   │
│                                                          │
│  ┌───────────────────────────────────────────────────┐   │
│  │ Settings                                          │   │
│  │ Port: [1234]                                      │   │
│  │ CORS: [✅ Enabled]                                │   │
│  │ Verbose logging: [☐]                              │   │
│  └───────────────────────────────────────────────────┘   │
│                                                          │
│  ┌───────────────────────────────────────────────────┐   │
│  │ Request Log                                       │   │
│  │ 12:34:56 POST /v1/chat/completions 200 1.23s     │   │
│  │ 12:35:01 POST /v1/chat/completions 200 0.89s     │   │
│  │ 12:35:15 GET /v1/models 200 0.01s                │   │
│  └───────────────────────────────────────────────────┘   │
│                                                          │
└──────────────────────────────────────────────────────────┘

5.3 OpenAI 兼容 API

支持的端点

端点	方法	说明
`/v1/chat/completions`	POST	聊天补全（最常用）
`/v1/completions`	POST	文本补全（旧接口）
`/v1/models`	GET	获取可用模型列表
`/v1/embeddings`	POST	获取文本嵌入向量

聊天补全 API

非流式请求

curl http://localhost:1234/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5-7b-instruct",
    "messages": [
      {"role": "system", "content": "你是一个有帮助的助手"},
      {"role": "user", "content": "什么是量子计算？请用简单的话解释"}
    ],
    "temperature": 0.7,
    "max_tokens": 500
  }'

响应格式：

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1234567890,
  "model": "qwen2.5-7b-instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "量子计算是一种利用量子力学原理进行计算的技术..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 35,
    "completion_tokens": 128,
    "total_tokens": 163
  }
}

流式请求

curl http://localhost:1234/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5-7b-instruct",
    "messages": [
      {"role": "user", "content": "讲一个短故事"}
    ],
    "stream": true
  }'

流式响应格式（Server-Sent Events）：

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","choices":[{"delta":{"content":"从前"},"index":0}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","choices":[{"delta":{"content":"有一个"},"index":0}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","choices":[{"delta":{"content":"小村庄"},"index":0}]}

data: [DONE]

获取模型列表

curl http://localhost:1234/v1/models

响应：

{
  "object": "list",
  "data": [
    {
      "id": "qwen2.5-7b-instruct",
      "object": "model",
      "created": 1234567890,
      "owned_by": "lm-studio"
    }
  ]
}

5.4 端口配置

更改默认端口

默认端口: 1234

更改方法:
1. 在 Local Server 界面找到 Port 设置
2. 输入新的端口号（如 8080）
3. 重启服务器生效

注意:
├── 端口范围: 1024-65535
├── 避免使用常见服务的端口（如 80, 443, 3306）
├── 确保防火墙允许该端口
└── 如果本机使用，无需特殊配置

多实例运行

如果需要同时运行多个模型服务器：
1. 第一个实例使用默认端口 1234
2. 第二个实例使用不同端口（如 1235）
3. 注意内存消耗，每个模型都需要独立的内存空间

示例配置:
├── http://localhost:1234 → Qwen2.5-7B (通用对话)
├── http://localhost:1235 → DeepSeek-Coder-V2 (代码)
└── http://localhost:1236 → Llama-3.1-8B (英文任务)

5.5 Python 集成

使用 OpenAI SDK

from openai import OpenAI

# 创建客户端，指向本地 LM Studio
client = OpenAI(
    base_url="http://localhost:1234/v1",
    api_key="lm-studio"  # 本地服务器不需要真实密钥
)

# 非流式调用
def chat(message: str) -> str:
    response = client.chat.completions.create(
        model="qwen2.5-7b-instruct",
        messages=[
            {"role": "system", "content": "你是一个有帮助的助手"},
            {"role": "user", "content": message}
        ],
        temperature=0.7,
        max_tokens=1024
    )
    return response.choices[0].message.content

# 流式调用
def chat_stream(message: str):
    stream = client.chat.completions.create(
        model="qwen2.5-7b-instruct",
        messages=[
            {"role": "user", "content": message}
        ],
        stream=True
    )
    for chunk in stream:
        if chunk.choices[0].delta.content:
            print(chunk.choices[0].delta.content, end="", flush=True)
    print()  # 换行

# 测试
print(chat("用一句话解释机器学习"))
chat_stream("写一首关于春天的短诗")

使用 requests 库

import requests
import json

def chat(message: str, system_prompt: str = None) -> str:
    """使用 requests 调用本地 LM Studio API"""
    messages = []
    if system_prompt:
        messages.append({"role": "system", "content": system_prompt})
    messages.append({"role": "user", "content": message})

    response = requests.post(
        "http://localhost:1234/v1/chat/completions",
        headers={"Content-Type": "application/json"},
        json={
            "model": "qwen2.5-7b-instruct",
            "messages": messages,
            "temperature": 0.7
        }
    )
    return response.json()["choices"][0]["message"]["content"]

# 流式请求
def chat_stream(message: str):
    """流式输出"""
    response = requests.post(
        "http://localhost:1234/v1/chat/completions",
        json={
            "model": "qwen2.5-7b-instruct",
            "messages": [{"role": "user", "content": message}],
            "stream": True
        },
        stream=True
    )
    for line in response.iter_lines():
        if line:
            line = line.decode("utf-8")
            if line.startswith("data: ") and line != "data: [DONE]":
                data = json.loads(line[6:])
                content = data["choices"][0]["delta"].get("content", "")
                if content:
                    print(content, end="", flush=True)
    print()

多轮对话管理

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:1234/v1",
    api_key="lm-studio"
)

class ChatSession:
    """管理多轮对话的会话类"""

    def __init__(self, model: str = "qwen2.5-7b-instruct",
                 system_prompt: str = None):
        self.model = model
        self.messages = []
        if system_prompt:
            self.messages.append({
                "role": "system",
                "content": system_prompt
            })

    def send(self, message: str) -> str:
        """发送消息并获取回复"""
        self.messages.append({"role": "user", "content": message})

        response = client.chat.completions.create(
            model=self.model,
            messages=self.messages,
            temperature=0.7
        )

        reply = response.choices[0].message.content
        self.messages.append({"role": "assistant", "content": reply})
        return reply

    def get_history(self) -> list:
        """获取对话历史"""
        return self.messages

    def clear(self):
        """清空对话历史（保留系统提示）"""
        self.messages = [m for m in self.messages if m["role"] == "system"]


# 使用示例
session = ChatSession(
    system_prompt="你是一个 Python 专家，帮助用户解决编程问题"
)

# 多轮对话
print(session.send("什么是装饰器？"))
print(session.send("能给个实际例子吗？"))
print(session.send("怎么实现一个计时装饰器？"))

# 查看历史
for msg in session.get_history():
    print(f"[{msg['role']}] {msg['content'][:50]}...")

5.6 JavaScript 集成

Node.js + OpenAI SDK

const OpenAI = require('openai');

const client = new OpenAI({
  baseURL: 'http://localhost:1234/v1',
  apiKey: 'lm-studio',
});

// 非流式调用
async function chat(message) {
  const response = await client.chat.completions.create({
    model: 'qwen2.5-7b-instruct',
    messages: [
      { role: 'system', content: '你是一个有帮助的助手' },
      { role: 'user', content: message },
    ],
    temperature: 0.7,
  });
  return response.choices[0].message.content;
}

// 流式调用
async function chatStream(message) {
  const stream = await client.chat.completions.create({
    model: 'qwen2.5-7b-instruct',
    messages: [{ role: 'user', content: message }],
    stream: true,
  });

  for await (const chunk of stream) {
    const content = chunk.choices[0]?.delta?.content || '';
    if (content) {
      process.stdout.write(content);
    }
  }
  console.log();
}

// 使用
(async () => {
  const reply = await chat('什么是 GraphQL？');
  console.log(reply);

  await chatStream('写一首关于编程的打油诗');
})();

浏览器端 Fetch API

// 浏览器端直接调用（需要 CORS 支持）
async function chat(message) {
  const response = await fetch('http://localhost:1234/v1/chat/completions', {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({
      model: 'qwen2.5-7b-instruct',
      messages: [
        { role: 'user', content: message }
      ],
      stream: false,
    }),
  });

  const data = await response.json();
  return data.choices[0].message.content;
}

// 流式调用
async function chatStream(message, onChunk) {
  const response = await fetch('http://localhost:1234/v1/chat/completions', {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({
      model: 'qwen2.5-7b-instruct',
      messages: [{ role: 'user', content: message }],
      stream: true,
    }),
  });

  const reader = response.body.getReader();
  const decoder = new TextDecoder();

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;

    const text = decoder.decode(value);
    const lines = text.split('\n').filter(line => line.startsWith('data: '));

    for (const line of lines) {
      if (line === 'data: [DONE]') return;
      const data = JSON.parse(line.slice(6));
      const content = data.choices[0]?.delta?.content;
      if (content) onChunk(content);
    }
  }
}

// 使用
chatStream('讲一个笑话', (chunk) => {
  document.getElementById('output').textContent += chunk;
});

5.7 完整应用示例

命令行聊天应用

#!/usr/bin/env python3
"""基于 LM Studio 的命令行聊天应用"""

from openai import OpenAI

def main():
    client = OpenAI(
        base_url="http://localhost:1234/v1",
        api_key="lm-studio"
    )

    messages = [
        {"role": "system", "content": "你是一个友好的助手，用中文回答问题。"}
    ]

    print("=== LM Studio 命令行聊天 ===")
    print("输入 'quit' 退出，'clear' 清空历史\n")

    while True:
        user_input = input("你: ").strip()
        if user_input.lower() == "quit":
            break
        if user_input.lower() == "clear":
            messages = messages[:1]  # 保留系统提示
            print("[已清空对话历史]\n")
            continue
        if not user_input:
            continue

        messages.append({"role": "user", "content": user_input})

        print("AI: ", end="", flush=True)
        full_response = ""

        stream = client.chat.completions.create(
            model="qwen2.5-7b-instruct",
            messages=messages,
            stream=True,
            temperature=0.7
        )

        for chunk in stream:
            content = chunk.choices[0].delta.content or ""
            print(content, end="", flush=True)
            full_response += content

        print("\n")
        messages.append({"role": "assistant", "content": full_response})

if __name__ == "__main__":
    main()

5.8 注意事项

注意事项	说明
服务器仅本地访问	默认只监听 localhost，外部无法访问
API Key	本地服务器接受任意 API Key，但建议设置为 “lm-studio”
并发限制	受模型大小和硬件限制，过多并发会导致响应变慢
模型切换	切换模型时，正在进行的请求可能失败
防火墙	如需局域网访问，需要在防火墙中放行端口

5.9 本章小结

要点	内容
启动服务器	在 Local Server 标签页选择模型并启动
API 兼容	完全兼容 OpenAI Chat Completions API
流式响应	支持 SSE 流式输出，提升用户体验
Python 集成	使用 OpenAI SDK 或 requests 库
JavaScript 集成	Node.js SDK 或浏览器 Fetch API

LM Studio 本地模型使用指南 / 05 - 本地服务器

本地服务器

5.1 本地服务器概述

什么是本地服务器？

核心特性

5.2 启动服务器

操作步骤

服务器界面

5.3 OpenAI 兼容 API

支持的端点

聊天补全 API

非流式请求

流式请求

获取模型列表

5.4 端口配置

更改默认端口

多实例运行

5.5 Python 集成

使用 OpenAI SDK

使用 requests 库

多轮对话管理

5.6 JavaScript 集成

Node.js + OpenAI SDK

浏览器端 Fetch API

5.7 完整应用示例

命令行聊天应用

5.8 注意事项

5.9 本章小结

扩展阅读