HTTP 协议详解教程 / 第 3 章：URL 与 URI

第 3 章：URL 与 URI

URL 是我们每天都在使用的概念，但其背后的语法规范和编码规则往往被忽视。本章将带你彻底理解 URL 的方方面面。

3.1 URI、URL 和 URN 的关系

这三个术语经常被混用，但它们有明确的区别：

URI (Uniform Resource Identifier)  — 统一资源标识符
├── URL (Uniform Resource Locator) — 统一资源定位符（标识 + 定位）
└── URN (Uniform Resource Name)    — 统一资源名称（仅标识）

术语	定义	示例
URI	标识资源的字符串的超集	`https://example.com/page`、`urn:isbn:978-7-111-63666-6`
URL	通过位置标识资源	`https://example.com/api/users/1`
URN	通过名称标识资源，与位置无关	`urn:uuid:550e8400-e29b-41d4-a716-446655440000`

📝 注意：实际工作中，URL 是使用最广泛的概念，日常说 “URI” 通常指的是 URL。

3.2 URL 的语法结构

根据 RFC 3986，URL 的通用语法为：

  scheme://[userinfo@]host[:port]/path[?query][#fragment]
  ──┬──   ────┬────  ─┬─  ─┬─  ─┬─    ──┬──    ──┬───
    │         │       │    │    │        │        │
  协议      用户信息  主机  端口  路径    查询参数  片段

各组成部分详解

https://alice:p4ss@api.example.com:8443/v2/users?role=admin&page=1#section-2

部分	值	说明
scheme	`https`	协议方案
userinfo	`alice:p4ss`	用户名和密码（已不推荐使用）
host	`api.example.com`	主机名（域名或 IP）
port	`8443`	端口号（默认端口可省略）
path	`/v2/users`	资源路径
query	`role=admin&page=1`	查询参数
fragment	`section-2`	片段标识（不会发送到服务器）

Python 解析 URL

from urllib.parse import urlparse, urlunparse

url = "https://alice:p4ss@api.example.com:8443/v2/users?role=admin&page=1#section-2"
parsed = urlparse(url)

print(f"scheme:   {parsed.scheme}")      # https
print(f"netloc:   {parsed.netloc}")      # alice:p4ss@api.example.com:8443
print(f"hostname: {parsed.hostname}")    # api.example.com
print(f"port:     {parsed.port}")        # 8443
print(f"username: {parsed.username}")    # alice
print(f"password: {parsed.password}")    # p4ss
print(f"path:     {parsed.path}")        # /v2/users
print(f"query:    {parsed.query}")       # role=admin&page=1
print(f"fragment: {parsed.fragment}")    # section-2

# 重新构建 URL
new_url = urlunparse((
    'https',                    # scheme
    'api.example.com',          # netloc
    '/v3/users',                # path
    '',                         # params (路径参数)
    'page=2&limit=20',          # query
    'top'                       # fragment
))
print(new_url)  # https://api.example.com/v3/users?page=2&limit=20#top

3.3 常见的 Scheme（协议方案）

Scheme	名称	默认端口	说明
`http`	超文本传输协议	80	明文传输
`https`	安全超文本传输协议	443	TLS 加密
`ftp`	文件传输协议	21	文件传输
`ws`	WebSocket	80	WebSocket 连接
`wss`	安全 WebSocket	443	加密 WebSocket
`file`	本地文件	—	访问本地文件系统
`data`	数据	—	内联数据
`mailto`	邮件	—	电子邮件地址

# data URL 示例
data_url = "data:text/html,<h1>Hello</h1>"
data_url_b64 = "data:text/html;base64,PGgxPkhlbGxvPC9oMT4="

# 在 JavaScript 中
# window.location = "mailto:alice@example.com?subject=Hello"

3.4 URL 编码（Percent-Encoding）

URL 中只能包含 ASCII 可打印字符（不包含空格和特殊字符）。其他字符必须进行 百分号编码（Percent-Encoding）。

编码规则

保留字符有特殊含义，作为普通数据使用时必须编码
非 ASCII 字符（中文、日文等）必须编码
编码格式：% + 字符的两位十六进制 ASCII 码

保留字符

字符	含义	编码
`:`	分隔 scheme/port	`%3A`
`/`	路径分隔	`%2F`
`?`	查询参数开始	`%3F`
`#`	片段开始	`%23`
`[` `]`	IP 地址	`%5B` `%5D`
`@`	用户信息分隔	`%40`
`!` `*` `'` `(` `)`	子分隔符	各自编码
`+`	空格（在查询中）	`%2B`
`=`	参数键值分隔	`%3D`
`&`	参数分隔	`%26`
`空格`	—	`%20` 或 `+`

Python URL 编码

from urllib.parse import quote, unquote, urlencode

# 编码路径
path = "/文档/报告 2024.pdf"
encoded_path = quote(path, safe='/')
print(encoded_path)  # /%E6%96%87%E6%A1%A3/%E6%8A%A5%E5%91%8A%202024.pdf

# 解码
print(unquote(encoded_path))  # /文档/报告 2024.pdf

# 编码查询参数
params = {
    "q": "HTTP 协议",
    "lang": "zh",
    "page": "1"
}
query_string = urlencode(params, quote_via=quote)
print(query_string)
# q=HTTP%20%E5%8D%8F%E8%AE%AE&lang=zh&page=1

// JavaScript URL 编码
const keyword = "HTTP 协议 & 安全";

// encodeURI — 编码整个 URL（不编码 scheme、host 等）
console.log(encodeURI(keyword));
// HTTP%20%E5%8D%8F%E8%AE%AE%20%26%20%E5%AE%89%E5%85%A8

// encodeURIComponent — 编码 URL 组件（编码更多字符）
console.log(encodeURIComponent(keyword));
// HTTP%20%E5%8D%8F%E8%AE%AE%20%26%20%E5%AE%89%E5%85%A8

// 构建带参数的 URL
const base = "https://api.example.com/search";
const params = new URLSearchParams({ q: keyword, page: 1 });
console.log(`${base}?${params}`);
// https://api.example.com/search?q=HTTP+%E5%8D%8F%E8%AE%AE+%26+%E5%AE%89%E5%85%A8&page=1

3.5 查询参数（Query Parameters）

查询参数以 ? 开头，键值对之间用 & 分隔。

语法

?key1=value1&key2=value2&key3=value3

复杂参数的编码方案

场景	方案	示例
数组	重复键	`ids=1&ids=2&ids=3`
数组	方括号	`ids[]=1&ids[]=2&ids[]=3`
对象	方括号	`filter[name]=alice&filter[age]=25`
嵌套	方括号链	`filter[address][city]=beijing`
空值	省略值	`debug` 或 `debug=`

from urllib.parse import parse_qs, urlencode

# 解析重复键参数
query = "ids=1&ids=2&ids=3&name=alice"
params = parse_qs(query, keep_blank_values=True)
print(params)
# {'ids': ['1', '2', '3'], 'name': ['alice']}

# 方括号风格
query2 = "ids[]=1&ids[]=2&filter[name]=alice&filter[age]=25"
# 这种格式需要手动解析或使用框架解析

// JavaScript URLSearchParams
const params = new URLSearchParams();
params.append('ids', '1');
params.append('ids', '2');
params.append('ids', '3');
params.set('page', '1');
params.sort();  // 可排序

console.log(params.toString()); // ids=1&ids=2&ids=3&page=1
console.log(params.getAll('ids')); // ['1', '2', '3']
console.log(params.get('page'));  // '1'

// 从当前 URL 解析
const url = new URL('https://example.com/search?q=hello&page=1');
console.log(url.searchParams.get('q'));    // 'hello'
console.log(url.searchParams.get('page')); // '1'

3.6 片段标识（Fragment）

片段（Fragment）以 # 开头，不会发送到服务器，仅在客户端（浏览器）使用。

https://example.com/page#section-2

// 浏览器中操作 Fragment
console.log(window.location.hash); // #section-2

window.location.hash = '#section-3';  // 修改片段，不发送请求

// 监听片段变化
window.addEventListener('hashchange', (event) => {
    console.log('旧片段:', event.oldURL);
    console.log('新片段:', event.newURL);
});

常见用途

用途	说明
页面内锚点	跳转到页面特定位置
SPA 路由	单页应用的前端路由（hash 模式）
媒体片段	视频的特定时间段 `#t=10,20`
文字片段	高亮页面中的特定文本 `#:~:text=hello`

3.7 相对 URL 与绝对 URL

HTML 中可以使用相对 URL，浏览器会基于当前页面 URL 解析。

<!-- 绝对 URL -->
<a href="https://example.com/page">绝对链接</a>

<!-- 相对 URL -->
<a href="page.html">同目录下的 page.html</a>
<a href="./page.html">同目录下的 page.html（显式）</a>
<a href="../other/page.html">上级目录的 other/page.html</a>
<a href="//cdn.example.com/file.js">协议相对 URL</a>

from urllib.parse import urljoin

base = "https://example.com/dir/page.html"

print(urljoin(base, "other.html"))
# https://example.com/dir/other.html

print(urljoin(base, "../other.html"))
# https://example.com/other.html

print(urljoin(base, "/root.html"))
# https://example.com/root.html

print(urljoin(base, "https://other.com/page"))
# https://other.com/page

print(urljoin(base, "//cdn.example.com/file.js"))
# https://cdn.example.com/file.js

3.8 业务场景：搜索引擎 URL 设计

一个典型的搜索 API URL 设计：

基础:  /api/v1/search
文本:  /api/v1/search?q=HTTP+协议
分页:  /api/v1/search?q=HTTP+协议&page=2&limit=20
排序:  /api/v1/search?q=HTTP+协议&sort=date&order=desc
过滤:  /api/v1/search?q=HTTP+协议&tag=tutorial&lang=zh

// 构建搜索 URL 的工具函数
class SearchURLBuilder {
    constructor(base) {
        this.url = new URL(base);
    }

    query(q) { this.url.searchParams.set('q', q); return this; }
    page(p) { this.url.searchParams.set('page', p); return this; }
    limit(l) { this.url.searchParams.set('limit', l); return this; }
    sort(field, order = 'asc') {
        this.url.searchParams.set('sort', field);
        this.url.searchParams.set('order', order);
        return this;
    }
    filter(key, value) {
        this.url.searchParams.set(key, value);
        return this;
    }
    build() { return this.url.toString(); }
}

const url = new SearchURLBuilder('https://api.example.com/search')
    .query('HTTP 协议')
    .page(2)
    .limit(20)
    .sort('date', 'desc')
    .filter('lang', 'zh')
    .build();

console.log(url);
// https://api.example.com/search?q=HTTP+%E5%8D%8F%E8%AE%AE&page=2&limit=20&sort=date&order=desc&lang=zh

⚠️ 注意事项

不要在 URL 中放置敏感信息：URL 会被记录在日志、浏览器历史、Referer 头中
Fragment 不发送到服务器：不能依赖 Fragment 做服务端逻辑
注意编码一致性：客户端和服务端使用相同的编码方式
URL 长度限制：虽然规范没有限制，但浏览器通常限制在 2000-8000 字符
避免使用 userinfo：user:password@host 格式已被废弃，存在安全风险

🔗 扩展阅读

下一章：第 4 章：请求方法详解 — GET/POST/PUT/DELETE/PATCH/OPTIONS/HEAD 语义与选型