Python 常用第三方库

Python 拥有丰富的第三方库生态系统，本章将介绍最常用的几个第三方库，帮助你快速上手实际开发。

包管理工具 pip

安装包

# 安装包
pip install requests

# 安装指定版本
pip install requests==2.28.0

# 安装最低版本
pip install requests>=2.28.0

# 从 requirements.txt 安装
pip install -r requirements.txt

管理包

# 查看已安装的包
pip list

# 查看包详情
pip show requests

# 升级包
pip install --upgrade requests

# 卸载包
pip uninstall requests

# 导出依赖
pip freeze > requirements.txt

使用虚拟环境

# 创建虚拟环境
python -m venv myenv

# 激活虚拟环境
# Windows
myenv\Scripts\activate
# Linux/macOS
source myenv/bin/activate

# 退出虚拟环境
deactivate

解释：虚拟环境可以隔离不同项目的依赖，避免版本冲突。建议每个项目都使用独立的虚拟环境。

requests - HTTP 请求库

requests 是 Python 最流行的 HTTP 客户端库，让 HTTP 请求变得简单优雅。

安装

pip install requests

基本使用

GET 请求

import requests

# 基本 GET 请求
response = requests.get('https://api.github.com/events')

# 获取响应内容
print(response.text)           # 文本内容
print(response.json())         # JSON 解析
print(response.status_code)    # 状态码
print(response.headers)        # 响应头

# 带参数的 GET 请求
params = {'key1': 'value1', 'key2': 'value2'}
response = requests.get('https://httpbin.org/get', params=params)
print(response.url)  # https://httpbin.org/get?key1=value1&key2=value2

解释：requests.get() 发送 GET 请求，返回一个 Response 对象。params 参数会自动将字典转换为 URL 查询字符串。

POST 请求

import requests

# 表单数据
data = {'username': 'admin', 'password': '123456'}
response = requests.post('https://httpbin.org/post', data=data)

# JSON 数据
import json
json_data = {'name': '张三', 'age': 25}
response = requests.post(
    'https://httpbin.org/post',
    json=json_data  # 自动设置 Content-Type: application/json
)

# 或手动发送 JSON
response = requests.post(
    'https://httpbin.org/post',
    data=json.dumps(json_data),
    headers={'Content-Type': 'application/json'}
)

解释：data 参数发送表单数据，json 参数自动序列化并发送 JSON 数据。使用 json 参数更简洁。

自定义请求头

import requests

headers = {
    'User-Agent': 'MyApp/1.0',
    'Authorization': 'Bearer token123',
    'Accept': 'application/json'
}

response = requests.get('https://api.github.com/user', headers=headers)

处理响应

import requests

response = requests.get('https://api.github.com/events')

# 状态码
print(f"状态码: {response.status_code}")

# 检查请求是否成功
if response.ok:  # 状态码在 200-400 之间
    print("请求成功")

# 状态码判断
if response.status_code == 200:
    print("OK")
elif response.status_code == 404:
    print("Not Found")

# 抛出异常（非 2xx 状态码）
response.raise_for_status()

# 获取响应内容
print(response.text)           # 文本
print(response.content)        # 二进制
print(response.json())         # JSON（自动解析）
print(response.encoding)       # 编码

# 响应头
print(response.headers['Content-Type'])

文件上传下载

import requests

# 上传文件
files = {'file': open('report.pdf', 'rb')}
response = requests.post('https://httpbin.org/post', files=files)

# 上传多个文件
files = [
    ('file1', open('file1.txt', 'rb')),
    ('file2', open('file2.txt', 'rb'))
]
response = requests.post('https://httpbin.org/post', files=files)

# 下载文件
response = requests.get('https://example.com/image.jpg')
with open('image.jpg', 'wb') as f:
    f.write(response.content)

# 流式下载大文件
response = requests.get('https://example.com/large_file.zip', stream=True)
with open('large_file.zip', 'wb') as f:
    for chunk in response.iter_content(chunk_size=8192):
        f.write(chunk)

解释：stream=True 启用流式传输，适用于下载大文件。iter_content() 分块读取内容，避免内存溢出。

Session 会话

import requests

# 创建会话（保持 cookie 和连接）
session = requests.Session()

# 设置会话级别的请求头
session.headers.update({'User-Agent': 'MyApp/1.0'})

# 登录
login_data = {'username': 'admin', 'password': '123456'}
session.post('https://example.com/login', data=login_data)

# 后续请求会自动带上登录后的 cookie
response = session.get('https://example.com/dashboard')

# 关闭会话
session.close()

# 或使用上下文管理器
with requests.Session() as session:
    session.post('https://example.com/login', data=login_data)
    response = session.get('https://example.com/dashboard')

解释：Session 对象可以跨请求保持 cookie 和连接池，提高性能。适合需要登录认证的场景。

超时和异常处理

import requests
from requests.exceptions import Timeout, ConnectionError, HTTPError

try:
    # 设置超时（连接超时 3 秒，读取超时 5 秒）
    response = requests.get(
        'https://api.github.com/events',
        timeout=(3, 5)
    )
    response.raise_for_status()  # 检查 HTTP 错误
    
except Timeout:
    print("请求超时")
except ConnectionError:
    print("连接失败")
except HTTPError as e:
    print(f"HTTP 错误: {e}")
except Exception as e:
    print(f"其他错误: {e}")

解释：timeout 参数很重要，可以防止请求无限等待。生产环境应始终设置超时时间。

实战示例：调用 REST API

import requests

class APIClient:
    """REST API 客户端"""
    
    def __init__(self, base_url, token=None):
        self.base_url = base_url.rstrip('/')
        self.session = requests.Session()
        if token:
            self.session.headers.update({
                'Authorization': f'Bearer {token}'
            })
    
    def get(self, endpoint, **kwargs):
        response = self.session.get(f"{self.base_url}{endpoint}", **kwargs)
        return self._handle_response(response)
    
    def post(self, endpoint, data=None, json=None, **kwargs):
        response = self.session.post(
            f"{self.base_url}{endpoint}",
            data=data,
            json=json,
            **kwargs
        )
        return self._handle_response(response)
    
    def _handle_response(self, response):
        response.raise_for_status()
        return response.json()

# 使用
client = APIClient('https://api.example.com', token='your_token')
users = client.get('/users')
new_user = client.post('/users', json={'name': '张三', 'email': '[email protected]'})

numpy - 数值计算库

NumPy 是 Python 科学计算的基础库，提供高效的多维数组对象和数学函数。

安装

pip install numpy

创建数组

import numpy as np

# 从列表创建
a = np.array([1, 2, 3, 4, 5])
print(a)                    # [1 2 3 4 5]
print(type(a))              # <class 'numpy.ndarray'>

# 二维数组
b = np.array([[1, 2, 3], [4, 5, 6]])
print(b)
# [[1 2 3]
#  [4 5 6]]

# 特殊数组
zeros = np.zeros((3, 4))        # 3x4 全零数组
ones = np.ones((2, 3, 4))       # 2x3x4 全一数组
empty = np.empty((2, 3))        # 未初始化数组
eye = np.eye(3)                 # 3x3 单位矩阵

# 序列数组
range_arr = np.arange(0, 10, 2)     # [0, 2, 4, 6, 8]
linspace = np.linspace(0, 1, 5)     # [0., 0.25, 0.5, 0.75, 1.]

# 随机数组
random_arr = np.random.rand(3, 4)   # 0-1 均匀分布
normal_arr = np.random.randn(3, 4)  # 标准正态分布
int_arr = np.random.randint(0, 10, (3, 4))  # 随机整数

解释：NumPy 数组（ndarray）是同质的多维容器，所有元素必须是相同类型。shape 表示维度，dtype 表示数据类型。

数组属性

import numpy as np

a = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])

print(f"维度: {a.ndim}")           # 2
print(f"形状: {a.shape}")          # (3, 4)
print(f"元素总数: {a.size}")        # 12
print(f"数据类型: {a.dtype}")       # int64
print(f"元素大小: {a.itemsize}")    # 8 字节

数组索引和切片

import numpy as np

a = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])

# 基本索引
print(a[0, 1])          # 2（第一行第二列）
print(a[1])             # [5, 6, 7, 8]（第二行）
print(a[-1])            # [9, 10, 11, 12]（最后一行）

# 切片
print(a[0:2, 1:3])      # 前两行，第二三列
# [[2 3]
#  [6 7]]

print(a[:, 1])          # 所有行的第二列
# [2 6 10]

print(a[1, :])          # 第二行的所有列
# [5 6 7 8]

# 布尔索引
print(a[a > 5])         # [6 7 8 9 10 11 12]

# 花式索引
print(a[[0, 2]])        # 第一行和第三行

数组运算

import numpy as np

a = np.array([1, 2, 3, 4])
b = np.array([5, 6, 7, 8])

# 算术运算（逐元素）
print(a + b)        # [ 6  8 10 12]
print(a - b)        # [-4 -4 -4 -4]
print(a * b)        # [ 5 12 21 32]
print(a / b)        # [0.2 0.33 0.43 0.5]
print(a ** 2)       # [ 1  4  9 16]

# 矩阵运算
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])

print(A @ B)        # 矩阵乘法
# [[19 22]
#  [43 50]]

print(A.dot(B))     # 另一种矩阵乘法写法

print(A.T)          # 转置
# [[1 3]
#  [2 4]]

# 广播机制
a = np.array([1, 2, 3])
b = 2
print(a * b)        # [2 4 6]（标量广播到数组）

解释：NumPy 的广播机制允许不同形状的数组进行运算，小数组会自动"广播"到大数组的形状。

常用函数

import numpy as np

a = np.array([1, 2, 3, 4, 5])

# 聚合函数
print(np.sum(a))        # 15 - 求和
print(np.mean(a))       # 3.0 - 平均值
print(np.max(a))        # 5 - 最大值
print(np.min(a))        # 1 - 最小值
print(np.std(a))        # 标准差
print(np.var(a))        # 方差

# 二维数组的轴操作
b = np.array([[1, 2, 3], [4, 5, 6]])
print(np.sum(b, axis=0))    # [5, 7, 9] - 列求和
print(np.sum(b, axis=1))    # [6, 15] - 行求和

# 数学函数
print(np.sqrt(a))       # 平方根
print(np.exp(a))        # 指数
print(np.log(a))        # 自然对数
print(np.sin(a))        # 正弦
print(np.abs(a))        # 绝对值

# 排序和搜索
arr = np.array([3, 1, 4, 1, 5, 9, 2, 6])
print(np.sort(arr))             # [1 1 2 3 4 5 6 9]
print(np.argsort(arr))          # 排序后的索引
print(np.where(arr > 3))        # 满足条件的索引

# 形状操作
c = np.arange(12)
print(c.reshape(3, 4))          # 改变形状
print(c.reshape(3, -1))         # -1 自动计算
print(c.ravel())                # 展平为一维

实战示例：数据分析

import numpy as np

# 生成模拟数据：100个学生的5门课程成绩
np.random.seed(42)
scores = np.random.randint(60, 100, (100, 5))

# 计算每个学生的平均分
student_avg = np.mean(scores, axis=1)
print(f"平均分最高: {np.max(student_avg):.2f}")
print(f"平均分最低: {np.min(student_avg):.2f}")
print(f"平均分中位数: {np.median(student_avg):.2f}")

# 计算每门课程的统计
course_avg = np.mean(scores, axis=0)
print(f"\n各课程平均分: {course_avg}")

# 找出优秀学生（平均分 > 85）
excellent = np.where(student_avg > 85)[0]
print(f"\n优秀学生人数: {len(excellent)}")

# 标准化成绩（Z-score）
scores_normalized = (scores - np.mean(scores, axis=0)) / np.std(scores, axis=0)
print(f"\n标准化后的前3个学生:\n{scores_normalized[:3]}")

pandas - 数据分析库

Pandas 是基于 NumPy 的数据分析库，提供 DataFrame 和 Series 两种核心数据结构。

安装

pip install pandas

Series - 一维数据

import pandas as pd

# 创建 Series
s1 = pd.Series([1, 2, 3, 4, 5])
s2 = pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e'])
s3 = pd.Series({'a': 1, 'b': 2, 'c': 3})

print(s1)
# 0    1
# 1    2
# 2    3
# 3    4
# 4    5

# 访问元素
print(s2['a'])          # 1
print(s2[0])            # 1
print(s2[['a', 'c']])   # a    1, c    3

# 属性
print(s2.index)         # Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
print(s2.values)        # [1 2 3 4 5]

DataFrame - 二维数据表

import pandas as pd

# 创建 DataFrame
df = pd.DataFrame({
    'name': ['张三', '李四', '王五'],
    'age': [25, 30, 35],
    'city': ['北京', '上海', '广州']
})

print(df)
#   name  age city
# 0   张三   25   北京
# 1   李四   30   上海
# 2   王五   35   广州

# 从列表创建
df2 = pd.DataFrame(
    [['张三', 25, '北京'], ['李四', 30, '上海']],
    columns=['name', 'age', 'city']
)

# 属性
print(df.shape)         # (3, 3)
print(df.columns)       # Index(['name', 'age', 'city'], dtype='object')
print(df.index)         # RangeIndex(start=0, stop=3, step=1)
print(df.dtypes)        # 各列数据类型

数据选择

import pandas as pd

df = pd.DataFrame({
    'name': ['张三', '李四', '王五', '赵六'],
    'age': [25, 30, 35, 28],
    'city': ['北京', '上海', '广州', '深圳'],
    'salary': [10000, 15000, 20000, 18000]
})

# 选择列
print(df['name'])               # 单列（Series）
print(df[['name', 'age']])      # 多列（DataFrame）

# 按位置选择
print(df.iloc[0])               # 第一行
print(df.iloc[0:2])             # 前两行
print(df.iloc[0:2, 0:2])        # 前两行前两列

# 按标签选择
print(df.loc[0])                # 索引为 0 的行
print(df.loc[0:1, 'name'])      # 前两行的 name 列
print(df.loc[:, ['name', 'salary']])  # 所有行的指定列

# 条件筛选
print(df[df['age'] > 28])               # 年龄大于28
print(df[(df['age'] > 25) & (df['salary'] > 15000)])  # 多条件

# 使用 query
print(df.query('age > 28 and salary > 15000'))

数据处理

import pandas as pd

df = pd.DataFrame({
    'name': ['张三', '李四', '王五', '赵六'],
    'age': [25, 30, 35, 28],
    'city': ['北京', '上海', '广州', '深圳'],
    'salary': [10000, 15000, 20000, 18000]
})

# 添加列
df['bonus'] = df['salary'] * 0.1
df['total'] = df['salary'] + df['bonus']

# 修改列
df['city'] = df['city'].str.replace('京', '京市')

# 删除列
df_dropped = df.drop(columns=['bonus'])
df_dropped = df.drop('bonus', axis=1)

# 重命名列
df_renamed = df.rename(columns={'name': '姓名', 'age': '年龄'})

# 排序
df_sorted = df.sort_values('salary', ascending=False)
df_sorted = df.sort_values(['city', 'salary'], ascending=[True, False])

# 去重
df_unique = df.drop_duplicates(subset=['city'])

# 缺失值处理
df_with_nan = df.copy()
df_with_nan.loc[1, 'salary'] = None

print(df_with_nan.isnull())             # 检测缺失值
print(df_with_nan.dropna())             # 删除缺失值行
print(df_with_nan.fillna(0))            # 填充缺失值
print(df_with_nan['salary'].fillna(df_with_nan['salary'].mean()))  # 用均值填充

数据统计分析

import pandas as pd

df = pd.DataFrame({
    'name': ['张三', '李四', '王五', '赵六', '孙七'],
    'department': ['技术', '销售', '技术', '销售', '技术'],
    'salary': [10000, 15000, 20000, 18000, 16000]
})

# 基本统计
print(df.describe())                    # 数值列统计摘要
print(df['salary'].mean())              # 平均值
print(df['salary'].median())            # 中位数
print(df['salary'].std())               # 标准差
print(df['salary'].max())               # 最大值

# 分组统计
print(df.groupby('department')['salary'].mean())
# department
# 技术    15333.33
# 销售    16500.00

# 多种聚合
print(df.groupby('department')['salary'].agg(['mean', 'max', 'min', 'count']))

# 分组后多列操作
print(df.groupby('department').agg({
    'salary': ['mean', 'max'],
    'name': 'count'
}))

# 交叉表
print(pd.crosstab(df['department'], df['salary'] > 15000))

# 数据透视表
print(df.pivot_table(values='salary', index='department', aggfunc='mean'))

数据读写

import pandas as pd

# 读取 CSV
df = pd.read_csv('data.csv', encoding='utf-8')
df = pd.read_csv('data.csv', header=0, index_col=0, usecols=['name', 'age'])

# 写入 CSV
df.to_csv('output.csv', index=False, encoding='utf-8')

# 读取 Excel
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')

# 写入 Excel
df.to_excel('output.xlsx', sheet_name='Sheet1', index=False)

# 读取 JSON
df = pd.read_json('data.json')

# 写入 JSON
df.to_json('output.json', orient='records', force_ascii=False)

# SQL 数据库
import sqlite3
conn = sqlite3.connect('database.db')
df = pd.read_sql('SELECT * FROM users', conn)
df.to_sql('users', conn, if_exists='replace', index=False)

实战示例：数据清洗与分析

import pandas as pd
import numpy as np

# 创建模拟数据
np.random.seed(42)
df = pd.DataFrame({
    'user_id': range(1, 101),
    'name': [f'用户{i}' for i in range(1, 101)],
    'age': np.random.randint(18, 60, 100),
    'city': np.random.choice(['北京', '上海', '广州', '深圳'], 100),
    'amount': np.random.randint(100, 10000, 100),
    'join_date': pd.date_range('2023-01-01', periods=100, freq='D')
})

# 添加一些缺失值
df.loc[10:15, 'amount'] = None

# 数据概览
print("=== 数据概览 ===")
print(df.info())
print(df.describe())

# 缺失值处理
print("\n=== 缺失值统计 ===")
print(df.isnull().sum())

# 用中位数填充缺失值
df['amount'] = df['amount'].fillna(df['amount'].median())

# 分组分析
print("\n=== 各城市消费分析 ===")
city_analysis = df.groupby('city').agg({
    'amount': ['mean', 'sum', 'count'],
    'age': 'mean'
}).round(2)
print(city_analysis)

# 年龄分组分析
df['age_group'] = pd.cut(df['age'], bins=[0, 25, 35, 45, 100], 
                         labels=['青年', '青年中年', '中年', '中老年'])
print("\n=== 年龄段消费分析 ===")
print(df.groupby('age_group')['amount'].mean().round(2))

# 时间序列分析
df['month'] = df['join_date'].dt.month
print("\n=== 月度用户增长 ===")
print(df.groupby('month').size())

matplotlib - 数据可视化库

Matplotlib 是 Python 最基础的绘图库，可以创建各种静态、动态、交互式图表。

安装

pip install matplotlib

基本绘图

import matplotlib.pyplot as plt
import numpy as np

# 设置中文字体
plt.rcParams['font.sans-serif'] = ['SimHei']  # Windows
plt.rcParams['axes.unicode_minus'] = False

# 折线图
x = np.linspace(0, 2 * np.pi, 100)
y = np.sin(x)

plt.figure(figsize=(10, 6))
plt.plot(x, y, label='sin(x)', color='blue', linestyle='-', linewidth=2)
plt.plot(x, np.cos(x), label='cos(x)', color='red', linestyle='--', linewidth=2)

plt.title('正弦和余弦函数')
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.grid(True)
plt.show()

常见图表类型

import matplotlib.pyplot as plt
import numpy as np

plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False

# 创建子图
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# 1. 柱状图
categories = ['A', 'B', 'C', 'D']
values = [23, 45, 56, 78]
axes[0, 0].bar(categories, values, color=['#ff9999', '#66b3ff', '#99ff99', '#ffcc99'])
axes[0, 0].set_title('柱状图')
axes[0, 0].set_xlabel('类别')
axes[0, 0].set_ylabel('数值')

# 2. 散点图
x = np.random.rand(50)
y = np.random.rand(50)
colors = np.random.rand(50)
axes[0, 1].scatter(x, y, c=colors, alpha=0.6, cmap='viridis')
axes[0, 1].set_title('散点图')

# 3. 直方图
data = np.random.randn(1000)
axes[1, 0].hist(data, bins=30, color='steelblue', edgecolor='white')
axes[1, 0].set_title('直方图')
axes[1, 0].set_xlabel('值')
axes[1, 0].set_ylabel('频数')

# 4. 饼图
sizes = [30, 25, 20, 15, 10]
labels = ['技术部', '销售部', '市场部', '人事部', '财务部']
explode = (0.05, 0, 0, 0, 0)
axes[1, 1].pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
               colors=['#ff9999', '#66b3ff', '#99ff99', '#ffcc99', '#ff99cc'])
axes[1, 1].set_title('部门人员分布')

plt.tight_layout()
plt.show()

结合 Pandas 绘图

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False

# 创建示例数据
df = pd.DataFrame({
    'month': ['1月', '2月', '3月', '4月', '5月', '6月'],
    'sales': [120, 135, 148, 162, 158, 170],
    'profit': [30, 35, 42, 48, 45, 52]
})

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# 销售额折线图
axes[0].plot(df['month'], df['sales'], marker='o', linewidth=2, markersize=8)
axes[0].set_title('月度销售额')
axes[0].set_xlabel('月份')
axes[0].set_ylabel('销售额（万元）')
axes[0].grid(True, alpha=0.3)

# 销售额和利润双柱状图
x = np.arange(len(df['month']))
width = 0.35
axes[1].bar(x - width/2, df['sales'], width, label='销售额')
axes[1].bar(x + width/2, df['profit'], width, label='利润')
axes[1].set_title('月度销售额与利润')
axes[1].set_xlabel('月份')
axes[1].set_ylabel('金额（万元）')
axes[1].set_xticks(x)
axes[1].set_xticklabels(df['month'])
axes[1].legend()

plt.tight_layout()
plt.show()

小结

本章我们学习了 Python 最常用的第三方库：

pip - 包管理工具
requests - HTTP 请求库，轻松进行网络请求
numpy - 数值计算库，高效的数组操作
pandas - 数据分析库，强大的数据处理能力
matplotlib - 数据可视化库，创建各种图表

这些库是 Python 数据科学生态的核心，掌握它们可以解决大部分数据处理和分析任务。

练习

使用 requests 库调用一个公开 API 并解析返回的 JSON 数据
使用 numpy 生成模拟数据并计算统计指标
使用 pandas 读取 CSV 文件并进行数据清洗和分析
使用 matplotlib 绘制数据的可视化图表

包管理工具 pip​

安装包​

管理包​

使用虚拟环境​

requests - HTTP 请求库​

安装​

基本使用​

GET 请求​

POST 请求​

自定义请求头​

处理响应​

文件上传下载​

Session 会话​

超时和异常处理​

实战示例：调用 REST API​

numpy - 数值计算库​

安装​

创建数组​

数组属性​

数组索引和切片​

数组运算​

常用函数​

实战示例：数据分析​

pandas - 数据分析库​

安装​

Series - 一维数据​

DataFrame - 二维数据表​

数据选择​

数据处理​

数据统计分析​

数据读写​

实战示例：数据清洗与分析​

matplotlib - 数据可视化库​

安装​

基本绘图​

常见图表类型​

结合 Pandas 绘图​

小结​

练习​

参考资源​