跳到主要内容

音频处理

音频处理是 AI 应用的重要组成部分,包括语音识别(ASR)和文本转语音(TTS)。本章介绍 Spring AI 中的音频处理功能。

概述

音频 AI 模型实现语音与文本之间的转换:

┌─────────────────────────────────────────────────────────────┐
│ 音频 AI 功能 │
├─────────────────────────────────────────────────────────────┤
│ │
│ 语音识别 (ASR) │
│ 音频 ───────────────────────> 文本 │
│ (Speech-to-Text) │
│ │
│ 文本转语音 (TTS) │
│ 文本 ───────────────────────> 音频 │
│ (Text-to-Speech) │
│ │
└─────────────────────────────────────────────────────────────┘

支持的模型

功能提供商模型
语音识别OpenAIWhisper
语音识别AzureSpeech Services
语音识别GoogleCloud Speech-to-Text
语音识别Ollamawhisper(本地)
文本转语音OpenAITTS-1, TTS-1-HD
文本转语音AzureSpeech Services
文本转语音GoogleCloud Text-to-Speech

语音识别 (ASR)

AudioTranscriptionModel 接口

public interface AudioTranscriptionModel {

// 转录音频文件
TranscriptionResponse transcribe(AudioTranscriptionRequest request);
}

OpenAI Whisper 配置

spring:
ai:
openai:
api-key: ${OPENAI_API_KEY}
audio:
transcription:
options:
model: whisper-1
language: zh # 可选,指定语言
response-format: json # json, text, srt, vtt

基本转录

@Service
public class TranscriptionService {

@Autowired
private AudioTranscriptionModel transcriptionModel;

/**
* 转录音频文件
*/
public String transcribe(Resource audioFile) {
AudioTranscriptionRequest request = AudioTranscriptionRequest.builder()
.audio(audioFile)
.build();

TranscriptionResponse response = transcriptionModel.transcribe(request);
return response.getResult().getOutput();
}

/**
* 转录并返回详细信息
*/
public TranscriptionResult transcribeWithDetails(Resource audioFile) {
AudioTranscriptionRequest request = AudioTranscriptionRequest.builder()
.audio(audioFile)
.responseFormat(AudioTranscriptionRequest.TranscriptResponseFormat.VERBOSE_JSON)
.build();

TranscriptionResponse response = transcriptionModel.transcribe(request);

return new TranscriptionResult(
response.getResult().getOutput(),
response.getMetadata().getDuration(),
response.getMetadata().getLanguage()
);
}

record TranscriptionResult(String text, Double duration, String language) {}
}

REST API 示例

@RestController
@RequestMapping("/api/transcribe")
public class TranscriptionController {

@Autowired
private AudioTranscriptionModel transcriptionModel;

/**
* 上传并转录音频
*/
@PostMapping
public TranscriptionResponse uploadAndTranscribe(
@RequestParam("file") MultipartFile file) throws IOException {

Resource audioResource = new ByteArrayResource(file.getBytes()) {
@Override
public String getFilename() {
return file.getOriginalFilename();
}
};

AudioTranscriptionRequest request = AudioTranscriptionRequest.builder()
.audio(audioResource)
.build();

return transcriptionModel.transcribe(request);
}

/**
* 指定语言转录
*/
@PostMapping("/language/{language}")
public String transcribeWithLanguage(
@PathVariable String language,
@RequestParam("file") MultipartFile file) throws IOException {

Resource audioResource = new ByteArrayResource(file.getBytes());

AudioTranscriptionRequest request = AudioTranscriptionRequest.builder()
.audio(audioResource)
.language(language)
.build();

return transcriptionModel.transcribe(request).getResult().getOutput();
}
}

转录选项

@GetMapping("/transcribe-options")
public String transcribeWithOptions(@RequestParam String audioUrl) throws IOException {
// 下载音频文件
byte[] audioData = downloadAudio(audioUrl);
Resource audioResource = new ByteArrayResource(audioData);

AudioTranscriptionRequest request = AudioTranscriptionRequest.builder()
.audio(audioResource)
.model("whisper-1")
.language("zh") // 语言代码
.prompt("这是一段关于技术的内容") // 提示词,帮助识别
.responseFormat(AudioTranscriptionRequest.TranscriptResponseFormat.JSON)
.temperature(0.0) // 随机性 (0-1)
.timestampGranularities(List.of("word")) // 时间戳粒度
.build();

return transcriptionModel.transcribe(request).getResult().getOutput();
}

支持的音频格式

格式扩展名
MP3.mp3
MP4.mp4, .m4a
WAV.wav
FLAC.flac
WebM.webm
OGG.ogg, .oga

文本转语音 (TTS)

配置

spring:
ai:
openai:
api-key: ${OPENAI_API_KEY}
audio:
speech:
options:
model: tts-1 # tts-1 (标准) 或 tts-1-hd (高清)
voice: alloy # 声音选项
speed: 1.0 # 语速 (0.25 - 4.0)

可用声音

声音特点
alloy中性、现代
echo男性、温暖
fable英国口音
onyx深沉男性
nova女性活力
shimmer女性柔和

基本使用

@Service
public class TextToSpeechService {

@Autowired
private SpeechModel speechModel;

/**
* 将文本转换为语音
*/
public byte[] textToSpeech(String text) {
SpeechRequest request = SpeechRequest.builder()
.input(text)
.voice("alloy")
.build();

SpeechResponse response = speechModel.call(request);
return response.getResult().getOutput();
}

/**
* 转换并保存到文件
*/
public Path textToSpeechFile(String text, String outputPath) throws IOException {
byte[] audioData = textToSpeech(text);

Path path = Paths.get(outputPath);
Files.write(path, audioData);

return path;
}
}

REST API 示例

@RestController
@RequestMapping("/api/speech")
public class SpeechController {

@Autowired
private SpeechModel speechModel;

/**
* 文本转语音
*/
@PostMapping(value = "/synthesize", produces = "audio/mpeg")
public byte[] synthesize(@RequestBody SpeechRequestDTO request) {
SpeechPrompt prompt = new SpeechPrompt(request.text());
prompt.getOptions().setVoice(request.voice());
prompt.getOptions().setSpeed(request.speed());

SpeechResponse response = speechModel.call(prompt);
return response.getResult().getOutput();
}

/**
* 流式输出
*/
@PostMapping(value = "/stream", produces = "audio/mpeg")
public Flux<byte[]> streamSynthesize(@RequestBody SpeechRequestDTO request) {
SpeechPrompt prompt = new SpeechPrompt(request.text());
prompt.getOptions().setVoice(request.voice());

return speechModel.stream(prompt)
.map(response -> response.getResult().getOutput());
}

/**
* 下载音频文件
*/
@PostMapping("/download")
public ResponseEntity<byte[]> downloadSpeech(
@RequestBody SpeechRequestDTO request) {

SpeechPrompt prompt = new SpeechPrompt(request.text());
prompt.getOptions().setVoice(request.voice());
prompt.getOptions().setResponseFormat("mp3");

SpeechResponse response = speechModel.call(prompt);
byte[] audioData = response.getResult().getOutput();

return ResponseEntity.ok()
.header(HttpHeaders.CONTENT_DISPOSITION,
"attachment; filename=speech.mp3")
.contentType(MediaType.APPLICATION_OCTET_STREAM)
.body(audioData);
}
}

record SpeechRequestDTO(
String text,
@default("alloy") String voice,
@default("1.0") Double speed
) {}

实际应用示例

1. 会议记录转文字

@Service
public class MeetingTranscriptionService {

@Autowired
private AudioTranscriptionModel transcriptionModel;

/**
* 转录会议录音
*/
public MeetingMinutes transcribeMeeting(Resource audioFile,
String meetingTitle) {
// 转录
AudioTranscriptionRequest request = AudioTranscriptionRequest.builder()
.audio(audioFile)
.responseFormat(AudioTranscriptionRequest.TranscriptResponseFormat.VERBOSE_JSON)
.language("zh")
.build();

TranscriptionResponse response = transcriptionModel.transcribe(request);
String transcript = response.getResult().getOutput();

// 提取关键信息
return new MeetingMinutes(
meetingTitle,
transcript,
extractKeyPoints(transcript),
extractActionItems(transcript),
LocalDateTime.now()
);
}

private List<String> extractKeyPoints(String transcript) {
// 可以使用 ChatModel 来提取关键点
return List.of("关键点提取功能需要结合 ChatModel");
}

private List<String> extractActionItems(String transcript) {
return List.of("待办事项提取功能需要结合 ChatModel");
}

record MeetingMinutes(
String title,
String transcript,
List<String> keyPoints,
List<String> actionItems,
LocalDateTime createdAt
) {}
}

2. 播客转录

@Service
public class PodcastService {

@Autowired
private AudioTranscriptionModel transcriptionModel;

/**
* 分段转录长音频
*/
public PodcastTranscript transcribePodcast(Resource audioFile) throws IOException {
AudioTranscriptionRequest request = AudioTranscriptionRequest.builder()
.audio(audioFile)
.responseFormat(AudioTranscriptionRequest.TranscriptResponseFormat.SRT)
.language("zh")
.build();

TranscriptionResponse response = transcriptionModel.transcribe(request);
String srtContent = response.getResult().getOutput();

// 解析 SRT 格式
List<SubtitleSegment> segments = parseSrt(srtContent);

return new PodcastTranscript(
segments,
segments.stream()
.map(SubtitleSegment::text)
.collect(Collectors.joining(" ")),
calculateDuration(segments)
);
}

private List<SubtitleSegment> parseSrt(String srtContent) {
// 解析 SRT 格式的字幕
List<SubtitleSegment> segments = new ArrayList<>();
String[] blocks = srtContent.split("\n\n");

for (String block : blocks) {
String[] lines = block.split("\n");
if (lines.length >= 3) {
segments.add(new SubtitleSegment(
lines[1], // 时间码
Arrays.stream(lines)
.skip(2)
.collect(Collectors.joining(" "))
));
}
}

return segments;
}

private Duration calculateDuration(List<SubtitleSegment> segments) {
// 计算总时长
return Duration.ofMinutes(segments.size()); // 简化实现
}

record PodcastTranscript(
List<SubtitleSegment> segments,
String fullText,
Duration duration
) {}

record SubtitleSegment(String timestamp, String text) {}
}

3. 语音助手

@Service
public class VoiceAssistantService {

@Autowired
private AudioTranscriptionModel transcriptionModel;

@Autowired
private ChatClient chatClient;

@Autowired
private SpeechModel speechModel;

/**
* 语音对话
*/
public byte[] voiceConversation(byte[] userAudio) {
// 1. 语音转文字
Resource audioResource = new ByteArrayResource(userAudio);
AudioTranscriptionRequest request = AudioTranscriptionRequest.builder()
.audio(audioResource)
.language("zh")
.build();

String userText = transcriptionModel.transcribe(request)
.getResult().getOutput();

// 2. AI 回复
String aiResponse = chatClient.prompt()
.user(userText)
.call()
.content();

// 3. 文字转语音
SpeechPrompt speechPrompt = new SpeechPrompt(aiResponse);
speechPrompt.getOptions().setVoice("nova");

return speechModel.call(speechPrompt)
.getResult().getOutput();
}
}

4. 多语言翻译

@Service
public class AudioTranslationService {

@Autowired
private AudioTranscriptionModel transcriptionModel;

@Autowired
private ChatClient chatClient;

@Autowired
private SpeechModel speechModel;

/**
* 语音翻译:输入一种语言的语音,输出另一种语言的语音
*/
public byte[] translateSpeech(byte[] audioData,
String sourceLanguage,
String targetLanguage) {
// 1. 转录原始语音
Resource audioResource = new ByteArrayResource(audioData);
AudioTranscriptionRequest request = AudioTranscriptionRequest.builder()
.audio(audioResource)
.language(sourceLanguage)
.build();

String originalText = transcriptionModel.transcribe(request)
.getResult().getOutput();

// 2. 翻译文本
String translatedText = chatClient.prompt()
.system(String.format(
"你是一个翻译专家,请将以下%s文本翻译成%s,只返回翻译结果。",
sourceLanguage, targetLanguage
))
.user(originalText)
.call()
.content();

// 3. 转换为目标语言的语音
SpeechPrompt speechPrompt = new SpeechPrompt(translatedText);

// 选择合适的声音
String voice = selectVoiceForLanguage(targetLanguage);
speechPrompt.getOptions().setVoice(voice);

return speechModel.call(speechPrompt)
.getResult().getOutput();
}

private String selectVoiceForLanguage(String language) {
return switch (language) {
case "zh" -> "nova";
case "en" -> "alloy";
case "ja" -> "echo";
default -> "alloy";
};
}
}

5. 有声读物生成

@Service
public class AudiobookService {

@Autowired
private SpeechModel speechModel;

/**
* 将文本转换为有声读物
*/
public List<Path> generateAudiobook(String content,
String outputDir,
String voice) throws IOException {
// 分段处理长文本
List<String> paragraphs = splitIntoParagraphs(content);

Path dir = Paths.get(outputDir);
Files.createDirectories(dir);

List<Path> audioFiles = new ArrayList<>();

for (int i = 0; i < paragraphs.size(); i++) {
String paragraph = paragraphs.get(i);

SpeechPrompt prompt = new SpeechPrompt(paragraph);
prompt.getOptions().setVoice(voice);
prompt.getOptions().setSpeed(0.9); // 有声书适合稍慢的语速

byte[] audioData = speechModel.call(prompt)
.getResult().getOutput();

Path audioFile = dir.resolve("chapter_" + i + ".mp3");
Files.write(audioFile, audioData);
audioFiles.add(audioFile);
}

return audioFiles;
}

private List<String> splitIntoParagraphs(String content) {
return Arrays.stream(content.split("\n\n"))
.filter(p -> !p.isBlank())
.toList();
}
}

文件格式

支持的输出格式

格式说明用途
MP3压缩格式通用播放
AAC高效压缩流媒体
FLAC无损压缩高质量存储
WAV无压缩专业处理
@GetMapping("/synthesize/format/{format}")
public byte[] synthesizeWithFormat(
@PathVariable String format,
@RequestParam String text) {

SpeechPrompt prompt = new SpeechPrompt(text);
prompt.getOptions().setResponseFormat(format);
prompt.getOptions().setVoice("alloy");

return speechModel.call(prompt).getResult().getOutput();
}

最佳实践

1. 音频预处理

@Service
public class AudioPreprocessingService {

/**
* 预处理音频以提高转录质量
*/
public byte[] preprocessAudio(byte[] audioData) throws Exception {
// 使用 Java Sound API 或 FFmpeg
// 1. 转换为单声道
// 2. 采样率标准化 (16kHz)
// 3. 降噪

// 简化实现:返回原始数据
return audioData;
}
}

2. 长音频处理

@Service
public class LongAudioService {

/**
* 分段处理长音频
*/
public String transcribeLongAudio(Resource audioFile) throws IOException {
// 对于超长音频,需要分段处理
// Whisper 支持最大 25MB 或约 2.5 小时的音频

// 如果音频更大,需要先分割
List<Resource> segments = splitAudio(audioFile);

StringBuilder transcript = new StringBuilder();

for (Resource segment : segments) {
AudioTranscriptionRequest request = AudioTranscriptionRequest.builder()
.audio(segment)
.build();

String text = transcriptionModel.transcribe(request)
.getResult().getOutput();

transcript.append(text).append(" ");
}

return transcript.toString().trim();
}

private List<Resource> splitAudio(Resource audioFile) {
// 使用 FFmpeg 分割音频
return List.of(audioFile);
}
}

3. 缓存转录结果

@Service
public class CachedTranscriptionService {

@Autowired
private AudioTranscriptionModel transcriptionModel;

private final Cache<String, String> transcriptCache =
Caffeine.newBuilder()
.maximumSize(1000)
.expireAfterAccess(Duration.ofHours(24))
.build();

public String transcribeWithCache(Resource audioFile) throws IOException {
String hash = computeHash(audioFile);

return transcriptCache.get(hash, key -> {
AudioTranscriptionRequest request = AudioTranscriptionRequest.builder()
.audio(audioFile)
.build();

return transcriptionModel.transcribe(request)
.getResult().getOutput();
});
}

private String computeHash(Resource resource) throws IOException {
return DigestUtils.md5Hex(resource.getInputStream());
}
}

小结

本章我们学习了:

  1. 语音识别 (ASR):将音频转换为文本
  2. 文本转语音 (TTS):将文本转换为音频
  3. 支持的模型:OpenAI Whisper、TTS 等
  4. 配置选项:语言、声音、速度等
  5. 实际应用:会议记录、播客转录、语音助手
  6. 最佳实践:预处理、分段处理、缓存

练习

  1. 实现一个语音消息转文字的服务
  2. 创建一个多语言语音翻译功能
  3. 实现一个简单的语音对话机器人

参考资源