音频处理

音频处理是 AI 应用的重要组成部分，包括语音识别（ASR）和文本转语音（TTS）。本章介绍 Spring AI 中的音频处理功能。

概述

音频 AI 模型实现语音与文本之间的转换：

┌─────────────────────────────────────────────────────────────┐
│                    音频 AI 功能                              │
├─────────────────────────────────────────────────────────────┤
│                                                            │
│   语音识别 (ASR)                                            │
│   音频 ───────────────────────> 文本                        │
│   (Speech-to-Text)                                         │
│                                                            │
│   文本转语音 (TTS)                                          │
│   文本 ───────────────────────> 音频                        │
│   (Text-to-Speech)                                         │
│                                                            │
└─────────────────────────────────────────────────────────────┘

支持的模型

功能	提供商	模型
语音识别	OpenAI	Whisper
语音识别	Azure	Speech Services
语音识别	Google	Cloud Speech-to-Text
语音识别	Ollama	whisper（本地）
文本转语音	OpenAI	TTS-1, TTS-1-HD
文本转语音	Azure	Speech Services
文本转语音	Google	Cloud Text-to-Speech

语音识别 (ASR)

AudioTranscriptionModel 接口

public interface AudioTranscriptionModel {
    
    // 转录音频文件
    TranscriptionResponse transcribe(AudioTranscriptionRequest request);
}

OpenAI Whisper 配置

spring:
  ai:
    openai:
      api-key: ${OPENAI_API_KEY}
      audio:
        transcription:
          options:
            model: whisper-1
            language: zh  # 可选，指定语言
            response-format: json  # json, text, srt, vtt

基本转录

@Service
public class TranscriptionService {

    @Autowired
    private AudioTranscriptionModel transcriptionModel;

    /**
     * 转录音频文件
     */
    public String transcribe(Resource audioFile) {
        AudioTranscriptionRequest request = AudioTranscriptionRequest.builder()
            .audio(audioFile)
            .build();
        
        TranscriptionResponse response = transcriptionModel.transcribe(request);
        return response.getResult().getOutput();
    }

    /**
     * 转录并返回详细信息
     */
    public TranscriptionResult transcribeWithDetails(Resource audioFile) {
        AudioTranscriptionRequest request = AudioTranscriptionRequest.builder()
            .audio(audioFile)
            .responseFormat(AudioTranscriptionRequest.TranscriptResponseFormat.VERBOSE_JSON)
            .build();
        
        TranscriptionResponse response = transcriptionModel.transcribe(request);
        
        return new TranscriptionResult(
            response.getResult().getOutput(),
            response.getMetadata().getDuration(),
            response.getMetadata().getLanguage()
        );
    }

    record TranscriptionResult(String text, Double duration, String language) {}
}

REST API 示例

@RestController
@RequestMapping("/api/transcribe")
public class TranscriptionController {

    @Autowired
    private AudioTranscriptionModel transcriptionModel;

    /**
     * 上传并转录音频
     */
    @PostMapping
    public TranscriptionResponse uploadAndTranscribe(
            @RequestParam("file") MultipartFile file) throws IOException {
        
        Resource audioResource = new ByteArrayResource(file.getBytes()) {
            @Override
            public String getFilename() {
                return file.getOriginalFilename();
            }
        };
        
        AudioTranscriptionRequest request = AudioTranscriptionRequest.builder()
            .audio(audioResource)
            .build();
        
        return transcriptionModel.transcribe(request);
    }

    /**
     * 指定语言转录
     */
    @PostMapping("/language/{language}")
    public String transcribeWithLanguage(
            @PathVariable String language,
            @RequestParam("file") MultipartFile file) throws IOException {
        
        Resource audioResource = new ByteArrayResource(file.getBytes());
        
        AudioTranscriptionRequest request = AudioTranscriptionRequest.builder()
            .audio(audioResource)
            .language(language)
            .build();
        
        return transcriptionModel.transcribe(request).getResult().getOutput();
    }
}

转录选项

@GetMapping("/transcribe-options")
public String transcribeWithOptions(@RequestParam String audioUrl) throws IOException {
    // 下载音频文件
    byte[] audioData = downloadAudio(audioUrl);
    Resource audioResource = new ByteArrayResource(audioData);
    
    AudioTranscriptionRequest request = AudioTranscriptionRequest.builder()
        .audio(audioResource)
        .model("whisper-1")
        .language("zh")           // 语言代码
        .prompt("这是一段关于技术的内容")  // 提示词，帮助识别
        .responseFormat(AudioTranscriptionRequest.TranscriptResponseFormat.JSON)
        .temperature(0.0)         // 随机性 (0-1)
        .timestampGranularities(List.of("word"))  // 时间戳粒度
        .build();
    
    return transcriptionModel.transcribe(request).getResult().getOutput();
}

支持的音频格式

格式	扩展名
MP3	.mp3
MP4	.mp4, .m4a
WAV	.wav
FLAC	.flac
WebM	.webm
OGG	.ogg, .oga

文本转语音 (TTS)

配置

spring:
  ai:
    openai:
      api-key: ${OPENAI_API_KEY}
      audio:
        speech:
          options:
            model: tts-1  # tts-1 (标准) 或 tts-1-hd (高清)
            voice: alloy  # 声音选项
            speed: 1.0    # 语速 (0.25 - 4.0)

可用声音

声音	特点
`alloy`	中性、现代
`echo`	男性、温暖
`fable`	英国口音
`onyx`	深沉男性
`nova`	女性活力
`shimmer`	女性柔和

基本使用

@Service
public class TextToSpeechService {

    @Autowired
    private SpeechModel speechModel;

    /**
     * 将文本转换为语音
     */
    public byte[] textToSpeech(String text) {
        SpeechRequest request = SpeechRequest.builder()
            .input(text)
            .voice("alloy")
            .build();
        
        SpeechResponse response = speechModel.call(request);
        return response.getResult().getOutput();
    }

    /**
     * 转换并保存到文件
     */
    public Path textToSpeechFile(String text, String outputPath) throws IOException {
        byte[] audioData = textToSpeech(text);
        
        Path path = Paths.get(outputPath);
        Files.write(path, audioData);
        
        return path;
    }
}

REST API 示例

@RestController
@RequestMapping("/api/speech")
public class SpeechController {

    @Autowired
    private SpeechModel speechModel;

    /**
     * 文本转语音
     */
    @PostMapping(value = "/synthesize", produces = "audio/mpeg")
    public byte[] synthesize(@RequestBody SpeechRequestDTO request) {
        SpeechPrompt prompt = new SpeechPrompt(request.text());
        prompt.getOptions().setVoice(request.voice());
        prompt.getOptions().setSpeed(request.speed());
        
        SpeechResponse response = speechModel.call(prompt);
        return response.getResult().getOutput();
    }

    /**
     * 流式输出
     */
    @PostMapping(value = "/stream", produces = "audio/mpeg")
    public Flux<byte[]> streamSynthesize(@RequestBody SpeechRequestDTO request) {
        SpeechPrompt prompt = new SpeechPrompt(request.text());
        prompt.getOptions().setVoice(request.voice());
        
        return speechModel.stream(prompt)
            .map(response -> response.getResult().getOutput());
    }

    /**
     * 下载音频文件
     */
    @PostMapping("/download")
    public ResponseEntity<byte[]> downloadSpeech(
            @RequestBody SpeechRequestDTO request) {
        
        SpeechPrompt prompt = new SpeechPrompt(request.text());
        prompt.getOptions().setVoice(request.voice());
        prompt.getOptions().setResponseFormat("mp3");
        
        SpeechResponse response = speechModel.call(prompt);
        byte[] audioData = response.getResult().getOutput();
        
        return ResponseEntity.ok()
            .header(HttpHeaders.CONTENT_DISPOSITION, 
                   "attachment; filename=speech.mp3")
            .contentType(MediaType.APPLICATION_OCTET_STREAM)
            .body(audioData);
    }
}

record SpeechRequestDTO(
    String text,
    @default("alloy") String voice,
    @default("1.0") Double speed
) {}

实际应用示例

1. 会议记录转文字

@Service
public class MeetingTranscriptionService {

    @Autowired
    private AudioTranscriptionModel transcriptionModel;

    /**
     * 转录会议录音
     */
    public MeetingMinutes transcribeMeeting(Resource audioFile, 
                                            String meetingTitle) {
        // 转录
        AudioTranscriptionRequest request = AudioTranscriptionRequest.builder()
            .audio(audioFile)
            .responseFormat(AudioTranscriptionRequest.TranscriptResponseFormat.VERBOSE_JSON)
            .language("zh")
            .build();
        
        TranscriptionResponse response = transcriptionModel.transcribe(request);
        String transcript = response.getResult().getOutput();
        
        // 提取关键信息
        return new MeetingMinutes(
            meetingTitle,
            transcript,
            extractKeyPoints(transcript),
            extractActionItems(transcript),
            LocalDateTime.now()
        );
    }

    private List<String> extractKeyPoints(String transcript) {
        // 可以使用 ChatModel 来提取关键点
        return List.of("关键点提取功能需要结合 ChatModel");
    }

    private List<String> extractActionItems(String transcript) {
        return List.of("待办事项提取功能需要结合 ChatModel");
    }

    record MeetingMinutes(
        String title,
        String transcript,
        List<String> keyPoints,
        List<String> actionItems,
        LocalDateTime createdAt
    ) {}
}

2. 播客转录

@Service
public class PodcastService {

    @Autowired
    private AudioTranscriptionModel transcriptionModel;

    /**
     * 分段转录长音频
     */
    public PodcastTranscript transcribePodcast(Resource audioFile) throws IOException {
        AudioTranscriptionRequest request = AudioTranscriptionRequest.builder()
            .audio(audioFile)
            .responseFormat(AudioTranscriptionRequest.TranscriptResponseFormat.SRT)
            .language("zh")
            .build();
        
        TranscriptionResponse response = transcriptionModel.transcribe(request);
        String srtContent = response.getResult().getOutput();
        
        // 解析 SRT 格式
        List<SubtitleSegment> segments = parseSrt(srtContent);
        
        return new PodcastTranscript(
            segments,
            segments.stream()
                .map(SubtitleSegment::text)
                .collect(Collectors.joining(" ")),
            calculateDuration(segments)
        );
    }

    private List<SubtitleSegment> parseSrt(String srtContent) {
        // 解析 SRT 格式的字幕
        List<SubtitleSegment> segments = new ArrayList<>();
        String[] blocks = srtContent.split("\n\n");
        
        for (String block : blocks) {
            String[] lines = block.split("\n");
            if (lines.length >= 3) {
                segments.add(new SubtitleSegment(
                    lines[1],  // 时间码
                    Arrays.stream(lines)
                        .skip(2)
                        .collect(Collectors.joining(" "))
                ));
            }
        }
        
        return segments;
    }

    private Duration calculateDuration(List<SubtitleSegment> segments) {
        // 计算总时长
        return Duration.ofMinutes(segments.size()); // 简化实现
    }

    record PodcastTranscript(
        List<SubtitleSegment> segments,
        String fullText,
        Duration duration
    ) {}

    record SubtitleSegment(String timestamp, String text) {}
}

3. 语音助手

@Service
public class VoiceAssistantService {

    @Autowired
    private AudioTranscriptionModel transcriptionModel;
    
    @Autowired
    private ChatClient chatClient;
    
    @Autowired
    private SpeechModel speechModel;

    /**
     * 语音对话
     */
    public byte[] voiceConversation(byte[] userAudio) {
        // 1. 语音转文字
        Resource audioResource = new ByteArrayResource(userAudio);
        AudioTranscriptionRequest request = AudioTranscriptionRequest.builder()
            .audio(audioResource)
            .language("zh")
            .build();
        
        String userText = transcriptionModel.transcribe(request)
            .getResult().getOutput();
        
        // 2. AI 回复
        String aiResponse = chatClient.prompt()
            .user(userText)
            .call()
            .content();
        
        // 3. 文字转语音
        SpeechPrompt speechPrompt = new SpeechPrompt(aiResponse);
        speechPrompt.getOptions().setVoice("nova");
        
        return speechModel.call(speechPrompt)
            .getResult().getOutput();
    }
}

4. 多语言翻译

@Service
public class AudioTranslationService {

    @Autowired
    private AudioTranscriptionModel transcriptionModel;
    
    @Autowired
    private ChatClient chatClient;
    
    @Autowired
    private SpeechModel speechModel;

    /**
     * 语音翻译：输入一种语言的语音，输出另一种语言的语音
     */
    public byte[] translateSpeech(byte[] audioData, 
                                   String sourceLanguage,
                                   String targetLanguage) {
        // 1. 转录原始语音
        Resource audioResource = new ByteArrayResource(audioData);
        AudioTranscriptionRequest request = AudioTranscriptionRequest.builder()
            .audio(audioResource)
            .language(sourceLanguage)
            .build();
        
        String originalText = transcriptionModel.transcribe(request)
            .getResult().getOutput();
        
        // 2. 翻译文本
        String translatedText = chatClient.prompt()
            .system(String.format(
                "你是一个翻译专家，请将以下%s文本翻译成%s，只返回翻译结果。",
                sourceLanguage, targetLanguage
            ))
            .user(originalText)
            .call()
            .content();
        
        // 3. 转换为目标语言的语音
        SpeechPrompt speechPrompt = new SpeechPrompt(translatedText);
        
        // 选择合适的声音
        String voice = selectVoiceForLanguage(targetLanguage);
        speechPrompt.getOptions().setVoice(voice);
        
        return speechModel.call(speechPrompt)
            .getResult().getOutput();
    }

    private String selectVoiceForLanguage(String language) {
        return switch (language) {
            case "zh" -> "nova";
            case "en" -> "alloy";
            case "ja" -> "echo";
            default -> "alloy";
        };
    }
}

5. 有声读物生成

@Service
public class AudiobookService {

    @Autowired
    private SpeechModel speechModel;

    /**
     * 将文本转换为有声读物
     */
    public List<Path> generateAudiobook(String content, 
                                         String outputDir,
                                         String voice) throws IOException {
        // 分段处理长文本
        List<String> paragraphs = splitIntoParagraphs(content);
        
        Path dir = Paths.get(outputDir);
        Files.createDirectories(dir);
        
        List<Path> audioFiles = new ArrayList<>();
        
        for (int i = 0; i < paragraphs.size(); i++) {
            String paragraph = paragraphs.get(i);
            
            SpeechPrompt prompt = new SpeechPrompt(paragraph);
            prompt.getOptions().setVoice(voice);
            prompt.getOptions().setSpeed(0.9);  // 有声书适合稍慢的语速
            
            byte[] audioData = speechModel.call(prompt)
                .getResult().getOutput();
            
            Path audioFile = dir.resolve("chapter_" + i + ".mp3");
            Files.write(audioFile, audioData);
            audioFiles.add(audioFile);
        }
        
        return audioFiles;
    }

    private List<String> splitIntoParagraphs(String content) {
        return Arrays.stream(content.split("\n\n"))
            .filter(p -> !p.isBlank())
            .toList();
    }
}

文件格式

支持的输出格式

格式	说明	用途
MP3	压缩格式	通用播放
AAC	高效压缩	流媒体
FLAC	无损压缩	高质量存储
WAV	无压缩	专业处理

@GetMapping("/synthesize/format/{format}")
public byte[] synthesizeWithFormat(
        @PathVariable String format,
        @RequestParam String text) {
    
    SpeechPrompt prompt = new SpeechPrompt(text);
    prompt.getOptions().setResponseFormat(format);
    prompt.getOptions().setVoice("alloy");
    
    return speechModel.call(prompt).getResult().getOutput();
}

最佳实践

1. 音频预处理

@Service
public class AudioPreprocessingService {

    /**
     * 预处理音频以提高转录质量
     */
    public byte[] preprocessAudio(byte[] audioData) throws Exception {
        // 使用 Java Sound API 或 FFmpeg
        // 1. 转换为单声道
        // 2. 采样率标准化 (16kHz)
        // 3. 降噪
        
        // 简化实现：返回原始数据
        return audioData;
    }
}

2. 长音频处理

@Service
public class LongAudioService {

    /**
     * 分段处理长音频
     */
    public String transcribeLongAudio(Resource audioFile) throws IOException {
        // 对于超长音频，需要分段处理
        // Whisper 支持最大 25MB 或约 2.5 小时的音频
        
        // 如果音频更大，需要先分割
        List<Resource> segments = splitAudio(audioFile);
        
        StringBuilder transcript = new StringBuilder();
        
        for (Resource segment : segments) {
            AudioTranscriptionRequest request = AudioTranscriptionRequest.builder()
                .audio(segment)
                .build();
            
            String text = transcriptionModel.transcribe(request)
                .getResult().getOutput();
            
            transcript.append(text).append(" ");
        }
        
        return transcript.toString().trim();
    }

    private List<Resource> splitAudio(Resource audioFile) {
        // 使用 FFmpeg 分割音频
        return List.of(audioFile);
    }
}

3. 缓存转录结果

@Service
public class CachedTranscriptionService {

    @Autowired
    private AudioTranscriptionModel transcriptionModel;

    private final Cache<String, String> transcriptCache = 
        Caffeine.newBuilder()
            .maximumSize(1000)
            .expireAfterAccess(Duration.ofHours(24))
            .build();

    public String transcribeWithCache(Resource audioFile) throws IOException {
        String hash = computeHash(audioFile);
        
        return transcriptCache.get(hash, key -> {
            AudioTranscriptionRequest request = AudioTranscriptionRequest.builder()
                .audio(audioFile)
                .build();
            
            return transcriptionModel.transcribe(request)
                .getResult().getOutput();
        });
    }

    private String computeHash(Resource resource) throws IOException {
        return DigestUtils.md5Hex(resource.getInputStream());
    }
}

小结

本章我们学习了：

语音识别 (ASR)：将音频转换为文本
文本转语音 (TTS)：将文本转换为音频
支持的模型：OpenAI Whisper、TTS 等
配置选项：语言、声音、速度等
实际应用：会议记录、播客转录、语音助手
最佳实践：预处理、分段处理、缓存

练习

实现一个语音消息转文字的服务
创建一个多语言语音翻译功能
实现一个简单的语音对话机器人

概述​

支持的模型​

语音识别 (ASR)​

AudioTranscriptionModel 接口​

OpenAI Whisper 配置​

基本转录​

REST API 示例​

转录选项​

支持的音频格式​

文本转语音 (TTS)​

配置​

可用声音​

基本使用​

REST API 示例​

实际应用示例​

1. 会议记录转文字​

2. 播客转录​

3. 语音助手​

4. 多语言翻译​

5. 有声读物生成​

文件格式​

支持的输出格式​

最佳实践​

1. 音频预处理​

2. 长音频处理​

3. 缓存转录结果​

小结​

练习​

参考资源​

概述

支持的模型

语音识别 (ASR)

AudioTranscriptionModel 接口

OpenAI Whisper 配置

基本转录

REST API 示例

转录选项

支持的音频格式

文本转语音 (TTS)

配置

可用声音

基本使用

REST API 示例

实际应用示例

1. 会议记录转文字

2. 播客转录

3. 语音助手

4. 多语言翻译

5. 有声读物生成

文件格式

支持的输出格式

最佳实践

1. 音频预处理

2. 长音频处理

3. 缓存转录结果

小结

练习

参考资源