多模态处理

多模态 AI 能够同时处理文本、图像、音频等多种类型的数据。本章介绍 Spring AI 中的多模态功能，包括图像理解和多模态对话。

什么是多模态？

多模态 AI 模型能够理解并生成多种类型的内容：

┌─────────────────────────────────────────────────────────────┐
│                    多模态 AI 能力                            │
├─────────────────────────────────────────────────────────────┤
│                                                            │
│   输入类型          输出类型          应用场景              │
│   ──────────────────────────────────────────────           │
│   文本 + 图像  ───>  文本         图像理解、问答           │
│   文本        ───>  图像         文生图（AI绘画）          │
│   文本 + 图像  ───>  图像         图像编辑、风格转换       │
│   音频        ───>  文本         语音转文字               │
│   文本        ───>  音频         文字转语音               │
│                                                            │
└─────────────────────────────────────────────────────────────┘

支持多模态的模型

模型	提供商	支持的模态
GPT-4o / GPT-4 Vision	OpenAI	文本、图像输入
Claude 3	Anthropic	文本、图像输入
Gemini Pro Vision	Google	文本、图像输入
LLaVA	Ollama (本地)	文本、图像输入
Qwen-VL	阿里云	文本、图像输入

图像输入

基本图像理解

Spring AI 使用 Media 类来表示图像等媒体内容：

import org.springframework.ai.model.Media;
import org.springframework.util.MimeTypeUtils;

@Service
public class ImageAnalysisService {

    @Autowired
    private ChatClient chatClient;

    /**
     * 分析图片内容
     */
    public String analyzeImage(String imageUrl, String question) {
        var userMessage = new UserMessage(
            question,
            List.of(new Media(MimeTypeUtils.IMAGE_PNG, imageUrl))
        );
        
        return chatClient.prompt()
            .messages(userMessage)
            .call()
            .content();
    }
}

使用 ChatClient 流式 API

@GetMapping("/analyze")
public String analyzeImage(
        @RequestParam String imageUrl,
        @RequestParam(defaultValue = "描述这张图片") String question) {
    
    return chatClient.prompt()
        .user(u -> u.text(question)
            .media(MimeTypeUtils.IMAGE_JPEG, new URL(imageUrl)))
        .call()
        .content();
}

从本地上传图片

@RestController
@RequestMapping("/api/vision")
public class VisionController {

    private final ChatClient chatClient;

    public VisionController(ChatClient.Builder builder) {
        this.chatClient = builder
            .defaultSystem("你是一个专业的图像分析助手，请详细描述和分析图片内容。")
            .build();
    }

    /**
     * 上传并分析图片
     */
    @PostMapping("/analyze")
    public String analyzeUploadedImage(
            @RequestParam("file") MultipartFile file,
            @RequestParam(defaultValue = "详细描述这张图片") String question) throws IOException {
        
        // 获取图片 MIME 类型
        String contentType = file.getContentType();
        MimeType mimeType = MimeType.valueOf(contentType != null ? contentType : "image/jpeg");
        
        // 创建用户消息
        var userMessage = new UserMessage(
            question,
            List.of(new Media(mimeType, file.getBytes()))
        );
        
        return chatClient.prompt()
            .messages(userMessage)
            .call()
            .content();
    }
}

从文件系统读取图片

@Service
public class LocalImageService {

    @Autowired
    private ChatClient chatClient;

    public String analyzeLocalImage(Path imagePath, String question) throws IOException {
        byte[] imageData = Files.readAllBytes(imagePath);
        
        // 根据文件扩展名确定 MIME 类型
        String fileName = imagePath.getFileName().toString();
        MimeType mimeType = determineMimeType(fileName);
        
        var userMessage = new UserMessage(
            question,
            List.of(new Media(mimeType, imageData))
        );
        
        return chatClient.prompt()
            .messages(userMessage)
            .call()
            .content();
    }

    private MimeType determineMimeType(String fileName) {
        if (fileName.endsWith(".png")) return MimeTypeUtils.IMAGE_PNG;
        if (fileName.endsWith(".gif")) return MimeTypeUtils.IMAGE_GIF;
        if (fileName.endsWith(".webp")) return MimeType.valueOf("image/webp");
        return MimeTypeUtils.IMAGE_JPEG; // 默认 JPEG
    }
}

多图像输入

分析多张图片

@PostMapping("/compare")
public String compareImages(
        @RequestParam("file1") MultipartFile file1,
        @RequestParam("file2") MultipartFile file2,
        @RequestParam(defaultValue = "比较这两张图片的异同") String question) throws IOException {
    
    var userMessage = new UserMessage(
        question,
        List.of(
            new Media(MimeTypeUtils.IMAGE_JPEG, file1.getBytes()),
            new Media(MimeTypeUtils.IMAGE_JPEG, file2.getBytes())
        )
    );
    
    return chatClient.prompt()
        .messages(userMessage)
        .call()
        .content();
}

图片序列分析

@PostMapping("/analyze-sequence")
public String analyzeSequence(
        @RequestParam("files") List<MultipartFile> files,
        @RequestParam String question) throws IOException {
    
    // 将多张图片作为序列输入
    List<Media> mediaList = files.stream()
        .map(file -> {
            try {
                return new Media(
                    MimeType.valueOf(file.getContentType()),
                    file.getBytes()
                );
            } catch (IOException e) {
                throw new RuntimeException(e);
            }
        })
        .toList();
    
    var userMessage = new UserMessage(question, mediaList);
    
    return chatClient.prompt()
        .messages(userMessage)
        .call()
        .content();
}

实际应用示例

1. OCR 文字识别

@Service
public class OcrService {

    private final ChatClient ocrClient;

    public OcrService(ChatClient.Builder builder) {
        this.ocrClient = builder
            .defaultSystem("""
                你是一个OCR识别专家。
                请准确识别图片中的所有文字内容。
                保持原有的格式和排版。
                如果文字模糊或有识别不确定的地方，请标注。
                """)
            .build();
    }

    /**
     * 识别图片中的文字
     */
    public OcrResult recognizeText(byte[] imageData, MimeType mimeType) {
        var userMessage = new UserMessage(
            "请识别这张图片中的所有文字内容",
            List.of(new Media(mimeType, imageData))
        );
        
        String text = ocrClient.prompt()
            .messages(userMessage)
            .call()
            .content();
        
        return new OcrResult(text, calculateConfidence(text));
    }

    private double calculateConfidence(String text) {
        // 简单的置信度计算
        return text.length() > 0 ? 0.85 : 0.0;
    }

    record OcrResult(String text, double confidence) {}
}

2. 图片内容提取

@Service
public class ContentExtractionService {

    private final ChatClient extractionClient;

    public ContentExtractionService(ChatClient.Builder builder) {
        this.extractionClient = builder
            .defaultSystem("""
                你是一个内容提取专家。
                请从图片中提取结构化信息，以JSON格式返回。
                """)
            .build();
    }

    /**
     * 提取图片中的结构化信息
     */
    public <T> T extractInfo(byte[] imageData, MimeType mimeType, 
                             String extractionPrompt, Class<T> type) {
        var userMessage = new UserMessage(
            extractionPrompt,
            List.of(new Media(mimeType, imageData))
        );
        
        return extractionClient.prompt()
            .messages(userMessage)
            .call()
            .entity(type);
    }

    /**
     * 提取发票信息
     */
    public InvoiceInfo extractInvoice(byte[] imageData) {
        return extractInfo(imageData, MimeTypeUtils.IMAGE_JPEG, """
            请从这张发票图片中提取以下信息：
            - 发票号码
            - 开票日期
            - 公司名称
            - 总金额
            - 税额
            """, InvoiceInfo.class);
    }

    /**
     * 提取名片信息
     */
    public BusinessCardInfo extractBusinessCard(byte[] imageData) {
        return extractInfo(imageData, MimeTypeUtils.IMAGE_JPEG, """
            请从这张名片中提取以下信息：
            - 姓名
            - 公司
            - 职位
            - 电话
            - 邮箱
            - 地址
            """, BusinessCardInfo.class);
    }

    record InvoiceInfo(
        String invoiceNumber,
        String date,
        String companyName,
        double totalAmount,
        double taxAmount
    ) {}

    record BusinessCardInfo(
        String name,
        String company,
        String title,
        String phone,
        String email,
        String address
    ) {}
}

3. 图片分类

@Service
public class ImageClassificationService {

    private final ChatClient classificationClient;

    public ImageClassificationService(ChatClient.Builder builder) {
        this.classificationClient = builder
            .defaultSystem("""
                你是一个图像分类专家。
                请根据图片内容进行分类。
                返回JSON格式的分类结果。
                """)
            .build();
    }

    /**
     * 对图片进行分类
     */
    public ClassificationResult classify(byte[] imageData, 
                                         MimeType mimeType,
                                         List<String> categories) {
        String prompt = String.format("""
            请将这张图片分类到以下类别之一：%s
            
            请返回JSON格式：
            {
              "category": "分类结果",
              "confidence": 0.95,
              "reason": "分类理由"
            }
            """, String.join(", ", categories));
        
        var userMessage = new UserMessage(
            prompt,
            List.of(new Media(mimeType, imageData))
        );
        
        return classificationClient.prompt()
            .messages(userMessage)
            .call()
            .entity(ClassificationResult.class);
    }

    /**
     * 检测图片内容类型
     */
    public ContentDetection detectContent(byte[] imageData, MimeType mimeType) {
        var userMessage = new UserMessage("""
            请检测这张图片的内容类型，返回JSON格式：
            {
              "type": "照片/插图/图表/截图/文档",
              "subject": "主要内容描述",
              "hasText": true/false,
              "hasPeople": true/false,
              "colors": ["主要颜色"]
            }
            """,
            List.of(new Media(mimeType, imageData))
        );
        
        return classificationClient.prompt()
            .messages(userMessage)
            .call()
            .entity(ContentDetection.class);
    }

    record ClassificationResult(
        String category,
        double confidence,
        String reason
    ) {}

    record ContentDetection(
        String type,
        String subject,
        boolean hasText,
        boolean hasPeople,
        List<String> colors
    ) {}
}

4. 图片问答系统

@RestController
@RequestMapping("/api/image-qa")
public class ImageQAController {

    private final ChatClient qaClient;

    public ImageQAController(ChatClient.Builder builder) {
        this.qaClient = builder
            .defaultSystem("""
                你是一个图片问答助手。
                请根据图片内容准确回答用户的问题。
                如果问题超出图片内容范围，请诚实说明。
                """)
            .build();
    }

    @PostMapping("/ask")
    public Answer askAboutImage(
            @RequestParam("file") MultipartFile file,
            @RequestParam String question) throws IOException {
        
        var userMessage = new UserMessage(
            question,
            List.of(new Media(
                MimeType.valueOf(file.getContentType()),
                file.getBytes()
            ))
        );
        
        ChatResponse response = qaClient.prompt()
            .messages(userMessage)
            .call()
            .chatResponse();
        
        return new Answer(
            response.getResult().getOutput().getContent(),
            response.getMetadata().getUsage().getTotalTokens()
        );
    }

    @PostMapping("/batch-ask")
    public List<Answer> batchAsk(
            @RequestParam("file") MultipartFile file,
            @RequestBody List<String> questions) throws IOException {
        
        byte[] imageData = file.getBytes();
        MimeType mimeType = MimeType.valueOf(file.getContentType());
        
        return questions.stream()
            .map(question -> {
                var userMessage = new UserMessage(
                    question,
                    List.of(new Media(mimeType, imageData))
                );
                
                String answer = qaClient.prompt()
                    .messages(userMessage)
                    .call()
                    .content();
                
                return new Answer(answer, 0);
            })
            .toList();
    }

    record Answer(String content, int tokens) {}
}

5. 图片描述生成

@Service
public class ImageDescriptionService {

    private final ChatClient descriptionClient;

    public ImageDescriptionService(ChatClient.Builder builder) {
        this.descriptionClient = builder
            .defaultSystem("你是一个专业的图像描述专家，请用清晰的语言描述图片内容。")
            .build();
    }

    /**
     * 生成简短描述
     */
    public String generateBriefDescription(byte[] imageData, MimeType mimeType) {
        var userMessage = new UserMessage(
            "请用一句话描述这张图片的主要内容",
            List.of(new Media(mimeType, imageData))
        );
        
        return descriptionClient.prompt()
            .messages(userMessage)
            .call()
            .content();
    }

    /**
     * 生成详细描述
     */
    public ImageDescription generateDetailedDescription(byte[] imageData, 
                                                        MimeType mimeType,
                                                        String language) {
        String prompt = String.format("""
            请用%s详细描述这张图片，包括：
            1. 主要内容和主体
            2. 场景和背景
            3. 颜色和构图
            4. 整体氛围和情感
            
            请以JSON格式返回。
            """, language);
        
        var userMessage = new UserMessage(
            prompt,
            List.of(new Media(mimeType, imageData))
        );
        
        return descriptionClient.prompt()
            .messages(userMessage)
            .call()
            .entity(ImageDescription.class);
    }

    /**
     * 生成图片标签
     */
    public List<String> generateTags(byte[] imageData, MimeType mimeType) {
        var userMessage = new UserMessage(
            "请为这张图片生成5-10个标签，返回JSON数组格式",
            List.of(new Media(mimeType, imageData))
        );
        
        return descriptionClient.prompt()
            .messages(userMessage)
            .call()
            .entity(new ParameterizedTypeReference<List<String>>() {});
    }

    record ImageDescription(
        String mainContent,
        String scene,
        String composition,
        String atmosphere
    ) {}
}

模型配置

OpenAI GPT-4 Vision

spring:
  ai:
    openai:
      api-key: ${OPENAI_API_KEY}
      chat:
        options:
          model: gpt-4o  # 支持视觉的模型
          max-tokens: 1000

Ollama 本地多模态模型

spring:
  ai:
    ollama:
      base-url: http://localhost:11434
      chat:
        model: llava  # 或 bakllava, moondream

拉取模型：

# 拉取 LLaVA 模型
ollama pull llava

# 拉取 Moondream（轻量级）
ollama pull moondream

Anthropic Claude

spring:
  ai:
    anthropic:
      api-key: ${ANTHROPIC_API_KEY}
      chat:
        options:
          model: claude-3-5-sonnet-20241022

图像大小和格式限制

大小限制

模型	最大图片大小	建议大小
GPT-4 Vision	20MB	< 4MB
Claude 3	5MB	< 2MB
LLaVA (本地)	取决于内存	< 10MB

压缩图片

@Service
public class ImageCompressionService {

    /**
     * 压缩图片到指定大小
     */
    public byte[] compressImage(byte[] imageData, long maxSizeKB) throws IOException {
        ByteArrayInputStream bis = new ByteArrayInputStream(imageData);
        BufferedImage image = ImageIO.read(bis);
        
        float quality = 0.9f;
        byte[] result = imageData;
        
        while (result.length > maxSizeKB * 1024 && quality > 0.1f) {
            ByteArrayOutputStream bos = new ByteArrayOutputStream();
            
            ImageWriter writer = ImageIO.getImageWritersByFormatName("jpg").next();
            ImageWriteParam param = writer.getDefaultWriteParam();
            param.setCompressionMode(ImageWriteParam.MODE_EXPLICIT);
            param.setCompressionQuality(quality);
            
            writer.setOutput(ImageIO.createImageOutputStream(bos));
            writer.write(null, new IIOImage(image, null, null), param);
            writer.dispose();
            
            result = bos.toByteArray();
            quality -= 0.1f;
        }
        
        return result;
    }
}

最佳实践

1. 优化图片大小

@RestController
public class OptimizedVisionController {

    @Autowired
    private ImageCompressionService compressionService;

    @PostMapping("/analyze-optimized")
    public String analyzeOptimized(
            @RequestParam("file") MultipartFile file,
            @RequestParam String question) throws IOException {
        
        // 压缩图片到 500KB 以下
        byte[] compressed = compressionService.compressImage(
            file.getBytes(), 
            500
        );
        
        var userMessage = new UserMessage(
            question,
            List.of(new Media(MimeTypeUtils.IMAGE_JPEG, compressed))
        );
        
        return chatClient.prompt()
            .messages(userMessage)
            .call()
            .content();
    }
}

2. 缓存图片分析结果

@Service
public class CachedImageAnalysisService {

    @Autowired
    private ChatClient chatClient;
    
    private final Cache<String, String> analysisCache = 
        Caffeine.newBuilder()
            .maximumSize(1000)
            .expireAfterAccess(Duration.ofHours(24))
            .build();

    public String analyzeImage(String imageHash, byte[] imageData, String question) {
        String cacheKey = imageHash + ":" + question.hashCode();
        
        return analysisCache.get(cacheKey, key -> {
            var userMessage = new UserMessage(
                question,
                List.of(new Media(MimeTypeUtils.IMAGE_JPEG, imageData))
            );
            
            return chatClient.prompt()
                .messages(userMessage)
                .call()
                .content();
        });
    }
}

3. 批量处理图片

@Service
public class BatchImageService {

    @Autowired
    private ChatClient chatClient;

    @Async
    public CompletableFuture<List<AnalysisResult>> analyzeBatch(
            List<ImageData> images, 
            String commonQuestion) {
        
        List<AnalysisResult> results = images.stream()
            .map(img -> {
                var userMessage = new UserMessage(
                    commonQuestion,
                    List.of(new Media(img.mimeType(), img.data()))
                );
                
                String result = chatClient.prompt()
                    .messages(userMessage)
                    .call()
                    .content();
                
                return new AnalysisResult(img.id(), result);
            })
            .toList();
        
        return CompletableFuture.completedFuture(results);
    }

    record ImageData(String id, byte[] data, MimeType mimeType) {}
    record AnalysisResult(String imageId, String analysis) {}
}

小结

本章我们学习了：

多模态概念：AI 同时处理多种数据类型
图像输入：URL、上传、本地文件等方式
多图像处理：比较、序列分析
实际应用：OCR、信息提取、分类、问答
模型配置：OpenAI、Ollama、Claude 等
最佳实践：压缩、缓存、批量处理

练习

实现一个图片内容审核系统
创建一个发票信息提取服务
实现一个图片搜索功能（基于内容描述）

下一步

图像生成 - 学习 AI 图像生成
音频处理 - 学习语音识别和合成

什么是多模态？​

支持多模态的模型​

图像输入​

基本图像理解​

使用 ChatClient 流式 API​

从本地上传图片​

从文件系统读取图片​

多图像输入​

分析多张图片​

图片序列分析​

实际应用示例​

1. OCR 文字识别​

2. 图片内容提取​

3. 图片分类​

4. 图片问答系统​

5. 图片描述生成​

模型配置​

OpenAI GPT-4 Vision​

Ollama 本地多模态模型​

Anthropic Claude​

图像大小和格式限制​

大小限制​

压缩图片​

最佳实践​

1. 优化图片大小​

2. 缓存图片分析结果​

3. 批量处理图片​

小结​

练习​

下一步​

什么是多模态？

支持多模态的模型

图像输入

基本图像理解

使用 ChatClient 流式 API

从本地上传图片

从文件系统读取图片

多图像输入

分析多张图片

图片序列分析

实际应用示例

1. OCR 文字识别

2. 图片内容提取

3. 图片分类

4. 图片问答系统

5. 图片描述生成

模型配置

OpenAI GPT-4 Vision

Ollama 本地多模态模型

Anthropic Claude

图像大小和格式限制

大小限制

压缩图片

最佳实践

1. 优化图片大小

2. 缓存图片分析结果

3. 批量处理图片

小结

练习

下一步