Recently, Mistral AI launched its Voxtral model, designed to combine text and audio processing capabilities to support various application scenarios. The Voxtral series includes two different models: Voxtral-Mini-3B-2507 and Voxtral-Small-24B-2507. The former is an optimized 3 billion parameter model suitable for fast audio transcription and basic multimodal understanding, while the latter has 24 billion parameters and supports more complex audio-text intelligence and multilingual processing, making it ideal for enterprise applications.

image.png

Both models support audio contexts of up to 30 to 40 minutes, have automatic language detection, and can process up to 32,000 tokens. These models are released under the Apache 2.0 license, suitable for commercial and research projects, and have efficient multimodal intelligent processing capabilities that can handle oral and written communication in a single workflow.

In this article, we demonstrate how to host the Voxtral model on an Amazon SageMaker AI endpoint using vLLM and the "Bring Your Own Container (BYOC)" method. vLLM is a high-performance library that better manages memory for large language models and supports tensor parallelism across multiple GPUs. The BYOC feature of SageMaker allows users to deploy with custom container images, offering greater flexibility in model optimization and version control.

The entire deployment process is managed by the SageMaker notebook environment as the central control point, responsible for building and pushing custom Docker images to Amazon Elastic Container Registry (ECR) and managing model configuration and deployment workflows. Additionally, Amazon S3 is responsible for storing key files required for the Voxtral implementation, enabling modular separation between configuration and container images.

This solution supports a variety of use cases, including traditional dialogue AI with only text processing, accurate audio file transcription, and complex multimodal applications combining audio and text intelligence. Users can seamlessly switch between the Voxtral-Mini and Voxtral-Small models with simple configuration updates. By implementing these multimodal features, Voxtral can provide users with more flexible and efficient audio and text processing services.

Key points:

📌 The Voxtral model combines text and audio processing, supporting multiple application scenarios.  

🔧 Amazon SageMaker supports hosting the Voxtral model with custom containers, offering higher flexibility.  

💡 Supports various use cases, including text processing, audio transcription, and complex multimodal applications.