Pixtral 12B is now available on Amazon SageMaker JumpStart


Today, we are excited to announce that Pixtral 12B (pixtral-12b-2409), a state-of-the-art vision language model (VLM) from Mistral AI that excels in both text-only and multimodal tasks, is available for customers through Amazon SageMaker JumpStart. You can try this model with SageMaker JumpStart, a machine learning (ML) hub that provides access to algorithms and models that can be deployed with one click for running inference.

In this post, we walk through how to discover, deploy, and use the Pixtral 12B model for a variety of real-world vision use cases.

Pixtral 12B overview

Pixtral 12B represents Mistral’s first VLM and demonstrates strong performance across various benchmarks, outperforming other open models and matching larger models, according to Mistral. Pixtral is trained to understand both images and documents, and shows strong abilities in vision tasks such as chart and figure understanding, document question answering, multimodal reasoning, and instruction following, some of which we demonstrate later in this post with examples. Pixtral 12B is able to ingest images at their natural resolution and aspect ratio. Unlike other open source models, Pixtral doesn’t compromise on text benchmark performance, such as instruction following, coding, and math, to excel in multimodal tasks.

Mistral designed a novel architecture for Pixtral 12B to optimize for both speed and performance. The model has two components: a 400-million-parameter vision encoder, which tokenizes images, and a 12-billion-parameter multimodal transformer decoder, which predicts the next text token given a sequence of text and images. The vision encoder was newly trained that natively supports variable image sizes, which allows Pixtral to be used to accurately understand complex diagrams, charts, and documents in high resolution, and provides fast inference speeds on small images like icons, clipart, and equations. This architecture allows Pixtral to process any number of images with arbitrary sizes in its large context window of 128,000 tokens.

License agreements are a critical decision factor when using open-weights models. Similar to other Mistral models, such as Mistral 7B, Mixtral 8x7B, Mixtral 8x22B and Mistral Nemo 12B, Pixtral 12B is released under the commercially permissive Apache 2.0, providing enterprise and startup customers with a high-performing VLM option to build complex multimodal applications.

SageMaker JumpStart overview

SageMaker JumpStart offers access to a broad selection of publicly available foundation models (FMs). These pre-trained models serve as powerful starting points that can be deeply customized to address specific use cases. You can now use state-of-the-art model architectures, such as language models, computer vision models, and more, without having to build them from scratch.

With SageMaker JumpStart, you can deploy models in a secure environment. The models can be provisioned on dedicated SageMaker Inference instances, including AWS Trainium and AWS Inferentia powered instances, and are isolated within your virtual private cloud (VPC). This enforces data security and compliance, because the models operate under your own VPC controls, rather than in a shared public environment. After deploying an FM, you can further customize and fine-tune the model, including SageMaker Inference for deploying models and container logs for improved observability.With SageMaker, you can streamline the entire model deployment process. Note that fine-tuning on Pixtral 12B is not yet available (at the time of writing) on SageMaker JumpStart.

Prerequisites

To try out Pixtral 12B in SageMaker JumpStart, you need the following prerequisites:

Discover Pixtral 12B in SageMaker JumpStart

You can access Pixtral 12B through SageMaker JumpStart in the SageMaker Studio UI and the SageMaker Python SDK. In this section, we go over how to discover the models in SageMaker Studio.

SageMaker Studio is an IDE that provides a single web-based visual interface where you can access purpose-built tools to perform ML development steps, from preparing data to building, training, and deploying your ML models. For more details on how to get started and set up SageMaker Studio, refer to Amazon SageMaker Studio Classic.

  1. In SageMaker Studio, access SageMaker JumpStart by choosing JumpStart in the navigation pane.
  2. Choose HuggingFace to access the Pixtral 12B model.
  3. Search for the Pixtral 12B model.
  4. You can choose the model card to view details about the model such as license, data used to train, and how to use the model.
  5. Choose Deploy to deploy the model and create an endpoint.

Deploy the model in SageMaker JumpStart

Deployment starts when you choose Deploy. When deployment is complete, an endpoint is created. You can test the endpoint by passing a sample inference request payload or by selecting the testing option using the SDK. When you use the SDK, you will see example code that you can use in the notebook editor of your choice in SageMaker Studio.

To deploy using the SDK, we start by selecting the Mistral Nemo Base model, specified by the model_id with the value huggingface-vlm-mistral-pixtral-12b-2409. You can deploy your choice of any of the selected models on SageMaker with the following code:

from sagemaker.jumpstart.model import JumpStartModel 

accept_eula = True 

model = JumpStartModel(model_id="huggingface-vlm-mistral-pixtral-12b-2409") 
predictor = model.deploy(accept_eula=accept_eula)

This deploys the model on SageMaker with default configurations, including the default instance type and default VPC configurations. You can change these configurations by specifying non-default values in JumpStartModel. The end-user license agreement (EULA) value must be explicitly defined as True in order to accept the EULA. Also, make sure that you have the account-level service limit for using ml.p4d.24xlarge or ml.pde.24xlarge for endpoint usage as one or more instances. To request a service quota increase, refer to AWS service quotas. After you deploy the model, you can run inference against the deployed endpoint through the SageMaker predictor.

Pixtral 12B use cases

In this section, we provide examples of inference on Pixtral 12B with example prompts.

OCR

We use the following image as input for OCR.

We use the following prompt:

payload = 
    "messages": [
        
            "role": "user",
            "content": [
                
                    "type": "text",
                    "text": "Extract and transcribe all text visible in the image, preserving its exact formatting, layout, and any special characters. Include line breaks and maintain the original capitalization and punctuation.",
                ,
                
                    "type": "image_url",
                    "image_url": 
                        "url": "Pixtral_data/amazon_s1_2.jpg"
                    
                
            ]
        
    ],
    "max_tokens": 2000,
    "temperature": 0.6,
    "top_p": 0.9,

print(response)
Approximate date of commencement of proposed sale to the public: AS SOON AS PRACTICABLE AFTER THIS REGISTRATION STATEMENT BECOMES EFFECTIVE. 
If any of the securities being registered on this Form are to be offered on a delayed or continuous basis pursuant to Rule 415 under the Securities Act of 1933, check the following box. 
[] If this Form is filed to register additional securities for an offering pursuant to Rule 462(b) under the Securities Act of 1933, check the following box and list the Securities Act registration statement number of the earlier effective registration statement for the same offering. 
[] If this Form is a post-effective amendment filed pursuant to Rule 462(c) under the Securities Act of 1933, check the following box and list the Securities Act registration statement number of the earlier effective registration statement for the same offering. 
[] If delivery of the prospectus is expected to be made pursuant to Rule 434, please check the following box. 
[] **CALCULATION OF REGISTRATION FEE** 
| TITLE OF EACH CLASS OF SECURITIES TO BE REGISTERED | AMOUNT TO BE REGISTERED(1) | PROPOSED MAXIMUM OFFERING PRICE PER SHARE(2) | PROPOSED MAXIMUM AGGREGATE OFFERING PRICE(2) | AMOUNT OF REGISTRATION FEE | 
|----------------------------------------------------|----------------------------|---------------------------------------------|---------------------------------------------|----------------------------| 
| Common Stock, $0.01 par value per share........... | 2,875,000 shares           | $14.00                                      | $40,250,000                                 | $12,197(3)                 | 

(1) Includes 375,000 shares that the Underwriters have the option to purchase to cover over-allotments, if any. 
(2) Estimated solely for the purpose of calculating the registration fee in accordance with Rule 457(c). 
(3) $11,326 of registration fee has been previously paid. ...

Chart understanding and analysis

For chart understanding and analysis, we use the following image as input.

We use the following prompt:

prompt= """
Analyze the attached image of the chart or graph. Your tasks are to:
Identify the type of chart or graph (e.g., bar chart, line graph, pie chart, etc.).
Extract the key data points, including labels, values, and any relevant scales or units.
Identify and describe the main trends, patterns, or significant observations presented in the chart.
Generate a clear and concise paragraph summarizing the extracted data and insights. The summary should highlight the most important information and provide an overview that would help someone understand the chart without seeing it.
Ensure that your summary is well-structured, accurately reflects the data, and is written in a professional tone.
"""
payload = 
    "messages": [
        
            "role": "user",
            "content": [
                
                    "type": "text",
                    "text": prompt,
                ,
                
                    "type": "image_url",
                    "image_url": 
                        "url": "Pixtral_data/amazon_s1_2.jpg"
                    
                
            ]
        
    ],
    "max_tokens": 2000,
    "temperature": 0.6,
    "top_p": 0.9,

print(response)
image_path = "Pixtral_data/Amazon_Chart.png"  # Replace with your local image path
response = send_images_to_model(predictor, prompt, image_path)
print(response)

We get the following output:

The image is a bar chart titled "Segment Results – North America," which presents data on net sales and operating income over several quarters from Q2 2023 to Q2 2024. The chart is divided into two sections: one for net sales and the other for operating income.

### Key Data Points:
- Net Sales:
 - Q2 2023: $82,546 million
 - Q3 2023: Approximately $85,000 million
 - Q4 2023: Approximately $90,000 million
 - Q1 2024: Approximately $85,000 million
 - Q2 2024: $90,033 million
 - Year-over-Year (Y/Y) growth: 9%

- Operating Income:
 - Q2 2023: $3,211 million
 - Q3 2023: Approximately $4,000 million
 - Q4 2023: Approximately $7,000 million
 - Q1 2024: Approximately $5,000 million
 - Q2 2024: $5,065 million
 - Year-over-Year (Y/Y) growth: 58%

- Total Trailing Twelve Months (TTM):
 - Net Sales: $369.8 billion
 - Operating Income: $20.8 billion
...
- **Operating Income:** Operating income shows significant growth, particularly in Q4 2023, where it peaks. There is a notable year-over-year increase of 58%.

### Summary:
The bar chart illustrates the segment results for North America, focusing on net sales and operating income from Q2 2023 to Q2 2024. Net sales demonstrate a steady upward trend, culminating in a 9% year-over-year increase, with the highest value recorded in Q2 2024 at $90,033 million. Operating income exhibits more volatility, with a significant peak in Q4 2023, and an overall substantial year-over-year growth of 58%. The total trailing twelve months (TTM) figures indicate robust performance, with net sales reaching $369.8 billion and operating income at $20.8 billion. This data underscores a positive growth trajectory in both net sales and operating income for the North American segment over the observed period.

Image to code

For an image-to-code example, we use the following image as input.

We use the following prompt:

def extract_html(text):
 pattern = r'```html\s*(.*?)\s*```'
 match = re.search(pattern, text, re.DOTALL)
 return match.group(1) if match else None
  
prompt = "Create HTML and CSS code for a minimalist and futuristic website to purchase luggage. Use the following image as template to create your own design."
payload = 
    "messages": [
        
            "role": "user",
            "content": [
                
                    "type": "text",
                    "text": prompt,
                ,
                
                    "type": "image_url",
                    "image_url": 
                        "url": "Pixtral_data/Amazon_Chart.png"
                    
                
            ]
        
    ],
    "max_tokens": 2000,
    "temperature": 0.6,
    "top_p": 0.9,

print('Input Image:\n\n')
html_code = extract_html(response)
print(html_code)
display(HTML(html_code))
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Luggage Store</title>
    <link rel="stylesheet" href="https://aws.amazon.com/blogs/machine-learning/pixtral-12b-is-now-available-on-amazon-sagemaker-jumpstart/styles.css">
</head>
<body>
    <header>
        <h1>Luggage Store</h1>
        <nav>
            <ul>
                <li><a href="#">Home</a></li>
                <li><a href="#">Products</a></li>
                <li><a href="#">About</a></li>
                <li><a href="#">Contact</a></li>
            </ul>
        </nav>
    </header>
...
        <p>&copy; 2023 Luggage Store. All rights reserved.</p>
    </footer>
</body>
</html>

Clean up

After you are done, delete the SageMaker endpoints using the following code to avoid incurring unnecessary costs:

predictor.delete_model()
predictor.delete_endpoint()

Conclusion

In this post, we showed you how to get started with Mistral’s newest multi-modal model, Pixtral 12B, in SageMaker JumpStart and deploy the model for inference. We also explored how SageMaker JumpStart empowers data scientists and ML engineers to discover, access, and deploy a wide range of pre-trained FMs for inference, including other Mistral AI models, such as Mistral 7B and Mixtral 8x22B.

For more information about SageMaker JumpStart, refer to Train, deploy, and evaluate pretrained models with SageMaker JumpStart and Getting started with Amazon SageMaker JumpStart to get started.

For more Mistral assets, check out the Mistral-on-AWS repo.


About the Authors

Preston Tuggle is a Sr. Specialist Solutions Architect working on generative AI.

Niithiyn Vijeaswaran is a GenAI Specialist Solutions Architect at AWS. His area of focus is generative AI and AWS AI Accelerators. He holds a Bachelor’s degree in Computer Science and Bioinformatics. Niithiyn works closely with the Generative AI GTM team to enable AWS customers on multiple fronts and accelerate their adoption of generative AI. He’s an avid fan of the Dallas Mavericks and enjoys collecting sneakers.

Shane Rai is a Principal GenAI Specialist with the AWS World Wide Specialist Organization (WWSO). He works with customers across industries to solve their most pressing and innovative business needs using the breadth of cloud-based AI/ML AWS services, including model offerings from top tier foundation model providers.

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here