Benchmarking Amazon Nova and GPT-4o models with FloTorch

Based on original post by Dr. Hemant Joshi, CTO, FloTorch.ai

A recent evaluation conducted by FloTorch compared the performance of Amazon Nova models with OpenAI’s GPT-4o.

Amazon Nova is a new generation of state-of-the-art foundation models (FMs) that deliver frontier intelligence and industry-leading price-performance. The Amazon Nova family of models includes Amazon Nova Micro, Amazon Nova Lite, and Amazon Nova Pro, which support text, image, and video inputs while generating text-based outputs. These models offer enterprises a range of capabilities, balancing accuracy, speed, and cost-efficiency.

Using its enterprise software, FloTorch conducted an extensive comparison between Amazon Nova models and OpenAI’s GPT-4o models with the Comprehensive Retrieval Augmented Generation (CRAG) benchmark dataset. FloTorch’s evaluation focused on three critical factors—latency, accuracy, and cost—across five diverse topics.

Key findings from the benchmark study:

GPT-4o demonstrated a slight advantage in accuracy over Amazon Nova Pro
Amazon Nova Pro outperformed GPT-4o in efficiency, operating 97% faster while being 65.26% more cost-effective
Amazon Nova Micro and Amazon Nova Lite outperformed GPT-4o-mini by 2 percentage points in accuracy
In terms of affordability, Amazon Nova Micro and Amazon Nova Lite were 10% and 56.59% cheaper than GPT-4o-mini, respectively
Amazon Nova Micro and Amazon Nova Lite also demonstrated faster response times, with 48% and 26.60% improvements, respectively

In this post, we discuss the findings from this benchmarking in more detail.

The growing need for cost-effective AI models

The landscape of generative AI is rapidly evolving. OpenAI launched GPT-4o in May 2024, and Amazon introduced Amazon Nova models at AWS re:Invent in December 2024. Although GPT-4o has gained traction in the AI community, enterprises are showing increased interest in Amazon Nova due to its lower latency and cost-effectiveness.

Large language models (LLMs) are generally proficient in responding to user queries, but they sometimes generate overly broad or inaccurate responses. Additionally, LLMs might provide answers that extend beyond the company-specific context, making them unsuitable for certain enterprise use cases.

One of the most critical applications for LLMs today is Retrieval Augmented Generation (RAG), which enables AI models to ground responses in enterprise knowledge bases such as PDFs, internal documents, and structured data. This is a crucial requirement for enterprises that want their AI systems to provide responses strictly within a defined scope.

To better serve the enterprise customers, the evaluation aimed to answer three key questions:

How does Amazon Nova Pro compare to GPT-4o in terms of latency, cost, and accuracy?
How do Amazon Nova Micro and Amazon Nova Lite perform against GPT-4o mini in these same metrics?
How well do these models handle RAG use cases across different industry domains?

By addressing these questions, the evaluation provides enterprises with actionable insights into selecting the right AI models for their specific needs—whether optimizing for speed, accuracy, or cost-efficiency.

Overview of the CRAG benchmark dataset

The CRAG dataset was released by Meta for testing with factual queries across five domains with eight question types and a large number of question-answer pairs. Five domains in CRAG dataset are Finance, Sports, Music, Movie, and Open (miscellaneous). The eight different question types are simple, simple_w_condition, comparison, aggregation, set, false_premise, post-processing, and multi-hop. The following table provides example questions with their domain and question type.

Domain	Question	Question Type
Sports	Can you carry less than the maximum number of clubs during a round of golf?	`simple`
Music	Can you tell me how many grammies were won by arlo guthrie until 60th grammy (2017)?	`simple_w_condition`
Open	Can i make cookies in an air fryer?	`simple`
Finance	Did meta have any mergers or acquisitions in 2022?	`simple_w_condition`
Movie	In 2016, which movie was distinguished for its visual effects at the oscars?	`simple_w_condition`

The evaluation considered 200 queries from this dataset representing five domains and two question types, simple and simple_w_condition. Both types of questions are common from users, and a typical Google search for the query such as “Can you tell me how many grammies were won by arlo guthrie until 60th grammy (2017)?” will not give you the correct answer (one Grammy). FloTorch used these queries and their ground truth answers to create a subset benchmark dataset. The CRAG dataset also provides top five search result pages for each query. These five webpages act as a knowledge base (source data) to limit the RAG model’s response. The goal is to index these five webpages dynamically using a common embedding algorithm and then use a retrieval (and reranking) strategy to retrieve chunks of data from the indexed knowledge base to infer the final answer.

Evaluation setup

The RAG evaluation pipeline consists of the several key components, as illustrated in the following diagram.

In this section, we explore each component in more detail.

Knowledge base

FloTorch used the top five HTML webpages provided with the CRAG dataset for each query as the knowledge base source data. HTML pages were parsed to extract text for the embedding stage.

Chunking strategy

FloTorch used a fixed chunking strategy with a chunk size of 512 tokens (four characters is usually around one token) and a 10% overlap between chunks. Further experiments with different chunking strategies, chunk sizes, and percent overlap will be done in coming weeks and will update this post.

Embedding strategy

FloTorch used the Amazon Titan Text Embeddings V2 model on Amazon Bedrock with an output vector size of 1024. With a maximum input token limit of 8,192 for the model, the system successfully embedded chunks from the knowledge base source data as well as short queries from the CRAG dataset efficiently. Amazon Bedrock APIs make it straightforward to use Amazon Titan Text Embeddings V2 for embedding data.

Vector database

FloTorch selected Amazon OpenSearch Service as a vector database for its high-performance metrics. The implementation included a provisioned three-node sharded OpenSearch Service cluster. Each provisioned node was r7g.4xlarge, selected for its availability and sufficient capacity to meet the performance requirements. FloTorch used HSNW indexing in OpenSearch Service.

Retrieval (and reranking) strategy

FloTorch used a retrieval strategy with a k-nearest neighbor (k-NN) of five for retrieved chunks. The experiments excluded reranking algorithms to make sure retrieved chunks remained consistent for both models when inferring the answer to the provided query. The following code snippet embeds the given query and passes the embeddings to the search function:

def search_results(interaction_ids: List[str], queries: List[str], k: int):
   """Retrieve search results for queries."""
   results = []
   embedding_max_length = int(os.getenv("EMBEDDING_MAX_LENGTH", 1024))
   normalize_embeddings = os.getenv("NORMALIZE_EMBEDDINGS", "True").lower() == "true"

   for interaction_id, query in zip(interaction_ids, queries):
       try:
           _, _, embedding = create_embeddings_with_titan_bedrock(query, embedding_max_length, normalize_embeddings)
           results.append(search(interaction_id + '_titan', embedding, k))
       except Exception as e:
           logger.error(f"Error processing query {query}: {e}")
           results.append(None)
   return results

Inferencing

FloTorch used the GPT-4o model from OpenAI using the API key available and used the Amazon Nova Pro model with conversation APIs. GPT-4o supports a context window of 128,000 compared to Amazon Nova Pro with a context window of 300,000. The maximum output token limit of GPT-4o is 16,384 vs. the Amazon Nova Pro maximum output token limit of 5,000. The benchmarking experiments were conducted without Amazon Bedrock Guardrails functionality. The implementation used the universal gateway provided by the FloTorch enterprise version to enable consistent API calls using the same function and to track token count and latency metrics uniformly. The inference function code is as follows:

def generate_responses(dataset_path: str, model_name: str, batch_size: int, api_endpoint: str, auth_header: str,
                        max_tokens: int, search_k: int, system_prompt: str):
   """Generate response for queries."""
   results = []

   for batch in tqdm(load_data_in_batches(dataset_path, batch_size), desc="Generating responses"):
       interaction_ids = [item["interaction_id"] for item in batch]
       queries = [item["query"] for item in batch]
       search_results_list = search_results(interaction_ids, queries, search_k)

       for i, item in enumerate(batch):
           item["search_results"] = search_results_list[i]

       responses = send_batch_request(batch, model_name, api_endpoint, auth_header, max_tokens, system_prompt)

       for i, response in enumerate(responses):
           results.append({
               "interaction_id": interaction_ids[i],
               "query": queries[i],
               "prediction": response.get("choices", [{}])[0].get("message", {}).get("content") if response else None,
               "response_time": response.get("response_time") if response else None,
               "response": response,
           })

   return results

Evaluation

Both models were evaluated by running batch queries. A batch of eight was selected to comply with Amazon Bedrock quota limits as well as GPT-4o rate limits. The query function code is as follows:

def send_batch_request(batch: List[Dict], model_name: str, api_endpoint: str, auth_header: str, max_tokens: int,
                      system_prompt: str):
   """Send batch queries to the API."""
   headers = {"Authorization": auth_header, "Content-Type": "application/json"}
   responses = []

   for item in batch:
       query = item["query"]
       query_time = item["query_time"]
       retrieval_results = item.get("search_results", [])

       references = "# References \n" + "\n".join(
           [f"Reference {_idx + 1}:\n{res['text']}\n" for _idx, res in enumerate(retrieval_results)])
       user_message = f"{references}\n------\n\nUsing only the references listed above, answer the following question:\nQuestion: {query}\n"

       payload = {
           "model": model_name,
           "messages": [{"role": "system", "content": system_prompt},
                        {"role": "user", "content": user_message}],
           "max_tokens": max_tokens,
       }

       try:
           start_time = time.time()
           response = requests.post(api_endpoint, headers=headers, json=payload, timeout=25000)
           response.raise_for_status()
           response_json = response.json()
           response_json['response_time'] = time.time() - start_time
           responses.append(response_json)
       except requests.RequestException as e:
           logger.error(f"API request failed for query: {query}. Error: {e}")
           responses.append(None)

   return responses

Benchmarking on the CRAG dataset

In this section, we discuss the latency, accuracy, and cost measurements of benchmarking on the CRAG dataset.

Latency

Latency measurements for each query response were calculated as the difference between two timestamps: the timestamp when the API call is made to the inference LLM, and a second timestamp when the entire response is received from the inference endpoint. The difference between these two timestamps determines the latency. A lower latency indicates a faster-performing LLM, making it suitable for applications requiring rapid response times. The study indicates that latency can be further reduced for both models through optimizations and caching techniques; however, the evaluation focused on measuring out-of-the-box latency performance for both models.

Accuracy

FloTorch used a modified version of the local_evaluation.py script provided with the CRAG benchmark for accuracy evaluations. The script was enhanced to provide proper categorization of correct, incorrect, and missing responses. The default GPT-4o evaluation LLM in the evaluation script was replaced with the mixtral-8x7b-instruct-v0:1 model API. Additional modifications to the script enabled monitoring of input and output tokens and latency as described earlier.

Cost

Cost calculations were straightforward because both Amazon Nova Pro and GPT-4o have published price per million input and output tokens separately. The calculation methodology involved multiplying input tokens by corresponding rates and applying the same process for output tokens. The total cost for running 200 queries was determined by combining input token and output token costs. OpenSearch Service provisioned cluster costs were excluded from this analysis because the cost comparison focused solely on the inference level between Amazon Nova Pro and GPT-4o LLMs.

Results

The following table summarizes the results.

.	Amazon Nova Pro	GPT-4o	Observation
Accuracy on subset of the CRAG dataset	51.50% (103 correct responses out of 200)	53.00% (106 correct responses out of 200)	GPT-4o outperforms Amazon Nova Pro by 1.5% on accuracy
Cost for running inference for 200 queries	$0.00030205	$0.000869537	Amazon Nova Pro saves 65.26% in costs compared to GPT-4o
Average latency (seconds)	1.682539835	2.15615045	Amazon Nova Pro is 21.97% faster than GPT-4o
Average of input and output tokens	1946.621359	1782.707547	Typical GPT-4o responses are shorter than Amazon Nova responses

For simple queries, Amazon Nova Pro and GPT-4o have similar accuracies (55 and 56 correct responses, respectively) but for simple queries with conditions, GPT-4o performs slightly better than Amazon Nova Pro (50 vs. 48 correct answers). Imagine you are part of an organization running an AI assistant service that handles 1,000 questions per month from 10,000 users (10,000,000 queries per month). Amazon Nova Pro will save your organization $5,674.88 per month ($68,098 per year) compared to GPT-4o.

Let’s look at similar results for Amazon Nova Micro, Amazon Nova Lite, and GPT-4o mini models on the same dataset.

	Amazon Nova Lite	Nove Micro	GPT-4o mini	Observation
Accuracy on subset of the CRAG dataset	52.00% (104 correct responses out of 200)	54.00% (108 correct responses out of 200)	50.00% (100 correct responses out of 200)	Both Amazon Nova Lite and Amazon Nova Micro outperform GPT-4o mini by 2 and 4 points, respectively
Cost for running inference for 200 queries	$0.00002247 (56.59% cheaper than GPT-4o mini)	$0.000013924 (73.10% cheaper than GPT-4o mini)	$0.000051768	Amazon Nova Lite and Amazon Nova Micro are cheaper than GPT-4o mini by 56.59% and 73.10%, respectively
Average latency (seconds)	1.553371465 (26.60% faster than GPT-4o mini)	1.6828564 (20.48% faster than GPT-4o mini)	2.116291895	Amazon Nova models are at least 20% faster than GPT-4o mini
Average of input and output tokens	1930.980769	1940.166667	1789.54	GPT-4o mini returns shorter answers

Amazon Nova Micro is significantly faster and less expensive compared to GPT-4o mini while providing more accurate answers. If you are running a service that handles about 10 million queries each month, it will save you on average 73% of what you will be paying for slightly less accurate results from the GPT-4o mini model.

Conclusion

Based on these tests for RAG cases, Amazon Nova models produce comparable or higher accuracy at significantly lower cost and latency compared to GPT-4o and GPT-4o mini models. FloTorch is continuing further experimentation with other relevant LLMs for comparison. Future research will include additional experiments with various query types such as comparison, aggregation, set, false_premise, post-processing, and multi-hop queries.

Get started with Amazon Nova on the Amazon Bedrock console. Learn more at the Amazon Nova product page.

About FloTorch

FloTorch.ai is helping enterprise customers design and manage agentic workflows in a secure and scalable manner. FloTorch’s mission is to help enterprises make data-driven decisions in the end-to-end generative AI pipeline, including but not limited to model selection, vector database selection, and evaluation strategies. FloTorch offers an open source version for customers with scalable experimentation with different chunking, embedding, retrieval, and inference strategies. The open source version works on a customer’s AWS account so you can experiment on your AWS account with your proprietary data. Interested users are invited to try out FloTorch from AWS Marketplace or from GitHub. FloTorch also offers an enterprise version of this product for scalable experimentation with LLM models and vector databases on cloud platforms. The enterprise version also includes a universal gateway with model registry to custom define new LLMs and recommendation engine to suggest ew LLMs and agent workflows. For more information, contact us at info@flotorch.ai.

About the author

Prasanna Sridharan is a Principal Gen AI/ML Architect at AWS, specializing in designing and implementing AI/ML and Generative AI solutions for enterprise customers. With a passion for helping AWS customers build innovative Gen AI applications, he focuses on creating scalable, cutting-edge AI solutions that drive business transformation. You can connect with Prasanna on LinkedIn.

Dr. Hemant Joshi has over 20 years of industry experience building products and services with AI/ML technologies. As CTO of FloTorch, Hemant is engaged with customers to implement State of the Art GenAI solutions and agentic workflows for enterprises.