How IDIADA optimized its intelligent chatbot with Amazon Bedrock

This post is co-written with Xavier Vizcaino, Diego Martín Montoro, and Jordi Sánchez Ferrer from Applus+ Idiada.

In 2021, Applus+ IDIADA, a global partner to the automotive industry with over 30 years of experience supporting customers in product development activities through design, engineering, testing, and homologation services, established the Digital Solutions department. This strategic move aimed to drive innovation by using digital tools and processes. Since then, we have optimized data strategies, developed customized solutions for customers, and prepared for the technological revolution reshaping the industry.

AI now plays a pivotal role in the development and evolution of the automotive sector, in which Applus+ IDIADA operates. Within this landscape, we developed an intelligent chatbot, AIDA (Applus Idiada Digital Assistant)— an Amazon Bedrock powered virtual assistant serving as a versatile companion to IDIADA’s workforce.

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.

With Amazon Bedrock, AIDA assists with a multitude of tasks, from addressing inquiries to tackling complex technical challenges spanning code, mathematics, and translation. Its capabilities are truly boundless.

With AIDA, we take another step towards our vision of providing global and integrated digital solutions that add value for our customers. Its internal deployment strengthens our leadership in developing data analysis, homologation, and vehicle engineering solutions. Additionally, in the medium term, IDIADA plans to offer AIDA as an integrable product for customers’ environments and develop “light” versions seamlessly integrable into existing systems.

In this post, we showcase the research process undertaken to develop a classifier for human interactions in this AI-based environment using Amazon Bedrock. The objective was to accurately identify the type of interaction received by the intelligent agent to route the request to the appropriate pipeline, providing a more specialized and efficient service.

The challenge: Optimize intelligent chatbot responses, allocate resources more effectively, and enhance the overall user experience

Built on a flexible and secure architecture, AIDA offers a versatile environment for integrating multiple data sources, including structured data from enterprise databases and unstructured data from internal sources like Amazon Simple Storage Service (Amazon S3). It boasts advanced capabilities like chat with data, advanced Retrieval Augmented Generation (RAG), and agents, enabling complex tasks such as reasoning, code execution, or API calls.

As AIDA’s interactions with humans proliferated, a pressing need emerged to establish a coherent system for categorizing these diverse exchanges.

Initially, users were making simple queries to AIDA, but over time, they started to request more specific and complex tasks. These included document translations, inquiries about IDIADA’s internal services, file uploads, and other specialized requests.

The main reason for this categorization was to develop distinct pipelines that could more effectively address various types of requests. By sorting interactions into categories, AIDA could be optimized to handle specific kinds of tasks more efficiently. This approach allows for tailored responses and processes for different types of user needs, whether it’s a simple question, a document translation, or a complex inquiry about IDIADA’s services.

The primary objective is to offer a more specialized service through the creation of dedicated pipelines for various contexts, such as conversation, document translation, and services to provide more accurate, relevant, and efficient responses to users’ increasingly diverse and specialized requests.

Solution overview

By categorizing the interactions into three main groups—conversation, services, and document translation—the system can better understand the user’s intent and respond accordingly. The Conversation class encompasses general inquiries and exchanges, the Services class covers requests for specific functionalities or support, and the Document_Translation class handles text translation needs.

The specialized pipelines, designed specifically for each use case, allow for a significant increase in efficiency and accuracy of AIDA’s responses. This is achieved in several ways:

Enhanced efficiency – By having dedicated pipelines for specific types of tasks, AIDA can process requests more quickly. Each pipeline is optimized for its particular use case, which reduces the computation time needed to generate an appropriate response.
Increased accuracy – The specialized pipelines are equipped with specific tools and knowledge for each type of task. This allows AIDA to provide more accurate and relevant responses, because it uses the most appropriate resources for each type of request.
Optimized resource allocation – By classifying interactions, AIDA can allocate computational resources more efficiently, directing the appropriate processing power to each type of task.
Improved response time – The combination of greater efficiency and optimized resource allocation results in faster response times for users.
Enhanced adaptability – This system allows AIDA to better adapt to different types of requests, from simple queries to complex tasks such as document translations or specialized inquiries about IDIADA services.

The research and development of this large language model (LLM) based classifier is an important step in the continuous improvement of the intelligent agent’s capabilities within the Applus IDIADA environment.

For this occasion, we use a set of 1,668 examples of pre-classified human interactions. These have been divided into 666 for training and 1,002 for testing. A 40/60 split has been applied, giving significant importance to the test set.

The following table shows some examples.

SAMPLE	CLASS
Can you make a summary of this text? “Legislation for the Australian Government’s …”	Conversation
No, only focus on this sentence : Braking technique to enable maximum brake application speed	Conversation
In a factory give me synonyms of a limiting resource of activities	Conversation
We need a translation of the file “Company_Bylaws.pdf” into English, could you handle it?	Document_Translation
Please translate the file “Product_Manual.xlsx” into English	Document_Translation
Could you convert the document “Data_Privacy_Policy.doc’ into English, please?	Document_Translation
Register my username in the IDIADA’s human resources database	Services
Send a mail to random_user@mail.com to schedule a meeting for the next weekend	Services
Book an electric car charger for me at IDIADA	Services

We present three different classification approaches: two based on LLMs and one using a classic machine learning (ML) algorithm. The aim is to understand which approach is most suitable for addressing the presented challenge.

LLM-based classifier: Simple prompt

In this case, we developed an LLM-based classifier to categorize inputs into three classes: Conversation, Services, and Document_Translation. Instead of relying on predefined, rigid definitions, our approach follows the principle of understanding a set. This principle involves analyzing the common characteristics and patterns present in the examples or instances that belong to each class. By studying the shared traits of inputs within a class, we can derive an understanding of the class itself, without being constrained by preconceived notions.

It’s important to note that the learned definitions might differ from common expectations. For instance, the Conversation class encompasses not only typical conversational exchanges but also tasks like text summarization, which share similar linguistic and contextual traits with conversational inputs.

By following this data-driven approach, the classifier can accurately categorize new inputs based on their similarity to the learned characteristics of each class, capturing the nuances and diversity within each category.

The code consists of the following key components: libraries, a prompt, model invocation, and an output parser.

Libraries

The programming language used in this code is Python, complemented by the LangChain module, which is specifically designed to facilitate the integration and use of LLMs. This module provides a comprehensive set of tools and abstractions that streamline the process of incorporating and deploying these advanced AI models.

To take advantage of the power of these language models, we use Amazon Bedrock. The integration with Amazon Bedrock is achieved through the Boto3 Python module, which serves as an interface to the AWS, enabling seamless interaction with Amazon Bedrock and the deployment of the classification model.

Prompt

The task is to assign one of three classes (Conversation, Services, or Document_Translation) to a given sentence, represented by question:

Conversation class – This class encompasses casual messages, summarization requests, general questions, affirmations, greetings, and similar types of text. It also includes requests for text translation, summarization, or explicit inquiries about the meaning of words or sentences in a specific language.
Services class – Texts belonging to this class consist of explicit requests for services such as room reservations, hotel bookings, dining services, cinema information, tourism-related inquiries, and similar service-oriented requests.
Document_Translation class – This class is characterized by requests for the translation of a document to a specific language. Unlike the Conversation class, these requests don’t involve summarization. Additionally, the name of the document to be translated and the target language are specified.

The prompt suggests a hierarchical approach to the classification process. First, the sentence should be evaluated to determine if it can be classified as a conversation. If the sentence doesn’t fit the Conversation class, one of the other two classes (Services or Document_Translation) should be assigned.

The priority for the Conversation class stems from the fact that 99% of the interactions are actually simple questions regarding various matters.

Model invocation

We use Anthropic’s Claude 3 Sonnet model for the natural language processing task. This LLM model has a context window of 200,000 tokens, enabling it to manage different languages and retrieve highly accurate answers. We use two key parameters:

max_tokens – This parameter limits the maximum number of tokens (words or subwords) that the language model can generate in its output to 50.
temperature – This parameter controls the randomness of the language model’s output. A temperature of 0.0 means that the model will produce the most likely output according to its training, without introducing randomness.

Output parser

Another important component is the output parser, which allows us to gather the desired information in JSON format. To achieve this, we use the LangChain parameter output_parsers.

The following code illustrates a simple prompt approach:

def classify_interaction(question):
   response_schemas = [
        ResponseSchema(name="class", description="the assigned class")
    ]
    output_parser = StructuredOutputParser.from_response_schemas(response_schemas)
    format_instructions = output_parser.get_format_instructions()
    prompt =f"""
    We have 3 classes Conversation (for example asking for assistance), Services and Document_Translation.
    Conversation:text consist of casual messages, summarization requests, general questions, afirmations, greetings, 
    and similar. Requests for text translation, text summarisation or explicit text translation requests, 
    questions about the meaning of words or sentences in a concrete language.
    Services:the text consist of explicit requests for rooms, hotels, eating services, cinema, tourism, and similar.
    Document_Translation: A translation of a document to a specific language is requested, and a summary is not requested. 
    The length of the document is specified.
    Assign a class to the following sentence.
    {question}
    Try to understand the sentence as a Conversation one, if you can't, then asign one of the other classes.
    {format_instructions} 
    """          
    response = bedrock_runtime.invoke_model(
            modelId='anthropic.claude-3-sonnet-20240229-v1:0',
            body=json.dumps(
                {
                    "anthropic_version": "bedrock-2023-05-31",
                    "max_tokens": 20,
                    "temperature":0,
                    "messages": [
                        {
                            "role": "user",
                            "content": [{"type": "text", "text": prompt}],
                        }
                    ],
                }
            ),
        )
    result_message = json.loads(response.get("body").read())
    texto = result_message['content'][0]['text']
    try:
        output_dict = output_parser.parse(texto.replace('\'', '"'))['class']
    except:
        output_dict="Conversation" 
    return output_dict

LLM-based classifier: Example augmented inference

We use RAG techniques to enhance the model’s response capabilities. Instead of relying solely on compressed definitions, we provide the model with a quasi-definition by extension. Specifically, we present the model with a diverse set of examples for each class, allowing it to learn the inherent characteristics and patterns that define each category. For instance, in addition to a concise definition of the Conversation class, the model is exposed to various conversational inputs, enabling it to identify common traits such as informal language, open-ended questions, and back-and-forth exchanges. This example-driven approach complements the initial descriptions provided, allowing the model to capture the nuances and diversity within each class. By combining concise definitions with representative examples, the RAG technique helps the model develop a more comprehensive understanding of the classes, enhancing its ability to accurately categorize new inputs based on their inherent nature and characteristics.

The following code provides examples in JSON format for RAG:

{
    "Conversation":[
       ""Could you give me examples of how to solve it?",
       "cool but anything short and sweet",
       "..."
    ],
    "Services":[
       "make a review of my investments in the eBull.com platform",
       "I need a room in IDIADA",
       "schedule a meeting with",
       "..."
    ]"Document_Translation":[
       "Translate the file into Catalan",
       "Could you translate the document I added earlier into Swedish?",
       "Translate the Guía_Rápida.doc file into Romanian",
       "..."
    ]
 }

The total number of examples provided for each class is as follows:

Conversation – 500 examples. This is the most common class, and only 500 samples are given to the model due to the vast amount of information, which could cause infrastructure overflow (very high delays, throttling, connection shutouts). This is a crucial point to note because it represents a significant bottleneck. Providing more examples to this approach could potentially improve performance, but the question remains: How many examples? Surely, a substantial amount would be required.
Services – 26 examples. This is the least common class, and in this case, all available training data has been used.
Document_Translation – 140 examples. Again, all available training data has been used for this class.

One of the key challenges with this approach is scalability. Although the model’s performance improves with more training examples, the computational demands quickly become overwhelming for our current infrastructure. The sheer volume of data required can lead to quota issues with Amazon Bedrock and unacceptably long response times. Rapid response times are essential for providing a satisfactory user experience, and this approach falls short in that regard.

In this case, we need to modify the code to embed all the examples. The following code shows the changes applied to the first version of the classifier. The prompt is modified to include all the examples in JSON format under the “Here you have some examples” section.

def classify_interaction(question, agent_examples):
    response_schemas = [
        ResponseSchema(name="class", description="the assigned class")
    ]
    output_parser = StructuredOutputParser.from_response_schemas(response_schemas)
    format_instructions = output_parser.get_format_instructions()
    prompt =f"""
    We have 3 classes Conversation (for example asking for assistance), Services and Document_Translation.
    Conversation:text consist of casual messages, summarization requests, general questions, afirmations, greetings, 
    and similar. Requests for text translation, text summarisation or explicit text translation requests, 
    questions about the meaning of words or sentences in a concrete language.
    Services:the text consist of explicit requests for rooms, hotels, eating services, cinema, tourism, and similar.
    Document_Translation: A translation of a document to a specific language is requested, and a summary is not requested. 
    The length of the document is specified.
    
    Here you have some examples:
    {agent_examples}

    Assign a class to the following sentence.
    {question}

    Try to understand the sentence as a Conversation one, if you can't, then asign one of the other classes.
    {format_instructions}   
    """
    
    response = bedrock_runtime.invoke_model(
            modelId='anthropic.claude-3-sonnet-20240229-v1:0',
            body=json.dumps(
                {
                    "anthropic_version": "bedrock-2023-05-31",
                    "max_tokens": 50,
                    "messages": [
                        {
                            "role": "user",
                            "content": [{"type": "text", "text": prompt}],
                        }
                    ],
                }
            ),
        )

    result_message = json.loads(response.get("body").read())
    texto = result_message['content'][0]['text']
    output_dict = output_parser.parse(texto.replace('\'', '"'))['class']
    
    return output_dict

K-NN-based classifier: Amazon Titan Embeddings

In this case, we take a different approach by recognizing that despite the multitude of possible interactions, they often share similarities and repetitive patterns. Instead of treating each input as entirely unique, we can use a distance-based approach like k-nearest neighbors (k-NN) to assign a class based on the most similar examples surrounding the input. To make this work, we need to transform the textual interactions into a format that allows algebraic operations. This is where embeddings come into play. Embeddings are vector representations of text that capture semantic and contextual information. We can calculate the semantic similarity between different interactions by converting text into these vector representations and comparing their vectors and determining their proximity in the embedding space.

To accommodate this approach, we need to modify the code accordingly:

from langchain.embeddings import BedrockEmbeddings
from sklearn.neighbors import KNeighborsClassifier

bedrock_runtime = boto3.client(
    service_name="bedrock-runtime",
    region_name="us-west-2"
)

bedrock_embedding = BedrockEmbeddings(
    client=bedrock_runtime,
    model_id="amazon.titan-embed-text-v1",
)
df_train = pd.read_excel('coordinator_dataset/casos_coordinador_train.xlsx')
df_test = pd.read_excel('coordinator_dataset/casos_coordinador_test.xlsx')

X_train_emb = bedrock_embedding.embed_documents(df_train['sample'].values.tolist())
X_test_emb = bedrock_embedding.embed_documents(df_test['sample'].values.tolist())
y_train = df_train['agent'].values.tolist()
y_test = df_test['agent'].values.tolist()
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X_train_emb, y_train)
y_pred = neigh.predict(X_test_emb)
print(classification_report(y_test, y_pred, target_names=['Conversation', 'Document_Translation', 'Services']))

We used the Amazon Titan Text Embeddings G1 model, which generates vectors of 1,536 dimensions. This model is trained to accept multiple languages while retaining the semantic meaning of the embedded phrases.

For the classfier, we employed a classic ML algorithm, k-NN, using the scikit-learn Python module. This method takes a parameter, which we set to 3.

The following figure illustrates the F1 scores for each class plotted against the number of neighbors (k) used in the k-NN algorithm. As the graph shows, the optimal value for k is 3, which yields the highest F1 score for the most prevalent class, Document_Translation. Although it’s not the absolute highest score for the Services class, Document_Translation is significantly more common, making k=3 the best overall choice to maximize performance across all classes.

K-NN-based classifier: Cohere’s multilingual embeddings model

In the previous section, we used the popular Amazon Titan Text Embeddings G1 model to generate text embeddings. However, other models might offer different advantages. In this section, we explore the use of Cohere’s multilingual model on Amazon Bedrock for generating embeddings. We chose the Cohere model due to its excellent capability in handling multiple languages without compromising the vectorization of phrases. As we will demonstrate, this model doesn’t introduce significant differences in the generated vectors compared to other models, making it more suitable for use in a multilingual environment like AIDA.

To use the Cohere model, we need to change the model_id:

from langchain.embeddings import BedrockEmbeddings
from sklearn.neighbors import KNeighborsClassifier

bedrock_runtime = boto3.client(
    service_name="bedrock-runtime",
    region_name="us-west-2"
)

bedrock_embedding = BedrockEmbeddings(
    client=bedrock_runtime,
    model_id=" cohere.embed-multilingual-v3",
)
df_train = pd.read_excel('coordinator_dataset/casos_coordinador_train.xlsx')
df_test = pd.read_excel('coordinator_dataset/casos_coordinador_test.xlsx')
data_train = [s[:1500] for s in df_train['sample']]
data_test = [s[:1500] for s in df_test['sample']]

y_train = df_train['agent'].values.tolist()
y_test = df_test['agent'].values.tolist()
X_test = df_test['sample'].values.tolist()

X_train_emb = bedrock_embedding.embed_documents(data_train)
X_test_emb = bedrock_embedding.embed_documents(data_test)

neigh = KNeighborsClassifier(n_neighbors=11)
neigh.fit(X_train_emb, y_train)
y_pred = neigh.predict(X_test_emb)
print(classification_report(y_test, y_pred, target_names=['Conversation', 'Document_Translation', 'Services']))

We use Cohere’s multilingual embeddings model to generate vectors with 1,024 dimensions. This model is trained to accept multiple languages and retain the semantic meaning of the embedded phrases.

For the classifier, we employ k-NN, using the scikit-learn Python module. This method takes a parameter, which we have set to 11.

The following figure illustrates the F1 scores for each class plotted against the number of neighbors used. As depicted, the optimal point is k=11, achieving the highest value for Document_Translation and the second-highest for Services. In this instance, the trade-off between Documents_Translation and Services is favorable.

Number of Neighbors

Amazon Titan Embeddings vs. Cohere’s multilingual embeddings model

In this section, we delve deeper into the embeddings generated by both models, aiming to understand their nature and consequently comprehend the results obtained. To achieve this, we have performed dimensionality reduction to visualize the vectors obtained in both cases in 2D.

Cohere’s multilingual embeddings model has a limitation on the size of the text it can vectorize, posing a significant constraint. Therefore, in the implementation showcased in the previous section, we applied a filter to only include interactions up to 1,500 characters (excluding cases that exceed this limit).

The following figure illustrates the vector spaces generated in each case.

Vector Space

As we can observe, the generated vector spaces are relatively similar, initially appearing to be analogous spaces with a rotation between one another. However, upon closer inspection, it becomes evident that the direction of maximum variance in the case of Cohere’s multilingual embeddings model is distinct (deducible from observing the relative position and shape of the different groups). This type of situation, where high class overlap is observed, presents an ideal case for applying algorithms such as k-NN.

As mentioned in the introduction, most human interactions with AI are very similar to each other within the same class. This would explain why k-NN-based models outperform LLM-based models.

SVM-based classifier: Amazon Titan Embeddings

In this scenario, it is likely that user interactions belonging to the three main categories (Conversation, Services, and Document_Translation) form distinct clusters or groups within the embedding space. Each category possesses particular linguistic and semantic characteristics that would be reflected in the geometric structure of the embedding vectors. The previous visualization of the embeddings space displayed only a 2D transformation of this space. This doesn’t imply that clusters coudn’t be highly separable in higher dimensions.

Classification algorithms like support vector machines (SVMs) are especially well-suited to use this implicit geometry of the data. SVMs seek to find the optimal hyperplane that separates the different groups or classes in the embedding space, maximizing the margin between them. This ability of SVMs to use the underlying geometric structure of the data makes them an intriguing option for this user interaction classification problem.

Furthermore, SVMs are a robust and efficient algorithm that can effectively handle high-dimensional datasets, such as text embeddings. This makes them particularly suitable for this scenario, where the embedding vectors of the user interactions are expected to have a high dimensionality.

The following code illustrates the implementation:

bedrock_runtime = boto3.client(
    service_name="bedrock-runtime",
    region_name="eu-central-1"
)

bedrock_embedding = BedrockEmbeddings(
    client=bedrock_runtime,
    model_id="amazon.titan-embed-text-v1",
)

df_train = pd.read_excel('coordinator_dataset/casos_coordinador_train.xlsx')
df_test = pd.read_excel('coordinator_dataset/casos_coordinador_test.xlsx')

y_train = df_train['agent'].values.tolist()
y_test = df_test['agent'].values.tolist()
X_test = df_test['sample'].values.tolist()

X_train_emb = bedrock_embedding.embed_documents(df_train['sample'].values.tolist())
X_test_emb = bedrock_embedding.embed_documents(df_test['sample'].values.tolist())

f1 = make_scorer(f1_score , average="weighted")
parameters = {'kernel':('linear', 'rbf','poly', 'sigmoid'), 
              'C':[1, 2, 4, 6, 8, 10],
              'class_weight':[None, 'balanced']}

svc = svm.SVC()
clf = GridSearchCV(svc, parameters,cv=10,n_jobs= -1, scoring=f1)
clf.fit(X_train_emb, y_train)

y_pred = clf.predict(X_test_emb)

We use Amazon Titan Text Embeddings G1. This model generates vectors of 1,536 dimensions, and is trained to accept several languages and to retain the semantic meaning of the phrases embedded.

To implement the classifier, we employed a classic ML algorithm, SVM, using the scikit-learn Python module. The SVM algorithm requires the tuning of several parameters to achieve optimal performance. To determine the best parameter values, we conducted a grid search with 10-fold cross-validation, using the F1 multi-class score as the evaluation metric. This systematic approach allowed us to identify the following set of parameters that yielded the highest performance for our classifier:

C – We set this parameter to 1. This parameter controls the trade-off between allowing training errors and forcing rigid margins. It acts as a regularization parameter. A higher value of C (for example, 10) indicates a higher penalty for misclassification errors. This results in a more complex model that tries to fit the training data more closely. A higher C value can be beneficial when the classes in the data are well separated, because it allows the algorithm to create a more intricate decision boundary to accurately classify the samples. On the other hand, a C value of 1 indicates a reasonable balance between fitting the training set and the model’s generalization ability. This value might be appropriate when the data has a simple structure, and a more flexible model isn’t necessary to capture the underlying relationships. In our case, the selected C value of 1 suggests that the data has a relatively simple structure, and a balanced model with moderate complexity is sufficient for accurate classification.
class_weight – We set this parameter to None. This parameter adjusts the weights of each class during the training process. Setting class_weight to balanced automatically adjusts the weights inversely proportional to the class frequencies in the input data. This is particularly useful when dealing with imbalanced datasets, where one class is significantly more prevalent than the others. In our case, the value of None for the class_weight parameter suggests that the minor classes don’t have much relevance or impact on the overall classification task. This choice implies that the implicit geometry or decision boundaries learned by the model might not be optimized for separating the different classes effectively.
Kernel – We set this parameter to linear. This parameter specifies the type of kernel function to be used by the SVC algorithm. The linear kernel is a simple and efficient choice because it assumes that the decision boundary between classes can be represented by a linear hyperplane in the feature space. This value suggests that, in a higher dimension vector space, the categories could be linearly separated by an hyperplane.

SVM-based classifier: Cohere’s multilingual embeddings model

The implementation details of the classifier are presented in the following code:

from langchain.embeddings import BedrockEmbeddings

bedrock_runtime = boto3.client(
    service_name="bedrock-runtime",
    region_name="us-west-2"
)

bedrock_embedding = BedrockEmbeddings(
    client=bedrock_runtime,
    model_id="cohere.embed-multilingual-v3",
)

df_train = pd.read_excel('coordinator_dataset/casos_coordinador_train.xlsx')
df_test = pd.read_excel('coordinator_dataset/casos_coordinador_test.xlsx')

data_train = [s[:1500] for s in df_train['sample']]
data_test = [s[:1500] for s in df_test['sample']]

y_train = df_train['agent'].values.tolist()

X_train_emb = bedrock_embedding.embed_documents(data_train)
X_test_emb = bedrock_embedding.embed_documents(data_test)

f1 = make_scorer(f1_score , average="weighted")

parameters = {'kernel':('linear', 'rbf','poly', 'sigmoid'), 
              'C':[1, 2, 4, 6, 8, 10],
              'class_weight':[None, 'balanced']}

svc = svm.SVC()
clf = GridSearchCV(svc, parameters,cv=10,n_jobs= -1, scoring=f1)
clf.fit(X_train_emb, y_train)

y_pred = clf.predict(X_test_emb)

We use the Amazon Titan Text Embeddings G1 model, which generates vectors of 1,536 dimensions. This model is trained to accept multiple languages and retain the semantic meaning of the embedded phrases.

For the classifier, we employ SVM, using the scikit-learn Python module. To obtain the optimal parameters, we performed a grid search with 10-fold cross-validation based on the multi-class F1 score, resulting in the following selected parameters (as detailed in the previous section):

C – We set this parameter to 1, which indicates a reasonable balance between fitting the training set and the model’s generalization ability. This setting suggests that the data has a simple structure and that a more flexible model might not be necessary to capture the underlying relationships.
class_weight – We set this parameter to None. A value of None suggests that the minor classes don’t have much relevance, which in turn implies that the implicit geometry might not be suitable for separating the different classes.
kernel – We set this parameter to linear. This value suggests that in a higher-dimensional vector space, the categories could be linearly separated by a hyperplane.

ANN-based classifier: Amazon Titan and Cohere’s multilingual embeddings model

Given the promising results obtained with SVMs, we decided to explore another geometry-based method by employing an Artificial Neural Network (ANN) approach.

In this case, we performed normalization of the input vectors to use the advantages of normalization when using neural networks. Normalizing the input data is a crucial step when working with ANNs, because it can help improve the model’s during training. We applied min/max scaling for normalization.

The use of an ANN-based approach provides the ability to capture complex non-linear relationships in the data, which might not be easily modeled using traditional linear methods like SVMs. The combination of the geometric insights and the normalization of inputs can potentially lead to improved predictive accuracy compared to the previous SVM results.

This approach consists of the following parameters:

Model definition – We define a sequential deep learning model using the Keras library from TensorFlow.
Model architecture – The model consists of three densely connected layers. The first layer has 16 neurons and uses the ReLU activation function. The second layer has 8 neurons and employs the ReLU activation function. The third layer has 3 neurons and uses the softmax activation function.
Model compilation – We compile the model using the categorical_crossentropy loss function, the Adam optimizer with a learning rate of 0.01, and the categorical_accuracy. We incorporate an EarlyStopping callback to stop the training if the categorical_accuracy metric doesn’t improve for 25 epochs.
Model training – We train the model for a maximum of 500 epochs using the training set and validate it on the test set. The batch size is set to 64. The performance metric used is the maximum classification accuracy (categorical_accuracy) obtained during the training.

We applied the same methodology, but using the embeddings generated by Cohere’s multilingual embeddings model after being normalized through min/max scaling. In both cases, we employed the same preprocessing steps:

bedrock_runtime = boto3.client(
    service_name="bedrock-runtime",
    region_name="us-west-2"
)

bedrock_embedding = BedrockEmbeddings(
    client=bedrock_runtime,
    model_id="cohere.embed-multilingual-v3",
)

df_train = pd.read_excel('coordinator_dataset/casos_coordinador_train.xlsx')
df_test = pd.read_excel('coordinator_dataset/casos_coordinador_test.xlsx')

df_train['sample'] = [s[:1500] for s in df_train['sample']]
df_test['sample'] = [s[:1500] for s in df_test['sample']]

X_train_emb = bedrock_embedding.embed_documents(df_train['sample'].values.tolist())
X_test_emb = bedrock_embedding.embed_documents(df_test['sample'].values.tolist())
y_train = df_train['agent'].values.tolist()

y_train_ohe = [ [int(y=='Conversation'), int(y=='Document_Translation'), int(y=='Services')] for y in y_train]
y_test = df_test['agent'].values.tolist()
y_test = [ ['Conversation', 'Document_Translation', 'Services'].index(y) for y in y_test]
X_test = df_test['sample'].values.tolist()

To help avoid ordinal assumptions, we employed a one-hot encoding representation for the output of the network. One-hot encoding doesn’t make any assumptions about the inherent order or hierarchy among the categories. This is particularly useful when the categorical variable doesn’t have a clear ordinal relationship, because the model can learn the relationships without being biased by any assumed ordering.

The following code illustrates the implementation:

def train_model( X, y, n_hebras = 10, reps = 30, train_size = 0.7, tipo_optimizacion = "low"):
    import threading

    reps_por_hebra = int(reps/n_hebras)
    hebras = [0]*n_hebras
    results = [0]*reps
    models = [0]*reps    
    
    for i in range(len(hebras)):
        hebras[i] = threading.Thread(target=eval_model_rep_times,
            args=(X, y, train_size, reps_por_hebra, i*reps_por_hebra, models, results))
        hebras[i].start()
        
    for i in range(len(hebras)):
        hebras[i].join()
        
    if tipo_optimizacion == "low":
        result = models[np.argmin(results)], min(results)
    else:
        result = models[np.argmax(results)], max(results)
    return result

def eval_model_rep_times(X, y, train_size, reps, index, models, results):
    for rep in range(reps):
        X_train, X_test, y_train, y_test = train_test_split( X, y, train_size = train_size)
        model, metric = create_and_fit_model(X_train, y_train, X_test, y_test) 
        models[index+rep] = model
        results[index+rep] = metric

def create_and_fit_model(X_train, y_train, X_test, y_test):
    ### DEFINITION GOES HERE ###
    model = Sequential() 
    model.add(Dense(16, input_shape = (len(X_train[0]),), activation='relu')  )
    model.add(Dense(8, activation='relu')  )
    model.add(Dense(3, activation='softmax' ))
    model.compile(loss="categorical_crossentropy", optimizer=Adam(learning_rate=0.01), metrics=['categorical_accuracy'])
    early_stopping = tf.keras.callbacks.EarlyStopping(monitor="categorical_accuracy", patience=25, mode="max")
    ### DEFINITION GOES HERE ###
  
    ### TRAINING GOES HERE ###
    history = model.fit(X_train,
              y_train,
              epochs=500,
              validation_data = (X_test, y_test),
              batch_size=64,
              callbacks= early_stopping,
              verbose=0)
    ### TRAINING GOES HERE ###
    
    metrica = max(history.history['categorical_accuracy'])
    
    #ALWAYS RETURN THE MODEL
    return model, metrica

model, mse = train_model(X_train_emb_norm, y_train_ohe, 5, 20, tipo_optimizacion="high")
y_pred = [ est.argmax() for est in model.predict(X_test_emb_norm) ]

Results

We conducted a comparative analysis using the previously presented code and data. The models were assessed based on their F1 scores for the conversation, services, and document translation tasks, as well as their runtimes. The following table summarizes our results.

MODEL	CONVERSATION F1	SERVICES F1	DOCUMENT_ TRANSLATION F1	RUNTIME (Seconds)
LLM	0.81	0.22	0.46	1.2
LLM with examples	0.86	0.13	0.68	18
KNN – Amazon Titan Embedding	0.98	0.57	0.88	0.35
KNN – Cohere Embedding	0.96	0.72	0.72	0.35
SVM Amazon Titan Embedding	0.98	0.69	0.82	0.3
SVM Cohere Embedding	0.99	0.80	0.93	0.3
ANN Amazon Titan Embedding	0.98	0.60	0.87	0.15
ANN Cohere Embedding	0.99	0.77	0.96	0.15

As illustrated in the table, the SVM and ANN models using Cohere’s multilingual embeddings model demonstrated the strongest overall performance. The SVM with Cohere’s multilingual embeddings model achieved the highest F1 scores in two out of three tasks, reaching 0.99 for Conversation, 0.80 for Services, and 0.93 for Document_Translation. Similarly, the ANN with Cohere’s multilingual embeddings model also performed exceptionally well, with F1 scores of 0.99, 0.77, and 0.96 for the respective tasks.

In contrast, the LLM exhibited relatively lower F1 scores, particularly for the services (0.22) and document translation (0.46) tasks. However, the performance of the LLM improved when provided with examples, with the F1 score for document translation increasing from 0.46 to 0.68.

Regarding runtime, the k-NN, SVM, and ANN models demonstrated significantly faster inference times compared to the LLM. The k-NN and SVM models with both Amazon Titan and Cohere’s multilingual embeddings model had runtimes of approximately 0.3–0.35 seconds. The ANN models were even faster, with runtimes of approximately 0.15 seconds. In contrast, the LLM required approximately 1.2 seconds for inference, and the LLM with examples took around 18 seconds.

These results suggest that the SVM and ANN models using Cohere’s multilingual embeddings model offer the best balance of performance and efficiency for the given tasks. The superior F1 scores of these models, coupled with their faster runtimes, make them promising candidates for application. The potential benefits of providing examples to the LLM model are also noteworthy, because this approach can help improve its performance on specific tasks.

Conclusion

The optimization of AIDA, Applus IDIADA’s intelligent chatbot powered by Amazon Bedrock, has been a resounding success. By developing dedicated pipelines to handle different types of user interactions—from general conversations to specialized service requests and document translations—AIDA has significantly improved its efficiency, accuracy, and overall user experience. The innovative use of LLMs, embeddings, and advanced classification algorithms has allowed AIDA to adapt to the evolving needs of IDIADA’s workforce, providing a versatile and reliable virtual assistant. AIDA now handles over 1,000 interactions per day, with a 95% accuracy rate in routing requests to the appropriate pipeline and driving a 20% increase in their team’s productivity.

Looking ahead, IDIADA plans to offer AIDA as an integrated product for customer environments, further expanding the reach and impact of this transformative technology.

Amazon Bedrock offers a comprehensive approach to security, compliance, and responsible AI development that empowers IDIADA and other customers to harness the full potential of generative AI without compromising on safety and trust. As this advanced technology continues to rapidly evolve, Amazon Bedrock provides the transparent framework needed to build innovative applications that inspire confidence.

Unlock new growth opportunities by creating custom, secure AI models tailored to your organization’s unique needs. Take the first step in your generative AI transformation—connect with an AWS expert today to begin your journey.

About the Authors

Xavier Vizcaino is the head of the DataLab, in the Digital Solutions department of Applus IDIADA. DataLab is the unit focused on the development of solutions for generating value from the exploitation of data through artificial intelligence.

Diego Martín Montoro is an AI Expert and Machine Learning Engineer at Applus+ Idiada Datalab. With a Computer Science degree and a Master’s in Data Science, Diego has built his career in the field of artificial intelligence and machine learning. His experience includes roles as a Machine Learning Engineer at companies like AppliedIT and Applus+ IDIADA, where he has worked on developing advanced AI systems and anomaly detection solutions.

Jordi Sánchez Ferrer is the current Product Owner of the Datalab at Applus+ Idiada. A Computer Engineer with a Master’s degree in Data Science, Jordi’s trajectory includes roles as a Business Intelligence developer, Machine Learning engineer, and lead developer in Datalab. In his current role, Jordi combines his technical expertise with product management skills, leading strategic initiatives that align data science and AI projects with business objectives at Applus+ Idiada.

Daniel Colls is a professional with more than 25 years of experience who has lived through the digital transformation and the transition from the on-premises model to the cloud from different perspectives in the IT sector. For the past 3 years, as a Solutions Architect at AWS, he has made this experience available to his customers, helping them successfully implement or move their workloads to the cloud.

How IDIADA optimized its intelligent chatbot with Amazon Bedrock

The challenge: Optimize intelligent chatbot responses, allocate resources more effectively, and enhance the overall user experience

Solution overview

LLM-based classifier: Simple prompt

Libraries

Prompt

Model invocation

Output parser

LLM-based classifier: Example augmented inference

K-NN-based classifier: Amazon Titan Embeddings

K-NN-based classifier: Cohere’s multilingual embeddings model

Amazon Titan Embeddings vs. Cohere’s multilingual embeddings model

SVM-based classifier: Amazon Titan Embeddings

SVM-based classifier: Cohere’s multilingual embeddings model

ANN-based classifier: Amazon Titan and Cohere’s multilingual embeddings model

Results

Conclusion

About the Authors

Recent Articles

Is Python Set to Surpass Its Competitors?

Anagram takes a gamified approach to employee cybersecurity training

Time Series Forecasting with PyCaret: Building Multi-Step Prediction Model

SOC 3.0 – The Evolution of the SOC and How AI is Empowering Human Talent

Generative AI for Data Scientists in 2025: Beyond Text Generation

Related Stories

Leave A Reply Cancel reply