As AWS environments grow in complexity, troubleshooting issues with resources can become a daunting task. Manually investigating and resolving problems can be time-consuming and error-prone, especially when dealing with intricate systems. Fortunately, AWS provides a powerful tool called AWS Support Automation Workflows, which is a collection of curated AWS Systems Manager self-service automation runbooks. These runbooks are created by AWS Support Engineering with best practices learned from solving customer issues. They enable AWS customers to troubleshoot, diagnose, and remediate common issues with their AWS resources.
Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI. Using Amazon Bedrock, you can experiment with and evaluate top FMs for your use case, privately customize them with your data using techniques such as fine-tuning and Retrieval Augmented Generation (RAG), and build agents that execute tasks using your enterprise systems and data sources. Because Amazon Bedrock is serverless, you don’t have to manage infrastructure, and you can securely integrate and deploy generative AI capabilities into your applications using the AWS services you are already familiar with.
In this post, we explore how to use the power of Amazon Bedrock Agents and AWS Support Automation Workflows to create an intelligent agent capable of troubleshooting issues with AWS resources.
Solution overview
Although the solution is versatile and can be adapted to use a variety of AWS Support Automation Workflows, we focus on a specific example: troubleshooting an Amazon Elastic Kubernetes Service (Amazon EKS) worker node that failed to join a cluster. The following diagram provides a high-level overview of troubleshooting agents with Amazon Bedrock.
Our solution is built around the following key components that work together to provide a seamless and efficient troubleshooting experience:
- Amazon Bedrock Agents – Amazon Bedrock Agents acts as the intelligent interface between users and AWS Support Automation Workflows. It processes natural language queries to understand the issue context and manages conversation flow to gather required information. The agent uses Anthropic’s Claude 3.5 Sonnet model for advanced reasoning and response generation, enabling natural interactions throughout the troubleshooting process.
- Amazon Bedrock agent action groups – These action groups define the structured API operations that the Amazon Bedrock agent can invoke. Using OpenAPI specifications, they define the interface between the agent and AWS Lambda functions, specifying the available operations, required parameters, and expected responses. Each action group contains the API schema that tells the agent how to properly format requests and interpret responses when interacting with Lambda functions.
- Lambda Function – The Lambda function acts as the integration layer between the Amazon Bedrock agent and AWS Support Automation Workflows. It validates input parameters from the agent and initiates the appropriate SAW runbook execution. It monitors the automation progress while processing the technical output into a structured format. When the workflow is complete, it returns formatted results back to the agent for user presentation.
- IAM role – The AWS Identity and Access Management (IAM) role provides the Lambda function with the necessary permissions to execute AWS Support Automation Workflows and interact with required AWS services. This role follows the principle of least privilege to maintain security best practices.
- AWS Support Automation Workflows – These pre-built diagnostic runbooks are developed by AWS Support Engineering. The workflows execute comprehensive system checks based on AWS best practices in a standardized, repeatable manner. They cover a wide range of AWS services and common issues, encapsulating AWS Support’s extensive troubleshooting expertise.
The following steps outline the workflow of our solution:
- Users start by describing their AWS resource issue in natural language through the Amazon Bedrock chat console. For example, “Why isn’t my EKS worker node joining the cluster?”
- The Amazon Bedrock agent analyzes the user’s question and matches it to the appropriate action defined in its OpenAPI schema. If essential information is missing, such as a cluster name or instance ID, the agent engages in a natural conversation to gather the required parameters. This makes sure that necessary data is collected before proceeding with the troubleshooting workflow.
- The Lambda function receives the validated request and triggers the corresponding AWS Support Automation Workflow. These SAW runbooks contain comprehensive diagnostic checks developed by AWS Support Engineering to identify common issues and their root causes. The checks run automatically without requiring user intervention.
- The SAW runbook systematically executes its diagnostic checks and compiles the findings. These results, including identified issues and configuration problems, are structured in JSON format and returned to the Lambda function.
- The Amazon Bedrock agent processes the diagnostic results using chain of thought (CoT) reasoning, based on the ReAct (synergizing reasoning and acting) technique. This enables the agent to analyze the technical findings, identify root causes, generate clear explanations, and provide step-by-step remediation guidance.
During the reasoning phase of the agent, the user is able to view the reasoning steps.
Troubleshooting examples
Let’s take a closer look at a common issue we mentioned earlier and how our agent can assist in troubleshooting it.
EKS worker node failed to join EKS cluster
When an EKS worker node fails to join an EKS cluster, our Amazon Bedrock agent can be invoked with the relevant information: cluster name and worker node ID. The agent will execute the corresponding AWS Support Automation Workflow, which will perform checks like verifying the worker node’s IAM role permissions and verifying the necessary network connectivity.
The automation workflow will run all the checks. Then Amazon Bedrock agent will ingest the troubleshooting, explain the root cause of the issue to the user, and suggest remediation steps based on the AWSSupport-TroubleshootEKSWorkerNode
output, such as updating the worker node’s IAM role or resolving network configuration issues, enabling them to take the necessary actions to resolve the problem.
OpenAPI example
When you create an action group in Amazon Bedrock, you must define the parameters that the agent needs to invoke from the user. You can also define API operations that the agent can invoke using these parameters. To define the API operations, we will create an OpenAPI schema in JSON:
"Body_troubleshoot_eks_worker_node_troubleshoot_eks_worker_node_post": {
"properties": {
"cluster_name": {
"type": "string",
"title": "Cluster Name",
"description": "The name of the EKS cluster"
},
"worker_id": {
"type": "string",
"title": "Worker Id",
"description": "The ID of the worker node"
}
},
"type": "object",
"required": [
"cluster_name",
"worker_id"
],
"title": "Body_troubleshoot_eks_worker_node_troubleshoot_eks_worker_node_post"
}
The schema consists of the following components:
- Body_troubleshoot_eks_worker_node_troubleshoot_eks_worker_node_post – This is the name of the schema, which corresponds to the request body for the
troubleshoot-eks-worker_node
POST endpoint. - Properties – This section defines the properties (fields) of the schema:
- “cluster_name” – This property represents the name of the EKS cluster. It is a string type and has a title and description.
- “worker_id” – This property represents the ID of the worker node. It is also a string type and has a title and description.
- Type – This property specifies that the schema is an “object” type, meaning it is a collection of key-value pairs.
- Required – This property lists the required fields for the schema, which in this case are “cluster_name” and “worker _id”. These fields must be provided in the request body.
- Title – This property provides a human-readable title for the schema, which can be used for documentation purposes.
The OpenAPI schema defines the structure of the request body. To learn more, see Define OpenAPI schemas for your agent’s action groups in Amazon Bedrock and OpenAPI specification.
Lambda function code
Now let’s explore the Lambda function code:
@app.post("/troubleshoot-eks-worker-node")
@tracer.capture_method
def troubleshoot_eks_worker_node(
cluster_name: Annotated[str, Body(description="The name of the EKS cluster")],
worker_id: Annotated[str, Body(description="The ID of the worker node")]
) -> dict:
"""
Troubleshoot EKS worker node that failed to join the cluster.
Args:
cluster_name (str): The name of the EKS cluster.
worker_id (str): The ID of the worker node.
Returns:
dict: The output of the Automation execution.
"""
return execute_automation(
automation_name="AWSSupport-TroubleshootEKSWorkerNode",
parameters={
'ClusterName': [cluster_name],
'WorkerID': [worker_id]
},
execution_mode="TroubleshootWorkerNode"
)
The code consists of the following components
- app.post(“/troubleshoot-eks-worker-node”, description=”Troubleshoot EKS worker node failed to join the cluster”) – This is a decorator that sets up a route for a POST request to the
/troubleshoot-eks-worker-node
endpoint. The description parameter provides a brief explanation of what this endpoint does. - @tracer.capture_method – This is another decorator that is likely used for tracing or monitoring purposes, possibly as part of an application performance monitoring (APM) tool. It captures information about the execution of the function, such as the duration, errors, and other metrics.
- cluster_name: str = Body(description=”The name of the EKS cluster”), – This parameter specifies that the
cluster_name
is a string type and is expected to be passed in the request body. The Body decorator is used to indicate that this parameter should be extracted from the request body. The description parameter provides a brief explanation of what this parameter represents. - worker_id: str = Body(description=”The ID of the worker node”) – This parameter specifies that the
worker_id
is a string type and is expected to be passed in the request body. - -> Annotated[dict, Body(description=”The output of the Automation execution”)] – This is the return type of the function, which is a dictionary. The Annotated type is used to provide additional metadata about the return value, specifically that it should be included in the response body. The description parameter provides a brief explanation of what the return value represents.
To link a new SAW runbook in the Lambda function, you can follow the same template.
Prerequisites
Make sure you have the following prerequisites:
Deploy the solution
Complete the following steps to deploy the solution:
- Clone the GitHub repository and go to the root of your downloaded repository folder:
$ git clone https://github.com/aws-samples/sample-bedrock-agent-for-troubleshooting-aws-resources.git
$ cd bedrock-agent-for-troubleshooting-aws-resources
- Install local dependencies:
$ npm install
- Sign in to your AWS account using the AWS CLI by configuring your credential file (replace <PROFILE_NAME> with the profile name of your deployment AWS account):
$ export AWS_PROFILE=PROFILE_NAME
- Bootstrap the AWS CDK environment (this is a one-time activity and is not needed if your AWS account is already bootstrapped):
$ cdk bootstrap
- Run the script to replace the placeholders for your AWS account and AWS Region in the config files:
$ cdk deploy --all
Test the agent
Navigate to the Amazon Bedrock Agents console in your Region and find your deployed agent. You will find the agent ID in the cdk deploy
command output.
You can now interact with the agent and test troubleshooting a worker node not joining an EKS cluster. The following are some example questions:
- I want to troubleshoot why my Amazon EKS worker node is not joining the cluster. Can you help me?
- Why this instance <instance_ID> is not able to join the EKS cluster <Cluster_Name>?
The following screenshot shows the console view of the agent.
The agent understood the question and mapped it with the right action group. It also spotted that the parameters needed are missing in the user prompt. It came back with a follow-up question to require the Amazon Elastic Compute Cloud (Amazon EC2) instance ID and EKS cluster name.
We can see the agent’s thought process in the trace step 1. The agent assesses the next step as ready to call the right Lambda function and right API path.
With the results coming back from the runbook, the agent now reviews the troubleshooting outcome. It goes through the information and will start writing the solution where it provides the instructions for the user to follow.
In the answer provided, the agent was able to spot all the issues and transform that into solution steps. We can also see the agent mentioning the right information like IAM policy and the required tag.
Clean up
When implementing Amazon Bedrock Agents, there are no additional charges for resource construction. However, costs are incurred for embedding model and text model invocations on Amazon Bedrock, with charges based on the pricing of each FM used. In this use case, you will also incur costs for Lambda invocations.
To avoid incurring future charges, delete the created resources by the AWS CDK. From the root of your repository folder, run the following command:
$ npm run cdk destroy --all
Conclusion
Amazon Bedrock Agents and AWS Support Automation Workflows are powerful tools that, when combined, can revolutionize AWS resource troubleshooting. In this post, we explored a serverless application built with the AWS CDK that demonstrates how these technologies can be integrated to create an intelligent troubleshooting agent. By defining action groups within the Amazon Bedrock agent and associating them with specific scenarios and automation workflows, we’ve developed a highly efficient process for diagnosing and resolving issues such as Amazon EKS worker node failures.
Our solution showcases the potential for automating complex troubleshooting tasks, saving time and streamlining operations. Powered by Anthropic’s Claude 3.5 Sonnet, the agent demonstrates improved understanding and responding in languages other than English, such as French, Japanese, and Spanish, making it accessible to global teams while maintaining its technical accuracy and effectiveness. The intelligent agent quickly identifies root causes and provides actionable insights, while automatically executing relevant AWS Support Automation Workflows. This approach not only minimizes downtime, but also scales effectively to accommodate various AWS services and use cases, making it a versatile foundation for organizations looking to enhance their AWS infrastructure management.
Explore the AWS Support Automation Workflow for additional use cases and consider using this solution as a starting point for building more comprehensive troubleshooting agents tailored to your organization’s needs. To learn more about using agents to orchestrate workflows, see Automate tasks in your application using conversational agents. For details about using guardrails to safeguard your generative AI applications, refer to Stop harmful content in models using Amazon Bedrock Guardrails.
Happy coding!
Acknowledgements
The authors thank all the reviewers for their valuable feedback.
About the Authors
Wael Dimassi is a Technical Account Manager at AWS, building on his 7-year background as a Machine Learning specialist. He enjoys learning about AWS AI/ML services and helping customers meet their business outcomes by building solutions for them.
Marwen Benzarti is a Senior Cloud Support Engineer at AWS Support where he specializes in Infrastructure as Code. With over 4 years at AWS and 2 years of previous experience as a DevOps engineer, Marwen works closely with customers to implement AWS best practices and troubleshoot complex technical challenges. Outside of work, he enjoys playing both competitive multiplayer and immersive story-driven video games.