Cisco achieves 50% latency improvement using Amazon SageMaker Inference faster autoscaling feature

This post is co-authored with Travis Mehlinger and Karthik Raghunathan from Cisco.

Webex by Cisco is a leading provider of cloud-based collaboration solutions which includes video meetings, calling, messaging, events, polling, asynchronous video and customer experience solutions like contact center and purpose-built collaboration devices. Webex’s focus on delivering inclusive collaboration experiences fuels our innovation, which leverages AI and Machine Learning, to remove the barriers of geography, language, personality, and familiarity with technology. Its solutions are underpinned with security and privacy by design. Webex works with the world’s leading business and productivity apps – including AWS.

Cisco’s Webex AI (WxAI) team plays a crucial role in enhancing these products with AI-driven features and functionalities, leveraging LLMs to improve user productivity and experiences. In the past year, the team has increasingly focused on building artificial intelligence (AI) capabilities powered by large language models (LLMs) to improve productivity and experience for users. Notably, the team’s work extends to Webex Contact Center, a cloud-based omni-channel contact center solution that empowers organizations to deliver exceptional customer experiences. By integrating LLMs, WxAI team enables advanced capabilities such as intelligent virtual assistants, natural language processing, and sentiment analysis, allowing Webex Contact Center to provide more personalized and efficient customer support. However, as these LLM models grew to contain hundreds of gigabytes of data, WxAI team faced challenges in efficiently allocating resources and starting applications with the embedded models. To optimize its AI/ML infrastructure, Cisco migrated its LLMs to Amazon SageMaker Inference, improving speed, scalability, and price-performance.

This blog post highlights how Cisco implemented faster autoscaling release reference. For more details on Cisco’s Use Cases, Solution & Benefits see How Cisco accelerated the use of generative AI with Amazon SageMaker Inference.

In this post, we will discuss the following:

Overview of Cisco’s use-case and architecture
Introduce new faster autoscaling feature
1. Single Model real-time endpoint
2. Deployment using Amazon SageMaker InferenceComponents
Share results on the performance improvements Cisco saw with faster autoscaling feature for GenAI inference
Next Steps

Cisco’s Use-case: Enhancing Contact Center Experiences

Webex is applying generative AI to its contact center solutions, enabling more natural, human-like conversations between customers and agents. The AI can generate contextual, empathetic responses to customer inquiries, as well as automatically draft personalized emails and chat messages. This helps contact center agents work more efficiently while maintaining a high level of customer service.

Architecture

Initially, WxAI embedded LLM models directly into the application container images running on Amazon Elastic Kubernetes Service (Amazon EKS). However, as the models grew larger and more complex, this approach faced significant scalability and resource utilization challenges. Operating the resource-intensive LLMs through the applications required provisioning substantial compute resources, which slowed down processes like allocating resources and starting applications. This inefficiency hampered WxAI’s ability to rapidly develop, test, and deploy new AI-powered features for the Webex portfolio.

To address these challenges, WxAI team turned to SageMaker Inference – a fully managed AI inference service that allows seamless deployment and scaling of models independently from the applications that use them. By decoupling the LLM hosting from the Webex applications, WxAI could provision the necessary compute resources for the models without impacting the core collaboration and communication capabilities.

“The applications and the models work and scale fundamentally differently, with entirely different cost considerations, by separating them rather than lumping them together, it’s much simpler to solve issues independently.”

– Travis Mehlinger, Principal Engineer at Cisco.

This architectural shift has enabled Webex to harness the power of generative AI across its suite of collaboration and customer engagement solutions.

Today Sagemaker endpoint uses autoscaling with invocation per instance. However, it takes ~6 minutes to detect need for autoscaling.

Introducing new Predefined metric types for faster autoscaling

Cisco Webex AI team wanted to improve their inference auto scaling times, so they worked with Amazon SageMaker to improve inference.

Amazon SageMaker’s real-time inference endpoint offers a scalable, managed solution for hosting Generative AI models. This versatile resource can accommodate multiple instances, serving one or more deployed models for instant predictions. Customers have the flexibility to deploy either a single model or multiple models using SageMaker InferenceComponents on the same endpoint. This approach allows for efficient handling of diverse workloads and cost-effective scaling.

To optimize real-time inference workloads, SageMaker employs application automatic scaling (auto scaling). This feature dynamically adjusts both the number of instances in use and the quantity of model copies deployed (when using inference components), responding to real-time changes in demand. When traffic to the endpoint surpasses a predefined threshold, auto scaling increases the available instances and deploys additional model copies to meet the heightened demand. Conversely, as workloads decrease, the system automatically removes unnecessary instances and model copies, effectively reducing costs. This adaptive scaling ensures that resources are optimally utilized, balancing performance needs with cost considerations in real-time.

Working with Cisco, Amazon SageMaker releases new sub-minute high-resolution pre-defined metric type SageMakerVariantConcurrentRequestsPerModelHighResolution for faster autoscaling and reduced detection time. This newer high-resolution metric has shown to reduce scaling detection times by up to 6x (compared to existing SageMakerVariantInvocationsPerInstance metric) and thereby improving overall end-to-end inference latency by up to 50%, on endpoints hosting Generative AI models like Llama3-8B.

With this new release, SageMaker real-time endpoints also now emits new ConcurrentRequestsPerModel and ConcurrentRequestsPerModelCopy CloudWatch metrics as well, which are more suited for monitoring and scaling Amazon SageMaker endpoints hosting LLMs and FMs.

Cisco’s Evaluation of faster autoscaling feature for GenAI inference

Cisco evaluated Amazon SageMaker’s new pre-defined metric types for faster autoscaling on their Generative AI workloads. They observed up to a 50% latency improvement in end-to-end inference latency by using the new SageMakerequestsPerModelHighResolution metric, compared to the existing SageMakerVariantInvocationsPerInstance metric.

The setup involved using their Generative AI models, on SageMaker’s real-time inference endpoints. SageMaker’s autoscaling feature dynamically adjusted both the number of instances and the quantity of model copies deployed to meet real-time changes in demand. The new high-resolution SageMakerVariantConcurrentRequestsPerModelHighResolution metric reduced scaling detection times by up to 6x, enabling faster autoscaling and lower latency.

In addition, SageMaker now emits new CloudWatch metrics, including ConcurrentRequestsPerModel and ConcurrentRequestsPerModelCopy, which are better suited for monitoring and scaling endpoints hosting large language models (LLMs) and foundation models (FMs). This enhanced autoscaling capability has been a game-changer for Cisco, helping to improve the performance and efficiency of their critical Generative AI applications.

“We are really pleased with the performance improvements we’ve seen from Amazon SageMaker’s new autoscaling metrics. The higher-resolution scaling metrics have significantly reduced latency during initial load and scale-out on our Gen AI workloads. We’re excited to do a broader rollout of this feature across our infrastructure”

– Travis Mehlinger, Principal Engineer at Cisco.

Cisco further plans to work with SageMaker inference to drive improvements in rest of the variables that impact autoscaling latencies. Like model download and load times.

Conclusion

Cisco’s Webex AI team is continuing to leverage Amazon SageMaker Inference to power generative AI experiences across its Webex portfolio. Evaluation with faster autoscaling from SageMaker has shown Cisco up to 50% latency improvements in its GenAI inference endpoints. As WxAI team continues to push the boundaries of AI-driven collaboration, its partnership with Amazon SageMaker will be crucial in informing upcoming improvements and advanced GenAI inference capabilities. With this new feature Cisco looks forward to further optimizing its AI Inference performance by rolling it broadly in multiple regions and delivering even more impactful generative AI features to its customers.

About the Authors

Travis Mehlinger is a Principal Software Engineer in the Webex Collaboration AI group, where he helps teams develop and operate cloud-native AI and ML capabilities to support Webex AI features for customers around the world.In his spare time, Travis enjoys cooking barbecue, playing video games, and traveling around the US and UK to race go karts.

Karthik Raghunathan is the Senior Director for Speech, Language, and Video AI in the Webex Collaboration AI Group. He leads a multidisciplinary team of software engineers, machine learning engineers, data scientists, computational linguists, and designers who develop advanced AI-driven features for the Webex collaboration portfolio. Prior to Cisco, Karthik held research positions at MindMeld (acquired by Cisco), Microsoft, and Stanford University.

Praveen Chamarthi is a Senior AI/ML Specialist with Amazon Web Services. He is passionate about AI/ML and all things AWS. He helps customers across the Americas to scale, innovate, and operate ML workloads efficiently on AWS. In his spare time, Praveen loves to read and enjoys sci-fi movies.

Saurabh Trikande is a Senior Product Manager for Amazon SageMaker Inference. He is passionate about working with customers and is motivated by the goal of democratizing AI. He focuses on core challenges related to deploying complex AI applications, multi-tenant models, cost optimizations, and making deployment of Generative AI models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch and spending time with his family.

Ravi Thakur is a Sr Solutions Architect Supporting Strategic Industries at AWS, and is based out of Charlotte, NC. His career spans diverse industry verticals, including banking, automotive, telecommunications, insurance, and energy. Ravi’s expertise shines through his dedication to solving intricate business challenges on behalf of customers, utilizing distributed, cloud-native, and well-architected design patterns. His proficiency extends to microservices, containerization, AI/ML, Generative AI, and more. Today, Ravi empowers AWS Strategic Customers on personalized digital transformation journeys, leveraging his proven ability to deliver concrete, bottom-line benefits.

Cisco achieves 50% latency improvement using Amazon SageMaker Inference faster autoscaling feature

Cisco’s Use-case: Enhancing Contact Center Experiences

Architecture

Introducing new Predefined metric types for faster autoscaling

Cisco’s Evaluation of faster autoscaling feature for GenAI inference

Conclusion

About the Authors

Recent Articles

HPE adds ‘digital circuit breaker’ to protect GreenLake customers

Cast AI raises $108M to get the max out of AI, Kubernetes and other workloads

From FOMO to Opportunity: Analytical AI in the Era of LLM Agents

Responsible AI in action: How Data Reply red teaming supports generative AI safety on AWS

Meta Launches LlamaFirewall Framework to Stop AI Jailbreaks, Injections, and Insecure Code

Related Stories

Leave A Reply Cancel reply