To stay competitive, businesses across industries use foundation models (FMs) to transform their applications. Although FMs offer impressive out-of-the-box capabilities, achieving a true competitive edge often requires deep model customization through pre-training or fine-tuning. However, these approaches demand advanced AI expertise, high performance compute, fast storage access and can be prohibitively expensive for many organizations.
In this post, we explore how organizations can address these challenges and cost-effectively customize and adapt FMs using AWS managed services such as Amazon SageMaker training jobs and Amazon SageMaker HyperPod. We discuss how these powerful tools enable organizations to optimize compute resources and reduce the complexity of model training and fine-tuning. We explore how you can make an informed decision about which Amazon SageMaker service is most applicable to your business needs and requirements.
Business challenge
Businesses today face numerous challenges in effectively implementing and managing machine learning (ML) initiatives. These challenges include scaling operations to handle rapidly growing data and models, accelerating the development of ML solutions, and managing complex infrastructure without diverting focus from core business objectives. Additionally, organizations must navigate cost optimization, maintain data security and compliance, and democratize both ease of use and access of machine learning tools across teams.
Customers have built their own ML architectures on bare metal machines using open source solutions such as Kubernetes, Slurm, and others. Although this approach provides control over the infrastructure, the amount of effort needed to manage and maintain the underlying infrastructure (for example, hardware failures) over time can be substantial. Organizations often underestimate the complexity involved in integrating these various components, maintaining security and compliance, and keeping the system up-to-date and optimized for performance.
As a result, many companies struggle to use the full potential of ML while maintaining efficiency and innovation in a competitive landscape.
How Amazon SageMaker can help
Amazon SageMaker addresses these challenges by providing a fully managed service that streamlines and accelerates the entire ML lifecycle. You can use the comprehensive set of SageMaker tools for building and training your models at scale while offloading the management and maintenance of underlying infrastructure to SageMaker.
You can use SageMaker to scale your training cluster to thousands of accelerators, with your own choice of compute and optimize your workloads for performance with SageMaker distributed training libraries. For cluster resiliency, SageMaker offers self-healing capabilities that automatically detect and recover from faults, allowing for continuous FM training for months with little to no interruption and reducing training time by up to 40%. SageMaker also supports popular ML frameworks such as TensorFlow and PyTorch through managed pre-built containers. For those who need more customization, SageMaker also allows users to bring in their own libraries or containers.
To address various business and technical use cases, Amazon SageMaker offers two options for distributed pre-training and fine-tuning: SageMaker training jobs and SageMaker HyperPod.
SageMaker training jobs
SageMaker training jobs offer a managed user experience for large, distributed FM training, removing the undifferentiated heavy lifting around infrastructure management and cluster resiliency while offering a pay-as-you-go option. SageMaker training jobs automatically spin up a resilient distributed training cluster, provide managed orchestration, monitor the infrastructure, and automatically recovers from faults for a smooth training experience. After the training is complete, SageMaker spins down the cluster and the customer is billed for the net training time in seconds. FM builders can further optimize this experience by using SageMaker Managed Warm Pools, which allows you to retain and reuse provisioned infrastructure after the completion of a training job for reduced latency and faster iteration time between different ML experiments.
With SageMaker training jobs, FM builders have the flexibility to choose the right instance type to best fit an individual to further optimize their training budget. For example, you can pre-train a large language model (LLM) on a P5 cluster or fine-tune an open source LLM on p4d instances. This allows businesses to offer a consistent training user experience across ML teams with varying levels of technical expertise and different workload types.
Additionally, Amazon SageMaker training jobs integrate tools such as SageMaker Profiler for training job profiling, Amazon SageMaker with MLflow for managing ML experiments, Amazon CloudWatch for monitoring and alerts, and TensorBoard for debugging and analyzing training jobs. Together, these tools enhance model development by offering performance insights, tracking experiments, and facilitating proactive management of training processes.
AI21 Labs, Technology Innovation Institute, Upstage, and Bria AI Â chose SageMaker training jobs to train and fine-tune their FMs with the reduced total cost of ownership by offloading the workload orchestration and management of underlying compute to SageMaker. They delivered faster results by focusing their resources on model development and experimentation while SageMaker handled the provisioning, creation, and termination of their compute clusters.
The following demo provides a high-level, step-by-step guide to using Amazon SageMaker training jobs.
SageMaker HyperPod
SageMaker HyperPod offers persistent clusters with deep infrastructure control, which builders can use to connect through Secure Shell (SSH) into Amazon Elastic Compute Cloud (Amazon EC2) instances for advanced model training, infrastructure management, and debugging. To maximize availability, HyperPod maintains a pool of dedicated and spare instances (at no additional cost to the customer), minimizing downtime for critical node replacements. Customers can use familiar orchestration tools such as Slurm or Amazon Elastic Kubernetes Service (Amazon EKS), and the libraries built on top of these tools for flexible job scheduling and compute sharing. Additionally, orchestrating SageMaker HyperPod clusters with Slurm allows NVIDIA’s Enroot and Pyxis integration to quickly schedule containers as performant unprivileged sandboxes. The operating system and software stack are based on the Deep Learning AMI, which are preconfigured with NVIDIA CUDA, NVIDIA cuDNN, and the latest versions of PyTorch and TensorFlow. HyperPod also includes SageMaker distributed training libraries, which are optimized for AWS infrastructure so users can automatically split training workloads across thousands of accelerators for efficient parallel training.
FM builders can use built-in ML tools in HyperPod to enhance model performance, such as using Amazon SageMaker with TensorBoard to visualize model a model architecture and address convergence issues, while Amazon SageMaker Debugger captures real-time training metrics and profiles. Additionally, integrating with observability tools such as Amazon CloudWatch Container Insights, Amazon Managed Service for Prometheus, and Amazon Managed Grafana offer deeper insights into cluster performance, health, and utilization, saving valuable development time.
This self-healing, high-performance environment, trusted by customers like Articul8, IBM, Perplexity AI, Hugging Face, Luma, and Thomson Reuters, supports advanced ML workflows and internal optimizations.
The following demo provides a high-level, step-by-step guide to using Amazon SageMaker HyperPod.
Choosing the right option
For organizations that require granular control over training infrastructure and extensive customization options, SageMaker HyperPod is the ideal choice. HyperPod offers custom network configurations, flexible parallelism strategies, and support for custom orchestration techniques. It integrates seamlessly with tools such as Slurm, Amazon EKS, Nvidia’s Enroot, and Pyxis, and provides SSH access for in-depth debugging and custom configurations.
SageMaker training jobs are tailored for organizations that want to focus on model development rather than infrastructure management and prefer ease of use with a managed experience. SageMaker training jobs feature a user-friendly interface, simplified setup and scaling, automatic handling of distributed training tasks, built-in synchronization, checkpointing, fault tolerance, and abstraction of infrastructure complexities.
When choosing between SageMaker HyperPod and training jobs, organizations should align their decision with their specific training needs, workflow preferences, and desired level of control over the training infrastructure. HyperPod is the preferred option for those seeking deep technical control and extensive customization, and training jobs is ideal for organizations that prefer a streamlined, fully managed solution.
Conclusion
Learn more about Amazon SageMaker and large-scale distributed training on AWS by visiting Getting Started on Amazon SageMaker, watching the Generative AI on Amazon SageMaker Deep Dive Series, and exploring the awsome-distributed-training and amazon-sagemaker-examples GitHub repositories.
About the authors
Trevor Harvey is a Principal Specialist in Generative AI at Amazon Web Services and an AWS Certified Solutions Architect – Professional. Trevor works with customers to design and implement machine learning solutions and leads go-to-market strategies for generative AI services.
Kanwaljit Khurmi is a Principal Generative AI/ML Solutions Architect at Amazon Web Services. He works with AWS customers to provide guidance and technical assistance, helping them improve the value of their solutions when using AWS. Kanwaljit specializes in helping customers with containerized and machine learning applications.
Miron Perel is a Principal Machine Learning Business Development Manager with Amazon Web Services. Miron advises Generative AI companies building their next generation models.
Guillaume Mangeot is Senior WW GenAI Specialist Solutions Architect at Amazon Web Services with over one decade of experience in High Performance Computing (HPC). With a multidisciplinary background in applied mathematics, he leads highly scalable architecture design in cutting-edge fields such as GenAI, ML, HPC, and storage, across various verticals including oil & gas, research, life sciences, and insurance.