Scaling MLflow for enterprise AI: What’s New in SageMaker AI with MLflow


Today we’re announcing Amazon SageMaker AI with MLflow, now including a serverless capability that dynamically manages infrastructure provisioning, scaling, and operations for artificial intelligence and machine learning (AI/ML) development tasks. It scales resources up during intensive experimentation and down to zero when not in use, reducing operational overhead. It introduces enterprise-scale features including seamless access management with cross-account sharing, automated version upgrades, and integration with SageMaker AI capabilities like model customization and pipelines. With no administrator configuration needed and at no additional cost, data scientists can immediately begin tracking experiments, implementing observability, and evaluating model performance without infrastructure delays, making it straightforward to scale MLflow workloads across your organization while maintaining security and governance.

In this post, we explore how these new capabilities help you run large MLflow workloads—from generative AI agents to large language model (LLM) experimentation—with improved performance, automation, and security using SageMaker AI with MLflow.

Enterprise scale features in SageMaker AI with MLflow

The new MLflow serverless capability in SageMaker AI delivers enterprise-grade management with automatic scaling, default provisioning, seamless version upgrades, simplified AWS Identity and Access Management (IAM) authorization, resource sharing through AWS Resource Access Manager (AWS RAM), and integration with both Amazon SageMaker Pipelines and model customization. The term MLflow Apps replaces the previous MLflow tracking servers terminology, reflecting the simplified, application-focused approach. You can access the new MLflow Apps page in Amazon SageMaker Studio, as shown in the following screenshot.

A default MLflow App is automatically provisioned when you create a SageMaker Studio domain, streamlining the setup process. It’s enterprise-ready out of the box, requiring no additional provisioning or configuration. The MLflow App scales elastically with your usage, alleviating the need for manual capacity planning. Your training, tracking, and experimentation workloads can get the resources they need automatically, simplifying operations while maintaining performance.

Administrators can define a maintenance window during the creation of the MLflow App, during which in-place version upgrades of the MLflow App take place. This helps the MLflow App be standardized, secure, and continuously up to date, minimizing manual maintenance overhead. MLflow version 3.4 is supported with this launch, and as shown in the following screenshot, extends MLflow to ML, generative AI applications, and agent workloads.

Simplified identity management with MLflow Apps

We’ve simplified access control and IAM permissions for ML teams with the new MLflow App. A streamlined permissions set, such as sagemaker:CallMlflowAppApi, now covers common MLflow operations—from creating and searching experiments to updating trace information—making access control more straightforward to enforce.

By enabling simplified IAM permissions boundaries, users and platform administrators can standardize IAM roles across teams, personas, and projects, facilitating consistent and auditable access to MLflow experiments and metadata. For complete IAM permission and policy configurations, see Set up IAM permissions for MLflow Apps.

Cross-account sharing of MLflow Apps using AWS RAM

Administrators want to centrally manage their MLflow infrastructure while provisioning access across different AWS accounts. MLflow Apps support AWS cross-account sharing for collaborative enterprise AI development. Using AWS RAM, this feature helps AI platform administrators share an MLflow App seamlessly across data scientists with consumer AWS accounts, as illustrated in the following diagram.

Diagram

Platform administrators can maintain a centralized, governed SageMaker domain that provisions and manages the MLflow App, and data scientists in separate consuming accounts can launch and interact with the MLflow App securely. Combined with the new simplified IAM permissions, enterprises can launch and manage an MLflow App from a centralized administrative AWS account. Using the shared MLflow App, a downstream data scientist consumer can log their MLflow experimentation and generative AI workloads while maintaining governance, auditability, and compliance from a single platform administrator control plane. To learn more about cross-account sharing, see Getting Started with AWS RAM.

SageMaker Pipelines and MLflow integration

SageMaker Pipelines is integrated with MLflow. SageMaker Pipelines is a serverless workflow orchestration service purpose-built for MLOps and LLMOps automation. You can seamlessly build, execute, and monitor repeatable end-to-end ML workflows with an intuitive drag-and-drop UI or the Python SDK. From a SageMaker pipeline, a default MLflow App will be created if one doesn’t already exist, an MLflow experiment name can be defined, and metrics, parameters, and artifacts are logged to the MLflow App as defined in your SageMaker pipeline code. The following screenshot shows an example ML pipeline using MLflow.

SageMaker model customization and MLflow integration

By default, SageMaker model customization integrates with MLflow, providing automatic linking between model customization jobs and MLflow experiments. When you run model customization fine-tuning jobs, the default MLflow App is used, an experiment is selected, and metrics, parameters, and artifacts are logged for you automatically. On the SageMaker model customization job page, you can view metrics sourced from MLflow and drill into additional metrics within the MLflow UI, as shown in the following screenshot.

View full metrics in MLflow

Conclusion

These features make the new MLflow Apps in SageMaker AI ready for enterprise-scale ML and generative AI workloads with minimal administrative burden. You can get started with the examples provided in the GitHub samples repository and AWS workshop.

MLflow Apps are generally available in the AWS Regions where SageMaker Studio is available, except China and US GovCloud Regions. We invite you to explore the new capability and experience the enhanced efficiency and control it brings to your ML projects. Get started now by visiting the SageMaker AI with MLflow product detail page and Accelerate generative AI development using managed MLflow on Amazon SageMaker AI, and send your feedback to AWS re:Post for SageMaker or through your usual AWS support contacts.


About the authors

Sandeep Raveesh is a GenAI Specialist Solutions Architect at AWS. He works with customers through their AIOps journey across model training, generative AI applications like agents, and scaling generative AI use cases. He also focuses on go-to-market strategies helping AWS build and align products to solve industry challenges in the generative AI space. You can connect with Sandeep on LinkedIn to learn about generative AI solutions.

Rahul Easwar is a Senior Product Manager at AWS, leading managed MLflow and Partner AI Apps within the Amazon SageMaker AIOps team. With over 20 years of experience spanning startups to enterprise technology, he leverages his entrepreneurial background and MBA from Chicago Booth to build scalable ML platforms that simplify AI adoption for organizations worldwide. Connect with Rahul on LinkedIn to learn more about his work in ML platforms and enterprise AI solutions.

Jessica Liao is a Senior UX Designer at AWS who leads design for MLflow, model governance, and inference within Amazon SageMaker AI, shaping how data scientists evaluate, govern, and deploy models. She brings expertise in handling complex problems and driving human-centered innovation from her experience designing DNA life science systems, which she now applies to make machine learning tools more accessible and intuitive through cross-functional collaboration.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *