AI companies have been struggling with Big Data environments and analytical and machine learning pipelines for years. Organizations expect to start driving value from AI and machine learning within a few months, but, on average, it takes from four months to a year to even launch an AI MVP.
Why does it take so long?
Historically, there have been at least three major factors that have affected time to market for AI projects:
- Tools maturity. No system out there in the market is user-friendly enough to run and forget. Data scientists and ML engineers have to invest their time and resources into open source tools to fill the gaps.
- Lack of expertise and cross skills. AI engineers should write code with a use case in mind. They should focus on what is possible and what is reasonable, and not think about unrealistic but theoretically viable abstractions.
- Inadequate operations. IT teams needed a radically new approach to how things are done, to reduce handoff and ensure self-service. Originally, DevOps was considered an appropriate solution.
More recently, both data scientists and ML engineers have access to more and better tools. They have honed their skills on specific use cases that businesses need most. And MLOps is now available to help them become as productive as possible.
Machine learning operations (MLOps) is a practice for collaboration between AI/ML professionals and operations, to manage the ML lifecycle. The goal of MLOps is to help businesses generate value faster by building, testing, and releasing AI solutions more quickly and frequently, in a reliable environment.
The MLOps ecosystem fundamentally consists of three components:
- ML — Responsible for data acquisition, understanding of business use cases, and ML modeling
- Dev — ML modeling coupled with continuous integration and continuous deployment
- Ops — Continuous delivery coupled with data tracking and monitoring to create feedback loops
Amazon Web Services (AWS) has created an ecosystem of services to cover each of these stages. 
At the core of the AWS AI stack is Amazon SageMaker, Amazon SageMaker GroundTruth, Amazon A2I, Amazon SageMaker Neo, and Amazon SageMaker Studio.
Amazon SageMaker is a fully managed service that allows engineers to develop ML models quickly while removing unnecessary heavy lifting from the ML lifecycle. Amazon SageMaker Studio is an integrated ML environment for building, training, deploying, and analyzing ML models. Complementing each other, they allow developers to:
- Collect and prepare training data
- Select or build ML algorithms
- Set up and manage environments for training
- Train, debug, and tune ML models
- Manage training runs
- Deploy models in production
- Monitor models
- Validate prediction
- Scale and manage production environments
AWS strives to provide developers and data scientists with all services required for building an end-to-end machine learning infrastructure. Not only do they make it easier to take advantage of open source tools like Kubernetes and Kubeflow, but they also help to develop specific services like Amazon SageMaker Operators for Kubernetes and Amazon SageMaker Components for Kubeflow Pipelines. Let’s look at these two in more detail.
The challenge with Kubernetes is that you have to build and manage services within your Kubernetes cluster for ML. Infrastructure teams and data science/ML engineering teams should be experienced enough to manage and monitor cluster resources and containers. You have to invest in automation and monitoring, as well as train data scientists, to ensure that GPU resources have high utilization. You have to integrate disparate open-source libraries and frameworks for ML in a secure and scalable way. 
Kubernetes customers want to use managed services and features for ML, but they want to retain the flexibility and control offered by Kubernetes. With Kubernetes operators and Kubeflow Pipeline Components, you can use managed ML services for training, model tuning, and inference without leaving Kubernetes environments and pipelines, and without having to learn SageMaker APIs.
Customers reduce training costs by not paying for idle GPU instances. With SageMaker operators and pipeline components for training and model tuning, GPU resources are fully managed by SageMaker and utilized only for the duration of a job. Customers with 30% or higher idle GPU resources on their local self-managed resources for training will see a reduction in total cost by using SageMaker operators and pipeline components.
Customers can create hybrid pipelines with SageMaker pipeline components for Kubeflow that can seamlessly execute jobs on AWS, on-premise resources, and other cloud providers.
And this is only one specific example of how AWS facilitates AI and machine learning projects. 
Overall, AWS creates the environment and provides the foundation for data labeling, as well as for building, training, tuning, deploying, and managing models. Nonetheless, it still needs third-party services like Argo and Kubeflow for MLOps.
If you find this overview of ML infrastructure and MLOps on AWS useful and want to learn more, listen to this on-demand webinar MLOps and Reproducible ML on AWS with Kubeflow and SageMaker by AWS and Provectus. Thank you!
