MLOps, or machine learning operations platform, is a software solution that offers the best practices regarding machine learning development. Using the MLOps platform allows you to manage everything about machine learning in production, where each new update doesn't feel like an entirely new project and easily dovetails to the last.
According to Statista reports, organizations worldwide invested 28.5 billion US dollars into machine learning software development in 2019 alone. Yet only 35 percent of the organizations deployed analytical models in production. It signifies that deploying and seamlessly running your models can be very daunting and can pose barriers to implementing these programs.
Challenge That MLOps Can Address
MLOps practices are very similar to DevOps, but it specifically addresses the best practices for machine learning. It provides shorter and regular releases that allow you to improve and enhance each step of the ML development lifecycle. Below we've outlined five signs that signal that you might need an MLOps platform.
Sign 1: Only you know how your production models are trained
Machine learning provides your system the capability to learn and improve from the data where you don't need to program everything explicitly. While working with machine learning, the way you train and test your models matters a lot, and even small training variations can cause inconsistencies in predictions. Therefore it's paramount to document how you collect the data used, how you preprocess it, what parameters you use, and so forth. The good thing about MLOps platforms is that they neatly package these steps into programmatic machine learning pipelines, which are self-documenting.
The more complicated the process is to train your model, the harder it is to onboard new team members to handle it -- if it's all done manually. MLOps platforms significantly reduce how much time it takes to train new people to keep your production models up to date.
Not only does it allow better communication within your team, but it also helps other teams as they can independently review the pipelines you've built -- rather than observing manual workflows. MLOps opens up productivity bottlenecks, reducing the friction between different groups working on your production models. Again, it is specifically helpful when you are working on complicated machine learning systems.
Sign 2: Work gets lost when people leave
Manual and undocumented workflows open up significant risk that knowledge is lost when a coworker leaves the team. MLOps guards and protects your organization's interests both externally and internally to prevent turnover-related losses.
If you work within the best practices of MLOps, your code and data are always versioned, and your production models can be traced back to how they were produced (i.e. metadata). Ideally, your whole model training process is componentized and built into a pipeline with shared ownership within your team. Compare this to the horror story we often hear where everything is in a single notebook on a local machine.
Working around a machine learning pipeline also helps to formalize testing and set benchmarks to ensure that the quality of production models don't degrade when working on the change. The same goes for setting monitoring and alerting for production models.
Sign 3: Updating a model conjures firefighting
There are many reasons to update production. The most common cause is that the predictions your production model is serving are no longer accurate because of changes in the underlying data. For example, the news cycle has moved on, and your news recommendation engine is no longer serving valuable recommendations. Another reason might be that you've developed your code further and want to release those new capabilities.
Either way, updating models can be a lengthy and cumbersome task as models can take anything from minutes to weeks to train. We often see that retraining a model becomes a firefighting exercise for teams who work around machine learning in an ad-hoc way.
Initially a model can be built on the data scientist's local machine but as models mature and data accumulates the training has to be shifted to cloud machines. This, in turn, involves more people in the mix to spin up cloud machines (and hopefully shut them down when the model has been trained). Additionally, the cloud environment is different from a local machine and DevOps knowledge is required to package the right dependencies into Docker containers for the training to run.
Without an MLOps platform, this hassle happens every time a model is retrained. A managed MLOps platform can reduce -- if not eliminate -- the firefighting that comes with machine orchestration, dependency management and scaling, meaning that your team can update old models and develop models without requiring more DevOps resources.
Additionally, MLOps platforms generally have APIs. Your model training pipelines can be integrated with existing systems to avoid the need to manually start the training process or hand over files. For example, new data in your data lake can trigger retraining and the trained model can be pushed to your end-user application's CI/CD without a single click.
Sign 4: No idea how the model in production is performing
If you're not using any MLOps system, you will have no visibility for how the production model is performing.
Standard monitoring for the number of requests, latency, and uptime is quite common for production models, but we see a new trend emerge. Machine learning observability is all the rage right now, and we see many new entrants, like Arize AI and Fiddler, to the market who are focused on it. It's not only about measuring data drift as an aggregate but also about outliers in the predictions you are serving and figuring out whether your models are biased towards a particular user group.
An MLOps platform might offer you some monitoring out-of-the-box but more importantly it should be able to integrate with more advanced monitoring and observability tools that suit the needs you have. These tools should be able to automatically alert you and trigger retraining and redeployment in your MLOps platform to ensure you are always serving the best possible predictions.
Sign 5: Governance audit keeps you up at night
Machine learning algorithms come under severe regulatory and ethical scrutiny due to not being explicitly defined by a person. Therefore as a data scientist, you might be looking at frequent governance audits, making many people nervous.
Working within an MLOps platform makes reproducibility a priority, and introduces version control for every step of the machine learning pipeline. Think of this as bookkeeping, and when your books are in order, there is nothing to worry about.
Additionally, suppose you work your whole training process into a pipeline. In that case, you codify the checkpoints and safeguards for how the data used to train should look and what the expected predictions should be. You won't be kept awake before the governance audit since you know that your ML system and production models are playing by the stable and reliable rules. All of your models are reproduced and developed in accordance and compliance with the original standards. You don't need to worry no matter how stringent the machine learning guidelines become.
Conclusion
Confidence and efficiency are the key outcomes of adopting MLOps practice and a MLOps platform. Confidence in that everything is working as expected and efficiency in that you won't need to repeat work manually.
For more information, read about how to get started with MLOps, whether you should build or buy an MLOps platform and how the popular MLOps platforms compare to each other.