We’re excited to announce a new experimental feature on the Valohai MLOps platform: Smart Instance Selection. This feature is designed to optimize the execution of machine learning jobs by intelligently selecting compute instances based on where data has been previously processed. By leveraging historical data locality, Smart Instance Selection can significantly reduce job execution times.
In this post, we’ll look at how Smart Instance Selection works, discuss its potential impact on ML workflows, and explain how you can take part in this experimental phase and help refine this capability.
The challenge of data transfer overhead
Many machine learning teams face challenges beyond building more effective models. One of these challenges is managing resources more efficiently to scale their operations further.
Whether you're working with cloud-based services or on-premises hardware, the overhead associated with transferring large datasets across storage services and compute instances can introduce substantial delays and increased costs. Smart Instance Selection addresses this by automatically detecting and prioritizing instances that already have the necessary data cached from previous jobs.
This feature could be particularly beneficial for those handling repetitive tasks or continuous integration and deployment pipelines, where the same data is frequently reprocessed or where processing tasks have common datasets. By reducing data transfer times, teams can accelerate iteration cycles, improve productivity, and focus more on model optimization and less on operational work.
How Smart Instance Selection Works
Traditionally, when a new job is submitted to the Valohai MLOps platform, the system assigns it to the next available compute instance of the requested specs based on a first-in, first-out (FIFO) queue.
With Smart Instance Selection enabled, the platform takes a more proactive approach by analyzing historical job data to identify instances with the highest cache hit rates. When a new job is submitted, the system will prioritize assigning it to an instance that already has the necessary data cached, reducing the need for data transfer and accelerating job execution.
However, to ensure jobs are processed in a timely manner, the system will also consider the age of the job in the queue. If a job has been waiting for a while, it will receive priority even with a low cache hit rate. This ensures that jobs are not unnecessarily delayed due to cache considerations. As a fallback, if no instances with cached data are available, the system will revert to the default FIFO queueing behavior.
How to Enable Smart Instance Selection
Note that Smart Instance Selection is an experimental feature that may have unforeseen impacts on your workflows. We recommend enabling this feature in a test environment first to evaluate its performance and compatibility with your existing processes.
Organization administrators can enable Smart Instance Selection by navigating to the Environment settings in the Valohai MLOps platform and toggling the Smart Instance Selection experimental feature. Once enabled, the system will take 1-2 hours to generate the necessary historical job data before Smart Instance Selection becomes fully operational.
Next steps
We are looking forward to hearing your feedback on Smart Instance Selection and how it has impacted your machine-learning workflows. Your insights will help us refine this feature and ensure it meets the needs of our users.
As we continue to develop and enhance the Valohai MLOps platform, we are committed to delivering innovative solutions that streamline ML operations and empower data science teams to achieve their goals more efficiently.
If you’re not a Valohai user yet, you can get started by booking a meeting with our Customer Team. You’ll get a platform demo tailored to your needs, with a custom environment set up for you to explore the platform further.
Alternatively, you can get a preview of Valohai’s capabilities using our self-service trial.