The cloud is just somebody else’s computer
It’s a running joke among developers that the cloud is just a word for somebody else’s computer. But the fact remains, that by leveraging the cloud you can reap benefits that you couldn’t achieve with your on-premises server farm.
The hypercloud providers (AWS, Azure, GCP) are able to offer a smaller total cost of ownership while delivering superior features from scalability to security. It doesn’t make financial sense to build everything in-house when you can get it off the shelf for only the time you need it. The cloud vendors are constantly innovating with solutions such as servers that only cost for the time they are used, as opposed for the time they are up and waiting for requests. They are also able to attract talent specialized in e.g. scalability and security in ways that would be impossible for every other vendor on their own.
Cloud and Machine Learning
Most machine learning experimentation starts from understanding your data on your laptop and doesn’t require that much computation power. But very quickly you will run into the need more than your local CPU can provide you with. The cloud is by far the more scalable place to do machine learning. You’ll get access to the latest GPUs or, even TPUs that you wouldn’t be able to afford and maintain on your own.
However, on-premises server farms form a valid alternative when your needs fulfil some of the following criteria:
- You need calculation capacity 24/7 (e.g. due to large teams or large models)
- Your have sensitive data that cannot leave your data centers for compliance or other reasons
In many cases a hybrid solution might also be viable, where the preliminary model testing is done on-premises and the production ready models are developed on more powerful cloud machines or by trying out hundreds of different models before inference. In most cases and with the right tools, the cloud is the fastest way to do both prototyping as well as production-ready machine learning.
The three main reasons for doing machine learning in the cloud:
- Flexible resource usage to combat spiky hardware resource needs
- Access to the newest hardware at the click of a button
- De-coupled architecture by not being bound to specific hardware
Best practices for ML cloud implementations
Having worked with hundreds of companies doing machine learning, we’ve seen the benefits of both on-premises hardware as well as cloud computation. The key takeaways are the following:
- There are differences between cloud providers, especially for enterprises looking for data privacy – find the best one for you
- Ensure your data is in the same cloud data center as the computation you’re going to use is. Training models on large data sets becomes slow if your data needs to be transferred from e.g. an on-premises data storage across the world for calculation. Store your data in S3 if you’re using AWS EC2 to minimize latency and so forth.
- For structured data (tabular data in a database or Spark cluster), use a pre-processing step that first dumps the data into flat files. Files offer a persistency layer for database queries and thus improve version control and re-usability when you’re training the models. Follow the golden principle of “extract once, train indefinitely”.
- Hybrid solutions are a best practice: If you already have on-premises hardware, regardless if it’s one Tesla V100 or a server-farm of Titan RTXs, use them. Transitioning to the cloud doesn’t happen overnight, and that is fine.
- Get a machine learning platform that abstracts the hardware away from your data scientists.
- Let your data scientists concentrate on data science – not DevOps or MLOps, regardless of which cloud or on-premises hardware they use.
Valohai machine learning management platform
Valohai is built for managing your hardware, both in the cloud and on-premises. It is the fastest way to test out different cloud providers and compare them to each other and your on-premises hardware. Sign up for a discussion with our sales engineers today to talk about your needs!