Blog / How to manage massive datasets in Valohai

How to manage massive datasets in Valohai

by Tomi Kokkonen | on February 19, 2025

As machine learning projects grow in complexity, data science teams often face the challenge of working with tens of thousands to millions of files. Whether for processing massive image datasets, training models on huge collections of text snippets, or managing structured data spread across countless files, handling massive datasets can be both time-consuming and error-prone.

To address these challenges, we are introducing several enhancements in Valohai designed to streamline ML workflows involving a massive number of files, from dataset creation and preprocessing to model training and data lineage tracking.

The challenges of large numbers of files

When working with extremely large datasets, ML practitioners often run into three issues:

1. Performance and scalability

Transferring, reading, and writing hundreds of thousands to millions of files individually is slow and resource-intensive. Startup times for ML jobs drag on because every file must be downloaded before computation begins, making experimentation less agile.

2. Complexity in versioning and metadata management

Another significant challenge is versioning, tagging, and organizing large numbers of files. Ideally, files should be versioned into datasets, tagged with metadata, and assigned properties to ensure traceability and consistency. However, managing this process when dealing with millions of files can quickly become unwieldy.

3. Limited flexibility in data access

A rigid data access method whereby all data must be pre-downloaded before ML jobs can start is one that hinders prototyping, debugging, and incremental processing. Without flexible and on-demand access patterns, teams waste time and resources just waiting for data that may not even be needed for the entire run.

These challenges affect every aspect of an ML workflow, from preprocessing steps (like image resizing or feature extraction) to training large-scale models. These challenges also make it harder to reliably trace which files influenced which results or models.

While bundling data into archives or formats like Parquet or TFRecord can speed up transfers, these methods come with trade-offs, such as reduced individual file accessibility and increased complexity in mixing or selectively removing files. In practice, dealing with large numbers of files remains a necessity for many ML projects, making their efficient management and processing critical for maintaining productivity and scalability.

New enhancements for handling large numbers of files in Valohai

Valohai’s latest updates tackle these challenges head-on, offering tools and conventions that simplify working with large numbers of files and reduce overhead at every stage.

1. Dataset packaging

Instead of handling tens of thousands or millions of individual files, you can now opt to package these files into a single archive at the time of dataset creation. Valohai will automatically create a package from your files and store both the package and the individual files in your object storage.

This results in faster job starts. When the whole dataset is used as input, Valohai will download the package and extract it to the job’s working directory before the job starts. By downloading a single large package instead of a massive number of tiny files, you can significantly cut startup overhead.

For large training jobs or massive preprocessing steps, dataset packaging slashes startup overhead without sacrificing fine-grained data access.

2. On-demand inputs

By default, Valohai downloads all input files before starting a job. However, this approach can be inefficient when working with a large number of files. To address this, we’ve introduced on-demand inputs: instead of downloading every file upfront, your jobs can start processing immediately, fetching only the files they need as they go.

Immediate execution starts: Begin training or processing without the long wait for all data to arrive.
Reduced overhead: Pulling in files as needed lowers memory usage and avoids unnecessary data transfer.

On-demand inputs generate a simple text file in the job’s working directory with a list of the job’s input files. In addition, you can get the object storage URLs for the input files via Valohai's API. Your code can then use this list to decide which files to download and when. Valohai also provides a helper library to simplify the process of downloading files on demand, so you won’t have to write the file-downloading logic yourself.

3. A new convention for dataset creation, tagging, and properties

Previously, creating a new dataset version or adding tags or properties to ML jobs’ outputs in Valohai required the use of sidecar files (one per output file). However, this approach doesn’t scale well when dealing with thousands or millions of files as the sidecar files, though minimal, can add up to a significant size. While sidecar files are still supported, we’ve introduced a more scalable way to define datasets and manage metadata:

valohai.metadata.jsonl: A single properties file within your output directories defines dataset versions, assigns tags, and sets properties for entire sets of files.
Streamlined management: Instead of a per-file sidecar, you now have a structured, scalable approach that can handle millions of files efficiently.
Helper functions: The valohai-utils library comes packaged with the helper functions that take care of the properties file handling for you.

This new convention simplifies many ML tasks, from large-scale preprocessing pipelines to creating reproducible dataset versions with all necessary metadata intact.

4. Improved lineage tracking

Valohai’s trace view provides lineage tracking across datasets, executions, and models. This view has been overhauled to better handle large numbers of files, ensuring that you can easily trace the lineage of large datasets without getting lost in the details:

Summarized views: Instead of enumerating every single file, large datasets appear as single entities, with a sample of representative files visible at a glance.
Deep drill-down: Click on the file or a dataset to explore its contents in detail, and take quick actions like downloading the file(s).
Faster navigation: Even with millions of files, you retain the ability to trace file origins, transformations, and usage in models efficiently.

This refined experience ensures that you can audit and understand the provenance of large numbers of files easily.

Embrace scalable data operations with Valohai

As ML datasets continue to grow, old methods for handling data become bottlenecks. Valohai’s new features address these pain points directly, enabling you to:

Start large-scale jobs faster.
Access and manage data more flexibly.
Maintain clear, scalable metadata practices.
Effortlessly track lineage, even with millions of files.

By adopting these enhancements, your team can focus on building robust models and driving innovation instead of wrestling with massive, unwieldy datasets. With Valohai’s streamlined approach to large numbers of files, you can confidently scale your ML workflows to new heights.

If you’re not a Valohai user yet, you can get a preview of dataset management in Valohai using our self-service trial or by chatting with our Customer Team:

Free eBookPractical MLOpsHow to get started with MLOps?