Success Stories / Arquimea Research Center

How Arquimea Research Center Doubled GPU Utilization and Avoided €180K-270K in Hardware Costs

ARC doubled GPU utilization and eliminated 80-90% of knowledge transfer overhead—turning the same hardware into a research factory

Key Results

75-85%

GPU utilization across all machines, up from ~50% idle—with zero new hardware

2-10 hrs → minutes

Knowledge transfer reduced to "here's the Valohai link, copy and run"

30-50% more

Compute capacity absorbed without expansion—one project's YoY growth handled internally

Zero

Infrastructure complaints—down from weekly GPU battles

Company

Arquimea Research Center (ARC) is the innovation and research hub of Arquimea Group, a global technology company. ARC focuses on 3D reconstruction and computer vision, with workflows that start with hundreds of images + metadata and produce 3D models and derived outputs.

They run tens of thousands of experiments per year across ~20 researchers in four teams, all on fully on-premise GPU infrastructure.

The AI research group of Arquimea Research Center has a dedicated fleet of:

22 GPUs across 7 machines—five servers with dual 3090s, one with four 3090s, and one NVIDIA DGX with 8× A100 80GB GPUs (two split via MIG for flexible workloads).

Before Valohai: The "Convenient Machine" Monopoly

ARC had multiple GPU servers across different networks and VPNs. On paper, plenty of compute. In practice, everyone fought for one "big" machine with 80GB GPUs.

Why? That machine had the easiest data access. Moving data to other servers meant manual sync, path reconfiguration, and ad-hoc setup differences. Result: the big box was perpetually contested while other GPUs sat 50% idle.

The root cause wasn't scheduling—it was that infrastructure friction created a compute monopoly.

Traditional HPC schedulers wouldn't fix this. The team needed something that eliminated the friction entirely: unified data access, zero-setup job submission, and built-in experiment tracking. Not another tool to learn, but infrastructure that gets out of the way.

Meanwhile, Mario spent 1-4 hours/week negotiating GPU access manually. During deadline crunches, coordination overhead exploded into endless Slack threads and priority battles.

We had to spend so much time just figuring out who could run where, and when. It became a bottleneck every time we hit shared deadlines.

Mario Alfonso Arsuaga – Principal Researcher, AI, Arquimea Research Center

Knowledge Transfer Tax

When someone built a good experiment, sharing it triggered a 2-10 hour process:

Author documents environment, paths, Docker details, common errors
Live tutorial session (1-2 hours) walking teammates through execution
Follow-up debugging for "doesn't work on my setup" issues

At 4-8 sessions per week, the organization burned 8-80 hours weekly on what's now a copy-paste operation. Senior researchers spent more time teaching Docker mounts than doing research.

People don't need to learn all the details, they just wanted to get the result. But to get the result, they had to understand Docker images, mounts, inputs, outputs… It was a very disgusting process for everyone involved.

Mario Alfonso Arsuaga – Principal Researcher, AI, Arquimea Research Center

Why Valohai: One Control Plane, Zero Infrastructure Friction

ARC deployed Valohai fully on-premise, connecting all GPU servers into one unified system:

Researchers see one environment, not scattered machines across networks
Jobs land wherever capacity exists—inputs/outputs handled through data abstraction, no manual transfers
Provenance is automatic—every run captures code, Docker image, data versions, parameters, hardware requirements
Web UI + CLI—researchers choose their comfort level, from browser-based submission to full programmatic control

This positioned them for future cloud bursting: same workflows, same Valohai interface, just more compute plugged in during peak periods.

Seamless Cloud Integration When Needed

Connecting AWS required minimal effort and feels seamless to researchers—they just select a different environment from a dropdown or change a value in their YAML. Everything runs inside ARC's own cloud environment, with all code and data staying under their control.

This gives ARC the ability to burst to both CPU and GPU machines in the cloud when needed. It functions as a natural extension of their on-prem infrastructure, making it easy for researchers to scale up without changing how they work.

Impact: Same Hardware, 2× the Research Output

All GPUs Are Equal Now (And They're Actually Busy)

With fair queuing and automatic data handling:

The "big box" now runs only workloads that truly need 80GB—at high utilization
Smaller machines actively used instead of forgotten
Idle time collapsed from ~50% to most GPUs busy or queued

Behavior changed: Researchers started queuing overnight and weekend jobs without coordination. And because data transfer friction disappeared, teams experimented with configurations they'd never tried before.

One project alone demanded 30-50% more compute vs. last year. Without Valohai, this would've triggered a 2-3 server purchase cycle. Instead, they absorbed it by unlocking existing idle capacity.

Last year, someone complained about GPUs almost every week. This year, with more demanding projects and more runs, nobody complains. Everyone feels like they have more GPUs but it's the same hardware.

Mario Alfonso Arsuaga – Principal Researcher, AI, Arquimea Research Center

Knowledge Transfer Collapsed to "Here's the Link"

Valohai's "copy execution" feature eliminated the tutorial process:

Share a Valohai run link (contains everything: code, image, data, params)
Recipient clicks "copy," tweaks parameters, hits run
Maybe a few chat questions, no sessions or debugging marathons

~30% of all organizational runs are now copied executions launched from the web UI. For some teams, that ratio hits 5-10× more copies than fresh VS Code launches.

Mario's personal workflow: "Most days I don't even open VS Code. I go to Valohai, find a run, copy it, tweak params, launch. About half my work starts this way."

The onboarding impact is dramatic: before Valohai, a new researcher needed 1-3 weeks to learn the Docker/VS Code/SSH stack before running anything real. Now, new team members contribute from day one—senior researchers prepare a Valohai step with parameters defined, and newcomers start by copying executions with different inputs.

Organic Code Promotion from Usage Patterns

Before: Fast experiments lived in long-running Docker containers. 3-4 times/year, code was lost to bad git hygiene or crashes, requiring days to recreate.

Now: Researchers run quick tests as ad-hoc Valohai executions. If results are interesting, teammates discover and copy that execution. Popular runs become templates. Monthly, engineering-focused members review most-copied executions and merge the best patterns into official pipelines.

The safety net proved its value multiple times this year. In one case, a researcher developed a script for splitting 3DGS models using a segmentation algorithm—iterating through Valohai to test results. When they accidentally deleted the local folder with no git commit, the code was gone. Recovery from Valohai's ad-hoc execution took less than an hour of searching.

Pipelines and Reuse: Designing for the Research Factory

ARC teams lean on Valohai Pipelines, especially Mario's team:

Most of their workloads are multi-step 3D reconstruction workflows
They use Valohai's pipeline reuse features to avoid rerunning expensive steps

Over time, this changed how they design projects:

Reusable steps (e.g., heavy preprocessing) live in well-defined places
Later steps can reuse outputs from previous runs
Change the last step → reuse all earlier steps from existing pipeline runs

Tens of Thousands of Experiments, Minimal Friction

Mario's team alone ran tens of thousands of experiments this year—many small inference jobs launched directly from the UI.

"I'm pretty sure we wouldn't have run this many experiments without Valohai. For my own work, I simply wouldn't have done it—I do most of it from the UI."

The organization ran 20,000-50,000 experiments this year through Valohai—and that's across just 20 researchers on 22 GPUs.

When I look at our team chat, every day there are Valohai links flying around. That's how we share work now: not documents, not tutorials, just runs you can copy and reuse.

Mario Alfonso Arsuaga – Principal Researcher, AI, Arquimea Research Center

What Changed (And What Didn't)

ARC didn't set out to "industrialize" ML research. They just wanted researchers to stop fighting over GPUs and wasting time on tutorials.

What emerged was a research factory by accident:

One control plane for scattered on-prem GPUs (soon extended to cloud bursts)
Fair queuing replaced weekly negotiation meetings
Copy-and-run became the default way to share and reproduce work
Ad-hoc executions served triple duty: collaboration mechanism, audit trail, and safety net for lost code

The result: researchers stay focused on science while infrastructure, GPU allocation, and environment management "just work" in the background.

What didn't change matters too: No codebase rewrites. No new DevOps headcount. No GPU purchases. They made the existing infrastructure work harder, not bigger.

Impact Summary

Annual ROI

2-3×

return on investment

Metric	Before Valohai	After Valohai	Business Impact
GPU Utilization	~50% idle	75-85% utilized	€180K-270K avoided hardware cost
Knowledge Transfer	2-10 hrs/session, 4-8×/week	Link → copy → run	€40K-150K annual savings
Onboarding	1-3 weeks to first experiment	Day 1 productivity	New researchers contribute immediately
Reproducibility	Manual, often failed	100% rerunnable	Zero "can't reproduce" failures
Sysadmin Overhead	1-4 hrs/week negotiating	~1 hr/month	150 hrs/year freed
Quota Escalations	Weekly battles	2-3 times/year	Coordination bottleneck eliminated
Experiment Volume	Baseline	20K-50K runs/year	Research factory output
Code Recovery	5-30 hrs to recreate lost work	<1 hr to recover	€2K-5K rework avoided

Valohai is one of the best improvements we did this year. People focus on the research they care about, and infrastructure just happens behind the scenes.

Mario Alfonso Arsuaga – Principal Researcher, AI, Arquimea Research Center

Future-proof your ml operations

Trace and reproduce all ML runs across multi/hybrid cloud

Book a demo