Skip to content

Making Meijer's $4M Databricks bill legible

Meijer's data team knew their Databricks bill was too high but couldn't see why. We made the spend legible down to the individual notebook, found the waste, and left them the tools to keep finding it.

$362K-$508K

a year in Databricks savings, identified against a multi-million-dollar spend

Meijer

A hometown introduction

The introduction came through Meijer’s CEO, who had made a point of steering work toward local Michigan companies. We qualify on that count. We’re based in Ann Arbor, and for anyone who grew up around here, Meijer is just part of the furniture. The fit went past geography, though. Meijer’s enterprise data team runs a large Azure Databricks operation, and they had reached the stage most fast-growing data platforms eventually hit, where the monthly bill keeps climbing and nobody can quite account for the increase.

There was no crisis behind the call. They wanted to understand their own spending well enough to manage it on purpose instead of reacting to it every month. That is a good reason to bring someone in, and they knew their environment cold, which is the kind of engagement we do our best work on.

Nobody could see where the money went

The bill was large and getting larger, but the harder problem was that no one could see inside it. Spending was scattered across workspaces, clusters, jobs, and notebooks, with no reliable way to attribute a given dollar to the thing that had actually spent it. About a fifth of the cost couldn’t be tied to a team at all and simply showed up in the books as “unknown.” It is difficult to have a useful conversation about cutting costs when you can’t say where they’re coming from to begin with.

That is the normal condition of a big Databricks environment that grew up fast, not a knock on Meijer’s team. There was one extra wrinkle. Overwatch, the Microsoft Labs tool that was supposed to provide this visibility, broke about ten days into the engagement. The job that populated its data stopped running and stayed down for roughly two months until Microsoft sorted it out. We worked around the outage and finished on schedule anyway.

What we actually did

The first stretch of the work was the unglamorous part: making the cost data readable. We cleaned up and extended the Overwatch dataset and built what we called the Enhanced Cost Monitoring Dashboard, which broke spending down by track, workspace, cluster, job, and notebook so each team could see what it was actually responsible for. That step alone let us put a label on roughly $13,000 a month of pipeline costs that had previously been filed under “unknown.”

With the spending finally visible, the waste turned up quickly. A few of the things a closer look found:

  • Several of the priciest jobs were configured as continuous streaming jobs but only did meaningful work once or twice a day. Reconfiguring them to a scheduled, run-until-caught-up model cut their cost by an estimated 40 to 60 percent each.
  • A number of clusters were spending long stretches, sometimes five to twenty minutes at a time, sitting in a “resizing” state, autoscaling and running up cost without actually doing any work. That generally meant a job had been placed on a cluster the wrong size for it.
  • One small job ran every five minutes and finished in about a minute, which was just enough to keep an expensive cluster from ever shutting itself down.

Alongside the analysis we built two things meant to keep working after we left. One was automated alerting, so cost spikes and long-running jobs get caught in real time rather than discovered on the next invoice. The other was a Recommendations Dashboard that lets an engineer check whether a job is cost-effective before it ships to production.

We did all of this inside Meijer’s stand-ups and tooling, working next to their data engineers rather than off on our own. The goal was for their team to be able to find the next $13,000 without us, and most of what we were there to do was teach them how.

What changed

By the end of the engagement we had identified an estimated $362,000 to $508,000 in annual savings. The more durable result was that Meijer could now see their Databricks spending down to an individual cluster, job, or notebook, including the costs that had been invisible before. Cost questions started getting asked earlier, while a job was still being written instead of after the bill arrived.

The dashboards and alerts stayed in place after we finished, and Meijer’s teams have gone on using them to find more savings on their own.

Stack

Azure Databricks · Databricks Overwatch · Azure · Power BI

← All case studies

Talk to an engineer