Scaling ETL: From 5 Hours to Under 1 Hour

In data-heavy systems, ETL pipelines are often the backbone of the application. They power analytics, reporting, and downstream workflows. When they are slow or inefficient, everything built on top of them suffers.

At Precanto, our ETL pipelines initially took over 5 hours to complete. This made iteration slow, delayed data availability, and limited the system’s ability to scale.

The goal was not just to make the pipeline faster, but to make it predictable, scalable, and easier to evolve.

Understanding the Bottlenecks

The initial implementation had several characteristics that contributed to the high runtime:

As data volume grew (tens of millions of rows per tenant), these issues compounded, leading to exponential increases in runtime.

Step 1: Moving Processing to Golang

The first major improvement came from shifting transformation logic out of the database and into a Golang-based processing layer.

This allowed:

Instead of relying on complex SQL operations, transformations were broken down into explicit steps that could be optimized individually.

Step 2: Table Swap Strategy

A major bottleneck was how data updates were handled.

The earlier approach involved modifying existing tables in place, which caused:

This was replaced with a table swap strategy:

This significantly reduced contention and made the pipeline more predictable and resilient.

Step 3: Structuring the Pipeline

Instead of treating the pipeline as a monolithic process, it was broken into well-defined stages:

Each stage had clear inputs and outputs, making it easier to reason about performance and optimize specific parts without affecting the entire pipeline.

Current Direction: Change Detection

Even with these improvements, the pipeline still processes large volumes of data on every run.

The next step is introducing change detection:

This shifts the pipeline from batch-heavy processing to a more incremental model, improving both performance and efficiency.

Results

Key Takeaways

Optimizing ETL pipelines is not just about making queries faster. It requires rethinking how data flows through the system.

The biggest gains often come not from micro-optimizations, but from changing how the system is structured.

This design also ties closely with how the system is structured at a tenant level. In our case, we used a database-per-tenant architecture to isolate workloads and improve predictability.