Scaling ETL: From 5 Hours to Under 1 Hour

In data-heavy systems, ETL pipelines are often the backbone of the application. They power analytics, reporting, and downstream workflows. When they are slow or inefficient, everything built on top of them suffers.

At Precanto, our ETL pipelines initially took over 5 hours to complete. This made iteration slow, delayed data availability, and limited the system’s ability to scale.

The goal was not just to make the pipeline faster, but to make it predictable, scalable, and easier to evolve.

Understanding the Bottlenecks

The initial implementation had several characteristics that contributed to the high runtime:

Heavy processing inside the database
Repeated full-table operations
Inefficient data movement between stages
Lack of clear separation between transformation steps

As data volume grew (tens of millions of rows per tenant), these issues compounded, leading to exponential increases in runtime.

Step 1: Moving Processing to Golang

The first major improvement came from shifting transformation logic out of the database and into a Golang-based processing layer.

This allowed:

Better control over execution flow
Efficient in-memory processing
Parallelism across independent workloads

Instead of relying on complex SQL operations, transformations were broken down into explicit steps that could be optimized individually.

Step 2: Table Swap Strategy

A major bottleneck was how data updates were handled.

The earlier approach involved modifying existing tables in place, which caused:

High write amplification
Lock contention
Temporary data duplication

This was replaced with a table swap strategy:

Build new data in a separate table
Once ready, atomically swap it with the existing table

This significantly reduced contention and made the pipeline more predictable and resilient.

Step 3: Structuring the Pipeline

Instead of treating the pipeline as a monolithic process, it was broken into well-defined stages:

Data ingestion and normalization
Transformation and enrichment
Aggregation and summarization

Each stage had clear inputs and outputs, making it easier to reason about performance and optimize specific parts without affecting the entire pipeline.

Current Direction: Change Detection

Even with these improvements, the pipeline still processes large volumes of data on every run.

The next step is introducing change detection:

Identify what data has changed since the last run
Process only the affected subsets
Reduce unnecessary recomputation

This shifts the pipeline from batch-heavy processing to a more incremental model, improving both performance and efficiency.

Results

Reduced runtime from 5+ hours to under 1 hour
Improved pipeline predictability and stability
Enabled faster iteration on data models and transformations
Laid the foundation for further optimization via incremental processing

Key Takeaways

Optimizing ETL pipelines is not just about making queries faster. It requires rethinking how data flows through the system.

Move heavy logic out of constrained environments when needed
Reduce contention by avoiding in-place updates
Design pipelines as composable stages
Minimize work through change detection and incremental processing

The biggest gains often come not from micro-optimizations, but from changing how the system is structured.

This design also ties closely with how the system is structured at a tenant level. In our case, we used a database-per-tenant architecture to isolate workloads and improve predictability.