Scaling ETL: From 5 Hours to Under 1 Hour
In data-heavy systems, ETL pipelines are often the backbone of the application. They power analytics, reporting, and downstream workflows. When they are slow or inefficient, everything built on top of them suffers.
At Precanto, our ETL pipelines initially took over 5 hours to complete. This made iteration slow, delayed data availability, and limited the system’s ability to scale.
The goal was not just to make the pipeline faster, but to make it predictable, scalable, and easier to evolve.
Understanding the Bottlenecks
The initial implementation had several characteristics that contributed to the high runtime:
- Heavy processing inside the database
- Repeated full-table operations
- Inefficient data movement between stages
- Lack of clear separation between transformation steps
As data volume grew (tens of millions of rows per tenant), these issues compounded, leading to exponential increases in runtime.
Step 1: Moving Processing to Golang
The first major improvement came from shifting transformation logic out of the database and into a Golang-based processing layer.
This allowed:
- Better control over execution flow
- Efficient in-memory processing
- Parallelism across independent workloads
Instead of relying on complex SQL operations, transformations were broken down into explicit steps that could be optimized individually.
Step 2: Table Swap Strategy
A major bottleneck was how data updates were handled.
The earlier approach involved modifying existing tables in place, which caused:
- High write amplification
- Lock contention
- Temporary data duplication
This was replaced with a table swap strategy:
- Build new data in a separate table
- Once ready, atomically swap it with the existing table
This significantly reduced contention and made the pipeline more predictable and resilient.
Step 3: Structuring the Pipeline
Instead of treating the pipeline as a monolithic process, it was broken into well-defined stages:
- Data ingestion and normalization
- Transformation and enrichment
- Aggregation and summarization
Each stage had clear inputs and outputs, making it easier to reason about performance and optimize specific parts without affecting the entire pipeline.
Current Direction: Change Detection
Even with these improvements, the pipeline still processes large volumes of data on every run.
The next step is introducing change detection:
- Identify what data has changed since the last run
- Process only the affected subsets
- Reduce unnecessary recomputation
This shifts the pipeline from batch-heavy processing to a more incremental model, improving both performance and efficiency.
Results
- Reduced runtime from 5+ hours to under 1 hour
- Improved pipeline predictability and stability
- Enabled faster iteration on data models and transformations
- Laid the foundation for further optimization via incremental processing
Key Takeaways
Optimizing ETL pipelines is not just about making queries faster. It requires rethinking how data flows through the system.
- Move heavy logic out of constrained environments when needed
- Reduce contention by avoiding in-place updates
- Design pipelines as composable stages
- Minimize work through change detection and incremental processing
The biggest gains often come not from micro-optimizations, but from changing how the system is structured.
This design also ties closely with how the system is structured at a tenant level. In our case, we used a database-per-tenant architecture to isolate workloads and improve predictability.