Fabric Data Factory Pipeline Patterns
10 December 2024
Real world use case and design patterns for building scalable Fabric pipelines in enterprise environments.
Fabric Data Factory Pipelines: Which Should You Actually Use?
In Microsoft Fabric Data Factory, "pipeline" doesn't mean just thing. There are a few different ways to move and transform data, and the names sound similar enough that it's easy to pick the wrong and regret it later.
Here's the short tour.
The four things you're choosing between
- Data pipelines — the orchestrator. Schedules things, calls things, handles dependencies.
- Dataflows Gen2 — Power Query in the cloud. Low-code transformations.
- Copy job — a streamlined way to move data from A to B without building a full pipeline.
- Notebook or Spark job activity — code-first transformation, called from a pipeline.
These aren't competing products. They're meant to be combined.
Data pipelines: the conductor
The descendant of Azure Data Factory pipelines. It doesn't transform data itself; it tells other things to run, in the right order, with the right error handling.
Reach for when you need to:
- Run several activities in sequence or in parallel.
- Add conditional logic, loops, or retries.
- Schedule a job or trigger it from an event.
- Pass parameters between activities.
Dataflows Gen2: low-code transformation
Power Query lifted out of Power BI and given a proper engine. Familiar visual editor, same M language, output lands in a Lakehouse, Warehouse, KQL database, or SQL endpoint.
Right tool when:
- The transformation is about shape and cleanup, not heavy compute.
- An analyst or BI developer owns the logic and prefers a visual interface.
- Volumes are modest to moderate — millions of rows, not billions.
It stops being the right answer at very large volumes, very complex logic, or anywhere you need fine-grained control over partitioning or Spark behaviour.
Copy job: the new lightweight option
A streamlined experience for specific job — moving data from a source to a destination — without making you build a full pipeline around it.
Use it when:
- You just need to land raw data into a Lakehouse or Warehouse.
- You want incremental copy with change data capture without wiring it up by hand.
- Transformation will happen later, downstream.
Mental model: copy job for getting data in, the other tools for doing things with it.
Notebooks and Spark jobs: the heavy lifting
When you need real code — PySpark, SQL at scale, custom logic, ML — you reach for a notebook or a Spark job definition, typically called from a data pipeline.
Pick this when:
- Volumes are large enough that Dataflows Gen2 starts straining.
- Logic is complex, branching, or needs to be testable as code.
- You want full control over Delta writes, partitioning, or merge logic.
A simple decision shortcut
- Just moving data in? Copy job.
- Light-to-medium transformation, low-code preferred? Dataflow Gen2.
- Heavy transformation or code-first? Notebook or Spark job.
- Stitching any of these together on a schedule? Data pipeline.
Most real Fabric workloads end up using three of these together: a copy job lands raw data, a notebook transforms it, and a data pipeline orchestrates the whole thing. Dataflows Gen2 sits alongside for the work owned by BI teams who prefer Power Query.
Use the tool that fits the person doing the work, and orchestrate the lot with a data pipeline. That's usually the right answer.
Written by Zahid Shaikh — Data Engineer and Power BI Developer working with Microsoft Fabric and Power BI.