Core Concepts of GCP Dataflow

This archive covers all its examples with Java

The main goal of this archive is to provide a comprehensive, holistic overview of the GCP Dataflow service.

Introduction

Dataflow is a GCP service that use the unified Apache Beam Model in an underlying manner.

This service allows collecting data in streaming or batch mode, transforming it, and loading it into a sink. The most common pattern implemented with Dataflow is ETL.

In this archive, we will cover key concepts such as Dataflow architecture, the lifecycle of a data pipeline, the key features of the API, as well as common anti-patterns and best practices.

The Architecture of Dataflow

A dataflow architecture is a computing model where the execution of operations is determined by the availability of data, rather than by the sequential order of instructions as in traditional von Neumann architectures.

Processing pipeline

A dataflow pipeline is partially user-defined and describes the complete data processing workflow. It specifies data sources, transformations and sinks.

In addition, pipeline options configure execution aspects such as the scaling strategy (auto-scaling versus fixed workers), the types and regions of workers, and whether the processing runs in streaming or batch mode.

This abstraction allows developers to focus on what to process, while Dataflow decides how to execute it efficiently.

Serverless Dataflow Service (Control Plane)

The control plane makes decisions on how data should be managed, routed, and processed

Once submitted, the processing pipeline is validated, optimized, and converted into an execution graph. Each transform in the code corresponds to a step.

Dataflow may fuse multiple steps into stages to reduce overhead and optimize performance. The actual execution order might differ from the user-defined order due to these optimizations.

Crucially, users do not need to provision or configure virtual machines or clusters. The control plane is responsible for orchestration: job scheduling, scaling, and failure recovery.

Workers (Data Plane)

The data plane executes these decisions by forwarding, filtering, and processing actual data according to the rules defined by the control plane.

The data plane consists of workers, which are managed VMs automatically provisioned by Dataflow. Each worker runs the embedded Apache Beam SDK harness. Workers execute pipeline stages, fetch input data, apply transformations, and write outputs. To exchange intermediate results, workers rely on the shuffle service.

Inter-worker communication

Intermediate results are serialized by the Beam SDK inside the worker. These serialized records are sent to the shuffle service, which: partitions, sorts, balances the data across workers. Receiving workers then deserialize the data to continue processing.

Shuffle & State Management

The shuffle service is comparable to a managed MapReduce-like layer. It handles grouping, sorting, and joining intermediate data. It acts as a fault-tolerant intermediary between workers. By offloading shuffle to Google’s infrastructure, pipelines achieve better scalability and reliability than VM-based shuffling.

MapReduce is a programming model popularized by Google. It is primarily used for manipulating and processing large amounts of data within a cluster of nodes.

The Map function breaks the problem down into subproblems. It takes input data and generates an intermediate set of new (key, value) pairs. Each node in the cluster processes a portion of the data independently.
The Reduce function collects the intermediate results produced by the different instances of the Map function. It groups all the values associated with the same key and performs a reduction operation to produce a final output reduced to a single (key, value) pair.

A Reduce operation in the context of MapReduce consists of aggregating or combining the values that share the same key after the Map stage.

For streaming pipelines, Dataflow also supports: Stateful processing (keeping per-key state across events), Watermarks (tracking event-time progress and late data handling).

Monitoring & Observability

The Dataflow monitoring interface provides deep insights into pipeline execution. Jobs are visualized as stages (units of fused work). Metrics per stage and per worker help identify bottlenecks. Developers can track throughput, latency, backlog, and error logs.

This makes it possible to analyze not only the global performance of the pipeline, but also fine-grained worker-level details, crucial for troubleshooting slowdowns.

The Lifecycle of a Dataflow pipeline

When you run your dataflow pipeline, the service creates an execution graph from the code that construct your pipeline object, including all of the transforms and their associated processing functions, such as DoFn objects.

When you call pipeline.run() in Java, Dataflow converts your logical graph (PCollections + transformations) into a pipeline spec. This spec is a JSON (or proto) file describing all steps and connections, which is then sent to the Dataflow workers like an architectural blueprint they follow to execute your code.

As mentioned previously the execution graph may change due to optimizations, merging, and validation. In the next sections, we will demonstrate how to voluntarily prevent some of these changes as a last resort.

Once everything is initialized and our resources are ready to operate, we have a job in run state associated with a job id.

Error and exception handling

The pipeline might throw exceptions while processing data. Some of these errors are transient, others are permanent, generally caused by corrupted or null data.

Dataflow processes elements in arbitrary bundles and retries the complete bundle when an error is thrown for any element in that bundle.

When running in Batch mode, bundles that include a failing item are retried four times. The pipeline fails completely when a single bundle has failed four times. In Streaming mode, a bundle that includes a failing item is retried indefinitely, which might cause your pipeline to permanently stall.

Parallelization and Distribution

The Dataflow service automatically parallelizes and distributes the processing logic in your pipeline to the workers you assign to perform your job.

For example, the ParDo transforms in a pipeline induce parallelism by automatically distributing processing logic represented by DoFn objects to multiple workers to be run in parallel.

There are 2 types of parallelization:

Horizontal Parallelism Horizontal parallelism occurs when pipeline data is split and processed on multiple workers at the same time. The DF runtime is parallelized. (copy term)
Vertical Parallelism Vertical parallelism occurs when pipeline data is split and processed by multiple CPU cores on the same worker. Each worker is powered by a Compute Engine VM. A VM can launch multiple processes to saturate all of its CPU cores. (copy term)

By default, Dataflow automatically manages job parallelization. Dataflow monitors the runtime statistics for the job (such as CPU, memory usage) to determine how to scale the job.

https://cloud.google.com/dataflow/docs/pipeline-lifecycle

Key features of the API

The first feature that is essential is the Pipeline object that allows us to access the flow Pipeline p = Pipeline.create(options);.

PCollection

A PCollection<T> is an immutable collection of values of type T. PCollection stands for Parallel Collection. It represents a potentially distributed, multi-element dataset that acts as the pipeline's data.

PCollection<String> lines = p.apply(TextIO.read().from("gs://bucket/file.txt"));

PTransform

A PTransform<InputT, OutputT> is an operation that takes an InputT (some subtype of PInput) and produces an OutputT (some subtype of POutput).

There are the following types :

ParDo → element-by-element transformation (map, filter, flatMap).
GroupByKey → group by key.
Combine → aggregation (sum, mean, custom).
Window → slice data into time windows.

PCollection<String> words = lines.apply(ParDo.of(new ExtractWordsFn()));

I/O Connectors (IO APIs)

Integrated into Beam/Dataflow:

TextIO (Cloud Storage, files),
PubsubIO (Pub/Sub),
BigQueryIO (BigQuery),
JdbcIO (relational databases),

Windowing & Triggers

There are the following types :

Tumbling windows (called fixed windows in Apache Beam)
Hopping windows (called sliding windows in Apache Beam)
Session windows

Windowing functions divide unbounded collections into logical components, or windows. Windowing functions group unbounded collections by the timestamps of the individual elements. Each window contains a finite number of elements.

words.apply(Window.into(FixedWindows.of(Duration.standardMinutes(1))));

Concretely, this means that the elements in the words stream will be grouped into consecutive time segments, each lasting 1 minute. For example, all data arriving between 12:00:00 and 12:00:59 will go into the first window, those between 12:01:00 and 12:01:59 will go into the next one, and so on. This then allows for computations or aggregations per window, such as counting the words in each minute.

GroupByKey

GroupByKey in Dataflow groups values by key, then redistributes the data across multiple workers to enable parallel processing.

// 1. ("DXC", "John"), ("Perficient", "Anna"), ("DXC", "Johnny")
// 2. ("DXC", ["DXC", "Johnny"]), ("Perficient", ["Anna"])

PCollection<KV<String, String>> contractorPerFirm = /* ... */;
PCollection<KV<String, Iterable<String>>> contractorsPerFirm = contractorPerFirm.apply(GroupByKey.create());

Combine (globally or by key)

Global Combine: Aggregates all data into a single unit.

PCollection<Integer> nums = /* ... */;
PCollection<Integer> total = nums.apply(Combine.globally(Sum.ofIntegers()));

Combine by Key: Aggregates data separately for each defined key.

PCollection<KV<String, Integer>> scores = /* ... */;
PCollection<KV<String, Integer>> totals = scores.apply(Combine.perKey(Sum.ofIntegers()));

CoGroupByKey

Natural join between multiple PCollections by key. It's an equivalent to a JOIN in SQL. Behind that, it's a shuffle (like GroupByKey but multi-source).

final TupleTag<String> namesTag = new TupleTag<>();
final TupleTag<Integer> scoresTag = new TupleTag<>();

PCollection<KV<Integer, String>> names = /* ... */;
PCollection<KV<Integer, Integer>> scores = /* ... */;

PCollection<KV<Integer, CoGbkResult>> result =
    KeyedPCollectionTuple.of(namesTag, names)
        .and(scoresTag, scores)
        .apply(CoGroupByKey.create());

Partition

Split a collection into N subcollections according to a function.

enum Tier { FREE, PRO, ENTERPRISE }

PCollection<User> users = /* ... */;

PCollectionList<User> split =
    users.apply(Partition.of(Tier.values().length, (user, n) -> {
      return user.tier().ordinal(); // 0=FREE, 1=PRO, 2=ENTERPRISE
    }));

PCollection<User> free       = split.get(Tier.FREE.ordinal());
PCollection<User> pro        = split.get(Tier.PRO.ordinal());
PCollection<User> enterprise = split.get(Tier.ENTERPRISE.ordinal());

Reshuffle

Forces a data redistribution for greater parallelization. Very useful after an operation that "merges" the data.

Note that Reshuffle is supported by Dataflow, even though it is marked deprecated in the Apache Beam documentation.

pcol.apply(Reshuffle.viaRandomKey());

A workaround for those who do not implement the deprecated is to add an artificial GroupByKey with a random key → Dataflow redistributes the data across the workers.

pcol
  .apply(WithKeys.of(e -> 0)) // Artifical Key
  .apply(GroupByKey.create())
  .apply(Values.create());

Flatten

Merges multiple PCollections (same type) into one.

PCollection<String> merged = PCollectionList.of(col1).and(col2).apply(Flatten.pCollections());

Flatten.pCollections() merges the list into a single PCollection (merged).

Side inputs

Way to avoid an expensive join: instead of doing a GroupByKey, you inject a “small” dataset as side data into a ParDo transformation. The maximum size of a side input in Google Cloud Dataflow is 80 MB.

final PCollectionView<Map<String, String>> lookup =
    dict.apply(View.asMap());

input.apply(ParDo.of(new DoFn<String, String>() {
    @ProcessElement
    public void process(ProcessContext c) {
        Map<String, String> map = c.sideInput(lookup);
        ...
    }
}).withSideInputs(lookup));

Frequent Anti-patterns

Unbounded per-element I/O (e.g., one DB/API call per element) causing fan-out latency and throttling. Prefer batching, side inputs, caches, or sinks with bulk commits.
Excessive shuffles (big, unnecessary GroupByKey/CoGroupByKey) that explode network and disk I/O. Combine/aggregate early and limit key cardinality.
Hot keys (highly skewed keys) creating stragglers and under-utilized workers. Use key-salting or load-balancing patterns.
Gigantic elements (very large records, huge side inputs) that blow worker memory. Use chunking, windowing, or external storage.
Global window + default trigger in streaming when you actually need bounded lateness/periodic output—leads to unflushed state and unbounded growth.
State/timer misuse (growing state without TTL/cleanup) causing memory leaks and backlogs. Configure allowed lateness and state cleanup.
Too much logging at element level in production pipelines—causes I/O bottlenecks and cost spikes. Prefer sampled/aggregated metrics.
Monolithic transforms (doing everything in one giant DoFn) that are hard to test, scale, and optimize. Compose smaller transforms.
Ignoring autoscaling/back-pressure signals (starving or overprovisioning workers) instead of addressing pipeline bottlenecks.
No dead-letter handling—bad records crash or stall the main flow. Use a DLQ pattern.
Relying on per-worker mutable singletons (hidden state) that break parallelism and determinism. Prefer Beam state, side inputs, or external stores.
Skipping validations & integration tests for transforms that are hard to debug later. Beam pipelines demand strong testing.

Recommended Best practices

Design for minimal shuffle: pre-aggregate with Combine/ApproximateQuantiles, filter early, and reduce key cardinality.
Handle skew: detect hot keys and use key-salting, combiner lifting, or custom partitioning strategies.
Batch and buffer external I/O: use connectors that support batching, implement retries/backoff, and prefer bulk commits.
Use proper windowing & triggers for streaming outputs; set allowed lateness and state TTL to bound resource use.
Adopt dead-letter queues to quarantine and inspect bad data without stopping the pipeline.
Compose small, reusable transforms with clear interfaces; keep business logic separate from I/O.
Leverage metrics & observability: Beam Metrics, Dataflow job metrics, and worker logs for SLOs and triage.
Right-size resources & watch autoscaling: fix true bottlenecks (hot stages, stragglers) rather than just adding workers.
Prefer built-in I/O connectors and proven patterns (file patterns, side inputs, sessionization, etc.).
Test thoroughly: unit test DoFn/PTransform, use integration tests with TestStream or small fixtures. Follow Beam style/testing guidance.
Plan for large jobs & failure recovery: checkpoint, shard outputs, and design idempotent sinks to avoid expensive restarts.
Consult troubleshooting guides for slow/stuck jobs and automated rejections; address root causes (OOM, bad keys, excessive shuffle).

References

PreviousGitOps NextRust

Last updated 1 month ago

hashtagIntroduction

hashtagThe Architecture of Dataflow

hashtagProcessing pipeline

hashtagServerless Dataflow Service (Control Plane)

hashtagWorkers (Data Plane)

hashtagShuffle & State Management

hashtagMonitoring & Observability

hashtagThe Lifecycle of a Dataflow pipeline

hashtagError and exception handling

hashtagParallelization and Distribution

hashtagThere are 2 types of parallelization:

hashtagKey features of the API

hashtagPCollection

hashtagPTransform

hashtagI/O Connectors (IO APIs)

hashtagWindowing & Triggers

hashtagGroupByKey

hashtagCombine (globally or by key)

hashtagCoGroupByKey

hashtagPartition

hashtagReshuffle

hashtagFlatten

hashtagSide inputs

hashtagFrequent Anti-patterns

hashtagRecommended Best practices

hashtagReferences