Apache Beam Core Concepts
The Beam Model
Evolved from Google's MapReduce, FlumeJava, and Millwheel projects. Originally called the "Dataflow Model."
Key Abstractions
Pipeline
A Pipeline encapsulates the entire data processing task, including reading, transforming, and writing data.
// Java Pipeline p = Pipeline.create(options); p.apply(...) .apply(...) .apply(...); p.run().waitUntilFinish();
# Python
with beam.Pipeline(options=options) as p:
(p | 'Read' >> beam.io.ReadFromText('input.txt')
| 'Transform' >> beam.Map(process)
| 'Write' >> beam.io.WriteToText('output'))
PCollection
A distributed dataset that can be bounded (batch) or unbounded (streaming).
Properties
- •Immutable - Once created, cannot be modified
- •Distributed - Elements processed in parallel
- •May be bounded or unbounded
- •Timestamped - Each element has an event timestamp
- •Windowed - Elements assigned to windows
PTransform
A data processing operation that transforms PCollections.
// Java PCollection<String> output = input.apply(MyTransform.create());
# Python output = input | 'Name' >> beam.ParDo(MyDoFn())
Core Transforms
ParDo
General-purpose parallel processing.
// Java
input.apply(ParDo.of(new DoFn<String, Integer>() {
@ProcessElement
public void processElement(@Element String element, OutputReceiver<Integer> out) {
out.output(element.length());
}
}));
# Python
class LengthFn(beam.DoFn):
def process(self, element):
yield len(element)
input | beam.ParDo(LengthFn())
# Or simpler:
input | beam.Map(len)
GroupByKey
Groups elements by key.
PCollection<KV<String, Integer>> input = ...; PCollection<KV<String, Iterable<Integer>>> grouped = input.apply(GroupByKey.create());
CoGroupByKey
Joins multiple PCollections by key.
Combine
Combines elements (sum, mean, etc.).
// Global combine input.apply(Combine.globally(Sum.ofIntegers())); // Per-key combine input.apply(Combine.perKey(Sum.ofIntegers()));
Flatten
Merges multiple PCollections.
PCollectionList<String> collections = PCollectionList.of(pc1).and(pc2).and(pc3); PCollection<String> merged = collections.apply(Flatten.pCollections());
Partition
Splits a PCollection into multiple PCollections.
Windowing
Types
- •Fixed Windows - Regular, non-overlapping intervals
- •Sliding Windows - Overlapping intervals
- •Session Windows - Gaps of inactivity define boundaries
- •Global Window - All elements in one window (default)
input.apply(Window.into(FixedWindows.of(Duration.standardMinutes(5))));
input | beam.WindowInto(beam.window.FixedWindows(300))
Triggers
Control when results are emitted.
input.apply(Window.<T>into(FixedWindows.of(Duration.standardMinutes(5)))
.triggering(AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(Duration.standardMinutes(1))))
.withAllowedLateness(Duration.standardHours(1))
.accumulatingFiredPanes());
Side Inputs
Additional inputs to ParDo.
PCollectionView<Map<String, String>> sideInput =
lookupTable.apply(View.asMap());
mainInput.apply(ParDo.of(new DoFn<String, String>() {
@ProcessElement
public void processElement(ProcessContext c) {
Map<String, String> lookup = c.sideInput(sideInput);
// Use lookup...
}
}).withSideInputs(sideInput));
Pipeline Options
Configure pipeline execution.
public interface MyOptions extends PipelineOptions {
@Description("Input file")
@Required
String getInput();
void setInput(String value);
}
MyOptions options = PipelineOptionsFactory.fromArgs(args).as(MyOptions.class);
Schema
Strongly-typed access to structured data.
@DefaultSchema(AutoValueSchema.class)
@AutoValue
public abstract class User {
public abstract String getName();
public abstract int getAge();
}
PCollection<User> users = ...;
PCollection<Row> rows = users.apply(Convert.toRows());
Error Handling
Dead Letter Queue Pattern
TupleTag<String> successTag = new TupleTag<>() {};
TupleTag<String> failureTag = new TupleTag<>() {};
PCollectionTuple results = input.apply(ParDo.of(new DoFn<String, String>() {
@ProcessElement
public void processElement(ProcessContext c) {
try {
c.output(process(c.element()));
} catch (Exception e) {
c.output(failureTag, c.element());
}
}
}).withOutputTags(successTag, TupleTagList.of(failureTag)));
results.get(successTag).apply(WriteToSuccess());
results.get(failureTag).apply(WriteToDeadLetter());
Cross-Language Pipelines
Use transforms from other SDKs.
# Use Java Kafka connector from Python
from apache_beam.io.kafka import ReadFromKafka
result = pipeline | ReadFromKafka(
consumer_config={'bootstrap.servers': 'localhost:9092'},
topics=['my-topic']
)
Best Practices
- •Prefer built-in transforms over custom DoFns
- •Use schemas for type-safe operations
- •Minimize side inputs for performance
- •Handle late data explicitly
- •Test with DirectRunner before deploying
- •Use TestPipeline for unit tests