warming up your workspace

Data Engineering

Learn the systems and algorithms behind data engineering by building them from scratch in Python: parsers and joins, the probabilistic sketches inside every warehouse, columnar file formats, MapReduce, streaming windows and watermarks, a DAG-based pipeline orchestrator, and a small query engine. The engine, not the buttons.

10 projects, 250 hands-on levels, run in your browser.

Syllabus

  • Ingestion & Parsing: Every pipeline starts by reading messy bytes off disk and turning them into clean records. Parse delimited text and JSON by hand, shape rows into dictionaries, infer and coerce types, and survive malformed input without crashing the job. These are the unglamorous primitives every loader is built from.
  • Cleaning & Validation: Raw records are full of duplicates, nulls, and values that violate the rules your warehouse assumes. Deduplicate, handle missing data, enforce constraints, and emit a data-quality report, the gate every batch passes through before it is trusted downstream.
  • Transformations & Joins: The heart of every pipeline: reshape records and combine datasets. Group and aggregate, pivot, then build the three join algorithms every query engine relies on, nested-loop, hash, and sort-merge, plus the set operations and window functions that SQL gives you for free.
  • Big-Data Algorithms: When the data does not fit in memory, you reach for clever algorithms instead of more RAM. Sort data larger than memory with external merge sort, test membership with a Bloom filter, estimate distinct counts the way a warehouse does, and track frequencies with a count-min sketch. These are the structures behind every system that claims to scale.
  • File Formats & Encoding: Why is Parquet so much smaller and faster than CSV? Because it stores data by column and squeezes each column with the right encoding. Build the columnar layout, run-length and dictionary and delta encodings, variable-length integers, checksums, and the zone maps that let a query skip whole blocks of data unread.
  • MapReduce & Partitioning: The model that made big data parallel: map each record to key-value pairs, shuffle pairs to the machine that owns their key, and reduce each group. Build the map, shuffle, and reduce stages, the classic word count and inverted index, and the hash and range partitioning that spreads work across workers, plus how to spot the skew that ruins it.
  • Stream Processing: Unbounded data never stops arriving, so you compute over windows of time instead of whole datasets, and you cope with events that show up late and out of order. Build event-time handling, tumbling, sliding, and session windows, the watermarks that decide when a window is done, and the incremental aggregation that keeps fixed state as the stream flows.
  • Pipelines & Orchestration: A pipeline is a directed acyclic graph of tasks, and an orchestrator decides what runs when. Build the DAG, order tasks by topological sort so dependencies run first, group independent tasks into parallel stages, retry flaky tasks with exponential backoff, make runs idempotent so reprocessing is safe, and schedule and backfill. This is a small Airflow.
  • A Mini Query Engine: Every warehouse turns a query into a plan of operators and runs them. Build the relational operators (select, project, sort, group, aggregate), compose them into a query executor, add an optimizer that pushes filters down and prunes columns, then handle change data capture and slowly-changing dimensions, the patterns that keep a warehouse in sync with its sources.
  • Capstone: A Mini Warehouse ETL: Assemble everything into one pipeline: ingest raw CSV into typed records, drop the bad rows, enrich each fact with a dimension lookup and derive new fields, aggregate into the metrics a business actually asks for, load them into a warehouse table with idempotent upserts and change capture, and orchestrate the stages as a DAG. This is a small data warehouse, end to end.

Key concepts

  • Backfill: Running a pipeline for past time periods that were missed or need reprocessing, one scheduled interval at a time. Idempotent tasks make backfills safe to rerun.
  • Bloom filter: A compact probabilistic structure that tests set membership with no false negatives and a tunable false-positive rate. Used to skip data blocks or files that d…
  • Broadcast join: A distributed join where a small table is copied to every worker so each can join its partition of the large table locally, avoiding a shuffle. Used when one s…
  • Cardinality: The number of distinct values in a column. Exact counting costs memory proportional to the cardinality, so engines estimate it with sketches like HyperLogLog f…
  • Change data capture (CDC): Keeping a warehouse current by capturing only what changed at the source, classifying changes into inserts, updates, and deletes rather than reloading everythi…
  • Columnar storage: Storing each column's values together rather than each row's, so a query reads only the columns it needs and each column compresses well. The basis of…
  • Combiner: A map-side pre-aggregation that reduces key-value pairs locally before the shuffle, cutting the volume of data moved over the network. It is a local mini-reduc…
  • Completeness: The fraction of records that have a non-null value in a given column, a core data-quality metric. Low completeness signals upstream extraction problems.
  • Count-min sketch: A probabilistic structure that estimates the frequency of every item in a stream using fixed memory (d rows of w counters). It never underestimates, and is use…
  • DAG: A directed acyclic graph of tasks where edges encode dependencies. Pipelines are DAGs, and an orchestrator runs each task only after its upstream tasks finish;…
  • Data lake: A store that holds raw data of any shape (files, logs, JSON) cheaply, schema-applied on read rather than on write. Lakes trade query performance for flexibilit…
  • Data pipeline: A sequence of processing stages that moves data from sources to destinations, transforming it along the way. Pipelines are usually modeled as DAGs and run on a…
  • Data quality: The measure of how complete, valid, unique, and consistent a dataset is. Pipelines compute quality metrics (null counts, duplicate counts, constraint violation…
  • Data skew: An uneven spread of work where one partition (often a single hot key) holds far more data than the others, leaving one worker overloaded while the rest idle. S…
  • Data warehouse: A central store optimized for analytical queries over structured, cleaned data, organized into fact and dimension tables. It is the destination most pipelines…
  • Deduplication: Removing duplicate records, which arrive from retries, replays, and upstream bugs. Dedup is done by exact match or by a natural key, keeping the first or the l…
  • Delta encoding: Compressing a sorted numeric column (ids, timestamps) by storing the first value and then the differences between consecutive values, which are small and compr…
  • Dictionary encoding: Compressing a low-cardinality column by storing each distinct value once and replacing every occurrence with a small integer code. It also speeds up equality f…
  • Dimension table: A table of descriptive attributes (product, customer, date) that facts reference. Dimensions give context to the measures in a fact table and are often slowly…
  • ELT: Extract, Load, Transform: load raw data into the warehouse first, then transform it there with SQL. Modern cloud warehouses make this cheap and flexible, since…
  • ETL: Extract, Transform, Load: the classic pipeline shape where data is pulled from sources, cleaned and reshaped, then written to a warehouse. Transforming before…
  • Event time vs processing time: Event time is when something actually happened; processing time is when the pipeline handled it. Correct streaming aggregates use event time, because events ar…
  • Exactly-once: A delivery guarantee where each event affects the result exactly once, despite retries and failures. Achieved by deduplicating on an event id or making writes…
  • External merge sort: Sorting data larger than memory by splitting it into sorted runs that fit, then merging the runs with a heap. The algorithm behind big sorts and sort-merge joi…
  • Fact table: The central table in a star schema holding measurable events (sales, clicks), with foreign keys to dimensions and numeric measures to aggregate.
  • Hash join: A join that builds a hash table on the smaller input keyed by the join key, then probes it with each row of the larger input. It runs in linear time and is the…
  • Hash partitioning: Assigning each key to a partition by its hash modulo the partition count, so the same key always lands together and load spreads evenly for well-distributed ke…
  • Idempotency: A property where running an operation twice has the same effect as running it once. Idempotent loads (via upserts and run keys) make retries and reprocessing s…
  • Incremental load: Loading only the new or changed data since the last run, instead of the full dataset. It makes pipelines cheap to run frequently, and relies on change capture…
  • Join: Combining two datasets on a shared key, the most important operation in data engineering. The main algorithms are nested-loop, hash, and sort-merge, chosen by…
  • Mapper: The map-stage function that transforms each input record independently into zero or more key-value pairs. Because records are processed independently, mapping…
  • MapReduce: A programming model for parallel data processing: a map stage transforms each record into key-value pairs, a shuffle groups pairs by key, and a reduce stage fo…
  • Orchestration: Scheduling and coordinating the tasks of a pipeline: deciding what runs when, retrying failures, enforcing dependencies, and backfilling missed runs. Tools lik…
  • Parquet: A columnar file format that stores data in row groups, compresses each column with the right encoding, and keeps min/max statistics so queries can skip blocks.…
  • Partitioning: Splitting data across workers or files so it can be processed in parallel, by hashing a key or by ranges. Good partitioning spreads work evenly; bad partitioni…
  • Predicate pushdown: An optimization that moves filter conditions as close to the data source as possible, so a scan reads and emits fewer rows before joins and projections. Less d…
  • Query planner: The component that turns a declarative query into an executable plan of operators, choosing join algorithms, pushing predicates down, and pruning columns to ma…
  • Range partitioning: Assigning records to partitions by which value range they fall in, given sorted boundaries. It keeps related keys together and enables range scans, but risks s…
  • Reducer: The reduce-stage function that folds all the values sharing a key into a single result (a sum, count, or list). Reducers run after the shuffle has grouped pair…
  • Run-length encoding (RLE): Compressing a column by storing each run of equal values as a (value, count) pair instead of repeating it. Extremely effective on sorted or low-variety columns.
  • Schema: The declared structure of data: the columns, their types, and constraints. A pipeline validates incoming data against a schema and rejects or quarantines rows…
  • Schema evolution: How a schema changes over time as new columns appear or types change, without breaking existing data or consumers. Robust pipelines add columns with defaults a…
  • Schema on read: Applying structure to data when you query it, not when you store it. Raw files are kept as-is and parsed into typed records at read time, the model of a data l…
  • Session window: A dynamically sized window that grows around a burst of activity and closes after a gap of silence. It models user sessions, where the window length depends on…
  • Shuffle: The stage between map and reduce that routes every key-value pair to the partition responsible for its key, grouping values by key. It is usually the most expe…
  • Sliding window: Overlapping windows of a fixed length that advance by a smaller step, so each event lands in several windows. Used for moving aggregates like a rolling five-mi…
  • Slowly changing dimension (SCD): How a dimension table tracks attribute changes over time. Type 2 keeps full history by closing the current version and adding a new active one, so queries can…
  • Sort-merge join: A join that sorts both inputs by the join key, then merges them with a single linear pass. It is efficient when the inputs are already sorted or too large to h…
  • Star schema: A warehouse design with one central fact table joined to surrounding dimension tables, shaped like a star. It is simple, fast to query, and the default for ana…
  • Stream processing: Computing over unbounded data that never stops arriving, using windows of time instead of whole datasets and coping with late, out-of-order events. The counter…
  • Topological sort: An ordering of a DAG's nodes where every task comes after its dependencies. It gives the orchestrator a valid run order and detects cycles, which make orde…
  • Tumbling window: Fixed-size, non-overlapping windows of time; each event falls in exactly one. The simplest windowing for streaming aggregates like per-minute counts.
  • Upsert (merge): An insert-or-update by key: matched rows are replaced, new rows inserted. Upserts make loads idempotent, since applying the same batch twice yields the same ta…
  • Varint: A variable-length integer encoding (LEB128) that spends fewer bytes on small numbers by storing seven bits per byte with a continuation flag. Pairs with delta…
  • Watermark: A stream's moving promise that no more events older than a certain event time will arrive. It decides when a window is complete and can fire, and which lat…
  • Zone map: Per-block min/max statistics stored alongside columnar data. For a filter on a column, the engine can skip any block whose range cannot contain a matching valu…