Understanding Delta Lake: Underlying Data Format, ACID Transaction Support and Scaling Metadata

Data Lake is an open-source project that enables building a data lakehouse on the top of existing cloud storage.

It is	It is not
Open sourced	Proprietary Technology
Builds upon standard data formats	Storage format
Optimized for cloud object storage	Storage medium
Built for scalable metadata handling	Database service or data warehouse

Delta Lake uses versioned Parquet files to store data. Parquet is a column-oriented format. This helps to 1) support flexible compression and extendable encoding schemas for each data type (column). 2) Fast query from the specific columns. 3) boost query by providing column metadata (min/max etc.)

When outputs data using .format("delta") instead of .format("parquet") , _delta_log file is added compared to the traditional parquet outputs. It contains a transaction log with every single operation performed on your data. This is essential to build ACID transaction support, scalable metadata and time travel features.

ACID Transactions Support

When the system is write new data, it first writes the data file and then update the __delta_log_ file. Each operation creates a 0..00X.json file inside the delta_log and it records the operation as well as the affected file.

Below is an example of the steps it takes to write and update the records. Note that the final read after changing the patient ID from P1 to P11 is to read out part-1,2,3.parquet instead of all files given the last json 00002.json indicates to skip reading part-0.parquet

Scaling Massive Metadata

Delta Lake writer will generate checkpoint file every 10 commits. The checkpoint file saves the state of the table at a given point in time and save the list in native Parquet format. This gives the Spark reader a shortcut to avoid reprocessing thousands of small JSON files leads to inefficiency.

If Delta Lake needs to re-create the state of the table from above operations, it will simply read the checkpoint file (000010.checkpoint.parquet), and reapply the two additional log entries (00011.json and 00012.json).

In conclusion, Delta Lake's adept handling of ACID transactions and intelligent metadata scaling embodies a paradigm shift in data lakehouse architecture. Its capabilities lay the foundation for robust, efficient, and scalable data management.

Reference

Haelen, B., & Davis, D. (2023). Delta Lake: up and running: Modern Data Lakehouse Architectures with Delta Lake. O’Reilly Media.

Breaking

Understanding Delta Lake: Underlying Data Format, ACID Transaction Support and Scaling Metadata

ACID Transactions Support

Scaling Massive Metadata

Reference

By Hang

Related Post

Leave a Reply Cancel reply

You Missed

Snail Sort – {kata} – [arrays, matrix, traversal] – (medium)

All Balanced Parentheses – {Kata} – [Recursions, Backtracking, Strings] – (Medium)

Next Small Number with the Same Digits – {Kata} – [Math, TwoPointers] -(Medium)

Evolution of Data Architectures

Understanding Delta Lake: Underlying Data Format, ACID Transaction Support and Scaling Metadata

ACID Transactions Support

Scaling Massive Metadata

Reference

By Hang

Related Post

Delta Lake DDL Cheat Sheet

Delta Live Table Change Data Capture

Streaming Tables vs. Materialized View

Leave a Reply Cancel reply

You Missed

Snail Sort – {kata} – [arrays, matrix, traversal] – (medium)

All Balanced Parentheses – {Kata} – [Recursions, Backtracking, Strings] – (Medium)

Next Small Number with the Same Digits – {Kata} – [Math, TwoPointers] -(Medium)

Evolution of Data Architectures