This post provides insights into Delta Live Tables, distinguishing between Streaming Tables and Materialized Views, highlighting their features, differences, and best practices for Delta Live Pipeline Triggering Modes.
What is a streaming table (Streaming Live Tables)?
A delta table that has structured streaming writing to it.
- Streaming table reads from append-only data sources.
- Streaming table allow you to reduce cost and latency by reading each input record only once.
- Streaming table supports DML (UPDATE, DELETE, MERGE) for ad-hoc data manipulation
- Best used for data ingestion with
clould_files
- Stateful: each input row is processed only once, even if query changes. A change to a streaming table’s definition does not reread already processed data.
- You can do Full Refresh to recompute the data, this only works when your streaming source has full retentions.
- When stream-stream job, time bound and watermark is a must-have. Otherwise, the join will buffer all data forever.
What is a Materialized View (Live tables)?
It is a result of a query, precomputed and stored in Delta
- It always return the result of the defining query, at the moment it was last updated.
- You cannot modify data in the materialized view, you can change its query instead.
Differences between Streaming Table vs. Materialized View
- Streaming table handles update and deletes to their inputs
- Support any aggregation / windows functions
- Joins work the same as in ad-hoc queries
- Automatically handle changes to the query definitions
Unlike streaming, MVs are allowed to read the input more than once, and recompute from scratch when necessary
Enzyme optimize the MV using incremental refresh, to recompute data only when necessary:
- Appending new data as it arrives [only works for select/project/inner join]
- Partition Re-computation: only recompute affected partitions. [help aggregation & updates, but only works when input & output have the same partition]
- Merge updates to specific rows [handle complicated queries updates/inserts/deletes]
Delta Live Pipeline Triggering Mode
Delta Live Pipeline allows users to choose how often to update the results.
- Triggered Manually (Lowest on Cost, Highest on Latency)
- Triggered on a schedule using Databricks Jobs
- Continuously (Highest on Cost, Lowest on Latency)
Best Practice
- No data manipulation on the bronze layer to keep source intact for re-calculation
- Updates to a streaming table will break downstream computation; Updates includes 1. Non-insert DML performed on a streaming table. 2. Streaming tables that compute aggregations without a watermark. 3. Streaming table used with
APPLY CHANGES INTO
skipChangeCommit
helps to skip changes rather than fail, but DML must manually used to propagate changes downstream
In summary, understanding the characteristics of Streaming Tables and Materialized Views, along with optimal Delta Live Pipeline Triggering Modes, is crucial for effective data processing, manipulation, and maintaining the integrity of downstream computations in Delta Live Tables.
Reference
https://www.youtube.com/watch?v=PIFL7W3DmaY&t=4249s