Validating and Cleansing Datasets

Ensuring the quality and integrity of data is crucial in any data processing pipeline. This guide outlines SQL and Python statements to validate and cleanse datasets effectively in Databricks notebooks.

Inspect Missing Data

count(<column>) skips NULL values while count(*) includes NULL value

SELECT count_if(<column> IS NULL) FROM <table>;
SELECT count(*) FROM <table> WHERE <col> IS NULL;

display(spark.read.table('users').selectExpr("count_if(email IS NULL)"))
spark.read.table('users').where(col("email").isNull()).count()

Deduplicate Rows
- SELECT DISTINCT(*) FROM <table>
- In python, usersDF.distinct().display()

Deduplicate Rows based on Specific Columns

CREATE OR REPLACE TEMP VIEW **deduped_users** AS 
SELECT user_id, user_first_touch_timestamp, max(email) AS email, max(updated) AS updated
FROM users_dirty
WHERE user_id IS NOT NULL
GROUP BY user_id, user_first_touch_timestamp;

from pyspark.sql.functions import max
dedupedDF = (usersDF
    .where(col("user_id").isNotNull())
    .groupBy("user_id", "user_first_touch_timestamp")
    .agg(max("email").alias("email"), 
         max("updated").alias("updated"))
    )

Validate Datasets on No Duplication Rows

SELECT max(row_count) <= 1 AS no_duplicate_ids FROM (
  SELECT user_id, count(*) AS row_count
  FROM **deduped_users**
  GROUP BY user_id)

from pyspark.sql.functions import count

display(dedupedDF
    .groupBy("user_id")
    .agg(count("*").alias("row_count"))
    .select((max("row_count") <= 1).alias("no_duplicate_ids")))

Confirm that each email is associated with at most one user_id.

SELECT max(user_id_count) <= 1 at_most_one_id FROM (
  SELECT email, count(user_id) AS user_id_count
  FROM deduped_users
  WHERE email IS NOT NULL
  GROUP BY email)

display(dedupedDF
    .where(col("email").isNotNull())
    .groupby("email")
    .agg(count("user_id").alias("user_id_count"))
    .select((max("user_id_count") <= 1).alias("at_most_one_id")))

Date Format and Regex

SELECT *,
  date_format(first_touch, "MMM d, yyyy") AS first_touch_date,
  date_format(first_touch, "HH:mm:ss") AS first_touch_time,
  regexp_extract(email, "(?<=@).+", 0) AS email_domain
FROM (
  SELECT *,
  CAST(user_first_touch_timestamp / 1e6 AS timestamp) AS first_touch
  FROM deduped_users
)

These SQL and Python statements offer powerful tools for validating, deduplicating, and transforming data, ensuring its accuracy and reliability in your Databricks environment.

Feel free to adapt and utilize these statements according to your specific data cleansing and validation requirements.

Breaking

Validating and Cleansing Datasets

By Hang

Related Post

Leave a Reply Cancel reply

You Missed

Snail Sort – {kata} – [arrays, matrix, traversal] – (medium)

All Balanced Parentheses – {Kata} – [Recursions, Backtracking, Strings] – (Medium)

Next Small Number with the Same Digits – {Kata} – [Math, TwoPointers] -(Medium)

Evolution of Data Architectures

Validating and Cleansing Datasets

By Hang

Related Post

Understanding Delta Lake: Underlying Data Format, ACID Transaction Support and Scaling Metadata

Delta Lake DDL Cheat Sheet

Delta Live Table Change Data Capture

Leave a Reply Cancel reply

You Missed

Snail Sort – {kata} – [arrays, matrix, traversal] – (medium)

All Balanced Parentheses – {Kata} – [Recursions, Backtracking, Strings] – (Medium)

Next Small Number with the Same Digits – {Kata} – [Math, TwoPointers] -(Medium)

Evolution of Data Architectures