Ensuring the quality and integrity of data is crucial in any data processing pipeline. This guide outlines SQL and Python statements to validate and cleanse datasets effectively in Databricks notebooks.
-
Inspect Missing Data
count(<column>)
skipsNULL
values whilecount(*)
includesNULL
value
SELECT count_if(<column> IS NULL) FROM <table>; SELECT count(*) FROM <table> WHERE <col> IS NULL;
display(spark.read.table('users').selectExpr("count_if(email IS NULL)")) spark.read.table('users').where(col("email").isNull()).count()
-
Deduplicate Rows
SELECT DISTINCT(*) FROM <table>
- In python,
usersDF.distinct().display()
-
Deduplicate Rows based on Specific Columns
CREATE OR REPLACE TEMP VIEW **deduped_users** AS SELECT user_id, user_first_touch_timestamp, max(email) AS email, max(updated) AS updated FROM users_dirty WHERE user_id IS NOT NULL GROUP BY user_id, user_first_touch_timestamp;
from pyspark.sql.functions import max dedupedDF = (usersDF .where(col("user_id").isNotNull()) .groupBy("user_id", "user_first_touch_timestamp") .agg(max("email").alias("email"), max("updated").alias("updated")) )
-
Validate Datasets on No Duplication Rows
SELECT max(row_count) <= 1 AS no_duplicate_ids FROM ( SELECT user_id, count(*) AS row_count FROM **deduped_users** GROUP BY user_id)
from pyspark.sql.functions import count display(dedupedDF .groupBy("user_id") .agg(count("*").alias("row_count")) .select((max("row_count") <= 1).alias("no_duplicate_ids")))
- Confirm that each email is associated with at most oneĀ
user_id
.
SELECT max(user_id_count) <= 1 at_most_one_id FROM ( SELECT email, count(user_id) AS user_id_count FROM deduped_users WHERE email IS NOT NULL GROUP BY email)
display(dedupedDF .where(col("email").isNotNull()) .groupby("email") .agg(count("user_id").alias("user_id_count")) .select((max("user_id_count") <= 1).alias("at_most_one_id")))
- Confirm that each email is associated with at most oneĀ
-
Date Format and Regex
SELECT *, date_format(first_touch, "MMM d, yyyy") AS first_touch_date, date_format(first_touch, "HH:mm:ss") AS first_touch_time, regexp_extract(email, "(?<=@).+", 0) AS email_domain FROM ( SELECT *, CAST(user_first_touch_timestamp / 1e6 AS timestamp) AS first_touch FROM deduped_users )
These SQL and Python statements offer powerful tools for validating, deduplicating, and transforming data, ensuring its accuracy and reliability in your Databricks environment.
Feel free to adapt and utilize these statements according to your specific data cleansing and validation requirements.