Validating and Cleansing Datasets

Ensuring the quality and integrity of data is crucial in any data processing pipeline. This guide outlines SQL and Python statements to validate and cleanse datasets effectively in Databricks notebooks.

  • Inspect Missing Data

    • count(<column>) skips NULL values while count(*) includes NULL value
    SELECT count_if(<column> IS NULL) FROM <table>;
    SELECT count(*) FROM <table> WHERE <col> IS NULL;
    
    display(spark.read.table('users').selectExpr("count_if(email IS NULL)"))
    spark.read.table('users').where(col("email").isNull()).count()
    
  • Deduplicate Rows

    • SELECT DISTINCT(*) FROM <table>
    • In python, usersDF.distinct().display()
  • Deduplicate Rows based on Specific Columns

    CREATE OR REPLACE TEMP VIEW **deduped_users** AS 
    SELECT user_id, user_first_touch_timestamp, max(email) AS email, max(updated) AS updated
    FROM users_dirty
    WHERE user_id IS NOT NULL
    GROUP BY user_id, user_first_touch_timestamp;
    
    from pyspark.sql.functions import max
    dedupedDF = (usersDF
        .where(col("user_id").isNotNull())
        .groupBy("user_id", "user_first_touch_timestamp")
        .agg(max("email").alias("email"), 
             max("updated").alias("updated"))
        )
    
  • Validate Datasets on No Duplication Rows

    SELECT max(row_count) <= 1 AS no_duplicate_ids FROM (
      SELECT user_id, count(*) AS row_count
      FROM **deduped_users**
      GROUP BY user_id)
    
    from pyspark.sql.functions import count
    
    display(dedupedDF
        .groupBy("user_id")
        .agg(count("*").alias("row_count"))
        .select((max("row_count") <= 1).alias("no_duplicate_ids")))
    
    • Confirm that each email is associated with at most oneĀ user_id.
    SELECT max(user_id_count) <= 1 at_most_one_id FROM (
      SELECT email, count(user_id) AS user_id_count
      FROM deduped_users
      WHERE email IS NOT NULL
      GROUP BY email)
    
    display(dedupedDF
        .where(col("email").isNotNull())
        .groupby("email")
        .agg(count("user_id").alias("user_id_count"))
        .select((max("user_id_count") <= 1).alias("at_most_one_id")))
    
  • Date Format and Regex

    SELECT *,
      date_format(first_touch, "MMM d, yyyy") AS first_touch_date,
      date_format(first_touch, "HH:mm:ss") AS first_touch_time,
      regexp_extract(email, "(?<=@).+", 0) AS email_domain
    FROM (
      SELECT *,
      CAST(user_first_touch_timestamp / 1e6 AS timestamp) AS first_touch
      FROM deduped_users
    )
    

These SQL and Python statements offer powerful tools for validating, deduplicating, and transforming data, ensuring its accuracy and reliability in your Databricks environment.

Feel free to adapt and utilize these statements according to your specific data cleansing and validation requirements.

By Hang

Leave a Reply

Your email address will not be published. Required fields are marked *