Importing Data into Databricks: A Comprehensive Guide

As a Data Engineer or Analyst working with Databricks, the ability to efficiently import data from external sources is fundamental. This guide covers various SQL statements and commands within Databricks notebooks to seamlessly import data from files and databases.

  • Single File Query

    • SELECT * FROM file_format.`/path/to/file`
    • SELECT * FROM json.`/path/to/file/001.json`
  • Directory File Query

    • SELECT * FROM json.`/path/to/file/001.json`
  • Create Reference (VIEW, TEMP VIEW, CTEs)

    -- VIEW (Exists Permenantely)
    CREATE OR REPLACE VIEW event_view
    AS SELECT * FROM json.`${DA.paths.kafka_events}`
    
    -- TEMP VIEW (Exists for the current SparkSession/Notebook)
    CREATE OR REPLACE TEMP VIEW events_temp_view
    AS SELECT * FROM json.`${DA.paths.kafka_events}`
    
    -- Common Table Expression (CTEs) (Exists for the current cell)
    WITH cte_json
    AS (SELECT * FROM json.`${DA.paths.kafka_events}`)
    SELECT * FROM cte_json
    
  • Loading as Raw Strings

    • SELECT * FROM text./path/to/file/001.json“
    • Result datasets has only value column with raw strings
  • Loading as Binary Files

    • SELECT * FROM binaryFile./path/to/file/001.json“
    • Result datasets has pathmodificationTimelength, and content ****columns.
    • content column shows the binary representation of the file contents.
  • Registering Tables on External Data with Read Options

    • CREATE TABLE table_identifier (col_name1 col_type1, ...) USING data_source OPTIONS (key1 = val1, key2 = val2, ...) LOCATION = path

      CREATE TABLE IF NOT EXISTS sales_csv
        (order_id LONG, email STRING, transactions_timestamp LONG, 
         total_item_quantity INTEGER, purchase_revenue_in_usd DOUBLE, 
         unique_items INTEGER, items STRING)
      USING CSV
      OPTIONS (
        header = "true",
        delimiter = "|"
      )
      LOCATION "/path/to/file/sales.csv"
      
      spark.sql(f"""
      CREATE TABLE IF NOT EXISTS sales_csv
        (order_id LONG, email STRING, transactions_timestamp LONG, total_item_quantity INTEGER, purchase_revenue_in_usd DOUBLE, unique_items INTEGER, items STRING)
      USING CSV
      OPTIONS (
        header = "true",
        delimiter = "|"
      )
      LOCATION "{DA.paths.sales_csv}"
      """)
      
  • Describe Tables

    • DESCRIBE EXTENDED table_identifier
    • DESCRIBE EXTENDED sales_csv
    • Result dataset gives col_name, data_type, comment columns.
  • Refresh Tables

    • REFRESH TABLE table_identifier
    • REFRESH TABLE sales_csv
    • Useful when the table has an update. Spark may not show the updates due to the cache mechanism. Use this statement to inform spark to refresh the data.
  • Extracting Data from SQL Database

    CREATE TABLE table_identifer
    USING JDBC
    OPTIONS (
        url = "jdbc:{databaseServerType}://{jdbcHostname}:{jdbcPort}",
        dbtable = "{jdbcDatabase}.table",
        user = "{jdbcUsername}",
        password = "{jdbcPassword}"
    )
    
    CREATE TABLE users_jdbc
    USING JDBC
    OPTIONS (
      url = "jdbc:sqlite:${DA.paths.ecommerce_db}",
      dbtable = "users"
    )
    

By Hang

Leave a Reply

Your email address will not be published. Required fields are marked *