Importing Data into Databricks: A Comprehensive Guide

As a Data Engineer or Analyst working with Databricks, the ability to efficiently import data from external sources is fundamental. This guide covers various SQL statements and commands within Databricks notebooks to seamlessly import data from files and databases.

Single File Query
- SELECT * FROM file_format.`/path/to/file`
- SELECT * FROM json.`/path/to/file/001.json`
Directory File Query
- SELECT * FROM json.`/path/to/file/001.json`

Create Reference (VIEW, TEMP VIEW, CTEs)

-- VIEW (Exists Permenantely)
CREATE OR REPLACE VIEW event_view
AS SELECT * FROM json.`${DA.paths.kafka_events}`

-- TEMP VIEW (Exists for the current SparkSession/Notebook)
CREATE OR REPLACE TEMP VIEW events_temp_view
AS SELECT * FROM json.`${DA.paths.kafka_events}`

-- Common Table Expression (CTEs) (Exists for the current cell)
WITH cte_json
AS (SELECT * FROM json.`${DA.paths.kafka_events}`)
SELECT * FROM cte_json

Loading as Raw Strings
- SELECT * FROM text./path/to/file/001.json“
- Result datasets has only value column with raw strings
Loading as Binary Files
- SELECT * FROM binaryFile./path/to/file/001.json“
- Result datasets has path, modificationTime, length, and content ****columns.
- content column shows the binary representation of the file contents.

Registering Tables on External Data with Read Options

CREATE TABLE table_identifier (col_name1 col_type1, ...) USING data_source OPTIONS (key1 = val1, key2 = val2, ...) LOCATION = path

CREATE TABLE IF NOT EXISTS sales_csv
  (order_id LONG, email STRING, transactions_timestamp LONG, 
   total_item_quantity INTEGER, purchase_revenue_in_usd DOUBLE, 
   unique_items INTEGER, items STRING)
USING CSV
OPTIONS (
  header = "true",
  delimiter = "|"
)
LOCATION "/path/to/file/sales.csv"

spark.sql(f"""
CREATE TABLE IF NOT EXISTS sales_csv
  (order_id LONG, email STRING, transactions_timestamp LONG, total_item_quantity INTEGER, purchase_revenue_in_usd DOUBLE, unique_items INTEGER, items STRING)
USING CSV
OPTIONS (
  header = "true",
  delimiter = "|"
)
LOCATION "{DA.paths.sales_csv}"
""")

Describe Tables
- DESCRIBE EXTENDED table_identifier
- DESCRIBE EXTENDED sales_csv
- Result dataset gives col_name, data_type, comment columns.
Refresh Tables
- REFRESH TABLE table_identifier
- REFRESH TABLE sales_csv
- Useful when the table has an update. Spark may not show the updates due to the cache mechanism. Use this statement to inform spark to refresh the data.

Extracting Data from SQL Database

CREATE TABLE table_identifer
USING JDBC
OPTIONS (
    url = "jdbc:{databaseServerType}://{jdbcHostname}:{jdbcPort}",
    dbtable = "{jdbcDatabase}.table",
    user = "{jdbcUsername}",
    password = "{jdbcPassword}"
)

CREATE TABLE users_jdbc
USING JDBC
OPTIONS (
  url = "jdbc:sqlite:${DA.paths.ecommerce_db}",
  dbtable = "users"
)

Breaking

Importing Data into Databricks: A Comprehensive Guide

By Hang

Related Post

Leave a Reply Cancel reply

You Missed

Snail Sort – {kata} – [arrays, matrix, traversal] – (medium)

All Balanced Parentheses – {Kata} – [Recursions, Backtracking, Strings] – (Medium)

Next Small Number with the Same Digits – {Kata} – [Math, TwoPointers] -(Medium)

Evolution of Data Architectures

Importing Data into Databricks: A Comprehensive Guide

By Hang

Related Post

Understanding Delta Lake: Underlying Data Format, ACID Transaction Support and Scaling Metadata

Delta Lake DDL Cheat Sheet

Delta Live Table Change Data Capture

Leave a Reply Cancel reply

You Missed

Snail Sort – {kata} – [arrays, matrix, traversal] – (medium)

All Balanced Parentheses – {Kata} – [Recursions, Backtracking, Strings] – (Medium)

Next Small Number with the Same Digits – {Kata} – [Math, TwoPointers] -(Medium)

Evolution of Data Architectures