As a Data Engineer or Analyst working with Databricks, the ability to efficiently import data from external sources is fundamental. This guide covers various SQL statements and commands within Databricks notebooks to seamlessly import data from files and databases.
-
Single File Query
SELECT * FROM file_format.`/path/to/file`
SELECT * FROM json.`/path/to/file/001.json`
-
Directory File Query
SELECT * FROM json.`/path/to/file/001.json`
-
Create Reference (VIEW, TEMP VIEW, CTEs)
-- VIEW (Exists Permenantely) CREATE OR REPLACE VIEW event_view AS SELECT * FROM json.`${DA.paths.kafka_events}` -- TEMP VIEW (Exists for the current SparkSession/Notebook) CREATE OR REPLACE TEMP VIEW events_temp_view AS SELECT * FROM json.`${DA.paths.kafka_events}` -- Common Table Expression (CTEs) (Exists for the current cell) WITH cte_json AS (SELECT * FROM json.`${DA.paths.kafka_events}`) SELECT * FROM cte_json
-
Loading as Raw Strings
SELECT * FROM text.
/path/to/file/001.json“- Result datasets has only
value
column with raw strings
-
Loading as Binary Files
SELECT * FROM binaryFile.
/path/to/file/001.json“- Result datasets has
path
,modificationTime
,length
, andcontent
****columns. content
column shows the binary representation of the file contents.
-
Registering Tables on External Data with Read Options
-
CREATE TABLE table_identifier (col_name1 col_type1, ...) USING data_source OPTIONS (key1 = val1, key2 = val2, ...) LOCATION = path
CREATE TABLE IF NOT EXISTS sales_csv (order_id LONG, email STRING, transactions_timestamp LONG, total_item_quantity INTEGER, purchase_revenue_in_usd DOUBLE, unique_items INTEGER, items STRING) USING CSV OPTIONS ( header = "true", delimiter = "|" ) LOCATION "/path/to/file/sales.csv"
spark.sql(f""" CREATE TABLE IF NOT EXISTS sales_csv (order_id LONG, email STRING, transactions_timestamp LONG, total_item_quantity INTEGER, purchase_revenue_in_usd DOUBLE, unique_items INTEGER, items STRING) USING CSV OPTIONS ( header = "true", delimiter = "|" ) LOCATION "{DA.paths.sales_csv}" """)
-
-
Describe Tables
DESCRIBE EXTENDED table_identifier
DESCRIBE EXTENDED sales_csv
- Result dataset gives
col_name
,data_type
,comment
columns.
-
Refresh Tables
REFRESH TABLE table_identifier
REFRESH TABLE sales_csv
- Useful when the table has an update. Spark may not show the updates due to the cache mechanism. Use this statement to inform spark to refresh the data.
-
Extracting Data from SQL Database
CREATE TABLE table_identifer USING JDBC OPTIONS ( url = "jdbc:{databaseServerType}://{jdbcHostname}:{jdbcPort}", dbtable = "{jdbcDatabase}.table", user = "{jdbcUsername}", password = "{jdbcPassword}" )
CREATE TABLE users_jdbc USING JDBC OPTIONS ( url = "jdbc:sqlite:${DA.paths.ecommerce_db}", dbtable = "users" )