Exploring Spark SQL’s High-Order Functions

In the realm of Spark SQL, high-order functions play an important role in data manipulation and transformation. These functions operate on arrays, enabling efficient handling of complex data structures. Let’s delve into four fundamental high-order functions: FILTER, TRANSFORM, EXISTS, and REDUCE.

FILTER – Refining Arrays Based on Conditions

The FILTER function empowers users to create new columns by excluding specific values from an array, depending on a provided condition. Take a look at this example:

SELECT
order_id,
FILTER(items, i -> i.item_id LIKE "%K") AS king_items
FROM sales
WHERE size(king_items) > 0;

In this snippet, the FILTER function sifts through the ‘items’ array, selecting only those items whose ‘item_id’ contains “K.” The subsequent WHERE clause filters out rows where ‘king_items’ array size is greater than zero.

TRANSFORM – Element-wise Transformation of Arrays

Similar to Python’s map function, TRANSFORM in Spark SQL facilitates the application of a function to each element within an array:

SELECT *,
TRANSFORM(items, i -> CAST(i.item_revenue_in_usd * 100 AS INT)) AS item_revenues
FROM sales;

Here, the TRANSFORM function multiplies each ‘item_revenue_in_usd’ by 100, converting it into an integer type, and generates the ‘item_revenues’ array.

EXISTS – Verifying Element Existence

The EXISTS function determines if any elements within an array satisfy a given condition:

SELECT exists(array(1, 2, 3), x -> x % 2 == 0);
-- true

SELECT exists(array(1, 2, 3), x -> x % 2 == 10);
-- false

This function mimics Python’s any by returning true if at least one element in the array meets the condition; otherwise, it returns false.

REDUCE – Aggregating Array Elements

REDUCE is instrumental in aggregating elements within an array using a customized aggregator function:

SELECT reduce(array(1, 2, 3), 0, (acc, x) -> acc + x);
-- 6

SELECT reduce(array(1, 2, 3), 0, (acc, x) -> acc + x, acc -> acc * 10);
-- 60

SELECT reduce(array(1, 2, 3, 4),
named_struct('sum', 0, 'cnt', 0),
(acc, x) -> named_struct('sum', acc.sum + x, 'cnt', acc.cnt + 1),
acc -> acc.sum / acc.cnt) AS avg;
-- 2.5

The REDUCE function efficiently consolidates array elements using a provided aggregator. It computes sums, products, averages, or any other user-defined operations over the array elements.


Mastering these high-order functions enhances your ability to efficiently manipulate data within Spark SQL, enabling streamlined data transformations and analysis.

By Hang

Leave a Reply

Your email address will not be published. Required fields are marked *