import org.apache.spark.sql.functions._
Standard Functions — functions Object
org.apache.spark.sql.functions
object defines many built-in functions to work with Columns in Datasets.
You can access the functions using the following import
statement:
There are over 200 functions in the functions
object.
scala> spark.catalog.listFunctions.count
res1: Long = 251
Name | Description | |
---|---|---|
Indicates whether a specified column is aggregated or not |
||
Computes the level of grouping |
||
Collection functions |
||
(new in 2.2.0) Creates a new row for each element in the given array or map column. If the array/map is |
||
Parses a column with a JSON string into a StructType or ArrayType of |
||
Date and time functions |
||
Converts current or specified time to Unix timestamp (in seconds) |
||
Generates tumbling time windows |
||
Math functions |
Converts the value of a long column to binary format |
|
Regular functions |
||
Creating Columns |
||
String functions |
||
UDF functions |
Creating UDFs |
|
Window functions |
Ranking records per window partition |
|
Gives the ntile group if (from |
||
Sequential numbering per window partition |
||
Cumulative distribution of records across window partitions |
||
Tip
|
The page gives only a brief ovierview of the many functions available in functions object and so you should read the official documentation of the functions object.
|
count
Function
Caution
|
FIXME |
explode_outer
Function
explode_outer(e: Column): Column
explode_outer
generates a new row for each element in e
array or map column.
Note
|
Unlike explode, explode_outer generates null when the array or map is null or empty.
|
val arrays = Seq((1,Seq.empty[String])).toDF("id", "array")
scala> arrays.printSchema
root
|-- id: integer (nullable = false)
|-- array: array (nullable = true)
| |-- element: string (containsNull = true)
scala> arrays.select(explode_outer($"array")).show
+----+
| col|
+----+
|null|
+----+
Internally, explode_outer
creates a Column with GeneratorOuter and Explode Catalyst expressions.
val explodeOuter = explode_outer($"array").expr
scala> println(explodeOuter.numberedTreeString)
00 generatorouter(explode('array))
01 +- explode('array)
02 +- 'array
explode
Function
Caution
|
FIXME |
scala> Seq(Array(0,1,2)).toDF("array").withColumn("num", explode('array)).show
+---------+---+
| array|num|
+---------+---+
|[0, 1, 2]| 0|
|[0, 1, 2]| 1|
|[0, 1, 2]| 2|
+---------+---+
Note
|
explode function is an equivalent of flatMap operator for Dataset .
|
Ranking Records per Window Partition — rank
Function
rank(): Column
dense_rank(): Column
percent_rank(): Column
rank
functions assign the sequential rank of each distinct value per window partition. They are equivalent to RANK
, DENSE_RANK
and PERCENT_RANK
functions in the good ol' SQL.
val dataset = spark.range(9).withColumn("bucket", 'id % 3)
import org.apache.spark.sql.expressions.Window
val byBucket = Window.partitionBy('bucket).orderBy('id)
scala> dataset.withColumn("rank", rank over byBucket).show
+---+------+----+
| id|bucket|rank|
+---+------+----+
| 0| 0| 1|
| 3| 0| 2|
| 6| 0| 3|
| 1| 1| 1|
| 4| 1| 2|
| 7| 1| 3|
| 2| 2| 1|
| 5| 2| 2|
| 8| 2| 3|
+---+------+----+
scala> dataset.withColumn("percent_rank", percent_rank over byBucket).show
+---+------+------------+
| id|bucket|percent_rank|
+---+------+------------+
| 0| 0| 0.0|
| 3| 0| 0.5|
| 6| 0| 1.0|
| 1| 1| 0.0|
| 4| 1| 0.5|
| 7| 1| 1.0|
| 2| 2| 0.0|
| 5| 2| 0.5|
| 8| 2| 1.0|
+---+------+------------+
rank
function assigns the same rank for duplicate rows with a gap in the sequence (similarly to Olympic medal places). dense_rank
is like rank
for duplicate rows but compacts the ranks and removes the gaps.
// rank function with duplicates
// Note the missing/sparse ranks, i.e. 2 and 4
scala> dataset.union(dataset).withColumn("rank", rank over byBucket).show
+---+------+----+
| id|bucket|rank|
+---+------+----+
| 0| 0| 1|
| 0| 0| 1|
| 3| 0| 3|
| 3| 0| 3|
| 6| 0| 5|
| 6| 0| 5|
| 1| 1| 1|
| 1| 1| 1|
| 4| 1| 3|
| 4| 1| 3|
| 7| 1| 5|
| 7| 1| 5|
| 2| 2| 1|
| 2| 2| 1|
| 5| 2| 3|
| 5| 2| 3|
| 8| 2| 5|
| 8| 2| 5|
+---+------+----+
// dense_rank function with duplicates
// Note that the missing ranks are now filled in
scala> dataset.union(dataset).withColumn("dense_rank", dense_rank over byBucket).show
+---+------+----------+
| id|bucket|dense_rank|
+---+------+----------+
| 0| 0| 1|
| 0| 0| 1|
| 3| 0| 2|
| 3| 0| 2|
| 6| 0| 3|
| 6| 0| 3|
| 1| 1| 1|
| 1| 1| 1|
| 4| 1| 2|
| 4| 1| 2|
| 7| 1| 3|
| 7| 1| 3|
| 2| 2| 1|
| 2| 2| 1|
| 5| 2| 2|
| 5| 2| 2|
| 8| 2| 3|
| 8| 2| 3|
+---+------+----------+
// percent_rank function with duplicates
scala> dataset.union(dataset).withColumn("percent_rank", percent_rank over byBucket).show
+---+------+------------+
| id|bucket|percent_rank|
+---+------+------------+
| 0| 0| 0.0|
| 0| 0| 0.0|
| 3| 0| 0.4|
| 3| 0| 0.4|
| 6| 0| 0.8|
| 6| 0| 0.8|
| 1| 1| 0.0|
| 1| 1| 0.0|
| 4| 1| 0.4|
| 4| 1| 0.4|
| 7| 1| 0.8|
| 7| 1| 0.8|
| 2| 2| 0.0|
| 2| 2| 0.0|
| 5| 2| 0.4|
| 5| 2| 0.4|
| 8| 2| 0.8|
| 8| 2| 0.8|
+---+------+------------+
Cumulative Distribution of Records Across Window Partitions — cume_dist
Function
cume_dist(): Column
cume_dist
computes the cumulative distribution of the records in window partitions. This is equivalent to SQL’s CUME_DIST
function.
val buckets = spark.range(9).withColumn("bucket", 'id % 3)
// Make duplicates
val dataset = buckets.union(buckets)
import org.apache.spark.sql.expressions.Window
val windowSpec = Window.partitionBy('bucket).orderBy('id)
scala> dataset.withColumn("cume_dist", cume_dist over windowSpec).show
+---+------+------------------+
| id|bucket| cume_dist|
+---+------+------------------+
| 0| 0|0.3333333333333333|
| 3| 0|0.6666666666666666|
| 6| 0| 1.0|
| 1| 1|0.3333333333333333|
| 4| 1|0.6666666666666666|
| 7| 1| 1.0|
| 2| 2|0.3333333333333333|
| 5| 2|0.6666666666666666|
| 8| 2| 1.0|
+---+------+------------------+
lag
Function
lag(e: Column, offset: Int): Column
lag(columnName: String, offset: Int): Column
lag(columnName: String, offset: Int, defaultValue: Any): Column
lag(e: Column, offset: Int, defaultValue: Any): Column
lag
returns the value in e
/ columnName
column that is offset
records before the current record. lag
returns null
value if the number of records in a window partition is less than offset
or defaultValue
.
val buckets = spark.range(9).withColumn("bucket", 'id % 3)
// Make duplicates
val dataset = buckets.union(buckets)
import org.apache.spark.sql.expressions.Window
val windowSpec = Window.partitionBy('bucket).orderBy('id)
scala> dataset.withColumn("lag", lag('id, 1) over windowSpec).show
+---+------+----+
| id|bucket| lag|
+---+------+----+
| 0| 0|null|
| 3| 0| 0|
| 6| 0| 3|
| 1| 1|null|
| 4| 1| 1|
| 7| 1| 4|
| 2| 2|null|
| 5| 2| 2|
| 8| 2| 5|
+---+------+----+
scala> dataset.withColumn("lag", lag('id, 2, "<default_value>") over windowSpec).show
+---+------+----+
| id|bucket| lag|
+---+------+----+
| 0| 0|null|
| 3| 0|null|
| 6| 0| 0|
| 1| 1|null|
| 4| 1|null|
| 7| 1| 1|
| 2| 2|null|
| 5| 2|null|
| 8| 2| 2|
+---+------+----+
Caution
|
FIXME It looks like lag with a default value has a bug — the default value’s not used at all.
|
lead
Function
lead(columnName: String, offset: Int): Column
lead(e: Column, offset: Int): Column
lead(columnName: String, offset: Int, defaultValue: Any): Column
lead(e: Column, offset: Int, defaultValue: Any): Column
lead
returns the value that is offset
records after the current records, and defaultValue
if there is less than offset
records after the current record. lag
returns null
value if the number of records in a window partition is less than offset
or defaultValue
.
val buckets = spark.range(9).withColumn("bucket", 'id % 3)
// Make duplicates
val dataset = buckets.union(buckets)
import org.apache.spark.sql.expressions.Window
val windowSpec = Window.partitionBy('bucket).orderBy('id)
scala> dataset.withColumn("lead", lead('id, 1) over windowSpec).show
+---+------+----+
| id|bucket|lead|
+---+------+----+
| 0| 0| 0|
| 0| 0| 3|
| 3| 0| 3|
| 3| 0| 6|
| 6| 0| 6|
| 6| 0|null|
| 1| 1| 1|
| 1| 1| 4|
| 4| 1| 4|
| 4| 1| 7|
| 7| 1| 7|
| 7| 1|null|
| 2| 2| 2|
| 2| 2| 5|
| 5| 2| 5|
| 5| 2| 8|
| 8| 2| 8|
| 8| 2|null|
+---+------+----+
scala> dataset.withColumn("lead", lead('id, 2, "<default_value>") over windowSpec).show
+---+------+----+
| id|bucket|lead|
+---+------+----+
| 0| 0| 3|
| 0| 0| 3|
| 3| 0| 6|
| 3| 0| 6|
| 6| 0|null|
| 6| 0|null|
| 1| 1| 4|
| 1| 1| 4|
| 4| 1| 7|
| 4| 1| 7|
| 7| 1|null|
| 7| 1|null|
| 2| 2| 5|
| 2| 2| 5|
| 5| 2| 8|
| 5| 2| 8|
| 8| 2|null|
| 8| 2|null|
+---+------+----+
Caution
|
FIXME It looks like lead with a default value has a bug — the default value’s not used at all.
|
Sequential numbering per window partition — row_number
Function
row_number(): Column
row_number
returns a sequential number starting at 1
within a window partition.
val buckets = spark.range(9).withColumn("bucket", 'id % 3)
// Make duplicates
val dataset = buckets.union(buckets)
import org.apache.spark.sql.expressions.Window
val windowSpec = Window.partitionBy('bucket).orderBy('id)
scala> dataset.withColumn("row_number", row_number() over windowSpec).show
+---+------+----------+
| id|bucket|row_number|
+---+------+----------+
| 0| 0| 1|
| 0| 0| 2|
| 3| 0| 3|
| 3| 0| 4|
| 6| 0| 5|
| 6| 0| 6|
| 1| 1| 1|
| 1| 1| 2|
| 4| 1| 3|
| 4| 1| 4|
| 7| 1| 5|
| 7| 1| 6|
| 2| 2| 1|
| 2| 2| 2|
| 5| 2| 3|
| 5| 2| 4|
| 8| 2| 5|
| 8| 2| 6|
+---+------+----------+
ntile
Function
ntile(n: Int): Column
ntile
computes the ntile group id (from 1
to n
inclusive) in an ordered window partition.
val dataset = spark.range(7).select('*, 'id % 3 as "bucket")
import org.apache.spark.sql.expressions.Window
val byBuckets = Window.partitionBy('bucket).orderBy('id)
scala> dataset.select('*, ntile(3) over byBuckets as "ntile").show
+---+------+-----+
| id|bucket|ntile|
+---+------+-----+
| 0| 0| 1|
| 3| 0| 2|
| 6| 0| 3|
| 1| 1| 1|
| 4| 1| 2|
| 2| 2| 1|
| 5| 2| 2|
+---+------+-----+
Caution
|
FIXME How is ntile different from rank ? What about performance?
|
Creating Columns — col
and column
Functions
col(colName: String): Column
column(colName: String): Column
col
and column
methods create a Column that you can later use to reference a column in a dataset.
import org.apache.spark.sql.functions._
scala> val nameCol = col("name")
nameCol: org.apache.spark.sql.Column = name
scala> val cityCol = column("city")
cityCol: org.apache.spark.sql.Column = city
Defining UDFs — udf
Function
udf(f: FunctionN[...]): UserDefinedFunction
The udf
family of functions allows you to create user-defined functions (UDFs) based on a user-defined function in Scala. It accepts f
function of 0 to 10 arguments and the input and output types are automatically inferred (given the types of the respective input and output types of the function f
).
import org.apache.spark.sql.functions._
val _length: String => Int = _.length
val _lengthUDF = udf(_length)
// define a dataframe
val df = sc.parallelize(0 to 3).toDF("num")
// apply the user-defined function to "num" column
scala> df.withColumn("len", _lengthUDF($"num")).show
+---+---+
|num|len|
+---+---+
| 0| 1|
| 1| 1|
| 2| 1|
| 3| 1|
+---+---+
Since Spark 2.0.0, there is another variant of udf
function:
udf(f: AnyRef, dataType: DataType): UserDefinedFunction
udf(f: AnyRef, dataType: DataType)
allows you to use a Scala closure for the function argument (as f
) and explicitly declaring the output data type (as dataType
).
// given the dataframe above
import org.apache.spark.sql.types.IntegerType
val byTwo = udf((n: Int) => n * 2, IntegerType)
scala> df.withColumn("len", byTwo($"num")).show
+---+---+
|num|len|
+---+---+
| 0| 0|
| 1| 2|
| 2| 4|
| 3| 6|
+---+---+
split
Function
split(str: Column, pattern: String): Column
split
function splits str
column using pattern
. It returns a new Column
.
Note
|
split UDF uses java.lang.String.split(String regex, int limit) method.
|
val df = Seq((0, "hello|world"), (1, "witaj|swiecie")).toDF("num", "input")
val withSplit = df.withColumn("split", split($"input", "[|]"))
scala> withSplit.show
+---+-------------+----------------+
|num| input| split|
+---+-------------+----------------+
| 0| hello|world| [hello, world]|
| 1|witaj|swiecie|[witaj, swiecie]|
+---+-------------+----------------+
Note
|
.$|()[{^?*+\ are RegEx’s meta characters and are considered special.
|
upper
Function
upper(e: Column): Column
upper
function converts a string column into one with all letter upper. It returns a new Column
.
Note
|
The following example uses two functions that accept a Column and return another to showcase how to chain them.
|
val df = Seq((0,1,"hello"), (2,3,"world"), (2,4, "ala")).toDF("id", "val", "name")
val withUpperReversed = df.withColumn("upper", reverse(upper($"name")))
scala> withUpperReversed.show
+---+---+-----+-----+
| id|val| name|upper|
+---+---+-----+-----+
| 0| 1|hello|OLLEH|
| 2| 3|world|DLROW|
| 2| 4| ala| ALA|
+---+---+-----+-----+
struct
Functions
struct(cols: Column*): Column
struct(colName: String, colNames: String*): Column
struct
family of functions allows you to create a new struct column based on a collection of Column
or their names.
Note
|
The difference between struct and another similar array function is that the types of the columns can be different (in struct ).
|
scala> df.withColumn("struct", struct($"name", $"val")).show
+---+---+-----+---------+
| id|val| name| struct|
+---+---+-----+---------+
| 0| 1|hello|[hello,1]|
| 2| 3|world|[world,3]|
| 2| 4| ala| [ala,4]|
+---+---+-----+---------+
broadcast
Function
broadcast[T](df: Dataset[T]): Dataset[T]
broadcast
function marks the input Dataset small enough to be used in broadcast join
.
Tip
|
Read up on Broadcast Joins (aka Map-Side Joins). |
val left = Seq((0, "aa"), (0, "bb")).toDF("id", "token").as[(Int, String)]
val right = Seq(("aa", 0.99), ("bb", 0.57)).toDF("token", "prob").as[(String, Double)]
scala> left.join(broadcast(right), "token").explain(extended = true)
== Parsed Logical Plan ==
'Join UsingJoin(Inner,List(token))
:- Project [_1#123 AS id#126, _2#124 AS token#127]
: +- LocalRelation [_1#123, _2#124]
+- BroadcastHint
+- Project [_1#136 AS token#139, _2#137 AS prob#140]
+- LocalRelation [_1#136, _2#137]
== Analyzed Logical Plan ==
token: string, id: int, prob: double
Project [token#127, id#126, prob#140]
+- Join Inner, (token#127 = token#139)
:- Project [_1#123 AS id#126, _2#124 AS token#127]
: +- LocalRelation [_1#123, _2#124]
+- BroadcastHint
+- Project [_1#136 AS token#139, _2#137 AS prob#140]
+- LocalRelation [_1#136, _2#137]
== Optimized Logical Plan ==
Project [token#127, id#126, prob#140]
+- Join Inner, (token#127 = token#139)
:- Project [_1#123 AS id#126, _2#124 AS token#127]
: +- Filter isnotnull(_2#124)
: +- LocalRelation [_1#123, _2#124]
+- BroadcastHint
+- Project [_1#136 AS token#139, _2#137 AS prob#140]
+- Filter isnotnull(_1#136)
+- LocalRelation [_1#136, _2#137]
== Physical Plan ==
*Project [token#127, id#126, prob#140]
+- *BroadcastHashJoin [token#127], [token#139], Inner, BuildRight
:- *Project [_1#123 AS id#126, _2#124 AS token#127]
: +- *Filter isnotnull(_2#124)
: +- LocalTableScan [_1#123, _2#124]
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
+- *Project [_1#136 AS token#139, _2#137 AS prob#140]
+- *Filter isnotnull(_1#136)
+- LocalTableScan [_1#136, _2#137]
expr
Function
expr(expr: String): Column
expr
function parses the input expr
SQL string to a Column
it represents.
val ds = Seq((0, "hello"), (1, "world"))
.toDF("id", "token")
.as[(Long, String)]
scala> ds.show
+---+-----+
| id|token|
+---+-----+
| 0|hello|
| 1|world|
+---+-----+
val filterExpr = expr("token = 'hello'")
scala> ds.filter(filterExpr).show
+---+-----+
| id|token|
+---+-----+
| 0|hello|
+---+-----+
Internally, expr
uses the active session’s sqlParser or creates a new SparkSqlParser to call parseExpression method.
grouping
Aggregate Function
grouping(e: Column): Column
grouping(columnName: String): Column (1)
-
Calls the first
grouping
withcolumnName
as aColumn
grouping
is an aggregate function that indicates whether a specified column is aggregated or not and:
-
returns
1
if the column is in a subtotal and isNULL
-
returns
0
if the underlying value isNULL
or any other value
Note
|
grouping can only be used with cube, rollup or GROUPING SETS multi-dimensional aggregate operators (and is verified when Analyzer does check analysis).
|
From Hive’s documentation about Grouping__ID function (that can somehow help to understand grouping
):
When aggregates are displayed for a column its value is
null
. This may conflict in case the column itself has somenull
values. There needs to be some way to identifyNULL
in column, which means aggregate andNULL
in column, which means value.GROUPING__ID
function is the solution to that.
val tmpWorkshops = Seq(
("Warsaw", 2016, 2),
("Toronto", 2016, 4),
("Toronto", 2017, 1)).toDF("city", "year", "count")
// there seems to be a bug with nulls
// and so the need for the following union
val cityNull = Seq(
(null.asInstanceOf[String], 2016, 2)).toDF("city", "year", "count")
val workshops = tmpWorkshops union cityNull
scala> workshops.show
+-------+----+-----+
| city|year|count|
+-------+----+-----+
| Warsaw|2016| 2|
|Toronto|2016| 4|
|Toronto|2017| 1|
| null|2016| 2|
+-------+----+-----+
val q = workshops
.cube("city", "year")
.agg(grouping("city"), grouping("year")) // <-- grouping here
.sort($"city".desc_nulls_last, $"year".desc_nulls_last)
scala> q.show
+-------+----+--------------+--------------+
| city|year|grouping(city)|grouping(year)|
+-------+----+--------------+--------------+
| Warsaw|2016| 0| 0|
| Warsaw|null| 0| 1|
|Toronto|2017| 0| 0|
|Toronto|2016| 0| 0|
|Toronto|null| 0| 1|
| null|2017| 1| 0|
| null|2016| 1| 0|
| null|2016| 0| 0| <-- null is city
| null|null| 0| 1| <-- null is city
| null|null| 1| 1|
+-------+----+--------------+--------------+
Internally, grouping
creates a Column with Grouping
expression.
val q = workshops.cube("city", "year").agg(grouping("city"))
scala> println(q.queryExecution.logical)
'Aggregate [cube(city#182, year#183)], [city#182, year#183, grouping('city) AS grouping(city)#705]
+- Union
:- Project [_1#178 AS city#182, _2#179 AS year#183, _3#180 AS count#184]
: +- LocalRelation [_1#178, _2#179, _3#180]
+- Project [_1#192 AS city#196, _2#193 AS year#197, _3#194 AS count#198]
+- LocalRelation [_1#192, _2#193, _3#194]
scala> println(q.queryExecution.analyzed)
Aggregate [city#724, year#725, spark_grouping_id#721], [city#724, year#725, cast((shiftright(spark_grouping_id#721, 1) & 1) as tinyint) AS grouping(city)#720]
+- Expand [List(city#182, year#183, count#184, city#722, year#723, 0), List(city#182, year#183, count#184, city#722, null, 1), List(city#182, year#183, count#184, null, year#723, 2), List(city#182, year#183, count#184, null, null, 3)], [city#182, year#183, count#184, city#724, year#725, spark_grouping_id#721]
+- Project [city#182, year#183, count#184, city#182 AS city#722, year#183 AS year#723]
+- Union
:- Project [_1#178 AS city#182, _2#179 AS year#183, _3#180 AS count#184]
: +- LocalRelation [_1#178, _2#179, _3#180]
+- Project [_1#192 AS city#196, _2#193 AS year#197, _3#194 AS count#198]
+- LocalRelation [_1#192, _2#193, _3#194]
Note
|
grouping was added to Spark SQL in [SPARK-12706] support grouping/grouping_id function together group set.
|
grouping_id
Aggregate Function
grouping_id(cols: Column*): Column
grouping_id(colName: String, colNames: String*): Column (1)
-
Calls the first
grouping_id
withcolName
andcolNames
as objects of typeColumn
grouping_id
is an aggregate function that computes the level of grouping:
-
0
for combinations of each column -
1
for subtotals of column 1 -
2
for subtotals of column 2 -
And so on…
val tmpWorkshops = Seq(
("Warsaw", 2016, 2),
("Toronto", 2016, 4),
("Toronto", 2017, 1)).toDF("city", "year", "count")
// there seems to be a bug with nulls
// and so the need for the following union
val cityNull = Seq(
(null.asInstanceOf[String], 2016, 2)).toDF("city", "year", "count")
val workshops = tmpWorkshops union cityNull
scala> workshops.show
+-------+----+-----+
| city|year|count|
+-------+----+-----+
| Warsaw|2016| 2|
|Toronto|2016| 4|
|Toronto|2017| 1|
| null|2016| 2|
+-------+----+-----+
val query = workshops
.cube("city", "year")
.agg(grouping_id()) // <-- all grouping columns used
.sort($"city".desc_nulls_last, $"year".desc_nulls_last)
scala> query.show
+-------+----+-------------+
| city|year|grouping_id()|
+-------+----+-------------+
| Warsaw|2016| 0|
| Warsaw|null| 1|
|Toronto|2017| 0|
|Toronto|2016| 0|
|Toronto|null| 1|
| null|2017| 2|
| null|2016| 2|
| null|2016| 0|
| null|null| 1|
| null|null| 3|
+-------+----+-------------+
scala> spark.catalog.listFunctions.filter(_.name.contains("grouping_id")).show(false)
+-----------+--------+-----------+----------------------------------------------------+-----------+
|name |database|description|className |isTemporary|
+-----------+--------+-----------+----------------------------------------------------+-----------+
|grouping_id|null |null |org.apache.spark.sql.catalyst.expressions.GroupingID|true |
+-----------+--------+-----------+----------------------------------------------------+-----------+
// bin function gives the string representation of the binary value of the given long column
scala> query.withColumn("bitmask", bin($"grouping_id()")).show
+-------+----+-------------+-------+
| city|year|grouping_id()|bitmask|
+-------+----+-------------+-------+
| Warsaw|2016| 0| 0|
| Warsaw|null| 1| 1|
|Toronto|2017| 0| 0|
|Toronto|2016| 0| 0|
|Toronto|null| 1| 1|
| null|2017| 2| 10|
| null|2016| 2| 10|
| null|2016| 0| 0| <-- null is city
| null|null| 3| 11|
| null|null| 1| 1|
+-------+----+-------------+-------+
The list of columns of grouping_id
should match grouping columns (in cube
or rollup
) exactly, or empty which means all the grouping columns (which is exactly what the function expects).
Note
|
grouping_id can only be used with cube, rollup or GROUPING SETS multi-dimensional aggregate operators (and is verified when Analyzer does check analysis).
|
Note
|
Spark SQL’s grouping_id function is known as grouping__id in Hive.
|
When aggregates are displayed for a column its value is
null
. This may conflict in case the column itself has somenull
values. There needs to be some way to identifyNULL
in column, which means aggregate andNULL
in column, which means value.GROUPING__ID
function is the solution to that.
Internally, grouping_id()
creates a Column with GroupingID
unevaluable expression.
Note
|
Unevaluable expressions are expressions replaced by some other expressions during analysis or optimization. |
// workshops dataset was defined earlier
val q = workshops
.cube("city", "year")
.agg(grouping_id())
// grouping_id function is spark_grouping_id virtual column internally
// that is resolved during analysis - see Analyzed Logical Plan
scala> q.explain(true)
== Parsed Logical Plan ==
'Aggregate [cube(city#182, year#183)], [city#182, year#183, grouping_id() AS grouping_id()#742]
+- Union
:- Project [_1#178 AS city#182, _2#179 AS year#183, _3#180 AS count#184]
: +- LocalRelation [_1#178, _2#179, _3#180]
+- Project [_1#192 AS city#196, _2#193 AS year#197, _3#194 AS count#198]
+- LocalRelation [_1#192, _2#193, _3#194]
== Analyzed Logical Plan ==
city: string, year: int, grouping_id(): int
Aggregate [city#757, year#758, spark_grouping_id#754], [city#757, year#758, spark_grouping_id#754 AS grouping_id()#742]
+- Expand [List(city#182, year#183, count#184, city#755, year#756, 0), List(city#182, year#183, count#184, city#755, null, 1), List(city#182, year#183, count#184, null, year#756, 2), List(city#182, year#183, count#184, null, null, 3)], [city#182, year#183, count#184, city#757, year#758, spark_grouping_id#754]
+- Project [city#182, year#183, count#184, city#182 AS city#755, year#183 AS year#756]
+- Union
:- Project [_1#178 AS city#182, _2#179 AS year#183, _3#180 AS count#184]
: +- LocalRelation [_1#178, _2#179, _3#180]
+- Project [_1#192 AS city#196, _2#193 AS year#197, _3#194 AS count#198]
+- LocalRelation [_1#192, _2#193, _3#194]
== Optimized Logical Plan ==
Aggregate [city#757, year#758, spark_grouping_id#754], [city#757, year#758, spark_grouping_id#754 AS grouping_id()#742]
+- Expand [List(city#755, year#756, 0), List(city#755, null, 1), List(null, year#756, 2), List(null, null, 3)], [city#757, year#758, spark_grouping_id#754]
+- Union
:- LocalRelation [city#755, year#756]
+- LocalRelation [city#755, year#756]
== Physical Plan ==
*HashAggregate(keys=[city#757, year#758, spark_grouping_id#754], functions=[], output=[city#757, year#758, grouping_id()#742])
+- Exchange hashpartitioning(city#757, year#758, spark_grouping_id#754, 200)
+- *HashAggregate(keys=[city#757, year#758, spark_grouping_id#754], functions=[], output=[city#757, year#758, spark_grouping_id#754])
+- *Expand [List(city#755, year#756, 0), List(city#755, null, 1), List(null, year#756, 2), List(null, null, 3)], [city#757, year#758, spark_grouping_id#754]
+- Union
:- LocalTableScan [city#755, year#756]
+- LocalTableScan [city#755, year#756]
Note
|
grouping_id was added to Spark SQL in [SPARK-12706] support grouping/grouping_id function together group set.
|
Parsing Column With JSON-Encoded Records — from_json
Functions
from_json(e: Column, schema: DataType): Column (1)
from_json(
e: Column,
schema: DataType,
options: Map[String, String]): Column
-
Relays to the other
from_json
with emptyoptions
Parses a column with a JSON string into a StructType or ArrayType of StructType
elements with the specified schema.
Note
|
options controls how a JSON is parsed and contains the same options as the json data source.
|
Internally, from_json
creates a Column with JsonToStructs unary expression.
val jsons = Seq("""{ "id": 0 }""").toDF("json")
import org.apache.spark.sql.types._
val schema = StructType(
StructField("id", IntegerType, nullable = false) :: Nil)
scala> jsons.select(from_json($"json", schema) as "ids").show
+---+
|ids|
+---+
|[0]|
+---+
Note
|
from_json corresponds to SQL’s from_json .
|
Converting Long to Binary Format (in String Representation) — bin
Function
bin(e: Column): Column
bin(columnName: String): Column (1)
-
Calls the first
bin
withcolumnName
as aColumn
bin
converts the long value in a column to its binary format (i.e. as an unsigned integer in base 2) with no extra leading 0s.
scala> spark.range(5).withColumn("binary", bin('id)).show
+---+------+
| id|binary|
+---+------+
| 0| 0|
| 1| 1|
| 2| 10|
| 3| 11|
| 4| 100|
+---+------+
val withBin = spark.range(5).withColumn("binary", bin('id))
scala> withBin.printSchema
root
|-- id: long (nullable = false)
|-- binary: string (nullable = false)
Internally, bin
creates a Column with Bin
unary expression.
scala> withBin.queryExecution.logical
res2: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
'Project [*, bin('id) AS binary#14]
+- Range (0, 5, step=1, splits=Some(8))
Note
|
Bin unary expression uses java.lang.Long.toBinaryString for the conversion.
|
Note
|
|