val q = sql("""
SELECT customer, year, SUM(sales)
FROM VALUES ("abc", 2017, 30) AS t1 (customer, year, sales)
GROUP BY customer, year
GROUPING SETS ((customer), (year))
""")
scala> println(q.queryExecution.logical.numberedTreeString)
00 'GroupingSets [ArrayBuffer('customer), ArrayBuffer('year)], ['customer, 'year], ['customer, 'year, unresolvedalias('SUM('sales), None)]
01 +- 'SubqueryAlias t1
02 +- 'UnresolvedInlineTable [customer, year, sales], [List(abc, 2017, 30)]
GroupingSets Unary Logical Operator
GroupingSets is a unary logical operator that represents SQL’s GROUPING SETS variant of GROUP BY clause.
GroupingSets operator is resolved to an Aggregate logical operator at analysis phase.
scala> println(q.queryExecution.analyzed.numberedTreeString)
00 Aggregate [customer#8, year#9, spark_grouping_id#5], [customer#8, year#9, sum(cast(sales#2 as bigint)) AS sum(sales)#4L]
01 +- Expand [List(customer#0, year#1, sales#2, customer#6, null, 1), List(customer#0, year#1, sales#2, null, year#7, 2)], [customer#0, year#1, sales#2, customer#8, year#9, spark_grouping_id#5]
02 +- Project [customer#0, year#1, sales#2, customer#0 AS customer#6, year#1 AS year#7]
03 +- SubqueryAlias t1
04 +- LocalRelation [customer#0, year#1, sales#2]
|
Note
|
GroupingSets can only be created using SQL.
|
|
Note
|
GroupingSets is not supported on Structured Streaming’s streaming Datasets.
|
GroupingSets is never resolved (as it can only be converted to an Aggregate logical operator).
The output schema of GroupingSets are exactly the attributes of aggregate named expressions.
Analysis Phase
GroupingSets operator is resolved at analysis phase in the following logical evaluation rules:
-
ResolveAliases for unresolved aliases in aggregate named expressions
val spark: SparkSession = ...
// using q from the example above
val plan = q.queryExecution.logical
scala> println(plan.numberedTreeString)
00 'GroupingSets [ArrayBuffer('customer), ArrayBuffer('year)], ['customer, 'year], ['customer, 'year, unresolvedalias('SUM('sales), None)]
01 +- 'SubqueryAlias t1
02 +- 'UnresolvedInlineTable [customer, year, sales], [List(abc, 2017, 30)]
// Note unresolvedalias for SUM expression
// Note UnresolvedInlineTable and SubqueryAlias
// FIXME Show the evaluation rules to get rid of the unresolvable parts
Creating GroupingSets Instance
GroupingSets takes the following when created:
-
Expressions from
GROUPING SETSclause -
Grouping expressions from
GROUP BYclause -
Child logical plan
-
Aggregate named expressions