val q = sql("""
SELECT customer, year, SUM(sales)
FROM VALUES ("abc", 2017, 30) AS t1 (customer, year, sales)
GROUP BY customer, year
GROUPING SETS ((customer), (year))
""")
scala> println(q.queryExecution.logical.numberedTreeString)
00 'GroupingSets [ArrayBuffer('customer), ArrayBuffer('year)], ['customer, 'year], ['customer, 'year, unresolvedalias('SUM('sales), None)]
01 +- 'SubqueryAlias t1
02 +- 'UnresolvedInlineTable [customer, year, sales], [List(abc, 2017, 30)]
GroupingSets Unary Logical Operator
GroupingSets
is a unary logical operator that represents SQL’s GROUPING SETS variant of GROUP BY
clause.
GroupingSets
operator is resolved to an Aggregate
logical operator at analysis phase.
scala> println(q.queryExecution.analyzed.numberedTreeString)
00 Aggregate [customer#8, year#9, spark_grouping_id#5], [customer#8, year#9, sum(cast(sales#2 as bigint)) AS sum(sales)#4L]
01 +- Expand [List(customer#0, year#1, sales#2, customer#6, null, 1), List(customer#0, year#1, sales#2, null, year#7, 2)], [customer#0, year#1, sales#2, customer#8, year#9, spark_grouping_id#5]
02 +- Project [customer#0, year#1, sales#2, customer#0 AS customer#6, year#1 AS year#7]
03 +- SubqueryAlias t1
04 +- LocalRelation [customer#0, year#1, sales#2]
Note
|
GroupingSets can only be created using SQL.
|
Note
|
GroupingSets is not supported on Structured Streaming’s streaming Datasets.
|
GroupingSets
is never resolved (as it can only be converted to an Aggregate
logical operator).
The output schema of GroupingSets
are exactly the attributes of aggregate named expressions.
Analysis Phase
GroupingSets
operator is resolved at analysis phase in the following logical evaluation rules:
-
ResolveAliases for unresolved aliases in aggregate named expressions
val spark: SparkSession = ...
// using q from the example above
val plan = q.queryExecution.logical
scala> println(plan.numberedTreeString)
00 'GroupingSets [ArrayBuffer('customer), ArrayBuffer('year)], ['customer, 'year], ['customer, 'year, unresolvedalias('SUM('sales), None)]
01 +- 'SubqueryAlias t1
02 +- 'UnresolvedInlineTable [customer, year, sales], [List(abc, 2017, 30)]
// Note unresolvedalias for SUM expression
// Note UnresolvedInlineTable and SubqueryAlias
// FIXME Show the evaluation rules to get rid of the unresolvable parts
Creating GroupingSets Instance
GroupingSets
takes the following when created:
-
Expressions from
GROUPING SETS
clause -
Grouping expressions from
GROUP BY
clause -
Child logical plan
-
Aggregate named expressions