Analyzer — Logical Query Plan Analyzer

Analyzer is a logical query plan analyzer in Spark SQL that semantically validates and transforms an unresolved logical plan to an analyzed logical plan (with proper relational entities) using logical evaluation rules.

Analyzer: Unresolved Logical Plan ==> Analyzed Logical Plan

You can access a session-specific Analyzer through SessionState.

val spark: SparkSession = ...
spark.sessionState.analyzer

You can access the analyzed logical plan of a Dataset using explain (with extended flag enabled) or SQL’s EXPLAIN EXTENDED operators.

// sample Dataset
val inventory = spark.range(5)
  .withColumn("new_column", 'id + 5 as "plus5")

// Using explain operator (with extended flag enabled)
scala> inventory.explain(extended = true)
== Parsed Logical Plan ==
'Project [*, ('id + 5) AS plus5#81 AS new_column#82]
+- Range (0, 5, step=1, splits=Some(8))

== Analyzed Logical Plan ==
id: bigint, new_column: bigint
Project [id#78L, (id#78L + cast(5 as bigint)) AS new_column#82L]
+- Range (0, 5, step=1, splits=Some(8))

== Optimized Logical Plan ==
Project [id#78L, (id#78L + 5) AS new_column#82L]
+- Range (0, 5, step=1, splits=Some(8))

== Physical Plan ==
*Project [id#78L, (id#78L + 5) AS new_column#82L]
+- *Range (0, 5, step=1, splits=8)

Alternatively, you can also access the analyzed logical plan through QueryExecution's analyzed attribute (that together with numberedTreeString method is a very good "debugging" tool).

// Here with numberedTreeString to...please your eyes :)
scala> println(inventory.queryExecution.analyzed.numberedTreeString)
00 Project [id#78L, (id#78L + cast(5 as bigint)) AS new_column#82L]
01 +- Range (0, 5, step=1, splits=Some(8))

Analyzer defines extendedResolutionRules extension point for additional logical evaluation rules that a custom Analyzer can use to extend the Resolution batch. The rules are added at the end of the Resolution batch.

Note	SessionState uses its own `Analyzer` with custom extendedResolutionRules, postHocResolutionRules, and extendedCheckRules extension methods.

Analyzer is created while its owning SessionState is.

Table 1. Analyzer’s Internal Registries and Counters (in alphabetical order)
Name	Description
`extendedResolutionRules`	Additional rules for Resolution batch. Empty by default
`fixedPoint`	`FixedPoint` with maxIterations for Hints, Substitution, Resolution and Cleanup batches. Set when `Analyzer` is created (and can be defined explicitly or through optimizerMaxIterations configuration setting.
`postHocResolutionRules`	The only rules in Post-Hoc Resolution batch if defined (that are executed in one pass, i.e. `Once` strategy). Empty by default

Analyzer is used by QueryExecution to resolve the managed LogicalPlan (and, as a sort of follow-up, assert that a structured query has already been properly analyzed, i.e. no failed or unresolved or somehow broken logical plan operators and expressions exist).

Tip

Enable TRACE or DEBUG logging levels for the respective session-specific loggers to see what happens inside Analyzer.

org.apache.spark.sql.internal.SessionState$$anon$1
org.apache.spark.sql.hive.HiveSessionStateBuilder$$anon$1 when Hive support is enabled

Add the following line to conf/log4j.properties:

# with no Hive support
log4j.logger.org.apache.spark.sql.internal.SessionState$$anon$1=TRACE

# with Hive support enabled
log4j.logger.org.apache.spark.sql.hive.HiveSessionStateBuilder$$anon$1=DEBUG

Refer to Logging.

The reason for such weird-looking logger names is that analyzer attribute is created as an anonymous subclass of Analyzer class in the respective SessionStates.

Executing Logical Evaluation Rules — `execute` Method

Analyzer is a RuleExecutor that defines the logical evaluation rules (i.e. resolving, removing, and in general modifying it), e.g.

Resolves unresolved relations and functions (including UnresolvedGenerators) using provided SessionCatalog
…

Batch Name Strategy Rules Description

Hints

FixedPoint

ResolveBroadcastHints

Adds a BroadcastHint unary operator to a logical plan for BROADCAST, BROADCASTJOIN and MAPJOIN broadcast hints for UnresolvedRelation and SubqueryAlias logical plans.

RemoveAllHints

Removes all the hints (valid or not).

Simple Sanity Check

Once

LookupFunctions

Checks whether a function identifier (referenced by an UnresolvedFunction) exists in the function registry. Throws a NoSuchFunctionException if not.

Substitution

FixedPoint

CTESubstitution

Resolves With operators (and substitutes named common table expressions — CTEs)

WindowsSubstitution

Substitutes UnresolvedWindowExpression with WindowExpression for WithWindowDefinition logical operators.

EliminateUnions

Eliminates Union of one child into that child

SubstituteUnresolvedOrdinals

Replaces ordinals in Sort and Aggregate operators with UnresolvedOrdinal

Resolution

FixedPoint

ResolveTableValuedFunctions

Replaces UnresolvedTableValuedFunction with table-valued function.

ResolveRelations

Resolves InsertIntoTable and UnresolvedRelation operators

ResolveReferences

ResolveCreateNamedStruct

ResolveDeserializer

ResolveNewInstance

ResolveUpCast

ResolveGroupingAnalytics

Resolves grouping expressions up in a logical plan tree:

Cube, Rollup and GroupingSets expressions
Filter with Grouping or GroupingID expressions
Sort with Grouping or GroupingID expressions

Expects that all children of a logical operator are already resolved (and given it belongs to a fixed-point batch it will likely happen at some iteration).

Fails analysis when grouping__id Hive function is used.

scala> sql("select grouping__id").queryExecution.analyzed
org.apache.spark.sql.AnalysisException: grouping__id is deprecated; use grouping_id() instead;
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
  at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91)
  at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGroupingAnalytics$$anonfun$apply$6.applyOrElse(Analyzer.scala:451)
  at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGroupingAnalytics$$anonfun$apply$6.applyOrElse(Analyzer.scala:448)

Note	`ResolveGroupingAnalytics` is only for grouping functions resolution while ResolveAggregateFunctions is responsible for resolving the other aggregates.

ResolvePivot

Resolves Pivot logical operator to Project with an Aggregate unary logical operator (for supported data types in aggregates) or just a single Aggregate.

ResolveOrdinalInOrderByAndGroupBy

ResolveMissingReferences

ExtractGenerator

ResolveGenerate

ResolveFunctions

Resolves functions using SessionCatalog:

UnresolvedGenerator to a Generator
UnresolvedFunction to a AggregateExpression (with AggregateFunction) or AggregateWindowFunction

If Generator is not found, ResolveFunctions reports the error:

[name] is expected to be a generator. However, its class is [className], which is not a generator.

ResolveAliases

Replaces UnresolvedAlias expressions with concrete aliases:

NamedExpressions
MultiAlias (for GeneratorOuter and Generator)
Alias (for Cast and ExtractValue)

ResolveSubquery

ResolveWindowOrder

ResolveWindowFrame

Resolves WindowExpression expressions

ResolveNaturalAndUsingJoin

ExtractWindowExpressions

GlobalAggregates

Resolves (aka replaces) Project operators with AggregateExpression that are not WindowExpression with Aggregate unary logical operators.

ResolveAggregateFunctions

Resolves aggregate functions in Filter and Sort operators

Note	`ResolveAggregateFunctions` skips (i.e. does not resolve) grouping functions that are resolved by ResolveGroupingAnalytics rule.

TimeWindowing

Resolves TimeWindow expressions to Filter with Expand logical operators.

Note	Multiple distinct time window expressions are not currently supported (and yield a `AnalysisException`)

ResolveInlineTables

Resolves UnresolvedInlineTable with LocalRelation

TypeCoercion.typeCoercionRules

extendedResolutionRules

Post-Hoc Resolution

Once

postHocResolutionRules

View

Once

AliasViewChild

Nondeterministic

Once

PullOutNondeterministic

UDF

Once

HandleNullInputsForUDF

FixNullability

Once

FixNullability

ResolveTimeZone

Once

ResolveTimeZone

Replaces TimeZoneAwareExpression with no timezone with one with session-local time zone.

Cleanup

FixedPoint

CleanupAliases

Tip	Consult the sources of `Analyzer` for the up-to-date list of the evaluation rules.

Creating Analyzer Instance

Analyzer takes the following when created:

SessionCatalog
CatalystConf
Number of iterations before FixedPoint rule batches have converged (i.e. Hints, Substitution, Resolution and Cleanup)

Analyzer initializes the internal registries and counters.

Note	`Analyzer` can also be created without specifying the maxIterations which is then configured using optimizerMaxIterations configuration setting.

`resolver` Method

resolver: Resolver

resolver requests CatalystConf for Resolver.

Note	`Resolver` is a mere function of two `String` parameters that returns `true` if both refer to the same entity (i.e. for case insensitive equality).

Analyzer — Logical Query Plan Analyzer