Expression — Executable Node in Catalyst Tree

Expression is a executable node (in a Catalyst tree) that can be evaluated to a value given input values, i.e. can produce a JVM object per InternalRow.

Note
Expression is often called a Catalyst expression even though it is merely built using (not be part of) the Catalyst — Tree Manipulation Framework.
// evaluating an expression
// Use Literal expression to create an expression from a Scala object
import org.apache.spark.sql.catalyst.expressions.Expression
import org.apache.spark.sql.catalyst.expressions.Literal
val e: Expression = Literal("hello")

import org.apache.spark.sql.catalyst.expressions.EmptyRow
val v: Any = e.eval(EmptyRow)

// Convert to Scala's String
import org.apache.spark.unsafe.types.UTF8String
scala> val s = v.asInstanceOf[UTF8String].toString
s: String = hello

Expression can generate a Java source code that is then used in evaluation.

Table 1. Specialized Expressions
Name Scala Kind Behaviour Examples

BinaryExpression

abstract class

CodegenFallback

trait

Does not support code generation and falls back to interpreted mode

ExpectsInputTypes

trait

LeafExpression

abstract class

Has no child expressions (and hence "terminates" the expression tree).

NamedExpression

Can later be referenced in a dataflow graph.

Nondeterministic

trait

NonSQLExpression

trait

Expression with no SQL representation

Gives the only custom sql method that is non-overridable (i.e. final).

When requested SQL representation, NonSQLExpression transforms Attributes to be PrettyAttributes to build text representation.

TernaryExpression

abstract class

TimeZoneAwareExpression

trait

Timezone-aware expressions

UnaryExpression

abstract class

Unevaluable

trait

Cannot be evaluated, i.e. eval and doGenCode are not supported and report an UnsupportedOperationException.

Unevaluable expressions are supposed to be replaced by some other expressions during analysis or optimization.

Expression Contract

package org.apache.spark.sql.catalyst.expressions

abstract class Expression extends TreeNode[Expression] {
  // only required methods that have no implementation
  def dataType: DataType
  def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode
  def eval(input: InternalRow = EmptyRow): Any
  def nullable: Boolean
}
Table 2. (Subset of) Expression Contract (in alphabetical order)
Method Description

canonicalized

checkInputDataTypes

childrenResolved

dataType

deterministic

doGenCode

Code-generated evaluation that generates a Java source code (that is used to evaluate the expression in a more optimized way not directly using eval).

Used as part of genCode.

eval

No-code-generated evaluation that evaluates the expression to a JVM object for a given InternalRow (without generating a corresponding Java code.)

Note
By default accepts EmptyRow, i.e. null.

foldable

genCode

Code-generated evaluation that generates a Java source code (that is used to evaluate the expression in a more optimized way not directly using eval).

Similar to doGenCode but supports expression reuse (aka subexpression elimination).

nullable

prettyName

references

resolved

semanticEquals

semanticHash

sql

SQL representation

prettyName followed by sql of children in the round brackets and concatenated using the comma (,), e.g.

import org.apache.spark.sql.catalyst.dsl.expressions._
import org.apache.spark.sql.catalyst.expressions.Sentences
val sentences = Sentences("Hi there! Good morning.", "en", "US")

import org.apache.spark.sql.catalyst.expressions.Expression
val expr: Expression = count("*") === 5 && count(sentences) === 5
scala> expr.sql
res0: String = ((count('*') = 5) AND (count(sentences('Hi there! Good morning.', 'en', 'US')) = 5))

Nondeterministic Expression

Nondeterministic expressions are non-deterministic and non-foldable, i.e. deterministic and foldable properties are disabled (i.e. false). They require explicit initialization before evaluation.

Nondeterministic expressions have two additional methods:

  1. initInternal for internal initialization (called before eval)

  2. evalInternal to evaluate a InternalRow into a JVM object.

Note
Nondeterministic is a Scala trait.

Nondeterministic expressions have the additional initialized flag that is enabled (i.e. true) after the other additional initInternal method has been called.

Examples of Nondeterministic expressions are InputFileName, MonotonicallyIncreasingID, SparkPartitionID functions and the abstract RDG (that is the base for Rand and Randn functions).

Note
Nondeterministic expressions are the target of PullOutNondeterministic logical plan rule.

results matching ""

    No results matching ""