ShuffledHashJoinExec Binary Physical Operator

ShuffledHashJoinExec is a binary physical operator for hash-based joins.

ShuffledHashJoinExec is created for joins with joining keys and one of the following holds:

***********
Start spark-shell with ShuffledHashJoinExec's selection requirements

./bin/spark-shell \
    -c spark.sql.join.preferSortMergeJoin=false \
    -c spark.sql.autoBroadcastJoinThreshold=1
***********

scala> spark.conf.get("spark.sql.join.preferSortMergeJoin")
res0: String = false

scala> spark.conf.get("spark.sql.autoBroadcastJoinThreshold")
res1: String = 1

scala> spark.conf.get("spark.sql.shuffle.partitions")
res2: String = 200

val dataset = Seq(
  (0, "playing"),
  (1, "with"),
  (2, "ShuffledHashJoinExec")
).toDF("id", "token")
val query = dataset.join(dataset, Seq("id"), "leftsemi")

scala> query.queryExecution.optimizedPlan.stats(spark.sessionState.conf).sizeInBytes
res3: BigInt = 72

scala> query.explain
== Physical Plan ==
ShuffledHashJoin [id#15], [id#20], LeftSemi, BuildRight
:- Exchange hashpartitioning(id#15, 200)
:  +- LocalTableScan [id#15, token#16]
+- Exchange hashpartitioning(id#20, 200)
   +- LocalTableScan [id#20]
Note
ShuffledHashJoinExec operator is chosen in JoinSelection execution planning strategy.
Table 1. ShuffledHashJoinExec’s SQLMetrics
Name Description

buildDataSize

data size of build side

buildTime

time to build hash map

numOutputRows

number of output rows

spark sql ShuffledHashJoinExec webui query details.png
Figure 1. ShuffledHashJoinExec in web UI (Details for Query)
Table 2. ShuffledHashJoinExec’s Required Child Output Distributions
Left Child Right Child

ClusteredDistribution (per left join key expressions)

ClusteredDistribution (per right join key expressions)

Executing ShuffledHashJoinExec — doExecute Method

doExecute(): RDD[InternalRow]
Caution
FIXME
Note
doExecute is a part of SparkPlan Contract to produce the result of a structured query as an RDD of internal binary rows.

buildHashedRelation Internal Method

Caution
FIXME

Creating ShuffledHashJoinExec Instance

ShuffledHashJoinExec takes the following when created:

results matching ""

    No results matching ""