***********
Start spark-shell with ShuffledHashJoinExec's selection requirements
./bin/spark-shell \
-c spark.sql.join.preferSortMergeJoin=false \
-c spark.sql.autoBroadcastJoinThreshold=1
***********
scala> spark.conf.get("spark.sql.join.preferSortMergeJoin")
res0: String = false
scala> spark.conf.get("spark.sql.autoBroadcastJoinThreshold")
res1: String = 1
scala> spark.conf.get("spark.sql.shuffle.partitions")
res2: String = 200
val dataset = Seq(
(0, "playing"),
(1, "with"),
(2, "ShuffledHashJoinExec")
).toDF("id", "token")
val query = dataset.join(dataset, Seq("id"), "leftsemi")
scala> query.queryExecution.optimizedPlan.stats(spark.sessionState.conf).sizeInBytes
res3: BigInt = 72
scala> query.explain
== Physical Plan ==
ShuffledHashJoin [id#15], [id#20], LeftSemi, BuildRight
:- Exchange hashpartitioning(id#15, 200)
: +- LocalTableScan [id#15, token#16]
+- Exchange hashpartitioning(id#20, 200)
+- LocalTableScan [id#20]
ShuffledHashJoinExec Binary Physical Operator
ShuffledHashJoinExec
is a binary physical operator for hash-based joins.
ShuffledHashJoinExec
is created for joins with joining keys and one of the following holds:
-
spark.sql.join.preferSortMergeJoin is disabled, canBuildRight, canBuildLocalHashMap for right join side and finally right join side is much smaller than left side
-
spark.sql.join.preferSortMergeJoin is disabled, canBuildLeft, canBuildLocalHashMap for left join side and finally left join side is much smaller than right
-
Left join keys are not orderable
Note
|
ShuffledHashJoinExec operator is chosen in JoinSelection execution planning strategy.
|
Name | Description |
---|---|
data size of build side |
|
time to build hash map |
|
number of output rows |
Left Child | Right Child |
---|---|
|
|
Executing ShuffledHashJoinExec — doExecute
Method
doExecute(): RDD[InternalRow]
Caution
|
FIXME |
Note
|
doExecute is a part of SparkPlan Contract to produce the result of a structured query as an RDD of internal binary rows.
|
buildHashedRelation
Internal Method
Caution
|
FIXME |
Creating ShuffledHashJoinExec Instance
ShuffledHashJoinExec
takes the following when created:
-
Left join key expressions
-
Right join key expressions
-
Optional join condition expression
-
Left physical operator
-
Right physical operator