ShuffledRowRDD

ShuffledRowRDD is a specialized RDD of InternalRows.

Note
ShuffledRowRDD looks like ShuffledRDD, and the difference is in the type of the values to process, i.e. InternalRow and (K, C) key-value pairs, respectively.

ShuffledRowRDD takes a ShuffleDependency (of integer keys and InternalRow values).

Note
The dependency property is mutable and is of type ShuffleDependency[Int, InternalRow, InternalRow].

ShuffledRowRDD takes an optional specifiedPartitionStartIndices collection of integers that is the number of post-shuffle partitions. When not specified, the number of post-shuffle partitions is managed by the Partitioner of the input ShuffleDependency.

Note
Post-shuffle partition is…​FIXME
Table 1. ShuffledRowRDD and RDD Contract
Name Description

getDependencies

A single-element collection with ShuffleDependency[Int, InternalRow, InternalRow].

partitioner

CoalescedPartitioner (with the Partitioner of the dependency)

getPreferredLocations

compute

numPreShufflePartitions Property

Caution
FIXME

Computing Partition (in TaskContext) — compute Method

compute(split: Partition, context: TaskContext): Iterator[InternalRow]
Note
compute is a part of RDD contract to compute a given partition in a TaskContext.

Internally, compute makes sure that the input split is a ShuffledRowRDDPartition. It then requests ShuffleManager for a ShuffleReader to read InternalRows for the split.

Note
compute uses ShuffleHandle (of ShuffleDependency dependency) and the pre-shuffle start and end partition offsets.

Getting Placement Preferences of Partition — getPreferredLocations Method

getPreferredLocations(partition: Partition): Seq[String]
Note
getPreferredLocations is a part of RDD contract to specify placement preferences (aka preferred task locations), i.e. where tasks should be executed to be as close to the data as possible.

Internally, getPreferredLocations requests MapOutputTrackerMaster for the preferred locations of the input partition (for the single ShuffleDependency).

Note
getPreferredLocations uses SparkEnv to access the current MapOutputTrackerMaster (which runs on the driver).

CoalescedPartitioner

Caution
FIXME

ShuffledRowRDDPartition

Caution
FIXME

results matching ""

    No results matching ""