preferredLocations: Seq[TaskLocation]
ResultTask
ResultTask is a Task that executes a function on the records in a RDD partition.
ResultTask is created exclusively when DAGScheduler submits missing tasks for a ResultStage.
ResultTask is created with a broadcast variable with the RDD and the function to execute it on and the partition.
| Name | Description |
|---|---|
Collection of TaskLocations. Corresponds directly to unique entries in locs with the only rule that when Initialized when Used exclusively when |
Creating ResultTask Instance
ResultTask takes the following when created:
-
stageId— the stage the task is executed for -
stageAttemptId— the stage attempt id -
Broadcast variable with the serialized task (as
Array[Byte]). The broadcast contains of a serialized pair ofRDDand the function to execute. -
Partition to compute
-
Collection of TaskLocations, i.e. preferred locations (executors) to execute the task on
-
The stage’s serialized TaskMetrics (as
Array[Byte]) -
(optional) Job id
ResultTask initializes the internal registries and counters.
preferredLocations Method
|
Note
|
preferredLocations is a part of Task contract.
|
preferredLocations simply returns preferredLocs internal property.
Deserialize RDD and Function (From Broadcast) and Execute Function (on RDD Partition) — runTask Method
runTask(context: TaskContext): U
|
Note
|
U is the type of a result as defined when ResultTask is created.
|
runTask deserializes a RDD and a function from the broadcast and then executes the function (on the records from the RDD partition).
|
Note
|
runTask is a part of Task contract to run a task.
|
Internally, runTask starts by tracking the time required to deserialize a RDD and a function to execute.
runTask creates a new closure Serializer.
|
Note
|
runTask uses SparkEnv to access the current closure Serializer.
|
runTask requests the closure Serializer to deserialize an RDD and the function to execute (from taskBinary broadcast).
|
Note
|
taskBinary broadcast is defined when ResultTask is created.
|
runTask records _executorDeserializeTime and _executorDeserializeCpuTime properties.
In the end, runTask executes the function (passing in the input context and the records from partition of the RDD).
|
Note
|
partition to use to access the records in a deserialized RDD is defined when ResultTask was created.
|