Learning Spark internals using groupBy (to cause shuffle)

Execute the following operation and explain transformations, actions, jobs, stages, tasks, partitions using spark-shell and web UI.

sc.parallelize(0 to 999, 50).zipWithIndex.groupBy(_._1 / 10).collect

You may also make it a little bit heavier with explaining data distribution over cluster and go over the concepts of drivers, masters, workers, and executors.

results matching ""

    No results matching ""