Performance Tuning

Goal: Improve Spark’s performance where feasible.

measure performance bottlenecks using new metrics, including block-time analysis
a live demo of a new performance analysis tool
CPU — not I/O (network) — is often a critical bottleneck
community dogma = network and disk I/O are major bottlenecks
a TPC-DS workload, of two sizes: a 20 machine cluster with 850GB of data, and a 60 machine cluster with 2.5TB of data.
- network is almost irrelevant for performance of these workloads
- network optimization could only reduce job completion time by, at most, 2%
- 10Gbps networking hardware is likely not necessary
serialized compressed data

reduceByKey is better
mind serialization time
- impacts CPU - time to serialize and network - time to send the data over the wire
Tungsten - recent initiative from Databrics - aims at reducing CPU time
- jobs become more bottlenecked by IO

results matching ""