Spark and Apache Cassandra
Rules for Effective Spark-Cassandra Setup
-
Use Cassandra nodes to host Spark executors for data locality. In this setup a Spark executor will talk to a local Cassandra node and will only query for local data. It is supposed to make queries faster by reducing the usage of network to send data between Spark executors (to process data) and Cassandra nodes (where data lives).
-
Set up a dedicated Cassandra cluster for a Spark analytics batch workload - Analytics Data Center. Since it is more about batch processing with lots of table scans they would interfere with caches for real-time data reads and writes.
-
Spark jobs write results back to Cassandra.
Core Concepts of Cassandra
-
A keyspace is a space for tables and resembles a schema in a relational database.
-
A table stores data (and is a table in a relational database).
-
A table uses partitions to group data.
-
Partitions are groups of rows.
-
Partition keys to determine the location of partitions. They are used for grouping.
-
Clustering keys to determine ordering of rows in partitions. They are used for sorting.
-
CQL (aka Cassandra Query Language) to create tables and query data.
Further reading or watching
-
Excellent write-up about how to run Cassandra inside Docker from DataStax. Read it as early as possible!