cache(): this.type = persist()
RDD Caching and Persistence
Caching or persistence are optimisation techniques for (iterative and interactive) Spark computations. They help saving interim partial results so they can be reused in subsequent stages. These interim results as RDDs are thus kept in memory (default) or more solid storages like disk and/or replicated.
The difference between cache
and persist
operations is purely syntactic. cache
is a synonym of persist
or persist(MEMORY_ONLY)
, i.e. cache
is merely persist
with the default storage level MEMORY_ONLY
.
Note
|
Due to the very small and purely syntactic difference between caching and persistence of RDDs the two terms are often used interchangeably and I will follow the "pattern" here. |
RDDs can also be unpersisted to remove RDD from a permanent storage like memory and/or disk.
Caching RDD — cache
Method
cache
is a synonym of persist with MEMORY_ONLY
storage level.
Persisting RDD — persist
Methods
persist(): this.type
persist(newLevel: StorageLevel): this.type
persist
marks a RDD for persistence using newLevel
storage level.
You can only change the storage level once or a UnsupportedOperationException
is thrown:
Cannot change storage level of an RDD after it was already assigned a level
Note
|
You can pretend to change the storage level of an RDD with already-assigned storage level only if the storage level is the same as it is currently assigned. |
If the RDD is marked as persistent the first time, the RDD is registered to ContextCleaner
(if available) and SparkContext
.
The internal storageLevel
attribute is set to the input newLevel
storage level.
Unpersisting RDDs (Clearing Blocks) — unpersist
Method
unpersist(blocking: Boolean = true): this.type
When called, unpersist
prints the following INFO message to the logs:
INFO [RddName]: Removing RDD [id] from persistence list
It then calls SparkContext.unpersistRDD(id, blocking) and sets NONE
storage level as the current storage level.