BlockManager — Key-Value Store for Blocks

BlockManager is a key-value store for blocks of data (simply blocks) in Spark. BlockManager acts as a local cache that runs on every "node" in a Spark application, i.e. the driver and executors (and is created when SparkEnv is created).

BlockManager provides interface for uploading and fetching blocks both locally and remotely using various stores, i.e. memory, disk, and off-heap.

When BlockManager is created, it creates its own private instances of DiskBlockManager, BlockInfoManager, MemoryStore and DiskStore (that it immediately wires together, i.e. BlockInfoManager with MemoryStore and DiskStore with DiskBlockManager).

The common idiom in Spark to access a BlockManager regardless of a location, i.e. the driver or executors, is through SparkEnv:

SparkEnv.get.blockManager

BlockManager is a BlockDataManager, i.e. manages the storage for blocks that can represent cached RDD partitions, intermediate shuffle outputs, broadcasts, etc. It is also a BlockEvictionHandler that drops a block from memory and storing it on a disk if applicable.

Cached blocks are blocks with non-zero sum of memory and disk sizes.

Tip	Use Web UI, esp. Storage and Executors tabs, to monitor the memory used.

Tip	Use spark-submit's command-line options, i.e. --driver-memory for the driver and --executor-memory for executors or their equivalents as Spark properties, i.e. spark.executor.memory and spark.driver.memory, to control the memory for storage memory.

A BlockManager is created when a Spark application starts and must be initialized before it is fully operable.

When External Shuffle Service is enabled, BlockManager uses ExternalShuffleClient to read other executors' shuffle files.

BlockManager uses BlockManagerSource to report metrics under the name BlockManager.

Table 1. BlockManager’s Internal Properties
Name	Initial Value	Description
`diskBlockManager`	FIXME	DiskBlockManager for…FIXME
`maxMemory`	Total available on-heap and off-heap memory for storage (in bytes)	Total maximum value that `BlockManager` can ever possibly use (that depends on MemoryManager and may vary over time).

Tip	Enable `INFO`, `DEBUG` or `TRACE` logging level for `org.apache.spark.storage.BlockManager` logger to see what happens inside. Add the following line to `conf/log4j.properties`: `log4j.logger.org.apache.spark.storage.BlockManager=TRACE` Refer to Logging.

Tip	You may want to shut off WARN messages being printed out about the current state of blocks using the following line to cut the noise: `log4j.logger.org.apache.spark.storage.BlockManager=OFF`

`getLocations` Method

Caution

Spark Property	Default Value	Description
`spark.blockManager.port`	`0`	Port to use for the block manager when a more specific setting for the driver or executors is not provided.
`spark.shuffle.sync`	`false`	Controls whether `DiskBlockObjectWriter` should force outstanding writes to disk when committing a single atomic block, i.e. all operating system buffers should synchronize with the disk to ensure that all changes to a file are in fact recorded in the storage.

BlockManager — Key-Value Store for Blocks

getLocations Method

blockIdsToHosts Method

getLocationBlockIds Method

getPeers Method

releaseAllLocksForTask Method

memoryStore Property

stop Method

putSingle Method

Getting Ids of Existing Blocks (For a Given Filter) — getMatchingBlockIds Method

getLocalValues Method

getRemoteValues Internal Method

Retrieving Block from Local or Remote Block Managers — get Method

getSingle Method

getOrElseUpdate Method

Getting Local Block Data As Bytes — getLocalBytes Method

getRemoteBytes Method

Finding Shuffle Block Data — getBlockData Method

removeBlockInternal Method

Is External Shuffle Service Enabled? — externalShuffleServiceEnabled Flag

Stores

Storing Block Data Locally — putBlockData Method

Storing Block Bytes Locally — putBytes Method

doPutBytes Internal Method

replicate Internal Method

maybeCacheDiskValuesInMemory Method

doPutIterator Method

doPut Internal Method

Removing Block From Memory and Disk — removeBlock Method

Removing RDD Blocks — removeRdd Method

Removing Broadcast Blocks — removeBroadcast Method

Getting Block Status — getStatus Method

Creating BlockManager Instance

shuffleClient

shuffleServerId

Initializing BlockManager — initialize Method

Registering Executor’s BlockManager with External Shuffle Server — registerWithExternalShuffleServer Method

Re-registering BlockManager with Driver and Reporting Blocks — reregister Method

Calculate Current Block Status — getCurrentBlockStatus Method

Removing Blocks From Memory Only — dropFromMemory Method

reportAllBlocks Method

Reporting Current Storage Status of Block to Driver — reportBlockStatus Method

Reporting Block Status Update to Driver — tryToReportBlockStatus Internal Method

BlockEvictionHandler

Broadcast Values

BlockManagerId

Execution Context

Misc

BlockResult

Registering Task with BlockInfoManager — registerTask Method

Offering DiskBlockObjectWriter To Write Blocks To Disk (For Current BlockManager) — getDiskWriter Method

Recording Updated BlockStatus In Current Task’s TaskMetrics — addUpdatedBlockStatusToTaskMetrics Internal Method

Settings

results matching ""

No results matching ""

`getLocations` Method

`blockIdsToHosts` Method

`getLocationBlockIds` Method

`getPeers` Method

`releaseAllLocksForTask` Method

`memoryStore` Property

`stop` Method

`putSingle` Method

Getting Ids of Existing Blocks (For a Given Filter) — `getMatchingBlockIds` Method

`getLocalValues` Method

`getRemoteValues` Internal Method

Retrieving Block from Local or Remote Block Managers — `get` Method

`getSingle` Method

`getOrElseUpdate` Method

Getting Local Block Data As Bytes — `getLocalBytes` Method

`getRemoteBytes` Method

Finding Shuffle Block Data — `getBlockData` Method

`removeBlockInternal` Method

Is External Shuffle Service Enabled? — `externalShuffleServiceEnabled` Flag

Storing Block Data Locally — `putBlockData` Method

Storing Block Bytes Locally — `putBytes` Method

`doPutBytes` Internal Method

`replicate` Internal Method

`maybeCacheDiskValuesInMemory` Method

`doPutIterator` Method

`doPut` Internal Method

Removing Block From Memory and Disk — `removeBlock` Method

Removing RDD Blocks — `removeRdd` Method

Removing Broadcast Blocks — `removeBroadcast` Method

Getting Block Status — `getStatus` Method

`shuffleClient`

`shuffleServerId`

Initializing BlockManager — `initialize` Method

Registering Executor’s BlockManager with External Shuffle Server — `registerWithExternalShuffleServer` Method

Re-registering BlockManager with Driver and Reporting Blocks — `reregister` Method

Calculate Current Block Status — `getCurrentBlockStatus` Method

Removing Blocks From Memory Only — `dropFromMemory` Method

`reportAllBlocks` Method

Reporting Current Storage Status of Block to Driver — `reportBlockStatus` Method

Reporting Block Status Update to Driver — `tryToReportBlockStatus` Internal Method

Registering Task with BlockInfoManager — `registerTask` Method

Offering DiskBlockObjectWriter To Write Blocks To Disk (For Current BlockManager) — `getDiskWriter` Method

Recording Updated BlockStatus In Current Task’s TaskMetrics — `addUpdatedBlockStatusToTaskMetrics` Internal Method