SQLConf

SQLConf is an internal key-value configuration store for parameters and hints used in Spark SQL.

Note

SQLConf is not meant to be used directly and is available through the user-facing interface RuntimeConfig that you can access using SparkSession.

import org.apache.spark.sql.SparkSession
val spark: SparkSession = ...

scala> spark.conf
res0: org.apache.spark.sql.RuntimeConfig = org.apache.spark.sql.RuntimeConfig@6b58a0f9

SQLConf offers methods to get, set, unset or clear their values, but has also the accessor methods to read the current value of a parameter or hint.

You can access a session-specific SQLConf using SessionState:

import org.apache.spark.sql.SparkSession
val spark: SparkSession = ...

import spark.sessionState.conf

// accessing properties through accessor methods
scala> conf.numShufflePartitions
res0: Int = 200

// setting properties using aliases
import org.apache.spark.sql.internal.SQLConf.SHUFFLE_PARTITIONS
conf.setConf(SHUFFLE_PARTITIONS, 2)
scala> conf.numShufflePartitions
res2: Int = 2

// unset aka reset properties to the default value
conf.unsetConf(SHUFFLE_PARTITIONS)
scala> conf.numShufflePartitions
res4: Int = 200
Table 1. SQLConf’s Accessor Methods (in alphabetical order)
Name Parameter / Hint Description

adaptiveExecutionEnabled

spark.sql.adaptive.enabled

Used exclusively for EnsureRequirements to add ExchangeCoordinator (when adaptive query execution is enabled)

autoBroadcastJoinThreshold

spark.sql.autoBroadcastJoinThreshold

Used exclusively in JoinSelection execution planning strategy

broadcastTimeout

spark.sql.broadcastTimeout

Used exclusively in BroadcastExchangeExec (for broadcasting a table to executors).

columnBatchSize

spark.sql.inMemoryColumnarStorage.batchSize

Used…​FIXME

dataFramePivotMaxValues

spark.sql.pivotMaxValues

Used exclusively in pivot operator.

dataFrameRetainGroupColumns

spark.sql.retainGroupColumns

Used exclusively in RelationalGroupedDataset when creating the result Dataset (after agg, count, mean, max, avg, min, and sum operators).

defaultSizeInBytes

spark.sql.defaultSizeInBytes

Used…​FIXME

numShufflePartitions

spark.sql.shuffle.partitions

Used in:

joinReorderEnabled

spark.sql.cbo.joinReorder.enabled

Used exclusively in CostBasedJoinReorder logical plan optimization

limitScaleUpFactor

spark.sql.limit.scaleUpFactor

Used exclusively when a physical operator is requested the first n rows as an array.

preferSortMergeJoin

spark.sql.join.preferSortMergeJoin

Used exclusively in JoinSelection execution planning strategy to prefer sort merge join over shuffle hash join.

starSchemaDetection

spark.sql.cbo.starSchemaDetection

Used exclusively in ReorderJoin logical plan optimization (and indirectly in StarSchemaDetection)

useCompression

spark.sql.inMemoryColumnarStorage.compressed

Used…​FIXME

wholeStageEnabled

spark.sql.codegen.wholeStage

Used in:

wholeStageFallback

spark.sql.codegen.fallback

Used exclusively when WholeStageCodegenExec is executed.

wholeStageMaxNumFields

spark.sql.codegen.maxFields

Used in:

windowExecBufferSpillThreshold

spark.sql.windowExec.buffer.spill.threshold

Used exclusively when WindowExec unary physical operator is executed.

useObjectHashAggregation

spark.sql.execution.useObjectHashAggregateExec

Used exclusively in Aggregation execution planning strategy when selecting a physical plan.

Table 2. Parameters and Hints (in alphabetical order)
Name Default Value Description

spark.sql.adaptive.enabled

false

Enables adaptive query execution

Note
Adaptive query execution is not supported for streaming Datasets and is disabled at execution.

Use adaptiveExecutionEnabled method to access the current value.

spark.sql.autoBroadcastJoinThreshold

10L * 1024 * 1024

Maximum size (in bytes) for a table that will be broadcast to all worker nodes when performing a join.

If the size of the statistics of the logical plan of a table is at most the setting, the DataFrame is broadcast for join.

Negative values or 0 disable broadcasting.

Use autoBroadcastJoinThreshold method to access the current value.

spark.sql.broadcastTimeout

5 * 60

Timeout in seconds for the broadcast wait time in broadcast joins.

When negative, it is assumed infinite (i.e. Duration.Inf)

Used through SQLConf.broadcastTimeout.

spark.sql.cbo.enabled

false

Enables cost-based optimization (CBO) for estimation of plan statistics when enabled (i.e. true).

Used (through SQLConf.cboEnabled method) in:

spark.sql.cbo.joinReorder.enabled

false

Enables join reorder for cost-based optimization (CBO).

Use joinReorderEnabled method to access the current value.

spark.sql.cbo.starSchemaDetection

false

Enables join reordering based on star schema detection for cost-based optimization (CBO) in ReorderJoin logical plan optimization.

Use starSchemaDetection method to access the current value.

spark.sql.codegen.fallback

true

(internal) Whether the whole stage codegen could be temporary disabled for the part of a query that has failed to compile generated code (true) or not (false).

Use wholeStageFallback method to access the current value.

spark.sql.codegen.maxFields

100

(internal) Maximum number of output fields (including nested fields) that whole-stage codegen supports. Going above the number deactivates whole-stage codegen.

Use wholeStageMaxNumFields method to access the current value.

spark.sql.codegen.wholeStage

true

(internal) Whether the whole stage (of multiple physical operators) will be compiled into a single Java method (true) or not (false).

Use wholeStageEnabled method to access the current value.

spark.sql.defaultSizeInBytes

Java’s Long.MaxValue

(internal) Table size used in query planning.

It is by default set to Java’s Long.MaxValue which is larger than spark.sql.autoBroadcastJoinThreshold to be more conservative. That is to say by default the optimizer will not choose to broadcast a table unless it knows for sure its size is small enough.

Use useObjectHashAggregation method to access the current value.

spark.sql.execution.useObjectHashAggregateExec

true

Flag to enable ObjectHashAggregateExec in Aggregation execution planning strategy.

Use useObjectHashAggregation method to access the current value.

spark.sql.inMemoryColumnarStorage.batchSize

10000

(internal) Controls…​FIXME

Use columnBatchSize method to access the current value.

spark.sql.inMemoryColumnarStorage.compressed

true

(internal) Controls…​FIXME

Use useCompression method to access the current value.

spark.sql.join.preferSortMergeJoin

true

(internal) Controls JoinSelection execution planning strategy to prefer sort merge join over shuffle hash join.

Use preferSortMergeJoin method to access the current value.

spark.sql.limit.scaleUpFactor

4

(internal) Minimal increase rate in the number of partitions between attempts when executing take operator on a structured query. Higher values lead to more partitions read. Lower values might lead to longer execution times as more jobs will be run.

Use limitScaleUpFactor method to access the current value.

spark.sql.optimizer.maxIterations

100

Maximum number of iterations for Analyzer and Optimizer.

spark.sql.pivotMaxValues

10000

Maximum number of (distinct) values that will be collected without error (when doing a pivot without specifying the values for the pivot column)

Use dataFramePivotMaxValues method to access the current value.

spark.sql.retainGroupColumns

true

Controls whether to retain columns used for aggregation or not (in RelationalGroupedDataset operators).

Use dataFrameRetainGroupColumns method to access the current value.

spark.sql.selfJoinAutoResolveAmbiguity

true

Control whether to resolve ambiguity in join conditions for self-joins automatically.

spark.sql.shuffle.partitions

200

Default number of partitions to use when shuffling data for joins or aggregations.

Corresponds to Apache Hive’s mapred.reduce.tasks property that Spark considers deprecated.

Use numShufflePartitions method to access the current value.

spark.sql.streaming.fileSink.log.deletion

true

Controls whether to delete the expired log files in file stream sink.

spark.sql.streaming.fileSink.log.cleanupDelay

FIXME

FIXME

spark.sql.streaming.schemaInference

FIXME

FIXME

spark.sql.streaming.fileSink.log.compactInterval

FIXME

FIXME

spark.sql.windowExec.buffer.spill.threshold

4096

(internal) Threshold for number of rows buffered in window operator

Use windowExecBufferSpillThreshold method to access the current value.

Note
SQLConf is a private[sql] serializable class in org.apache.spark.sql.internal package.

Getting Parameters and Hints

You can get the current parameters and hints using the following family of get methods.

getConfString(key: String): String
getConf[T](entry: ConfigEntry[T], defaultValue: T): T
getConf[T](entry: ConfigEntry[T]): T
getConf[T](entry: OptionalConfigEntry[T]): Option[T]
getConfString(key: String, defaultValue: String): String
getAllConfs: immutable.Map[String, String]
getAllDefinedConfs: Seq[(String, String, String)]

Setting Parameters and Hints

You can set parameters and hints using the following family of set methods.

setConf(props: Properties): Unit
setConfString(key: String, value: String): Unit
setConf[T](entry: ConfigEntry[T], value: T): Unit

Unsetting Parameters and Hints

You can unset parameters and hints using the following family of unset methods.

unsetConf(key: String): Unit
unsetConf(entry: ConfigEntry[_]): Unit

Clearing All Parameters and Hints

clear(): Unit

You can use clear to remove all the parameters and hints in SQLConf.

results matching ""

    No results matching ""