spark 2 translation
Introduction
Overview of Apache Spark
Spark SQL
Spark SQL — Queries Over Structured Data on Massive Scale
SparkSession — The Entry Point to Spark SQL
Builder — Building SparkSession using Fluent API
SharedState — Shared State Across SparkSessions
Dataset — Strongly-Typed Structured Query with Encoder
Encoders — Internal Row Converters
ExpressionEncoder — Expression-Based Encoder
LocalDateTimeEncoder — Custom ExpressionEncoder for java.time.LocalDateTime
DataFrame — Dataset of Rows
Row
RowEncoder — Encoder for DataFrames
Schema — Structure of Data
StructType
StructField
Data Types
Dataset Operators
Column Operators
Standard Functions — functions Object
Standard Functions for Date and Time
Window Aggregate Functions
User-Defined Functions (UDFs)
Basic Aggregation — Typed and Untyped Grouping Operators
RelationalGroupedDataset — Untyped Row-based Grouping
KeyValueGroupedDataset — Typed Grouping
Joins
Broadcast Joins (aka Map-Side Joins)
Multi-Dimensional Aggregation
UserDefinedAggregateFunction — Contract for User-Defined Aggregate Functions (UDAFs)
Dataset Caching and Persistence
User-Friendly Names Of Cached Queries in web UI’s Storage Tab
DataSource API — Loading and Saving Datasets
DataFrameReader — Reading Datasets from External Data Sources
DataFrameWriter
DataSource — Pluggable Data Provider Framework
CreatableRelationProvider — Data Sources That Save Rows Per Save Mode
RelationProvider — Data Sources With Schema Inference
SchemaRelationProvider — Data Sources With Mandatory User-Defined Schema
DataSourceRegister
CSVFileFormat
JdbcRelationProvider
JsonFileFormat
JsonDataSource
ParquetFileFormat
Custom Formats
CacheManager — In-Memory Cache for Tables and Views
BaseRelation — Collection of Tuples with Schema
HadoopFsRelation
JDBCRelation
QueryExecution — Query Execution of Dataset
Spark SQL’s Performance Tuning Tips and Tricks (aka Case Studies)
Number of Partitions for groupBy Aggregation
Expression — Executable Node in Catalyst Tree
AggregateExpression — Expression Container for AggregateFunction
AggregateFunction
DeclarativeAggregate
ImperativeAggregate — Contract for Aggregate Function Expressions with Imperative Methods
TypedImperativeAggregate — Contract for Imperative Aggregate Functions with Custom Aggregation Buffer
Attribute Leaf Expression
BoundReference Leaf Expression — Reference to Value in InternalRow
CallMethodViaReflection Expression
Generator — Catalyst Expressions that Generate Zero Or More Rows
JsonToStructs Unary Expression
Literal Leaf Expression
ScalaUDAF — Catalyst Expression Adapter for UserDefinedAggregateFunction
StaticInvoke Non-SQL Expression
TimeWindow Unevaluable Unary Expression
UnixTimestamp TimeZoneAware Binary Expression
WindowExpression Unevaluable Expression
WindowSpecDefinition Unevaluable Expression
WindowFunction
AggregateWindowFunction
OffsetWindowFunction
SizeBasedWindowFunction
LogicalPlan — Logical Query Plan / Logical Operator
Aggregate Unary Logical Operator
BroadcastHint Unary Logical Operator
DeserializeToObject Logical Operator
Expand Unary Logical Operator
GroupingSets Unary Logical Operator
Hint Logical Operator
InMemoryRelation Leaf Logical Operator For Cached Query Plans
Join Logical Operator
LocalRelation Logical Operator
LogicalRelation Logical Operator — Adapter for BaseRelation
Pivot Unary Logical Operator
Repartition Logical Operators — Repartition and RepartitionByExpression
RunnableCommand — Generic Logical Command with Side Effects
AlterViewAsCommand Logical Command
ClearCacheCommand Logical Command
CreateDataSourceTableCommand Logical Command
CreateViewCommand Logical Command
ExplainCommand Logical Command
SubqueryAlias Logical Operator
UnresolvedFunction Logical Operator
UnresolvedRelation Logical Operator
Window Unary Logical Operator
WithWindowDefinition Unary Logical Operator
Analyzer — Logical Query Plan Analyzer
CheckAnalysis — Analysis Validation
ResolveWindowFrame Logical Evaluation Rule
WindowsSubstitution Logical Evaluation Rule
SparkOptimizer — Logical Query Optimizer
Optimizer — Base for Logical Query Plan Optimizers
ColumnPruning
CombineTypedFilters
ConstantFolding
CostBasedJoinReorder
DecimalAggregates
EliminateSerialization
GetCurrentDatabase / ComputeCurrentTime
LimitPushDown
NullPropagation — Nullability (NULL Value) Propagation
PropagateEmptyRelation
PushDownPredicate — Predicate Pushdown / Filter Pushdown Logical Plan Optimization
ReorderJoin
SimplifyCasts
SparkPlan — Physical Query Plan / Physical Operator
BroadcastExchangeExec Unary Operator for Broadcasting Joins
BroadcastHashJoinExec Binary Physical Operator
BroadcastNestedLoopJoinExec Binary Physical Operator
CoalesceExec Unary Physical Operator
DataSourceScanExec — Contract for Leaf Physical Operators with Code Generation
FileSourceScanExec Physical Operator
RowDataSourceScanExec Physical Operator
ExecutedCommandExec Physical Operator
HashAggregateExec Aggregate Physical Operator for Hash-Based Aggregation
InMemoryTableScanExec Physical Operator
LocalTableScanExec Physical Operator
ObjectHashAggregateExec Aggregate Physical Operator
ShuffleExchange Unary Physical Operator
ShuffledHashJoinExec Binary Physical Operator
SortAggregateExec Aggregate Physical Operator for Sort-Based Aggregation
SortMergeJoinExec Binary Physical Operator
InputAdapter Unary Physical Operator
WindowExec Unary Physical Operator
AggregateProcessor
WindowFunctionFrame
WholeStageCodegenExec Unary Operator with Java Code Generation
Partitioning — Specification of Physical Operator’s Output Partitions
SparkPlanner — Query Planner with no Hive Support
SparkStrategy — Base for Execution Planning Strategies
SparkStrategies — Container of Execution Planning Strategies
Aggregation Execution Planning Strategy for Aggregate Physical Operators
BasicOperators Execution Planning Strategy
DataSourceStrategy Execution Planning Strategy
FileSourceStrategy Execution Planning Strategy
InMemoryScans Execution Planning Strategy
JoinSelection Execution Planning Strategy
Physical Plan Preparations Rules
CollapseCodegenStages Physical Preparation Rule — Collapsing Physical Operators for Whole-Stage CodeGen
EnsureRequirements Physical Preparation Rule
SQL Parsing Framework
SparkSqlParser — Default SQL Parser
SparkSqlAstBuilder
CatalystSqlParser — DataTypes and StructTypes Parser
AstBuilder — ANTLR-based SQL Parser
AbstractSqlParser — Base SQL Parsing Infrastructure
ParserInterface — SQL Parser Contract
SQLMetric — Physical Operator Metric
Catalyst — Tree Manipulation Framework
TreeNode — Node in Catalyst Tree
QueryPlan — Structured Query Plan
RuleExecutor — Tree Transformation Rule Executor
GenericStrategy
QueryPlanner — Converting Logical Plan to Physical Trees
Catalyst DSL — Implicit Conversions for Catalyst Data Structures
ExchangeCoordinator and Adaptive Query Execution
ShuffledRowRDD
Debugging Query Execution
Datasets vs DataFrames vs RDDs
SQLConf
CatalystConf
Catalog
CatalogImpl
ExternalCatalog — System Catalog of Permanent Entities
SessionState
BaseSessionStateBuilder — Base for Builders of SessionState
SessionCatalog — Metastore of Session-Specific Relational Entities
UDFRegistration
FunctionRegistry
ExperimentalMethods
SQLExecution Helper Object
CatalystSerde
Tungsten Execution Backend (aka Project Tungsten)
Whole-Stage Code Generation (CodeGen)
CodegenSupport — Physical Operators with Optional Java Code Generation
InternalRow — Abstract Binary Row Format
UnsafeRow — Mutable Raw-Memory Unsafe Binary Row Format
CodeGenerator
UnsafeProjection — Generic Function to Project InternalRows to UnsafeRows
GenerateUnsafeProjection
ExternalAppendOnlyUnsafeRowArray — Append-Only Array for UnsafeRows (with Disk Spill Threshold)
AggregationIterator — Generic Iterator of UnsafeRows for Aggregate Physical Operators
TungstenAggregationIterator — Iterator of UnsafeRows for HashAggregateExec Physical Operator
JdbcDialect
HadoopFileLinesReader
KafkaWriter — Writing Dataset to Kafka
KafkaSourceProvider
KafkaWriteTask
Hive Integration
Spark SQL CLI — spark-sql
DataSinks Strategy
Thrift JDBC/ODBC Server — Spark Thrift Server (STS)
SparkSQLEnv
(obsolete) SQLContext
Settings
Spark MLlib
Spark MLlib — Machine Learning in Spark
ML Pipelines and PipelineStages (spark.ml)
ML Pipeline Components — Transformers
Tokenizer
ML Pipeline Components — Estimators
ML Pipeline Models
Evaluators
CrossValidator
Params and ParamMaps
ML Persistence — Saving and Loading Models and Pipelines
Example — Text Classification
Example — Linear Regression
Latent Dirichlet Allocation (LDA)
Vector
LabeledPoint
Streaming MLlib
GeneralizedLinearRegression
Structured Streaming
Spark Structured Streaming — Streaming Datasets
Spark Core / Tools
Spark Shell — spark-shell shell script
Web UI — Spark Application’s Web Console
Jobs Tab
Stages Tab — Stages for All Jobs
Stages for All Jobs
Stage Details
Pool Details
Storage Tab
BlockStatusListener Spark Listener
Environment Tab
EnvironmentListener Spark Listener
Executors Tab
ExecutorsListener Spark Listener
SQL Tab
SQLListener Spark Listener
JobProgressListener Spark Listener
StorageStatusListener Spark Listener
StorageListener — Spark Listener for Tracking Persistence Status of RDD Blocks
RDDOperationGraphListener Spark Listener
SparkUI
Spark Submit — spark-submit shell script
SparkSubmitArguments
SparkSubmitOptionParser — spark-submit’s Command-Line Parser
SparkSubmitCommandBuilder Command Builder
spark-class shell script
AbstractCommandBuilder
SparkLauncher — Launching Spark Applications Programmatically
Spark Core / Architecture
Spark Architecture
Driver
Executor
TaskRunner
ExecutorSource
Master
Workers
Spark Core / RDD
Anatomy of Spark Application
SparkConf — Programmable Configuration for Spark Applications
Spark Properties and spark-defaults.conf Properties File
Deploy Mode
SparkContext
HeartbeatReceiver RPC Endpoint
Inside Creating SparkContext
ConsoleProgressBar
SparkStatusTracker
Local Properties — Creating Logical Job Groups
RDD — Resilient Distributed Dataset
RDD Lineage — Logical Execution Plan
TaskLocation
ParallelCollectionRDD
MapPartitionsRDD
OrderedRDDFunctions
CoGroupedRDD
SubtractedRDD
HadoopRDD
NewHadoopRDD
ShuffledRDD
BlockRDD
Operators
Transformations
PairRDDFunctions
Actions
Caching and Persistence
StorageLevel
Partitions and Partitioning
Partition
Partitioner
HashPartitioner
Shuffling
Checkpointing
CheckpointRDD
RDD Dependencies
NarrowDependency — Narrow Dependencies
ShuffleDependency — Shuffle Dependencies
Map/Reduce-side Aggregator
Spark Core / Optimizations
Broadcast variables
Accumulators
AccumulatorContext
Spark Core / Services
SerializerManager
MemoryManager — Memory Management
UnifiedMemoryManager
SparkEnv — Spark Runtime Environment
DAGScheduler — Stage-Oriented Scheduler
Jobs
Stage — Physical Unit Of Execution
ShuffleMapStage — Intermediate Stage in Execution DAG
ResultStage — Final Stage in Job
StageInfo
DAGScheduler Event Bus
JobListener
JobWaiter
TaskScheduler — Spark Scheduler
Tasks
ShuffleMapTask — Task for ShuffleMapStage
ResultTask
TaskDescription
FetchFailedException
MapStatus — Shuffle Map Output Status
TaskSet — Set of Tasks for Stage
TaskSetManager
Schedulable
Schedulable Pool
Schedulable Builders
FIFOSchedulableBuilder
FairSchedulableBuilder
Scheduling Mode — spark.scheduler.mode Spark Property
TaskInfo
TaskSchedulerImpl — Default TaskScheduler
Speculative Execution of Tasks
TaskResultGetter
TaskContext
TaskContextImpl
TaskResults — DirectTaskResult and IndirectTaskResult
TaskMemoryManager
MemoryConsumer
TaskMetrics
ShuffleWriteMetrics
TaskSetBlacklist — Blacklisting Executors and Nodes For TaskSet
SchedulerBackend — Pluggable Scheduler Backends
CoarseGrainedSchedulerBackend
DriverEndpoint — CoarseGrainedSchedulerBackend RPC Endpoint
ExecutorBackend — Pluggable Executor Backends
CoarseGrainedExecutorBackend
MesosExecutorBackend
BlockManager — Key-Value Store for Blocks
MemoryStore
DiskStore
BlockDataManager
ShuffleClient
BlockTransferService — Pluggable Block Transfers
NettyBlockTransferService — Netty-Based BlockTransferService
NettyBlockRpcServer
BlockManagerMaster — BlockManager for Driver
BlockManagerMasterEndpoint — BlockManagerMaster RPC Endpoint
DiskBlockManager
BlockInfoManager
BlockInfo
BlockManagerSlaveEndpoint
DiskBlockObjectWriter
BlockManagerSource — Metrics Source for BlockManager
StorageStatus
MapOutputTracker — Shuffle Map Output Registry
MapOutputTrackerMaster — MapOutputTracker For Driver
MapOutputTrackerMasterEndpoint
MapOutputTrackerWorker — MapOutputTracker for Executors
ShuffleManager — Pluggable Shuffle Systems
SortShuffleManager — The Default Shuffle System
ExternalShuffleService
OneForOneStreamManager
ShuffleBlockResolver
IndexShuffleBlockResolver
ShuffleWriter
BypassMergeSortShuffleWriter
SortShuffleWriter
UnsafeShuffleWriter — ShuffleWriter for SerializedShuffleHandle
BaseShuffleHandle — Fallback Shuffle Handle
BypassMergeSortShuffleHandle — Marker Interface for Bypass Merge Sort Shuffle Handles
SerializedShuffleHandle — Marker Interface for Serialized Shuffle Handles
ShuffleReader
BlockStoreShuffleReader
ShuffleBlockFetcherIterator
ShuffleExternalSorter — Cache-Efficient Sorter
ExternalSorter
Serialization
Serializer — Task SerDe
SerializerInstance
SerializationStream
DeserializationStream
ExternalClusterManager — Pluggable Cluster Managers
BroadcastManager
BroadcastFactory — Pluggable Broadcast Variable Factories
TorrentBroadcastFactory
TorrentBroadcast
CompressionCodec
ContextCleaner — Spark Application Garbage Collector
CleanerListener
Dynamic Allocation (of Executors)
ExecutorAllocationManager — Allocation Manager for Spark Core
ExecutorAllocationClient
ExecutorAllocationListener
ExecutorAllocationManagerSource
HTTP File Server
Data Locality
Cache Manager
OutputCommitCoordinator
RpcEnv — RPC Environment
RpcEndpoint
RpcEndpointRef
RpcEnvFactory
Netty-based RpcEnv
TransportConf — Transport Configuration
(obsolete) Spark Streaming
Spark Streaming — Streaming RDDs
Spark Deployment Environments
Deployment Environments — Run Modes
Spark local (pseudo-cluster)
LocalSchedulerBackend
LocalEndpoint
Spark on cluster
Spark on YARN
Spark on YARN
YarnShuffleService — ExternalShuffleService on YARN
ExecutorRunnable
Client
YarnRMClient
ApplicationMaster
AMEndpoint — ApplicationMaster RPC Endpoint
YarnClusterManager — ExternalClusterManager for YARN
TaskSchedulers for YARN
YarnScheduler
YarnClusterScheduler
SchedulerBackends for YARN
YarnSchedulerBackend
YarnClientSchedulerBackend
YarnClusterSchedulerBackend
YarnSchedulerEndpoint RPC Endpoint
YarnAllocator
Introduction to Hadoop YARN
Setting up YARN Cluster
Kerberos
ConfigurableCredentialManager
ClientDistributedCacheManager
YarnSparkHadoopUtil
Settings
Spark Standalone
Spark Standalone
Standalone Master
Standalone Worker
web UI
Submission Gateways
Management Scripts for Standalone Master
Management Scripts for Standalone Workers
Checking Status
Example 2-workers-on-1-node Standalone Cluster (one executor per worker)
StandaloneSchedulerBackend
Spark on Mesos
Spark on Mesos
MesosCoarseGrainedSchedulerBackend
About Mesos
Execution Model
Execution Model
Security
Spark Security
Securing Web UI
Spark Core / Data Sources
Data Sources in Spark
Using Input and Output (I/O)
Parquet
Spark and Cassandra
Spark and Kafka
Couchbase Spark Connector
(obsolete) Spark GraphX
Spark GraphX — Distributed Graph Computations
Graph Algorithms
Monitoring, Tuning and Debugging
Unified Memory Management
Spark History Server
HistoryServer
SQLHistoryListener
FsHistoryProvider
HistoryServerArguments
Logging
Performance Tuning
MetricsSystem
MetricsConfig — Metrics System Configuration
Metrics Source
Metrics Sink
SparkListener — Intercepting Events from Spark Scheduler
LiveListenerBus
ReplayListenerBus
SparkListenerBus — Internal Contract for Spark Event Buses
EventLoggingListener — Spark Listener for Persisting Events
StatsReportListener — Logging Summary Statistics
JsonProtocol
Debugging Spark using sbt
Varia
Building Apache Spark from Sources
Spark and Hadoop
SparkHadoopUtil
Spark and software in-memory file systems
Spark and The Others
Distributed Deep Learning on Spark
Spark Packages
Interactive Notebooks
Interactive Notebooks
Apache Zeppelin
Spark Notebook
Spark Tips and Tricks
Spark Tips and Tricks
Access private members in Scala in Spark shell
SparkException: Task not serializable
Running Spark Applications on Windows
Exercises
One-liners using PairRDDFunctions
Learning Jobs and Partitions Using take Action
Spark Standalone - Using ZooKeeper for High-Availability of Master
Spark’s Hello World using Spark shell and Scala
WordCount using Spark shell
Your first complete Spark application (using Scala and sbt)
Spark (notable) use cases
Using Spark SQL to update data in Hive using ORC files
Developing Custom SparkListener to monitor DAGScheduler in Scala
Developing RPC Environment
Developing Custom RDD
Working with Datasets from JDBC Data Sources (and PostgreSQL)
Causing Stage to Fail
Further Learning
Courses
Books
Spark Distributions
DataStax Enterprise
MapR Sandbox for Hadoop (Spark 1.5.2 only)
Spark Workshop
Spark Advanced Workshop
Requirements
Day 1
Day 2
Spark Talk Ideas
Spark Talks Ideas (STI)
10 Lesser-Known Tidbits about Spark Standalone
Learning Spark internals using groupBy (to cause shuffle)
Powered by
GitBook
Environment Tab
Environment Tab
Figure 1. Environment tab in Web UI
results matching "
"
No results matching "
"