spark 2 translation
Introduction
Overview of Apache Spark
Spark SQL
Spark SQL — Queries Over Structured Data on Massive Scale
SparkSession — The Entry Point to Spark SQL
- Builder — Building SparkSession using Fluent API
- SharedState — Shared State Across SparkSessions
Dataset — Strongly-Typed Structured Query with Encoder
Schema — Structure of Data
Dataset Operators
DataSource API — Loading and Saving Datasets
CacheManager — In-Memory Cache for Tables and Views
BaseRelation — Collection of Tuples with Schema
- HadoopFsRelation
- JDBCRelation
QueryExecution — Query Execution of Dataset
Spark SQL’s Performance Tuning Tips and Tricks (aka Case Studies)
- Number of Partitions for groupBy Aggregation
Expression — Executable Node in Catalyst Tree
LogicalPlan — Logical Query Plan / Logical Operator
Analyzer — Logical Query Plan Analyzer
SparkOptimizer — Logical Query Optimizer
SparkPlan — Physical Query Plan / Physical Operator
Partitioning — Specification of Physical Operator’s Output Partitions
SparkPlanner — Query Planner with no Hive Support
Physical Plan Preparations Rules
- CollapseCodegenStages Physical Preparation Rule — Collapsing Physical Operators for Whole-Stage CodeGen
- EnsureRequirements Physical Preparation Rule
SQL Parsing Framework
SQLMetric — Physical Operator Metric
Catalyst — Tree Manipulation Framework
ExchangeCoordinator and Adaptive Query Execution
ShuffledRowRDD
Debugging Query Execution
Datasets vs DataFrames vs RDDs
SQLConf
- CatalystConf
Catalog
- CatalogImpl
ExternalCatalog — System Catalog of Permanent Entities
SessionState
- BaseSessionStateBuilder — Base for Builders of SessionState
SessionCatalog — Metastore of Session-Specific Relational Entities
UDFRegistration
FunctionRegistry
ExperimentalMethods
SQLExecution Helper Object
CatalystSerde
Tungsten Execution Backend (aka Project Tungsten)
ExternalAppendOnlyUnsafeRowArray — Append-Only Array for UnsafeRows (with Disk Spill Threshold)
AggregationIterator — Generic Iterator of UnsafeRows for Aggregate Physical Operators
- TungstenAggregationIterator — Iterator of UnsafeRows for HashAggregateExec Physical Operator
JdbcDialect
HadoopFileLinesReader
KafkaWriter — Writing Dataset to Kafka
- KafkaSourceProvider
- KafkaWriteTask
Hive Integration
- Spark SQL CLI — spark-sql
- DataSinks Strategy
Thrift JDBC/ODBC Server — Spark Thrift Server (STS)
- SparkSQLEnv
(obsolete) SQLContext
Settings
Spark MLlib
Spark MLlib — Machine Learning in Spark
ML Pipelines and PipelineStages (spark.ml)
Latent Dirichlet Allocation (LDA)
Vector
LabeledPoint
Streaming MLlib
GeneralizedLinearRegression
Structured Streaming
Spark Structured Streaming — Streaming Datasets
Spark Core / Tools
Spark Shell — spark-shell shell script
Web UI — Spark Application’s Web Console
Spark Submit — spark-submit shell script
spark-class shell script
- AbstractCommandBuilder
SparkLauncher — Launching Spark Applications Programmatically
Spark Core / Architecture
Spark Architecture
Driver
Executor
- TaskRunner
- ExecutorSource
Master
Workers
Spark Core / RDD
Anatomy of Spark Application
SparkConf — Programmable Configuration for Spark Applications
- Spark Properties and spark-defaults.conf Properties File
- Deploy Mode
SparkContext
RDD — Resilient Distributed Dataset
Operators
- Transformations
  - PairRDDFunctions
- Actions
Caching and Persistence
- StorageLevel
Partitions and Partitioning
- Partition
- Partitioner
  - HashPartitioner
Shuffling
Checkpointing
- CheckpointRDD
RDD Dependencies
- NarrowDependency — Narrow Dependencies
- ShuffleDependency — Shuffle Dependencies
Map/Reduce-side Aggregator
Spark Core / Optimizations
Broadcast variables
Accumulators
- AccumulatorContext
Spark Core / Services
SerializerManager
MemoryManager — Memory Management
- UnifiedMemoryManager
SparkEnv — Spark Runtime Environment
DAGScheduler — Stage-Oriented Scheduler
TaskScheduler — Spark Scheduler
SchedulerBackend — Pluggable Scheduler Backends
- CoarseGrainedSchedulerBackend
  - DriverEndpoint — CoarseGrainedSchedulerBackend RPC Endpoint
ExecutorBackend — Pluggable Executor Backends
- CoarseGrainedExecutorBackend
- MesosExecutorBackend
BlockManager — Key-Value Store for Blocks
MapOutputTracker — Shuffle Map Output Registry
- MapOutputTrackerMaster — MapOutputTracker For Driver
  - MapOutputTrackerMasterEndpoint
- MapOutputTrackerWorker — MapOutputTracker for Executors
ShuffleManager — Pluggable Shuffle Systems
Serialization
ExternalClusterManager — Pluggable Cluster Managers
BroadcastManager
- BroadcastFactory — Pluggable Broadcast Variable Factories
  - TorrentBroadcastFactory
  - TorrentBroadcast
- CompressionCodec
ContextCleaner — Spark Application Garbage Collector
- CleanerListener
Dynamic Allocation (of Executors)
HTTP File Server
Data Locality
Cache Manager
OutputCommitCoordinator
RpcEnv — RPC Environment
TransportConf — Transport Configuration
(obsolete) Spark Streaming
Spark Streaming — Streaming RDDs
Spark Deployment Environments
Deployment Environments — Run Modes
Spark local (pseudo-cluster)
- LocalSchedulerBackend
- LocalEndpoint
Spark on cluster
Spark on YARN
Spark on YARN
YarnShuffleService — ExternalShuffleService on YARN
ExecutorRunnable
Client
YarnRMClient
ApplicationMaster
- AMEndpoint — ApplicationMaster RPC Endpoint
YarnClusterManager — ExternalClusterManager for YARN
TaskSchedulers for YARN
- YarnScheduler
- YarnClusterScheduler
SchedulerBackends for YARN
YarnAllocator
Introduction to Hadoop YARN
Setting up YARN Cluster
Kerberos
- ConfigurableCredentialManager
ClientDistributedCacheManager
YarnSparkHadoopUtil
Settings
Spark Standalone
Spark Standalone
Standalone Master
Standalone Worker
web UI
Submission Gateways
Management Scripts for Standalone Master
Management Scripts for Standalone Workers
Checking Status
Example 2-workers-on-1-node Standalone Cluster (one executor per worker)
StandaloneSchedulerBackend
Spark on Mesos
Spark on Mesos
MesosCoarseGrainedSchedulerBackend
About Mesos
Execution Model
Execution Model
Security
Spark Security
Securing Web UI
Spark Core / Data Sources
Data Sources in Spark
Using Input and Output (I/O)
- Parquet
Spark and Cassandra
Spark and Kafka
Couchbase Spark Connector
(obsolete) Spark GraphX
Spark GraphX — Distributed Graph Computations
Graph Algorithms
Monitoring, Tuning and Debugging
Unified Memory Management
Spark History Server
Logging
Performance Tuning
MetricsSystem
SparkListener — Intercepting Events from Spark Scheduler
JsonProtocol
Debugging Spark using sbt
Varia
Building Apache Spark from Sources
Spark and Hadoop
- SparkHadoopUtil
Spark and software in-memory file systems
Spark and The Others
Distributed Deep Learning on Spark
Spark Packages
Interactive Notebooks
Interactive Notebooks
- Apache Zeppelin
- Spark Notebook
Spark Tips and Tricks
Spark Tips and Tricks
Access private members in Scala in Spark shell
SparkException: Task not serializable
Running Spark Applications on Windows
Exercises
One-liners using PairRDDFunctions
Learning Jobs and Partitions Using take Action
Spark Standalone - Using ZooKeeper for High-Availability of Master
Spark’s Hello World using Spark shell and Scala
WordCount using Spark shell
Your first complete Spark application (using Scala and sbt)
Spark (notable) use cases
Using Spark SQL to update data in Hive using ORC files
Developing Custom SparkListener to monitor DAGScheduler in Scala
Developing RPC Environment
Developing Custom RDD
Working with Datasets from JDBC Data Sources (and PostgreSQL)
Causing Stage to Fail
Further Learning
Courses
Books
Spark Distributions
DataStax Enterprise
MapR Sandbox for Hadoop (Spark 1.5.2 only)
Spark Workshop
Spark Advanced Workshop
- Requirements
- Day 1
- Day 2
Spark Talk Ideas
Spark Talks Ideas (STI)
10 Lesser-Known Tidbits about Spark Standalone
Learning Spark internals using groupBy (to cause shuffle)

Powered by GitBook

Environment Tab

Environment Tab

spark webui environment.png

Figure 1. Environment tab in Web UI

results matching ""

No results matching ""