• spark 2 translation
  • Introduction
  • Overview of Apache Spark
  • Spark SQL
  • Spark SQL — Queries Over Structured Data on Massive Scale
  • SparkSession — The Entry Point to Spark SQL
    • Builder — Building SparkSession using Fluent API
    • SharedState — Shared State Across SparkSessions
  • Dataset — Strongly-Typed Structured Query with Encoder
    • Encoders — Internal Row Converters
    • ExpressionEncoder — Expression-Based Encoder
    • LocalDateTimeEncoder — Custom ExpressionEncoder for java.time.LocalDateTime
    • DataFrame — Dataset of Rows
      • Row
      • RowEncoder — Encoder for DataFrames
  • Schema — Structure of Data
    • StructType
    • StructField
    • Data Types
  • Dataset Operators
    • Column Operators
    • Standard Functions — functions Object
      • Standard Functions for Date and Time
      • Window Aggregate Functions
    • User-Defined Functions (UDFs)
    • Basic Aggregation — Typed and Untyped Grouping Operators
      • RelationalGroupedDataset — Untyped Row-based Grouping
      • KeyValueGroupedDataset — Typed Grouping
    • Joins
      • Broadcast Joins (aka Map-Side Joins)
    • Multi-Dimensional Aggregation
    • UserDefinedAggregateFunction — Contract for User-Defined Aggregate Functions (UDAFs)
    • Dataset Caching and Persistence
      • User-Friendly Names Of Cached Queries in web UI’s Storage Tab
  • DataSource API — Loading and Saving Datasets
    • DataFrameReader — Reading Datasets from External Data Sources
    • DataFrameWriter
    • DataSource — Pluggable Data Provider Framework
      • CreatableRelationProvider — Data Sources That Save Rows Per Save Mode
      • RelationProvider — Data Sources With Schema Inference
      • SchemaRelationProvider — Data Sources With Mandatory User-Defined Schema
    • DataSourceRegister
      • CSVFileFormat
      • JdbcRelationProvider
      • JsonFileFormat
      • JsonDataSource
      • ParquetFileFormat
    • Custom Formats
  • CacheManager — In-Memory Cache for Tables and Views
  • BaseRelation — Collection of Tuples with Schema
    • HadoopFsRelation
    • JDBCRelation
  • QueryExecution — Query Execution of Dataset
  • Spark SQL’s Performance Tuning Tips and Tricks (aka Case Studies)
    • Number of Partitions for groupBy Aggregation
  • Expression — Executable Node in Catalyst Tree
    • AggregateExpression — Expression Container for AggregateFunction
    • AggregateFunction
      • DeclarativeAggregate
      • ImperativeAggregate — Contract for Aggregate Function Expressions with Imperative Methods
      • TypedImperativeAggregate — Contract for Imperative Aggregate Functions with Custom Aggregation Buffer
    • Attribute Leaf Expression
    • BoundReference Leaf Expression — Reference to Value in InternalRow
    • CallMethodViaReflection Expression
    • Generator — Catalyst Expressions that Generate Zero Or More Rows
    • JsonToStructs Unary Expression
    • Literal Leaf Expression
    • ScalaUDAF — Catalyst Expression Adapter for UserDefinedAggregateFunction
    • StaticInvoke Non-SQL Expression
    • TimeWindow Unevaluable Unary Expression
    • UnixTimestamp TimeZoneAware Binary Expression
    • WindowExpression Unevaluable Expression
      • WindowSpecDefinition Unevaluable Expression
    • WindowFunction
      • AggregateWindowFunction
      • OffsetWindowFunction
      • SizeBasedWindowFunction
  • LogicalPlan — Logical Query Plan / Logical Operator
    • Aggregate Unary Logical Operator
    • BroadcastHint Unary Logical Operator
    • DeserializeToObject Logical Operator
    • Expand Unary Logical Operator
    • GroupingSets Unary Logical Operator
    • Hint Logical Operator
    • InMemoryRelation Leaf Logical Operator For Cached Query Plans
    • Join Logical Operator
    • LocalRelation Logical Operator
    • LogicalRelation Logical Operator — Adapter for BaseRelation
    • Pivot Unary Logical Operator
    • Repartition Logical Operators — Repartition and RepartitionByExpression
    • RunnableCommand — Generic Logical Command with Side Effects
      • AlterViewAsCommand Logical Command
      • ClearCacheCommand Logical Command
      • CreateDataSourceTableCommand Logical Command
      • CreateViewCommand Logical Command
      • ExplainCommand Logical Command
    • SubqueryAlias Logical Operator
    • UnresolvedFunction Logical Operator
    • UnresolvedRelation Logical Operator
    • Window Unary Logical Operator
    • WithWindowDefinition Unary Logical Operator
  • Analyzer — Logical Query Plan Analyzer
    • CheckAnalysis — Analysis Validation
    • ResolveWindowFrame Logical Evaluation Rule
    • WindowsSubstitution Logical Evaluation Rule
  • SparkOptimizer — Logical Query Optimizer
    • Optimizer — Base for Logical Query Plan Optimizers
    • ColumnPruning
    • CombineTypedFilters
    • ConstantFolding
    • CostBasedJoinReorder
    • DecimalAggregates
    • EliminateSerialization
    • GetCurrentDatabase / ComputeCurrentTime
    • LimitPushDown
    • NullPropagation — Nullability (NULL Value) Propagation
    • PropagateEmptyRelation
    • PushDownPredicate — Predicate Pushdown / Filter Pushdown Logical Plan Optimization
    • ReorderJoin
    • SimplifyCasts
  • SparkPlan — Physical Query Plan / Physical Operator
    • BroadcastExchangeExec Unary Operator for Broadcasting Joins
    • BroadcastHashJoinExec Binary Physical Operator
    • BroadcastNestedLoopJoinExec Binary Physical Operator
    • CoalesceExec Unary Physical Operator
    • DataSourceScanExec — Contract for Leaf Physical Operators with Code Generation
      • FileSourceScanExec Physical Operator
      • RowDataSourceScanExec Physical Operator
    • ExecutedCommandExec Physical Operator
    • HashAggregateExec Aggregate Physical Operator for Hash-Based Aggregation
    • InMemoryTableScanExec Physical Operator
    • LocalTableScanExec Physical Operator
    • ObjectHashAggregateExec Aggregate Physical Operator
    • ShuffleExchange Unary Physical Operator
    • ShuffledHashJoinExec Binary Physical Operator
    • SortAggregateExec Aggregate Physical Operator for Sort-Based Aggregation
    • SortMergeJoinExec Binary Physical Operator
    • InputAdapter Unary Physical Operator
    • WindowExec Unary Physical Operator
      • AggregateProcessor
      • WindowFunctionFrame
    • WholeStageCodegenExec Unary Operator with Java Code Generation
  • Partitioning — Specification of Physical Operator’s Output Partitions
  • SparkPlanner — Query Planner with no Hive Support
    • SparkStrategy — Base for Execution Planning Strategies
    • SparkStrategies — Container of Execution Planning Strategies
    • Aggregation Execution Planning Strategy for Aggregate Physical Operators
    • BasicOperators Execution Planning Strategy
    • DataSourceStrategy Execution Planning Strategy
    • FileSourceStrategy Execution Planning Strategy
    • InMemoryScans Execution Planning Strategy
    • JoinSelection Execution Planning Strategy
  • Physical Plan Preparations Rules
    • CollapseCodegenStages Physical Preparation Rule — Collapsing Physical Operators for Whole-Stage CodeGen
    • EnsureRequirements Physical Preparation Rule
  • SQL Parsing Framework
    • SparkSqlParser — Default SQL Parser
      • SparkSqlAstBuilder
    • CatalystSqlParser — DataTypes and StructTypes Parser
    • AstBuilder — ANTLR-based SQL Parser
    • AbstractSqlParser — Base SQL Parsing Infrastructure
    • ParserInterface — SQL Parser Contract
  • SQLMetric — Physical Operator Metric
  • Catalyst — Tree Manipulation Framework
    • TreeNode — Node in Catalyst Tree
    • QueryPlan — Structured Query Plan
    • RuleExecutor — Tree Transformation Rule Executor
    • GenericStrategy
    • QueryPlanner — Converting Logical Plan to Physical Trees
    • Catalyst DSL — Implicit Conversions for Catalyst Data Structures
  • ExchangeCoordinator and Adaptive Query Execution
  • ShuffledRowRDD
  • Debugging Query Execution
  • Datasets vs DataFrames vs RDDs
  • SQLConf
    • CatalystConf
  • Catalog
    • CatalogImpl
  • ExternalCatalog — System Catalog of Permanent Entities
  • SessionState
    • BaseSessionStateBuilder — Base for Builders of SessionState
  • SessionCatalog — Metastore of Session-Specific Relational Entities
  • UDFRegistration
  • FunctionRegistry
  • ExperimentalMethods
  • SQLExecution Helper Object
  • CatalystSerde
  • Tungsten Execution Backend (aka Project Tungsten)
    • Whole-Stage Code Generation (CodeGen)
    • CodegenSupport — Physical Operators with Optional Java Code Generation
    • InternalRow — Abstract Binary Row Format
      • UnsafeRow — Mutable Raw-Memory Unsafe Binary Row Format
    • CodeGenerator
    • UnsafeProjection — Generic Function to Project InternalRows to UnsafeRows
      • GenerateUnsafeProjection
  • ExternalAppendOnlyUnsafeRowArray — Append-Only Array for UnsafeRows (with Disk Spill Threshold)
  • AggregationIterator — Generic Iterator of UnsafeRows for Aggregate Physical Operators
    • TungstenAggregationIterator — Iterator of UnsafeRows for HashAggregateExec Physical Operator
  • JdbcDialect
  • HadoopFileLinesReader
  • KafkaWriter — Writing Dataset to Kafka
    • KafkaSourceProvider
    • KafkaWriteTask
  • Hive Integration
    • Spark SQL CLI — spark-sql
    • DataSinks Strategy
  • Thrift JDBC/ODBC Server — Spark Thrift Server (STS)
    • SparkSQLEnv
  • (obsolete) SQLContext
  • Settings
  • Spark MLlib
  • Spark MLlib — Machine Learning in Spark
  • ML Pipelines and PipelineStages (spark.ml)
    • ML Pipeline Components — Transformers
      • Tokenizer
    • ML Pipeline Components — Estimators
    • ML Pipeline Models
    • Evaluators
    • CrossValidator
    • Params and ParamMaps
    • ML Persistence — Saving and Loading Models and Pipelines
    • Example — Text Classification
    • Example — Linear Regression
  • Latent Dirichlet Allocation (LDA)
  • Vector
  • LabeledPoint
  • Streaming MLlib
  • GeneralizedLinearRegression
  • Structured Streaming
  • Spark Structured Streaming — Streaming Datasets
  • Spark Core / Tools
  • Spark Shell — spark-shell shell script
  • Web UI — Spark Application’s Web Console
    • Jobs Tab
    • Stages Tab — Stages for All Jobs
      • Stages for All Jobs
      • Stage Details
      • Pool Details
    • Storage Tab
      • BlockStatusListener Spark Listener
    • Environment Tab
      • EnvironmentListener Spark Listener
    • Executors Tab
      • ExecutorsListener Spark Listener
    • SQL Tab
      • SQLListener Spark Listener
    • JobProgressListener Spark Listener
    • StorageStatusListener Spark Listener
    • StorageListener — Spark Listener for Tracking Persistence Status of RDD Blocks
    • RDDOperationGraphListener Spark Listener
    • SparkUI
  • Spark Submit — spark-submit shell script
    • SparkSubmitArguments
    • SparkSubmitOptionParser — spark-submit’s Command-Line Parser
    • SparkSubmitCommandBuilder Command Builder
  • spark-class shell script
    • AbstractCommandBuilder
  • SparkLauncher — Launching Spark Applications Programmatically
  • Spark Core / Architecture
  • Spark Architecture
  • Driver
  • Executor
    • TaskRunner
    • ExecutorSource
  • Master
  • Workers
  • Spark Core / RDD
  • Anatomy of Spark Application
  • SparkConf — Programmable Configuration for Spark Applications
    • Spark Properties and spark-defaults.conf Properties File
    • Deploy Mode
  • SparkContext
    • HeartbeatReceiver RPC Endpoint
    • Inside Creating SparkContext
    • ConsoleProgressBar
    • SparkStatusTracker
    • Local Properties — Creating Logical Job Groups
  • RDD — Resilient Distributed Dataset
    • RDD Lineage — Logical Execution Plan
    • TaskLocation
    • ParallelCollectionRDD
    • MapPartitionsRDD
    • OrderedRDDFunctions
    • CoGroupedRDD
    • SubtractedRDD
    • HadoopRDD
    • NewHadoopRDD
    • ShuffledRDD
    • BlockRDD
  • Operators
    • Transformations
      • PairRDDFunctions
    • Actions
  • Caching and Persistence
    • StorageLevel
  • Partitions and Partitioning
    • Partition
    • Partitioner
      • HashPartitioner
  • Shuffling
  • Checkpointing
    • CheckpointRDD
  • RDD Dependencies
    • NarrowDependency — Narrow Dependencies
    • ShuffleDependency — Shuffle Dependencies
  • Map/Reduce-side Aggregator
  • Spark Core / Optimizations
  • Broadcast variables
  • Accumulators
    • AccumulatorContext
  • Spark Core / Services
  • SerializerManager
  • MemoryManager — Memory Management
    • UnifiedMemoryManager
  • SparkEnv — Spark Runtime Environment
  • DAGScheduler — Stage-Oriented Scheduler
    • Jobs
    • Stage — Physical Unit Of Execution
      • ShuffleMapStage — Intermediate Stage in Execution DAG
      • ResultStage — Final Stage in Job
      • StageInfo
    • DAGScheduler Event Bus
    • JobListener
      • JobWaiter
  • TaskScheduler — Spark Scheduler
    • Tasks
      • ShuffleMapTask — Task for ShuffleMapStage
      • ResultTask
    • TaskDescription
    • FetchFailedException
    • MapStatus — Shuffle Map Output Status
    • TaskSet — Set of Tasks for Stage
    • TaskSetManager
      • Schedulable
      • Schedulable Pool
      • Schedulable Builders
        • FIFOSchedulableBuilder
        • FairSchedulableBuilder
      • Scheduling Mode — spark.scheduler.mode Spark Property
      • TaskInfo
    • TaskSchedulerImpl — Default TaskScheduler
      • Speculative Execution of Tasks
      • TaskResultGetter
    • TaskContext
      • TaskContextImpl
    • TaskResults — DirectTaskResult and IndirectTaskResult
    • TaskMemoryManager
      • MemoryConsumer
    • TaskMetrics
      • ShuffleWriteMetrics
    • TaskSetBlacklist — Blacklisting Executors and Nodes For TaskSet
  • SchedulerBackend — Pluggable Scheduler Backends
    • CoarseGrainedSchedulerBackend
      • DriverEndpoint — CoarseGrainedSchedulerBackend RPC Endpoint
  • ExecutorBackend — Pluggable Executor Backends
    • CoarseGrainedExecutorBackend
    • MesosExecutorBackend
  • BlockManager — Key-Value Store for Blocks
    • MemoryStore
    • DiskStore
    • BlockDataManager
    • ShuffleClient
    • BlockTransferService — Pluggable Block Transfers
      • NettyBlockTransferService — Netty-Based BlockTransferService
      • NettyBlockRpcServer
    • BlockManagerMaster — BlockManager for Driver
      • BlockManagerMasterEndpoint — BlockManagerMaster RPC Endpoint
    • DiskBlockManager
    • BlockInfoManager
      • BlockInfo
    • BlockManagerSlaveEndpoint
    • DiskBlockObjectWriter
    • BlockManagerSource — Metrics Source for BlockManager
    • StorageStatus
  • MapOutputTracker — Shuffle Map Output Registry
    • MapOutputTrackerMaster — MapOutputTracker For Driver
      • MapOutputTrackerMasterEndpoint
    • MapOutputTrackerWorker — MapOutputTracker for Executors
  • ShuffleManager — Pluggable Shuffle Systems
    • SortShuffleManager — The Default Shuffle System
    • ExternalShuffleService
    • OneForOneStreamManager
    • ShuffleBlockResolver
      • IndexShuffleBlockResolver
    • ShuffleWriter
      • BypassMergeSortShuffleWriter
      • SortShuffleWriter
      • UnsafeShuffleWriter — ShuffleWriter for SerializedShuffleHandle
    • BaseShuffleHandle — Fallback Shuffle Handle
    • BypassMergeSortShuffleHandle — Marker Interface for Bypass Merge Sort Shuffle Handles
    • SerializedShuffleHandle — Marker Interface for Serialized Shuffle Handles
    • ShuffleReader
      • BlockStoreShuffleReader
    • ShuffleBlockFetcherIterator
    • ShuffleExternalSorter — Cache-Efficient Sorter
    • ExternalSorter
  • Serialization
    • Serializer — Task SerDe
    • SerializerInstance
    • SerializationStream
    • DeserializationStream
  • ExternalClusterManager — Pluggable Cluster Managers
  • BroadcastManager
    • BroadcastFactory — Pluggable Broadcast Variable Factories
      • TorrentBroadcastFactory
      • TorrentBroadcast
    • CompressionCodec
  • ContextCleaner — Spark Application Garbage Collector
    • CleanerListener
  • Dynamic Allocation (of Executors)
    • ExecutorAllocationManager — Allocation Manager for Spark Core
    • ExecutorAllocationClient
    • ExecutorAllocationListener
    • ExecutorAllocationManagerSource
  • HTTP File Server
  • Data Locality
  • Cache Manager
  • OutputCommitCoordinator
  • RpcEnv — RPC Environment
    • RpcEndpoint
    • RpcEndpointRef
    • RpcEnvFactory
    • Netty-based RpcEnv
  • TransportConf — Transport Configuration
  • (obsolete) Spark Streaming
  • Spark Streaming — Streaming RDDs
  • Spark Deployment Environments
  • Deployment Environments — Run Modes
  • Spark local (pseudo-cluster)
    • LocalSchedulerBackend
    • LocalEndpoint
  • Spark on cluster
  • Spark on YARN
  • Spark on YARN
  • YarnShuffleService — ExternalShuffleService on YARN
  • ExecutorRunnable
  • Client
  • YarnRMClient
  • ApplicationMaster
    • AMEndpoint — ApplicationMaster RPC Endpoint
  • YarnClusterManager — ExternalClusterManager for YARN
  • TaskSchedulers for YARN
    • YarnScheduler
    • YarnClusterScheduler
  • SchedulerBackends for YARN
    • YarnSchedulerBackend
    • YarnClientSchedulerBackend
    • YarnClusterSchedulerBackend
    • YarnSchedulerEndpoint RPC Endpoint
  • YarnAllocator
  • Introduction to Hadoop YARN
  • Setting up YARN Cluster
  • Kerberos
    • ConfigurableCredentialManager
  • ClientDistributedCacheManager
  • YarnSparkHadoopUtil
  • Settings
  • Spark Standalone
  • Spark Standalone
  • Standalone Master
  • Standalone Worker
  • web UI
  • Submission Gateways
  • Management Scripts for Standalone Master
  • Management Scripts for Standalone Workers
  • Checking Status
  • Example 2-workers-on-1-node Standalone Cluster (one executor per worker)
  • StandaloneSchedulerBackend
  • Spark on Mesos
  • Spark on Mesos
  • MesosCoarseGrainedSchedulerBackend
  • About Mesos
  • Execution Model
  • Execution Model
  • Security
  • Spark Security
  • Securing Web UI
  • Spark Core / Data Sources
  • Data Sources in Spark
  • Using Input and Output (I/O)
    • Parquet
  • Spark and Cassandra
  • Spark and Kafka
  • Couchbase Spark Connector
  • (obsolete) Spark GraphX
  • Spark GraphX — Distributed Graph Computations
  • Graph Algorithms
  • Monitoring, Tuning and Debugging
  • Unified Memory Management
  • Spark History Server
    • HistoryServer
    • SQLHistoryListener
    • FsHistoryProvider
    • HistoryServerArguments
  • Logging
  • Performance Tuning
  • MetricsSystem
    • MetricsConfig — Metrics System Configuration
    • Metrics Source
    • Metrics Sink
  • SparkListener — Intercepting Events from Spark Scheduler
    • LiveListenerBus
    • ReplayListenerBus
    • SparkListenerBus — Internal Contract for Spark Event Buses
    • EventLoggingListener — Spark Listener for Persisting Events
    • StatsReportListener — Logging Summary Statistics
  • JsonProtocol
  • Debugging Spark using sbt
  • Varia
  • Building Apache Spark from Sources
  • Spark and Hadoop
    • SparkHadoopUtil
  • Spark and software in-memory file systems
  • Spark and The Others
  • Distributed Deep Learning on Spark
  • Spark Packages
  • Interactive Notebooks
  • Interactive Notebooks
    • Apache Zeppelin
    • Spark Notebook
  • Spark Tips and Tricks
  • Spark Tips and Tricks
  • Access private members in Scala in Spark shell
  • SparkException: Task not serializable
  • Running Spark Applications on Windows
  • Exercises
  • One-liners using PairRDDFunctions
  • Learning Jobs and Partitions Using take Action
  • Spark Standalone - Using ZooKeeper for High-Availability of Master
  • Spark’s Hello World using Spark shell and Scala
  • WordCount using Spark shell
  • Your first complete Spark application (using Scala and sbt)
  • Spark (notable) use cases
  • Using Spark SQL to update data in Hive using ORC files
  • Developing Custom SparkListener to monitor DAGScheduler in Scala
  • Developing RPC Environment
  • Developing Custom RDD
  • Working with Datasets from JDBC Data Sources (and PostgreSQL)
  • Causing Stage to Fail
  • Further Learning
  • Courses
  • Books
  • Spark Distributions
  • DataStax Enterprise
  • MapR Sandbox for Hadoop (Spark 1.5.2 only)
  • Spark Workshop
  • Spark Advanced Workshop
    • Requirements
    • Day 1
    • Day 2
  • Spark Talk Ideas
  • Spark Talks Ideas (STI)
  • 10 Lesser-Known Tidbits about Spark Standalone
  • Learning Spark internals using groupBy (to cause shuffle)
Powered by GitBook

Books

Books

  • O’Reilly

    • Learning Spark (my review at Amazon.com)

    • Advanced Analytics with Spark

    • Data Algorithms: Recipes for Scaling Up with Hadoop and Spark

    • Spark Operations: Operationalizing Apache Spark at Scale (in the works)

  • Manning

    • Spark in Action (MEAP)

    • Streaming Data (MEAP)

    • Spark GraphX in Action (MEAP)

  • Packt

    • Mastering Apache Spark

    • Spark Cookbook

    • Learning Real-time Processing with Spark Streaming

    • Machine Learning with Spark

    • Fast Data Processing with Spark, 2nd Edition

      • Fast Data Processing with Spark

    • Apache Spark Graph Processing

  • Apress

    • Big Data Analytics with Spark

    • Guide to High Performance Distributed Computing (Case Studies with Hadoop, Scalding and Spark)

results matching ""

    No results matching ""