Skip to content

Latest commit

 

History

History
709 lines (557 loc) · 40 KB

SUMMARY.adoc

File metadata and controls

709 lines (557 loc) · 40 KB

Summary

Spark SQL

  1. Spark SQL — Queries Over Structured Data on Massive Scale

  2. SparkSession — The Entry Point to Spark SQL

  3. Dataset — Strongly-Typed Structured Query with Encoder

  4. Schema — Structure of Data

  5. Dataset Operators

  6. DataSource API — Loading and Saving Datasets

  7. CacheManager — In-Memory Cache for Tables and Views

  8. BaseRelation — Collection of Tuples with Schema

  9. QueryExecution — Query Execution of Dataset

  10. Spark SQL’s Performance Tuning Tips and Tricks (aka Case Studies)

  11. Expression — Executable Node in Catalyst Tree

    1. AggregateExpression — Expression Container for AggregateFunction

    2. AggregateFunction

    3. Attribute Leaf Expression

    4. BoundReference Leaf Expression — Reference to Value in InternalRow

    5. CallMethodViaReflection Expression

    6. Generator — Catalyst Expressions that Generate Zero Or More Rows

    7. JsonToStructs Unary Expression

    8. Literal Leaf Expression

    9. ScalaUDAF — Catalyst Expression Adapter for UserDefinedAggregateFunction

    10. StaticInvoke Non-SQL Expression

    11. TimeWindow Unevaluable Unary Expression

    12. UnixTimestamp TimeZoneAware Binary Expression

    13. WindowExpression Unevaluable Expression

    14. WindowFunction

  12. LogicalPlan — Logical Query Plan / Logical Operator

    1. Aggregate Unary Logical Operator

    2. BroadcastHint Unary Logical Operator

    3. DeserializeToObject Logical Operator

    4. Expand Unary Logical Operator

    5. GroupingSets Unary Logical Operator

    6. Hint Logical Operator

    7. InMemoryRelation Leaf Logical Operator For Cached Query Plans

    8. Join Logical Operator

    9. LocalRelation Logical Operator

    10. LogicalRelation Logical Operator — Adapter for BaseRelation

    11. Pivot Unary Logical Operator

    12. Repartition Logical Operators — Repartition and RepartitionByExpression

    13. RunnableCommand — Generic Logical Command with Side Effects

    14. SubqueryAlias Logical Operator

    15. UnresolvedFunction Logical Operator

    16. UnresolvedRelation Logical Operator

    17. Window Unary Logical Operator

    18. WithWindowDefinition Unary Logical Operator

  13. Analyzer — Logical Query Plan Analyzer

  14. SparkOptimizer — Logical Query Optimizer

  15. SparkPlan — Physical Query Plan / Physical Operator

    1. BroadcastExchangeExec Unary Operator for Broadcasting Joins

    2. BroadcastHashJoinExec Binary Physical Operator

    3. BroadcastNestedLoopJoinExec Binary Physical Operator

    4. CoalesceExec Unary Physical Operator

    5. DataSourceScanExec — Contract for Leaf Physical Operators with Code Generation

    6. ExecutedCommandExec Physical Operator

    7. HashAggregateExec Aggregate Physical Operator for Hash-Based Aggregation

    8. InMemoryTableScanExec Physical Operator

    9. LocalTableScanExec Physical Operator

    10. ObjectHashAggregateExec Aggregate Physical Operator

    11. ShuffleExchange Unary Physical Operator

    12. ShuffledHashJoinExec Binary Physical Operator

    13. SortAggregateExec Aggregate Physical Operator for Sort-Based Aggregation

    14. SortMergeJoinExec Binary Physical Operator

    15. InputAdapter Unary Physical Operator

    16. WindowExec Unary Physical Operator

    17. WholeStageCodegenExec Unary Operator with Java Code Generation

  16. Partitioning — Specification of Physical Operator’s Output Partitions

  17. SparkPlanner — Query Planner with no Hive Support

  18. Physical Plan Preparations Rules

  19. SQL Parsing Framework

  20. SQLMetric — Physical Operator Metric

  21. Catalyst — Tree Manipulation Framework

  22. ExchangeCoordinator and Adaptive Query Execution

  23. ShuffledRowRDD

  24. Debugging Query Execution

  25. Datasets vs DataFrames vs RDDs

  26. SQLConf

  27. Catalog

  28. ExternalCatalog — System Catalog of Permanent Entities

  29. SessionState

  30. SessionCatalog — Metastore of Session-Specific Relational Entities

  31. UDFRegistration

  32. FunctionRegistry

  33. ExperimentalMethods

  34. SQLExecution Helper Object

  35. CatalystSerde

  36. Tungsten Execution Backend (aka Project Tungsten)

  37. UnsafeHashedRelation

  38. ExternalAppendOnlyUnsafeRowArray — Append-Only Array for UnsafeRows (with Disk Spill Threshold)

  39. AggregationIterator — Generic Iterator of UnsafeRows for Aggregate Physical Operators

  40. JdbcDialect

  41. HadoopFileLinesReader

  42. KafkaWriter — Writing Dataset to Kafka

  43. Hive Integration

  44. Thrift JDBC/ODBC Server — Spark Thrift Server (STS)

  45. (obsolete) SQLContext

  46. Settings

Spark Core / Tools

  1. Spark Shell — spark-shell shell script

  2. Web UI — Spark Application’s Web Console

  3. Spark Submit — spark-submit shell script

  4. spark-class shell script

  5. SparkLauncher — Launching Spark Applications Programmatically

Spark Core / RDD

  1. Anatomy of Spark Application

  2. SparkConf — Programmable Configuration for Spark Applications

  3. SparkContext

  4. RDD — Resilient Distributed Dataset

  5. Operators

  6. Caching and Persistence

  7. Partitions and Partitioning

  8. Shuffling

  9. Checkpointing

  10. RDD Dependencies

  11. Map/Reduce-side Aggregator

Spark Core / Services

  1. SerializerManager

  2. MemoryManager — Memory Management

  3. SparkEnv — Spark Runtime Environment

  4. DAGScheduler — Stage-Oriented Scheduler

  5. TaskScheduler — Spark Scheduler

  6. SchedulerBackend — Pluggable Scheduler Backends

  7. ExecutorBackend — Pluggable Executor Backends

  8. BlockManager — Key-Value Store for Blocks

  9. MapOutputTracker — Shuffle Map Output Registry

  10. ShuffleManager — Pluggable Shuffle Systems

  11. Serialization

  12. ExternalClusterManager — Pluggable Cluster Managers

  13. BroadcastManager

  14. ContextCleaner — Spark Application Garbage Collector

  15. Dynamic Allocation (of Executors)

  16. HTTP File Server

  17. Data Locality

  18. Cache Manager

  19. OutputCommitCoordinator

  20. RpcEnv — RPC Environment

  21. TransportConf — Transport Configuration

(obsolete) Spark Streaming

Execution Model

Further Learning