log4j.logger.org.apache.spark.sql.hive.HiveContext=DEBUG
Hive Integration
Spark SQL supports Apache Hive using HiveContext
. It uses the Spark SQL execution engine to work with data stored in Hive.
Note
|
|
HiveContext
is a specialized SQLContext
to work with Hive.
Tip
|
Import org.apache.spark.sql.hive package to use HiveContext .
|
Tip
|
Enable Add the following line to Refer to Logging. |
Hive Functions
SQLContext.sql (or simply sql
) allows you to interact with Hive.
You can use show functions
to learn about the Hive functions supported through the Hive integration.
scala> sql("show functions").show(false)
16/04/10 15:22:08 INFO HiveSqlParser: Parsing command: show functions
+---------------------+
|function |
+---------------------+
|! |
|% |
|& |
|* |
|+ |
|- |
|/ |
|< |
|<= |
|<=> |
|= |
|== |
|> |
|>= |
|^ |
|abs |
|acos |
|add_months |
|and |
|approx_count_distinct|
+---------------------+
only showing top 20 rows
Hive Configuration - hive-site.xml
The configuration for Hive is in hive-site.xml
on the classpath.
The default configuration uses Hive 1.2.1 with the default warehouse in /user/hive/warehouse
.
16/04/09 13:37:54 INFO HiveContext: Initializing execution hive, version 1.2.1
16/04/09 13:37:58 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
16/04/09 13:37:58 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
16/04/09 13:37:58 INFO HiveContext: default warehouse location is /user/hive/warehouse
16/04/09 13:37:58 INFO HiveContext: Initializing HiveMetastoreConnection version 1.2.1 using Spark classes.
16/04/09 13:38:01 DEBUG HiveContext: create HiveContext
current_database function
current_database
function returns the current database of Hive metadata.
scala> sql("select current_database()").show(false)
16/04/09 13:52:13 INFO HiveSqlParser: Parsing command: select current_database()
+-----------------+
|currentdatabase()|
+-----------------+
|default |
+-----------------+
current_database
function is registered when HiveContext
is initialized.
Internally, it uses private CurrentDatabase
class that uses HiveContext.sessionState.catalog.getCurrentDatabase
.
Analyzing Tables
analyze(tableName: String)
analyze
analyzes tableName
table for query optimizations. It currently supports only Hive tables.
scala> sql("show tables").show(false)
16/04/09 14:04:10 INFO HiveSqlParser: Parsing command: show tables
+---------+-----------+
|tableName|isTemporary|
+---------+-----------+
|dafa |false |
+---------+-----------+
scala> spark.asInstanceOf[HiveContext].analyze("dafa")
16/04/09 14:02:56 INFO HiveSqlParser: Parsing command: dafa
java.lang.UnsupportedOperationException: Analyze only works for Hive tables, but dafa is a LogicalRelation
at org.apache.spark.sql.hive.HiveContext.analyze(HiveContext.scala:304)
... 50 elided
Experimental: Metastore Tables with non-Hive SerDe
Caution
|
FIXME Review the uses of convertMetastoreParquet , convertMetastoreParquetWithSchemaMerging , convertMetastoreOrc , convertCTAS .
|
Settings
-
spark.sql.hive.metastore.version
(default:1.2.1
) - the version of the Hive metastore. Supported versions from0.12.0
up to and including1.2.1
. -
spark.sql.hive.version
(default:1.2.1
) - the version of Hive used by Spark SQL.
Caution
|
FIXME Review HiveContext object.
|