Using Spark SQL to update data in Hive using ORC files
The example has showed up on Spark’s users mailing list.
| 
 Caution 
 | 
Solution was to use Hive in ORC format with partitions:
- 
A table in Hive stored as an ORC file (using partitioning)
 - 
Using
SQLContext.sqlto insert data into the table - 
Using
SQLContext.sqlto periodically runALTER TABLE…CONCATENATEto merge your many small files into larger files optimized for your HDFS block size- 
Since the
CONCATENATEcommand operates on files in place it is transparent to any downstream processing 
 - 
 - 
Hive solution is just to concatenate the files
- 
it does not alter or change records.
 - 
it’s possible to update data in Hive using ORC format
 - 
With transactional tables in Hive together with insert, update, delete, it does the "concatenate " for you automatically in regularly intervals. Currently this works only with tables in orc.format (stored as orc)
 - 
Alternatively, use Hbase with Phoenix as the SQL layer on top
 - 
Hive was originally not designed for updates, because it was.purely warehouse focused, the most recent one can do updates, deletes etc in a transactional way.
 
 - 
 
Criteria:
- 
Spark Streaming jobs are receiving a lot of small events (avg 10kb)
 - 
Events are stored to HDFS, e.g. for Pig jobs
 - 
There are a lot of small files in HDFS (several millions)