Parquet
Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.
Spark 1.5 uses Parquet 1.7.
-
excellent for local file storage on HDFS (instead of external databases).
-
writing very large datasets to disk
-
supports schema and schema evolution.
-
faster than json/gzip