In this blog we will see the usage of parquet file as a datasource using Spark framework to understand what makes it an attractive solution as a file based datasource when compared to other file storage types.
Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides competent data compression and encoding schemes with enhanced performance to handle complex data in bulk. Apache Parquet is designed to be a common interchange format for both batch and interactive workloads.
Parquet was built with complex nested data structures in mind and uses the record shredding and assembly algorithm. Compression schemes need to be specified on a per-column level and is future-proofed to allow adding more encodings as they are invented and implemented.
It is a free and open-source file format, under the Apache license group. It is a language agnostic which helps in using this format in cross platform and cross language scenarios.
Unlike other file storage formats which store in rows, parquet data is organized by columns. It can be used effectively in OLAP use cases, where huge data should be searched through.
Efficient data compression and decompression techniques are followed which occupies less space when compared to other file formats to store same amount of data. It can also be used to store data with complex data structure.
We use a simple ‘Java – Maven’ project with jdk 17 and Spark (3.3.2). The program is segregated into 2 functions –
To avoid any issues while running spark 3.3 code with jdk 17, jvm entries to allow access to jdk internal should be provided. Please find below jvm entries: