Introduction
In this blog we will see the usage of parquet file as a datasource using Spark framework to understand what makes it an attractive solution as a file based datasource when compared to other file storage types.
Apache Parquet
Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides competent data compression and encoding schemes with enhanced performance to handle complex data in bulk. Apache Parquet is designed to be a common interchange format for both batch and interactive workloads.
Parquet was built with complex nested data structures in mind and uses the record shredding and assembly algorithm. Compression schemes need to be specified on a per-column level and is future-proofed to allow adding more encodings as they are invented and implemented.
Characteristics
It is a free and open-source file format, under the Apache license group. It is a language agnostic which helps in using this format in cross platform and cross language scenarios.
Unlike other file storage formats which store in rows, parquet data is organized by columns. It can be used effectively in OLAP use cases, where huge data should be searched through.
Efficient data compression and decompression techniques are followed which occupies less space when compared to other file formats to store same amount of data. It can also be used to store data with complex data structure.
Benefits
Implementation
We use a simple ‘Java – Maven’ project with jdk 17 and Spark (3.3.2). The program is segregated into 2 functions –
Pom Dependencies:
Generate Parquet:
Query Parquet:
To avoid any issues while running spark 3.3 code with jdk 17, jvm entries to allow access to jdk internal should be provided. Please find below jvm entries:
–add-opens=java.base/java.lang=ALL-UNNAMED
–add-opens=java.base/java.lang.invoke=ALL-UNNAMED
–add-opens=java.base/java.lang.reflect=ALL-UNNAMED
–add-opens=java.base/java.io=ALL-UNNAMED
–add-opens=java.base/java.net=ALL-UNNAMED
–add-opens=java.base/java.nio=ALL-UNNAMED
–add-opens=java.base/java.util=ALL-UNNAMED
–add-opens=java.base/java.util.concurrent=ALL-UNNAMED
–add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED
–add-opens=java.base/sun.nio.ch=ALL-UNNAMED
–add-opens=java.base/sun.nio.cs=ALL-UNNAMED
–add-opens=java.base/sun.security.action=ALL-UNNAMED
–add-opens=java.base/sun.util.calendar=ALL-UNNAMED
–add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED
Cookie | Duration | Description |
---|---|---|
cookielawinfo-checkbox-analytics | 11 months | This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics". |
cookielawinfo-checkbox-functional | 11 months | The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional". |
cookielawinfo-checkbox-necessary | 11 months | This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary". |
cookielawinfo-checkbox-others | 11 months | This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other. |
cookielawinfo-checkbox-performance | 11 months | This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance". |
viewed_cookie_policy | 11 months | The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data. |