Parquet, as you already might be familiar, is an efficient columnar storage format available for any project in the Hadoop ecosystem and off late, it has been gaining a lot of traction as the de-facto standard for storage for Big Data & Analytics.
Parquet stores the data in a compressed binary encoded format and therefore is not human-readable, so the community has provided with a set of tools that could be used to query, read and write to parquet files. As it turns out, installing these tools locally is not really straightforward and this has been the motivation behind this small write-up.
So without further ado, let’s outline the steps needed to install the parquet-tools on your machine:
- Clone the repo: Parquet/parquet-mr
- Navigate to the tools subdirectory and built it
- If there are build failures due to unable to fetch a specific version of com.twitter.parquet artifact ( as it was in my case ) change the version from 1.6.0rc3-SNAPSHOT to 1.6.1-SNAPSHOT
- After it has been successfully built, we copy the dependencies (jar files) to target/dependency using maven copy-dependencies command:
This will copy all the dependencies that are required to run the tools in the target folder of the current directory
- Next step is to add the recently build tools jar to the dependency folder we created in the last step so that all the necessary files are in one place.
- Now that the target subdirectory contains all the code (jar files) at one place, we simply create a directory on our local system to keep all the jar files and scripts present here at one place ( in our case we save the scripts at ~/usr/local/parquet and the jar files in a subdirectory at ~/usr/local/parquet/lib ) as depicted below:
This completes the installation and now you can use the tools to read, dump, display the file and/or the schema of the parquet file as follows: