3.2.2.4. Parquet#
Apache Parquet is a columnar storage format commonly used in big data processing. Unlike row-based formats like CSV, Parquet stores data column-wise, which makes it highly efficient for analytical queries and compression.
Parquet files are:
Binary files (not human-readable)
Efficient in storage and I/O, especially with large datasets
Schema-aware, preserving data types and structure
Natively supported by big data tools like Spark, Hive, and Dask
Due to its performance benefits, Parquet is widely used in data pipelines and cloud data lakes.
Reading a Parquet File#
To work with Parquet files in pandas, you need a Parquet engine such as pyarrow or fastparquet.
Install one with:
pip install pyarrow
Then you can read a Parquet file like this:
import pandas as pd
# Read from a Parquet file
df = pd.read_parquet("example.parquet")
df.head()
| Name | Age | Department | Salary | |
|---|---|---|---|---|
| 0 | Alice | 30 | Engineering | 85000 |
| 1 | Bob | 25 | Marketing | 62000 |
| 2 | Charlie | 28 | Sales | 70000 |
| 3 | Diana | 35 | Engineering | 92000 |
| 4 | Ethan | 40 | HR | 78000 |
Writing a DataFrame to Parquet#
You can also write a pandas DataFrame to a Parquet file:
# Save the DataFrame to a Parquet file
df.to_parquet("output.parquet", index=False)
df2 = pd.read_parquet("output.parquet")
df2.head()
| Name | Age | Department | Salary | |
|---|---|---|---|---|
| 0 | Alice | 30 | Engineering | 85000 |
| 1 | Bob | 25 | Marketing | 62000 |
| 2 | Charlie | 28 | Sales | 70000 |
| 3 | Diana | 35 | Engineering | 92000 |
| 4 | Ethan | 40 | HR | 78000 |
Parquet is an excellent choice when working with large-scale structured data and is particularly well-suited for cloud storage and analytics environments.