3.2.2.4. Parquet#

Apache Parquet is a columnar storage format commonly used in big data processing. Unlike row-based formats like CSV, Parquet stores data column-wise, which makes it highly efficient for analytical queries and compression.

Parquet files are:

  • Binary files (not human-readable)

  • Efficient in storage and I/O, especially with large datasets

  • Schema-aware, preserving data types and structure

  • Natively supported by big data tools like Spark, Hive, and Dask

Due to its performance benefits, Parquet is widely used in data pipelines and cloud data lakes.

Reading a Parquet File#

To work with Parquet files in pandas, you need a Parquet engine such as pyarrow or fastparquet.

Install one with:

pip install pyarrow

Then you can read a Parquet file like this:

import pandas as pd

# Read from a Parquet file
df = pd.read_parquet("example.parquet")
df.head()
Name Age Department Salary
0 Alice 30 Engineering 85000
1 Bob 25 Marketing 62000
2 Charlie 28 Sales 70000
3 Diana 35 Engineering 92000
4 Ethan 40 HR 78000

Writing a DataFrame to Parquet#

You can also write a pandas DataFrame to a Parquet file:

# Save the DataFrame to a Parquet file
df.to_parquet("output.parquet", index=False)

df2 = pd.read_parquet("output.parquet")
df2.head()
Name Age Department Salary
0 Alice 30 Engineering 85000
1 Bob 25 Marketing 62000
2 Charlie 28 Sales 70000
3 Diana 35 Engineering 92000
4 Ethan 40 HR 78000

Parquet is an excellent choice when working with large-scale structured data and is particularly well-suited for cloud storage and analytics environments.