3.2.2.2. DataFrame#

In the data science world, a DataFrame is a 2-dimensional data structure used to store and manipulate tabular data. The pandas library in Python provides powerful tools to work with DataFrames, allowing you to:

  • Read and write data from various file formats

  • Perform data transformations

  • Handle missing values

  • Filter, sort, and aggregate data

We’ll explore how to read data from different file formats into a Pandas DataFrame and how to save a DataFrame back into one of those formats.

Sample DataFrame

Name Age Department Salary
0 Alice 30 Engineering 85000
1 Bob 25 Marketing 62000
2 Charlie 28 Sales 70000
3 Diana 35 Engineering 92000
4 Ethan 40 HR 78000

You’ll be working with DataFrames extensively in the upcoming sections, so it’s important to get comfortable with them early on.

Common Operations#

df.head()#

Shows the first few rows of the DataFrame.

df.head()
Name Age Department Salary
0 Alice 30 Engineering 85000
1 Bob 25 Marketing 62000
2 Charlie 28 Sales 70000
3 Diana 35 Engineering 92000
4 Ethan 40 HR 78000

df.describe()#

Gives statistical summary of numerical columns.

df.describe()
Age Salary
count 5.00000 5.000000
mean 31.60000 77400.000000
std 5.94138 11865.917579
min 25.00000 62000.000000
25% 28.00000 70000.000000
50% 30.00000 78000.000000
75% 35.00000 85000.000000
max 40.00000 92000.000000

df.info()#

Provides a concise summary of the DataFrame: number of rows, columns, data types, and memory usage.

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Name        5 non-null      object
 1   Age         5 non-null      int64 
 2   Department  5 non-null      object
 3   Salary      5 non-null      int64 
dtypes: int64(2), object(2)
memory usage: 292.0+ bytes

Filtering rows#

You can filter rows using conditions.

# Show employees older than 30
df[df["Age"] > 30]
Name Age Department Salary
3 Diana 35 Engineering 92000
4 Ethan 40 HR 78000

df.sort_values()#

Sort the DataFrame based on one or more columns.

df.sort_values(by="Salary", ascending=False)
Name Age Department Salary
3 Diana 35 Engineering 92000
0 Alice 30 Engineering 85000
4 Ethan 40 HR 78000
2 Charlie 28 Sales 70000
1 Bob 25 Marketing 62000

pd.concat()#

Concatenates multiple DataFrames vertically or horizontally.

df1 = df.iloc[:3]
df2 = df.iloc[3:]

pd.concat([df1, df2], axis=0)
Name Age Department Salary
0 Alice 30 Engineering 85000
1 Bob 25 Marketing 62000
2 Charlie 28 Sales 70000
3 Diana 35 Engineering 92000
4 Ethan 40 HR 78000

pd.merge()#

Merge two DataFrames based on a common column.

dept_info = pd.DataFrame({
    "Department": ["Engineering", "Marketing", "Sales", "HR"],
    "Location": ["Building A", "Building B", "Building C", "Building D"]
})

pd.merge(df, dept_info, on="Department")
Name Age Department Salary Location
0 Alice 30 Engineering 85000 Building A
1 Bob 25 Marketing 62000 Building B
2 Charlie 28 Sales 70000 Building C
3 Diana 35 Engineering 92000 Building A
4 Ethan 40 HR 78000 Building D

df.groupby()#

Group data by one or more columns and perform aggregation.

df.groupby("Department")["Salary"].mean()
Department
Engineering    88500.0
HR             78000.0
Marketing      62000.0
Sales          70000.0
Name: Salary, dtype: float64

df.isnull() and df.fillna()#

Check for and handle missing values.

df_with_missing = df.copy()
df_with_missing.loc[2, "Salary"] = None

# Check missing
df_with_missing.isnull()

# Fill missing
df_with_missing.fillna(0)
Name Age Department Salary
0 Alice 30 Engineering 85000.0
1 Bob 25 Marketing 62000.0
2 Charlie 28 Sales 0.0
3 Diana 35 Engineering 92000.0
4 Ethan 40 HR 78000.0

These commands form the foundation of most pandas workflows. You’ll use them frequently in tasks like data cleaning, transformation, analysis, and preprocessing.
To explore more commands and detailed usage, refer to the pandas documentation.