3.2.2.2. DataFrame#
In the data science world, a DataFrame is a 2-dimensional data structure used to store and manipulate tabular data. The pandas library in Python provides powerful tools to work with DataFrames, allowing you to:
Read and write data from various file formats
Perform data transformations
Handle missing values
Filter, sort, and aggregate data
We’ll explore how to read data from different file formats into a Pandas DataFrame and how to save a DataFrame back into one of those formats.
Sample DataFrame
| Name | Age | Department | Salary | |
|---|---|---|---|---|
| 0 | Alice | 30 | Engineering | 85000 |
| 1 | Bob | 25 | Marketing | 62000 |
| 2 | Charlie | 28 | Sales | 70000 |
| 3 | Diana | 35 | Engineering | 92000 |
| 4 | Ethan | 40 | HR | 78000 |
You’ll be working with DataFrames extensively in the upcoming sections, so it’s important to get comfortable with them early on.
Common Operations#
df.head()#
Shows the first few rows of the DataFrame.
df.head()
| Name | Age | Department | Salary | |
|---|---|---|---|---|
| 0 | Alice | 30 | Engineering | 85000 |
| 1 | Bob | 25 | Marketing | 62000 |
| 2 | Charlie | 28 | Sales | 70000 |
| 3 | Diana | 35 | Engineering | 92000 |
| 4 | Ethan | 40 | HR | 78000 |
df.describe()#
Gives statistical summary of numerical columns.
df.describe()
| Age | Salary | |
|---|---|---|
| count | 5.00000 | 5.000000 |
| mean | 31.60000 | 77400.000000 |
| std | 5.94138 | 11865.917579 |
| min | 25.00000 | 62000.000000 |
| 25% | 28.00000 | 70000.000000 |
| 50% | 30.00000 | 78000.000000 |
| 75% | 35.00000 | 85000.000000 |
| max | 40.00000 | 92000.000000 |
df.info()#
Provides a concise summary of the DataFrame: number of rows, columns, data types, and memory usage.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 5 non-null object
1 Age 5 non-null int64
2 Department 5 non-null object
3 Salary 5 non-null int64
dtypes: int64(2), object(2)
memory usage: 292.0+ bytes
Filtering rows#
You can filter rows using conditions.
# Show employees older than 30
df[df["Age"] > 30]
| Name | Age | Department | Salary | |
|---|---|---|---|---|
| 3 | Diana | 35 | Engineering | 92000 |
| 4 | Ethan | 40 | HR | 78000 |
df.sort_values()#
Sort the DataFrame based on one or more columns.
df.sort_values(by="Salary", ascending=False)
| Name | Age | Department | Salary | |
|---|---|---|---|---|
| 3 | Diana | 35 | Engineering | 92000 |
| 0 | Alice | 30 | Engineering | 85000 |
| 4 | Ethan | 40 | HR | 78000 |
| 2 | Charlie | 28 | Sales | 70000 |
| 1 | Bob | 25 | Marketing | 62000 |
pd.concat()#
Concatenates multiple DataFrames vertically or horizontally.
df1 = df.iloc[:3]
df2 = df.iloc[3:]
pd.concat([df1, df2], axis=0)
| Name | Age | Department | Salary | |
|---|---|---|---|---|
| 0 | Alice | 30 | Engineering | 85000 |
| 1 | Bob | 25 | Marketing | 62000 |
| 2 | Charlie | 28 | Sales | 70000 |
| 3 | Diana | 35 | Engineering | 92000 |
| 4 | Ethan | 40 | HR | 78000 |
pd.merge()#
Merge two DataFrames based on a common column.
dept_info = pd.DataFrame({
"Department": ["Engineering", "Marketing", "Sales", "HR"],
"Location": ["Building A", "Building B", "Building C", "Building D"]
})
pd.merge(df, dept_info, on="Department")
| Name | Age | Department | Salary | Location | |
|---|---|---|---|---|---|
| 0 | Alice | 30 | Engineering | 85000 | Building A |
| 1 | Bob | 25 | Marketing | 62000 | Building B |
| 2 | Charlie | 28 | Sales | 70000 | Building C |
| 3 | Diana | 35 | Engineering | 92000 | Building A |
| 4 | Ethan | 40 | HR | 78000 | Building D |
df.groupby()#
Group data by one or more columns and perform aggregation.
df.groupby("Department")["Salary"].mean()
Department
Engineering 88500.0
HR 78000.0
Marketing 62000.0
Sales 70000.0
Name: Salary, dtype: float64
df.isnull() and df.fillna()#
Check for and handle missing values.
df_with_missing = df.copy()
df_with_missing.loc[2, "Salary"] = None
# Check missing
df_with_missing.isnull()
# Fill missing
df_with_missing.fillna(0)
| Name | Age | Department | Salary | |
|---|---|---|---|---|
| 0 | Alice | 30 | Engineering | 85000.0 |
| 1 | Bob | 25 | Marketing | 62000.0 |
| 2 | Charlie | 28 | Sales | 0.0 |
| 3 | Diana | 35 | Engineering | 92000.0 |
| 4 | Ethan | 40 | HR | 78000.0 |
These commands form the foundation of most pandas workflows. You’ll use them frequently in tasks like data cleaning, transformation, analysis, and preprocessing.
To explore more commands and detailed usage, refer to the pandas documentation.