DataFrame

3.2.2.2. DataFrame#

In the data science world, a DataFrame is a 2-dimensional data structure used to store and manipulate tabular data. The pandas library in Python provides powerful tools to work with DataFrames, allowing you to:

Read and write data from various file formats
Perform data transformations
Handle missing values
Filter, sort, and aggregate data

We’ll explore how to read data from different file formats into a Pandas DataFrame and how to save a DataFrame back into one of those formats.

Sample DataFrame

	Name	Age	Department	Salary
0	Alice	30	Engineering	85000
1	Bob	25	Marketing	62000
2	Charlie	28	Sales	70000
3	Diana	35	Engineering	92000
4	Ethan	40	HR	78000

You’ll be working with DataFrames extensively in the upcoming sections, so it’s important to get comfortable with them early on.

Common Operations#

`df.head()`#

Shows the first few rows of the DataFrame.

df.head()

	Name	Age	Department	Salary
0	Alice	30	Engineering	85000
1	Bob	25	Marketing	62000
2	Charlie	28	Sales	70000
3	Diana	35	Engineering	92000
4	Ethan	40	HR	78000

`df.describe()`#

Gives statistical summary of numerical columns.

df.describe()

	Age	Salary
count	5.00000	5.000000
mean	31.60000	77400.000000
std	5.94138	11865.917579
min	25.00000	62000.000000
25%	28.00000	70000.000000
50%	30.00000	78000.000000
75%	35.00000	85000.000000
max	40.00000	92000.000000

`df.info()`#

Provides a concise summary of the DataFrame: number of rows, columns, data types, and memory usage.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Name        5 non-null      object
 1   Age         5 non-null      int64 
 2   Department  5 non-null      object
 3   Salary      5 non-null      int64 
dtypes: int64(2), object(2)
memory usage: 292.0+ bytes

Filtering rows#

You can filter rows using conditions.

# Show employees older than 30
df[df["Age"] > 30]

	Name	Age	Department	Salary
3	Diana	35	Engineering	92000
4	Ethan	40	HR	78000

`df.sort_values()`#

Sort the DataFrame based on one or more columns.

df.sort_values(by="Salary", ascending=False)

	Name	Age	Department	Salary
3	Diana	35	Engineering	92000
0	Alice	30	Engineering	85000
4	Ethan	40	HR	78000
2	Charlie	28	Sales	70000
1	Bob	25	Marketing	62000

`pd.concat()`#

Concatenates multiple DataFrames vertically or horizontally.

df1 = df.iloc[:3]
df2 = df.iloc[3:]

pd.concat([df1, df2], axis=0)

	Name	Age	Department	Salary
0	Alice	30	Engineering	85000
1	Bob	25	Marketing	62000
2	Charlie	28	Sales	70000
3	Diana	35	Engineering	92000
4	Ethan	40	HR	78000

`pd.merge()`#

Merge two DataFrames based on a common column.

dept_info = pd.DataFrame({
    "Department": ["Engineering", "Marketing", "Sales", "HR"],
    "Location": ["Building A", "Building B", "Building C", "Building D"]
})

pd.merge(df, dept_info, on="Department")

	Name	Age	Department	Salary	Location
0	Alice	30	Engineering	85000	Building A
1	Bob	25	Marketing	62000	Building B
2	Charlie	28	Sales	70000	Building C
3	Diana	35	Engineering	92000	Building A
4	Ethan	40	HR	78000	Building D

`df.groupby()`#

Group data by one or more columns and perform aggregation.

df.groupby("Department")["Salary"].mean()

Department
Engineering    88500.0
HR             78000.0
Marketing      62000.0
Sales          70000.0
Name: Salary, dtype: float64

`df.isnull()` and `df.fillna()`#

Check for and handle missing values.

df_with_missing = df.copy()
df_with_missing.loc[2, "Salary"] = None

# Check missing
df_with_missing.isnull()

# Fill missing
df_with_missing.fillna(0)

	Name	Age	Department	Salary
0	Alice	30	Engineering	85000.0
1	Bob	25	Marketing	62000.0
2	Charlie	28	Sales	0.0
3	Diana	35	Engineering	92000.0
4	Ethan	40	HR	78000.0

These commands form the foundation of most pandas workflows. You’ll use them frequently in tasks like data cleaning, transformation, analysis, and preprocessing.
To explore more commands and detailed usage, refer to the pandas documentation.

DataFrame

Contents

3.2.2.2. DataFrame#

Common Operations#

df.head()#

df.describe()#

df.info()#

Filtering rows#

df.sort_values()#

pd.concat()#

pd.merge()#

df.groupby()#

df.isnull() and df.fillna()#