5.1.2. Data Preview#

Before doing any analysis, it’s essential to look at the data in its raw form.

5.1.2.1. First Look at the Data#

For large datasets, it might not be practical to view all rows, but you should always inspect at least the first 5–10 records manually before doing anything else.

This exercise helps you:

  • Familiarize yourself with the columns and their contents

  • Spot obvious issues like strange encodings or missing values

  • Start thinking about what could be features and targets

Pandas provides the .head() method to quickly preview rows.
It takes an optional parameter indicating the number of rows to display.

Example: Previewing Data with .head()#

Let’s use the Chinook database, which contains tables such as artists, albums, and tracks.

import sqlite3
import pandas as pd

# Connect to Chinook sample database
conn = sqlite3.connect("../data/chinook.db")

# Load the 'albums' table into a dataframe
df_albums = pd.read_sql_query("SELECT * FROM albums", conn)

# Show the first 5 rows
df_albums.head()
AlbumId Title ArtistId
0 1 For Those About To Rock We Salute You 1
1 2 Balls to the Wall 2
2 3 Restless and Wild 2
3 4 Let There Be Rock 1
4 5 Big Ones 3
# Load the 'albums' table into a dataframe
df_tracks = pd.read_sql_query("SELECT * FROM tracks", conn)

# Show the first 5 rows
df_tracks.head()
TrackId Name AlbumId MediaTypeId GenreId Composer Milliseconds Bytes UnitPrice
0 1 For Those About To Rock (We Salute You) 1 1 1 Angus Young, Malcolm Young, Brian Johnson 343719 11170334 0.99
1 2 Balls to the Wall 2 2 1 None 342562 5510424 0.99
2 3 Fast As a Shark 3 2 1 F. Baltes, S. Kaufman, U. Dirkscneider & W. Ho... 230619 3990994 0.99
3 4 Restless and Wild 3 2 1 F. Baltes, R.A. Smith-Diesel, S. Kaufman, U. D... 252051 4331779 0.99
4 5 Princess of the Dawn 3 2 1 Deaffy & R.A. Smith-Diesel 375418 6290521 0.99

Sometimes datasets have a large number of columns, making it hard to explore everything in a scrollable table. It’s good practice to print:

  • The shape (number of rows and columns). You can get this using df.shape.

  • The list of column names. Pandas provides df.columns for this.

  • The data types of each column. df.dtypes gives you an overview of column types.

This helps you:

  • Understand the size and complexity of the dataset

  • Start identifying which fields are potential features or labels

  • Spot data types that might need cleaning (e.g., numbers stored as text)

5.1.2.2. Summary Overview#

Once we have previewed the data and built a basic understanding, it helps to look at summary views of individual features and the dataset as a whole. This further improves your familiarity with:

  • How many non-null values each column has

  • What data types are assigned

  • Memory usage

Pandas provides the .info() method for this.

Example: Using .info()#

Let’s switch to a richer example using the tracks table:

# Show a concise summary of the dataframe
df_tracks.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3503 entries, 0 to 3502
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   TrackId       3503 non-null   int64  
 1   Name          3503 non-null   object 
 2   AlbumId       3503 non-null   int64  
 3   MediaTypeId   3503 non-null   int64  
 4   GenreId       3503 non-null   int64  
 5   Composer      2525 non-null   object 
 6   Milliseconds  3503 non-null   int64  
 7   Bytes         3503 non-null   int64  
 8   UnitPrice     3503 non-null   float64
dtypes: float64(1), int64(6), object(2)
memory usage: 246.4+ KB

5.1.2.3. Statistical Summaries#

If you have numeric data, statistical summaries give you useful metrics like:

  • mean

  • min and max

  • standard deviation

  • percentiles

These help you quickly see the distribution and spread of values in each numeric column.

Pandas provides the .describe() method to generate these statistics.

Example: Using .describe()#

# Show descriptive statistics for numeric columns
df_tracks.describe()
TrackId AlbumId MediaTypeId GenreId Milliseconds Bytes UnitPrice
count 3503.000000 3503.000000 3503.000000 3503.000000 3.503000e+03 3.503000e+03 3503.000000
mean 1752.000000 140.929489 1.208393 5.725378 3.935992e+05 3.351021e+07 1.050805
std 1011.373324 81.775395 0.580443 6.190204 5.350054e+05 1.053925e+08 0.239006
min 1.000000 1.000000 1.000000 1.000000 1.071000e+03 3.874700e+04 0.990000
25% 876.500000 70.500000 1.000000 1.000000 2.072810e+05 6.342566e+06 0.990000
50% 1752.000000 141.000000 1.000000 3.000000 2.556340e+05 8.107896e+06 0.990000
75% 2627.500000 212.000000 1.000000 7.000000 3.216450e+05 1.026679e+07 0.990000
max 3503.000000 347.000000 5.000000 25.000000 5.286953e+06 1.059546e+09 1.990000

If your dataset has categorical columns and you want to include them in the summary, you can pass include="all":

# Describe all columns, including object types
df_tracks.describe(include="all")
TrackId Name AlbumId MediaTypeId GenreId Composer Milliseconds Bytes UnitPrice
count 3503.000000 3503 3503.000000 3503.000000 3503.000000 2525 3.503000e+03 3.503000e+03 3503.000000
unique NaN 3257 NaN NaN NaN 852 NaN NaN NaN
top NaN The Trooper NaN NaN NaN Steve Harris NaN NaN NaN
freq NaN 5 NaN NaN NaN 80 NaN NaN NaN
mean 1752.000000 NaN 140.929489 1.208393 5.725378 NaN 3.935992e+05 3.351021e+07 1.050805
std 1011.373324 NaN 81.775395 0.580443 6.190204 NaN 5.350054e+05 1.053925e+08 0.239006
min 1.000000 NaN 1.000000 1.000000 1.000000 NaN 1.071000e+03 3.874700e+04 0.990000
25% 876.500000 NaN 70.500000 1.000000 1.000000 NaN 2.072810e+05 6.342566e+06 0.990000
50% 1752.000000 NaN 141.000000 1.000000 3.000000 NaN 2.556340e+05 8.107896e+06 0.990000
75% 2627.500000 NaN 212.000000 1.000000 7.000000 NaN 3.216450e+05 1.026679e+07 0.990000
max 3503.000000 NaN 347.000000 5.000000 25.000000 NaN 5.286953e+06 1.059546e+09 1.990000

This combination of .info() and .describe() gives you a comprehensive overview of your dataset before you begin cleaning or modeling.