Data Preview

5.1.2. Data Preview#

Before doing any analysis, it’s essential to look at the data in its raw form.

5.1.2.1. First Look at the Data#

For large datasets, it might not be practical to view all rows, but you should always inspect at least the first 5–10 records manually before doing anything else.

This exercise helps you:

Familiarize yourself with the columns and their contents
Spot obvious issues like strange encodings or missing values
Start thinking about what could be features and targets

Pandas provides the .head() method to quickly preview rows.
It takes an optional parameter indicating the number of rows to display.

Example: Previewing Data with `.head()`#

Let’s use the Chinook database, which contains tables such as artists, albums, and tracks.

import sqlite3
import pandas as pd

# Connect to Chinook sample database
conn = sqlite3.connect("../data/chinook.db")

# Load the 'albums' table into a dataframe
df_albums = pd.read_sql_query("SELECT * FROM albums", conn)

# Show the first 5 rows
df_albums.head()

	AlbumId	Title	ArtistId
0	1	For Those About To Rock We Salute You	1
1	2	Balls to the Wall	2
2	3	Restless and Wild	2
3	4	Let There Be Rock	1
4	5	Big Ones	3

# Load the 'albums' table into a dataframe
df_tracks = pd.read_sql_query("SELECT * FROM tracks", conn)

# Show the first 5 rows
df_tracks.head()

	TrackId	Name	AlbumId	MediaTypeId	GenreId	Composer	Milliseconds	Bytes	UnitPrice
0	1	For Those About To Rock (We Salute You)	1	1	1	Angus Young, Malcolm Young, Brian Johnson	343719	11170334	0.99
1	2	Balls to the Wall	2	2	1	None	342562	5510424	0.99
2	3	Fast As a Shark	3	2	1	F. Baltes, S. Kaufman, U. Dirkscneider & W. Ho...	230619	3990994	0.99
3	4	Restless and Wild	3	2	1	F. Baltes, R.A. Smith-Diesel, S. Kaufman, U. D...	252051	4331779	0.99
4	5	Princess of the Dawn	3	2	1	Deaffy & R.A. Smith-Diesel	375418	6290521	0.99

Sometimes datasets have a large number of columns, making it hard to explore everything in a scrollable table. It’s good practice to print:

The shape (number of rows and columns). You can get this using df.shape.
The list of column names. Pandas provides df.columns for this.
The data types of each column. df.dtypes gives you an overview of column types.

This helps you:

Understand the size and complexity of the dataset
Start identifying which fields are potential features or labels
Spot data types that might need cleaning (e.g., numbers stored as text)

5.1.2.2. Summary Overview#

Once we have previewed the data and built a basic understanding, it helps to look at summary views of individual features and the dataset as a whole. This further improves your familiarity with:

How many non-null values each column has
What data types are assigned
Memory usage

Pandas provides the .info() method for this.

Example: Using `.info()`#

Let’s switch to a richer example using the tracks table:

# Show a concise summary of the dataframe
df_tracks.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3503 entries, 0 to 3502
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   TrackId       3503 non-null   int64  
 1   Name          3503 non-null   object 
 2   AlbumId       3503 non-null   int64  
 3   MediaTypeId   3503 non-null   int64  
 4   GenreId       3503 non-null   int64  
 5   Composer      2525 non-null   object 
 6   Milliseconds  3503 non-null   int64  
 7   Bytes         3503 non-null   int64  
 8   UnitPrice     3503 non-null   float64
dtypes: float64(1), int64(6), object(2)
memory usage: 246.4+ KB

5.1.2.3. Statistical Summaries#

If you have numeric data, statistical summaries give you useful metrics like:

mean
min and max
standard deviation
percentiles

These help you quickly see the distribution and spread of values in each numeric column.

Pandas provides the .describe() method to generate these statistics.

Example: Using `.describe()`#

# Show descriptive statistics for numeric columns
df_tracks.describe()

	TrackId	AlbumId	MediaTypeId	GenreId	Milliseconds	Bytes	UnitPrice
count	3503.000000	3503.000000	3503.000000	3503.000000	3.503000e+03	3.503000e+03	3503.000000
mean	1752.000000	140.929489	1.208393	5.725378	3.935992e+05	3.351021e+07	1.050805
std	1011.373324	81.775395	0.580443	6.190204	5.350054e+05	1.053925e+08	0.239006
min	1.000000	1.000000	1.000000	1.000000	1.071000e+03	3.874700e+04	0.990000
25%	876.500000	70.500000	1.000000	1.000000	2.072810e+05	6.342566e+06	0.990000
50%	1752.000000	141.000000	1.000000	3.000000	2.556340e+05	8.107896e+06	0.990000
75%	2627.500000	212.000000	1.000000	7.000000	3.216450e+05	1.026679e+07	0.990000
max	3503.000000	347.000000	5.000000	25.000000	5.286953e+06	1.059546e+09	1.990000

If your dataset has categorical columns and you want to include them in the summary, you can pass include="all":

# Describe all columns, including object types
df_tracks.describe(include="all")

	TrackId	Name	AlbumId	MediaTypeId	GenreId	Composer	Milliseconds	Bytes	UnitPrice
count	3503.000000	3503	3503.000000	3503.000000	3503.000000	2525	3.503000e+03	3.503000e+03	3503.000000
unique	NaN	3257	NaN	NaN	NaN	852	NaN	NaN	NaN
top	NaN	The Trooper	NaN	NaN	NaN	Steve Harris	NaN	NaN	NaN
freq	NaN	5	NaN	NaN	NaN	80	NaN	NaN	NaN
mean	1752.000000	NaN	140.929489	1.208393	5.725378	NaN	3.935992e+05	3.351021e+07	1.050805
std	1011.373324	NaN	81.775395	0.580443	6.190204	NaN	5.350054e+05	1.053925e+08	0.239006
min	1.000000	NaN	1.000000	1.000000	1.000000	NaN	1.071000e+03	3.874700e+04	0.990000
25%	876.500000	NaN	70.500000	1.000000	1.000000	NaN	2.072810e+05	6.342566e+06	0.990000
50%	1752.000000	NaN	141.000000	1.000000	3.000000	NaN	2.556340e+05	8.107896e+06	0.990000
75%	2627.500000	NaN	212.000000	1.000000	7.000000	NaN	3.216450e+05	1.026679e+07	0.990000
max	3503.000000	NaN	347.000000	5.000000	25.000000	NaN	5.286953e+06	1.059546e+09	1.990000

This combination of .info() and .describe() gives you a comprehensive overview of your dataset before you begin cleaning or modeling.

Data Preview

Contents