5.1.2. Data Preview#
Before doing any analysis, it’s essential to look at the data in its raw form.
5.1.2.1. First Look at the Data#
For large datasets, it might not be practical to view all rows, but you should always inspect at least the first 5–10 records manually before doing anything else.
This exercise helps you:
Familiarize yourself with the columns and their contents
Spot obvious issues like strange encodings or missing values
Start thinking about what could be features and targets
Pandas provides the .head() method to quickly preview rows.
It takes an optional parameter indicating the number of rows to display.
Example: Previewing Data with .head()#
Let’s use the Chinook database, which contains tables such as artists, albums, and tracks.
import sqlite3
import pandas as pd
# Connect to Chinook sample database
conn = sqlite3.connect("../data/chinook.db")
# Load the 'albums' table into a dataframe
df_albums = pd.read_sql_query("SELECT * FROM albums", conn)
# Show the first 5 rows
df_albums.head()
| AlbumId | Title | ArtistId | |
|---|---|---|---|
| 0 | 1 | For Those About To Rock We Salute You | 1 |
| 1 | 2 | Balls to the Wall | 2 |
| 2 | 3 | Restless and Wild | 2 |
| 3 | 4 | Let There Be Rock | 1 |
| 4 | 5 | Big Ones | 3 |
# Load the 'albums' table into a dataframe
df_tracks = pd.read_sql_query("SELECT * FROM tracks", conn)
# Show the first 5 rows
df_tracks.head()
| TrackId | Name | AlbumId | MediaTypeId | GenreId | Composer | Milliseconds | Bytes | UnitPrice | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | For Those About To Rock (We Salute You) | 1 | 1 | 1 | Angus Young, Malcolm Young, Brian Johnson | 343719 | 11170334 | 0.99 |
| 1 | 2 | Balls to the Wall | 2 | 2 | 1 | None | 342562 | 5510424 | 0.99 |
| 2 | 3 | Fast As a Shark | 3 | 2 | 1 | F. Baltes, S. Kaufman, U. Dirkscneider & W. Ho... | 230619 | 3990994 | 0.99 |
| 3 | 4 | Restless and Wild | 3 | 2 | 1 | F. Baltes, R.A. Smith-Diesel, S. Kaufman, U. D... | 252051 | 4331779 | 0.99 |
| 4 | 5 | Princess of the Dawn | 3 | 2 | 1 | Deaffy & R.A. Smith-Diesel | 375418 | 6290521 | 0.99 |
Sometimes datasets have a large number of columns, making it hard to explore everything in a scrollable table. It’s good practice to print:
The shape (number of rows and columns). You can get this using
df.shape.The list of column names. Pandas provides
df.columnsfor this.The data types of each column.
df.dtypesgives you an overview of column types.
This helps you:
Understand the size and complexity of the dataset
Start identifying which fields are potential features or labels
Spot data types that might need cleaning (e.g., numbers stored as text)
5.1.2.2. Summary Overview#
Once we have previewed the data and built a basic understanding, it helps to look at summary views of individual features and the dataset as a whole. This further improves your familiarity with:
How many non-null values each column has
What data types are assigned
Memory usage
Pandas provides the .info() method for this.
Example: Using .info()#
Let’s switch to a richer example using the tracks table:
# Show a concise summary of the dataframe
df_tracks.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3503 entries, 0 to 3502
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 TrackId 3503 non-null int64
1 Name 3503 non-null object
2 AlbumId 3503 non-null int64
3 MediaTypeId 3503 non-null int64
4 GenreId 3503 non-null int64
5 Composer 2525 non-null object
6 Milliseconds 3503 non-null int64
7 Bytes 3503 non-null int64
8 UnitPrice 3503 non-null float64
dtypes: float64(1), int64(6), object(2)
memory usage: 246.4+ KB
5.1.2.3. Statistical Summaries#
If you have numeric data, statistical summaries give you useful metrics like:
mean
min and max
standard deviation
percentiles
These help you quickly see the distribution and spread of values in each numeric column.
Pandas provides the .describe() method to generate these statistics.
Example: Using .describe()#
# Show descriptive statistics for numeric columns
df_tracks.describe()
| TrackId | AlbumId | MediaTypeId | GenreId | Milliseconds | Bytes | UnitPrice | |
|---|---|---|---|---|---|---|---|
| count | 3503.000000 | 3503.000000 | 3503.000000 | 3503.000000 | 3.503000e+03 | 3.503000e+03 | 3503.000000 |
| mean | 1752.000000 | 140.929489 | 1.208393 | 5.725378 | 3.935992e+05 | 3.351021e+07 | 1.050805 |
| std | 1011.373324 | 81.775395 | 0.580443 | 6.190204 | 5.350054e+05 | 1.053925e+08 | 0.239006 |
| min | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.071000e+03 | 3.874700e+04 | 0.990000 |
| 25% | 876.500000 | 70.500000 | 1.000000 | 1.000000 | 2.072810e+05 | 6.342566e+06 | 0.990000 |
| 50% | 1752.000000 | 141.000000 | 1.000000 | 3.000000 | 2.556340e+05 | 8.107896e+06 | 0.990000 |
| 75% | 2627.500000 | 212.000000 | 1.000000 | 7.000000 | 3.216450e+05 | 1.026679e+07 | 0.990000 |
| max | 3503.000000 | 347.000000 | 5.000000 | 25.000000 | 5.286953e+06 | 1.059546e+09 | 1.990000 |
If your dataset has categorical columns and you want to include them in the summary, you can pass include="all":
# Describe all columns, including object types
df_tracks.describe(include="all")
| TrackId | Name | AlbumId | MediaTypeId | GenreId | Composer | Milliseconds | Bytes | UnitPrice | |
|---|---|---|---|---|---|---|---|---|---|
| count | 3503.000000 | 3503 | 3503.000000 | 3503.000000 | 3503.000000 | 2525 | 3.503000e+03 | 3.503000e+03 | 3503.000000 |
| unique | NaN | 3257 | NaN | NaN | NaN | 852 | NaN | NaN | NaN |
| top | NaN | The Trooper | NaN | NaN | NaN | Steve Harris | NaN | NaN | NaN |
| freq | NaN | 5 | NaN | NaN | NaN | 80 | NaN | NaN | NaN |
| mean | 1752.000000 | NaN | 140.929489 | 1.208393 | 5.725378 | NaN | 3.935992e+05 | 3.351021e+07 | 1.050805 |
| std | 1011.373324 | NaN | 81.775395 | 0.580443 | 6.190204 | NaN | 5.350054e+05 | 1.053925e+08 | 0.239006 |
| min | 1.000000 | NaN | 1.000000 | 1.000000 | 1.000000 | NaN | 1.071000e+03 | 3.874700e+04 | 0.990000 |
| 25% | 876.500000 | NaN | 70.500000 | 1.000000 | 1.000000 | NaN | 2.072810e+05 | 6.342566e+06 | 0.990000 |
| 50% | 1752.000000 | NaN | 141.000000 | 1.000000 | 3.000000 | NaN | 2.556340e+05 | 8.107896e+06 | 0.990000 |
| 75% | 2627.500000 | NaN | 212.000000 | 1.000000 | 7.000000 | NaN | 3.216450e+05 | 1.026679e+07 | 0.990000 |
| max | 3503.000000 | NaN | 347.000000 | 5.000000 | 25.000000 | NaN | 5.286953e+06 | 1.059546e+09 | 1.990000 |
This combination of .info() and .describe() gives you a comprehensive overview of your dataset before you begin cleaning or modeling.