5.1.1. Data Nomenclature#

Before diving into data, let’s first familiarize ourselves with some common terminology used to refer to different components of data.

When working with datasets, especially in tabular form, we often encounter certain terms to describe different parts of the data. These terms are commonly used in machine learning, data analysis, and statistics - and can be extended to non-tabular or unstructured data as well.

Before we start exploring the data, let’s fimiliarize ourselves with some commonly used terms.

Example Dataset: We will continue with this dataset in the examples below.

import sqlite3
import pandas as pd

# Connect to Chinook sample database
conn = sqlite3.connect("../data/chinook.db")
# Load the 'tracks' table from Chinook database
df_tracks = pd.read_sql_query("SELECT * FROM tracks", conn)

# view first 5 rows of the dataset
df_tracks.head(5) 
TrackId Name AlbumId MediaTypeId GenreId Composer Milliseconds Bytes UnitPrice
0 1 For Those About To Rock (We Salute You) 1 1 1 Angus Young, Malcolm Young, Brian Johnson 343719 11170334 0.99
1 2 Balls to the Wall 2 2 1 None 342562 5510424 0.99
2 3 Fast As a Shark 3 2 1 F. Baltes, S. Kaufman, U. Dirkscneider & W. Ho... 230619 3990994 0.99
3 4 Restless and Wild 3 2 1 F. Baltes, R.A. Smith-Diesel, S. Kaufman, U. D... 252051 4331779 0.99
4 5 Princess of the Dawn 3 2 1 Deaffy & R.A. Smith-Diesel 375418 6290521 0.99

5.1.1.1. Record#

A record is a single unit in the dataset, representing one observation. In tabular data, a record corresponds to a row. It is also referred to as sample, object, or instance.

In the dataset, the first row (Index = 0) is one record:

print(df_tracks.iloc[0])
TrackId                                                 1
Name              For Those About To Rock (We Salute You)
AlbumId                                                 1
MediaTypeId                                             1
GenreId                                                 1
Composer        Angus Young, Malcolm Young, Brian Johnson
Milliseconds                                       343719
Bytes                                            11170334
UnitPrice                                            0.99
Name: 0, dtype: object

5.1.1.2. Feature#

A feature is a property or characteristic describing each record. In tabular data, a feature corresponds to a column. It is also called attribute, field, or covariate.

# List all column names
print("Columns:", df_tracks.columns.tolist())
print()
# Show data types
print("Data Types:")
print(df_tracks.dtypes)
Columns: ['TrackId', 'Name', 'AlbumId', 'MediaTypeId', 'GenreId', 'Composer', 'Milliseconds', 'Bytes', 'UnitPrice']

Data Types:
TrackId           int64
Name             object
AlbumId           int64
MediaTypeId       int64
GenreId           int64
Composer         object
Milliseconds      int64
Bytes             int64
UnitPrice       float64
dtype: object

5.1.1.3. Dimension#

This refers to the number of features in the dataset. In the example above, there are 9 features (not considering the Index). In some literature, dimension refers to the shape of the dataset: Number of records × Number of features

# Check the shape
print("Shape:", df_tracks.shape)
Shape: (3503, 9)

5.1.1.4. Feature Vector#

It is the collection of all feature values for a single record. It can be represented as a multi-dimensional point in feature space. When we have a label (that we want to predict), the label is not included in the feature vector.

Example:

print(df_tracks.iloc[0].to_list())
[np.int64(1), 'For Those About To Rock (We Salute You)', np.int64(1), np.int64(1), np.int64(1), 'Angus Young, Malcolm Young, Brian Johnson', np.int64(343719), np.int64(11170334), np.float64(0.99)]

5.1.1.5. Classes#

It is the set of possible categories a feature can take. It usually applies to categorical features, not continuous numerical ones.

Example: We could have only a certain number of values for GenreId.

print(df_tracks['GenreId'].unique())
[ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
 25]