5.1.1. Data Nomenclature#
Before diving into data, let’s first familiarize ourselves with some common terminology used to refer to different components of data.
When working with datasets, especially in tabular form, we often encounter certain terms to describe different parts of the data. These terms are commonly used in machine learning, data analysis, and statistics - and can be extended to non-tabular or unstructured data as well.
Before we start exploring the data, let’s fimiliarize ourselves with some commonly used terms.
Example Dataset: We will continue with this dataset in the examples below.
import sqlite3
import pandas as pd
# Connect to Chinook sample database
conn = sqlite3.connect("../data/chinook.db")
# Load the 'tracks' table from Chinook database
df_tracks = pd.read_sql_query("SELECT * FROM tracks", conn)
# view first 5 rows of the dataset
df_tracks.head(5)
| TrackId | Name | AlbumId | MediaTypeId | GenreId | Composer | Milliseconds | Bytes | UnitPrice | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | For Those About To Rock (We Salute You) | 1 | 1 | 1 | Angus Young, Malcolm Young, Brian Johnson | 343719 | 11170334 | 0.99 |
| 1 | 2 | Balls to the Wall | 2 | 2 | 1 | None | 342562 | 5510424 | 0.99 |
| 2 | 3 | Fast As a Shark | 3 | 2 | 1 | F. Baltes, S. Kaufman, U. Dirkscneider & W. Ho... | 230619 | 3990994 | 0.99 |
| 3 | 4 | Restless and Wild | 3 | 2 | 1 | F. Baltes, R.A. Smith-Diesel, S. Kaufman, U. D... | 252051 | 4331779 | 0.99 |
| 4 | 5 | Princess of the Dawn | 3 | 2 | 1 | Deaffy & R.A. Smith-Diesel | 375418 | 6290521 | 0.99 |
5.1.1.1. Record#
A record is a single unit in the dataset, representing one observation. In tabular data, a record corresponds to a row. It is also referred to as sample, object, or instance.
In the dataset, the first row (Index = 0) is one record:
print(df_tracks.iloc[0])
TrackId 1
Name For Those About To Rock (We Salute You)
AlbumId 1
MediaTypeId 1
GenreId 1
Composer Angus Young, Malcolm Young, Brian Johnson
Milliseconds 343719
Bytes 11170334
UnitPrice 0.99
Name: 0, dtype: object
5.1.1.2. Feature#
A feature is a property or characteristic describing each record. In tabular data, a feature corresponds to a column. It is also called attribute, field, or covariate.
# List all column names
print("Columns:", df_tracks.columns.tolist())
print()
# Show data types
print("Data Types:")
print(df_tracks.dtypes)
Columns: ['TrackId', 'Name', 'AlbumId', 'MediaTypeId', 'GenreId', 'Composer', 'Milliseconds', 'Bytes', 'UnitPrice']
Data Types:
TrackId int64
Name object
AlbumId int64
MediaTypeId int64
GenreId int64
Composer object
Milliseconds int64
Bytes int64
UnitPrice float64
dtype: object
5.1.1.3. Dimension#
This refers to the number of features in the dataset. In the example above, there are 9 features (not considering the Index). In some literature, dimension refers to the shape of the dataset: Number of records × Number of features
# Check the shape
print("Shape:", df_tracks.shape)
Shape: (3503, 9)
5.1.1.4. Feature Vector#
It is the collection of all feature values for a single record. It can be represented as a multi-dimensional point in feature space. When we have a label (that we want to predict), the label is not included in the feature vector.
Example:
print(df_tracks.iloc[0].to_list())
[np.int64(1), 'For Those About To Rock (We Salute You)', np.int64(1), np.int64(1), np.int64(1), 'Angus Young, Malcolm Young, Brian Johnson', np.int64(343719), np.int64(11170334), np.float64(0.99)]
5.1.1.5. Classes#
It is the set of possible categories a feature can take. It usually applies to categorical features, not continuous numerical ones.
Example: We could have only a certain number of values for GenreId.
print(df_tracks['GenreId'].unique())
[ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
25]