3.2.1. Data Dictionary#

Before storing data in a structured format, it is useful to develop an understanding of what the data represents. We record these observations in a concise table, referred to as a data dictionary. While other representations are possible, a data dictionary provides a centralized record of findings and serves as a useful reference for future work.

You may create the data dictionary in any format; however, a tabular format is generally preferred as it provides clear, intuitive, and easy access to information.

The data dictionary typically has the following structure:

  • attribute_name: Name of the feature or attribute

  • type: Whether the attribute is qualitative or quantitative

  • subtype: Whether the attribute is nominal, ordinal, interval, or ratio

  • numeric_nature: Whether the attribute is discrete or continuous. This applies only to quantitative attributes

  • source: Where the variable was sourced from

  • description: A textual explanation describing the meaning of the variable and its significance

  • data_type: The underlying data representation, such as text, number, or date

3.2.1.1. Types of Attributes#

Often, we observe that the nature of data present in each column is quite different. Now, we will look at various classes into which we can group these attributes, helping us better understand the nature of our data.

The classification of attributes is based on their type but here, type refers to the physical meaning of the property, not the variable type used in programming languages.

Quantitative vs Qualitative#

The first broad classification of attributes is based on whether they represent a quantity (quantitative) or a quality (qualitative).

  • Quantitative attributes represent measurable magnitudes. Examples: age, income, weight, temperature.

  • Qualitative attributes describe non-numeric characteristics or categories. Examples: color (red, green), marital status (married, single).

Note

A numeric feature is not necessarily quantitative. We must always consider the physical meaning of the attribute. For instance, suppose we have a column named Location with values 1, 2, and 3, where:

  • 1 → Metro

  • 2 → Urban

  • 3 → Rural

Even though these values are numbers, the attribute is qualitative because the numbers are merely category labels, not magnitudes.

For the dataset above, the classification might look like this:

  • Qualitative: TID, Refund, Marital Status, Cheat

  • Quantitative: Taxable Income

Subclasses of Qualitative and Quantitative Data#

Qualitative#

We can further divide qualitative into two.

  • Nominal: Values are used only to distinguish categories. They have no inherent order or magnitude. Examples: employee ID, zip code, gender.

  • Ordinal: Similar to nominal, but the categories have a meaningful order. There is still no magnitude, but there is a ranking. Example: Customer satisfaction score (negative, slightly negative, neutral, slightly positive, positive).

Quantitative#

Even quantitative can be split into two categories.

  • Interval: The difference (interval) between values is meaningful, and there is a unit of measurement-but ratios are meaningless. Example: Temperature in °F or °C. The difference between 95°F and 75°F is meaningful (20°F), but 100°F is not “twice as hot” as 50°F.

  • Ratio: Like interval data, but here both differences and ratios are meaningful. Examples: age, weight, length.

Note

Temperature in °C or °F is interval data because ratios don’t make physical sense. However, temperature in Kelvin is ratio data; 100K is twice as hot as 50K.

Note

Dates are a tricky case. One could argue that a date is qualitative - it labels a particular day without magnitude. However, one could also treat dates as quantitative - differences between two dates (in days) are meaningful.
Typically, dates are classified as interval data, but the choice depends on the interpretation relevant to the problem at hand.

Numerical Data: Discrete vs Continuous#

Numerical attributes can be further divided into discrete and continuous:

  • Discrete: Attributes with a finite or countably infinite set of possible values. Examples: counts, number of children, zip codes.

  • Continuous: Attributes that can take any value from the real number set. Examples: temperature, height, weight.

Symmetric vs Asymmetric Attributes#

Another way to classify numeric attributes is by considering the importance of zero values:

  • Asymmetric attributes: Only non-zero values are important. Example: In a word count vector of an article, the fact that a word appears 45 times is important, but the absence of another word is not.

  • Symmetric attributes: Zero values are just as meaningful as non-zero values. Example: Temperature-both the value itself and the fact that it could be zero are important to interpretation.

3.2.1.2. Example#

Say, we have this sample dataset that represents a small music catalog. Each row corresponds to a song, along with basic information about the artist and album it belongs to. The dataset could be used for tasks such as browsing music, generating playlists, or basic analytics like counting songs per artist.

Sample Data#

track_id

track_name

artist_name

album_name

duration_sec

price_usd

1

Morning Light

Aurora Sky

First Dawn

210

0.99

2

Silent Roads

Aurora Sky

First Dawn

185

0.99

3

City Nights

Neon Pulse

After Dark

240

1.29

4

Echoes

Neon Pulse

Reflections

200

1.29

5

Open Horizons

Wild Trails

Long Journey

275

0.89

Sample Data Dictionary#

The table below documents each attribute in the dataset, describing its meaning, type, and representation.

attribute_name

data_type

type

subtype

numeric_nature

source

description

track_id

INTEGER

Quantitative

Nominal

Discrete

System

Unique identifier for each track

track_name

TEXT

Qualitative

Nominal

N/A

Metadata

Name of the song

artist_name

TEXT

Qualitative

Nominal

N/A

Metadata

Name of the artist who performed the song

album_name

TEXT

Qualitative

Nominal

N/A

Metadata

Name of the album the song belongs to

duration_sec

INTEGER

Quantitative

Ratio

Discrete

Audio data

Length of the track in seconds

price_usd

NUMERIC

Quantitative

Ratio

Continuous

Pricing data

Price of the track in US dollars

This example demonstrates how a data dictionary complements a dataset by clearly defining each attribute, making the data easier to understand, use, and maintain over time.

We will learn more about the types and numeric nature in the Data Overview section later.