5.3.3. Relating Data Points#
Correlation helps us understand how features relate to one another. In many tasks, however, we are interested in a different question:
How similar are two data points when considering one or more features?
Similarity and distance measures allow us to compare entire examples, not just individual variables. These measures form the foundation of methods such as clustering, nearest neighbors, recommender systems, and anomaly detection.
5.3.3.1. Similarity and Dissimilarity#
Similarity quantifies how alike two objects are, while dissimilarity, often referred to as distance, quantifies how different they are.
High similarity implies the objects are close or alike
High distance implies the objects are far apart or different
Similarity is often normalized to lie between 0 and 1, while distance measures typically start at 0 and grow without a fixed upper bound.
5.3.3.2. Distance Metrics#
A distance function ( d(p, q) ) is considered a valid metric if it satisfies the following properties:
Non-negativity: d(p, q) >= 0
Identity: d(p, q) = 0 if and only if ( p = q )
Symmetry: d(p, q) = d(q, p)
Triangle inequality: d(p, q) + d(q, r) >= d(p, r)
Metrics that satisfy these properties behave intuitively in geometric space and are widely used in data analysis.
5.3.3.3. Similarity for Single Attributes#
Similarity can be defined differently depending on the type of attribute being compared.
Nominal Attributes#
Nominal attributes represent categories with no inherent order.
\( d(p, q) = \begin{cases} 0 & \text{if } p = q \\ 1 & \text{if } p \ne q \end{cases} \qquad \\ s(p, q) = 1 - d(p, q) \)
Ordinal Attributes#
Ordinal attributes have a meaningful order but no fixed spacing.
\( d(p, q) = \frac{|p - q|}{n - 1} \qquad s(p, q) = 1 - d(p, q) \)
Interval and Ratio Attributes#
Numeric attributes support arithmetic operations.
\( d(p, q) = |p - q| \)
Similarity is often derived by normalization, for example:
\( s(p, q) = \frac{1}{1 + d(p, q)} \)
5.3.3.4. Similarity for Multi-Attribute Data#
Real datasets typically contain multiple features. In this case, similarity and distance are computed by aggregating differences across dimensions.
Manhattan Distance#
\( d(p, q) = \sum_i |p_i - q_i| \)
This metric emphasizes absolute differences across features.
Euclidean Distance#
\( d(p, q) = \sqrt{\sum_i (p_i - q_i)^2} \)
Euclidean distance is sensitive to scale, which is why normalization and standardization are important.
Minkowski Distance#
\( d(p, q) = \left( \sum_i |p_i - q_i|^r \right)^{1/r} \)
( r = 1 ): Manhattan distance
( r = 2 ): Euclidean distance
5.3.3.5. Binary Similarity Measures#
For binary attributes, specialized similarity measures are often more appropriate.
Simple Matching Coefficient (SMC)#
SMC can be thought of as the ratio of matching binary feature values to total number of binary feature values.
\( \text{SMC} = \frac{f_{11} + f_{00}}{f_{11} + f_{10} + f_{01} + f_{00}} \)
Jaccard Similarity#
Similar to SMC, here we only omit the negative class.
\( \text{Jaccard} = \frac{f_{11}}{f_{11} + f_{10} + f_{01}} \)
Jaccard similarity ignores double negatives and is well suited for sparse data.
5.3.3.6. Cosine Similarity#
Cosine similarity measures the angle between vectors rather than their magnitude.
\( \text{Cosine}(p, q) = \frac{p \cdot q}{|p| |q|} \)
This makes it effective for high-dimensional and sparse data such as text.
5.3.3.7. Correlation as a Similarity Measure#
Pearson correlation can also be interpreted as a similarity measure between standardized vectors. Unlike distance, it captures similarity in pattern of variation rather than absolute values.
This connection explains why correlation often appears alongside cosine similarity in high-dimensional analysis.
5.3.3.8. Summary#
Similarity and distance measures provide a way to compare entire data points across multiple features. The choice of measure depends on attribute type, scale, and the structure of the data.
These measures allow us to move from feature–feature analysis to example–example comparison, enabling tasks such as clustering, retrieval, and pattern discovery.
In the next section, we use these ideas to visualize structure and relationships in high-dimensional data.