Entity Resolution - Identifying Real-World Entities in Noisy Data

Fundamental Theories and Python Implementations

Explore the technical details of fundamental entity resolution approaches using a benchmark dataset.

Adapted from the original article by Tomonori Masui, Towards Data Science

Table of Contents

Pre-requisites Instillation

Using Python 3.12.9 install the following libraries.

pipenv install pandas numpy requests nltk scikit-learn scipy jupyter spacy
import pandas as pd
import numpy as np

pd.options.display.max_colwidth = None

Overview of Entity Resolution

The standard entity resolution (ER) framework consists of several steps:

  1. Blocking: Reducing the search space by grouping similar records.
  2. Block Processing: Refining blocks to eliminate unnecessary comparisons.
  3. Entity Matching: Comparing records and identifying matches.
  4. Clustering: Grouping matched records into entities.

1. Blocking

This is the first step in entity resolution.

Aims to reduce the search space to identify the same entity by dividing the dataset into smaller, manageable blocks.

These blocks contain records that share similar attributes, making the subsequent comparison more efficient.

2. Block Processing

This step refines the blocks to minimize the number of comparisons.

Unnecessary comparisons are discarded:

  1. the redundant ones, which are repeated across multiple blocks,
  2. the superfluous ones, which involve records unlikely to match.

3. Entity Matching

Comparing records within blocks to find matches based on the similarity of the records.

Various similarity metrics and matching algorithms can be employed to classify pairs of records as matches or non-matches.

4. Clustering

Clustering involves grouping the matched records into clusters based on their similarity.

The created clusters can be used to get a consolidated view of entities.

Entity Resolution workflow

Benchmark Dataset

The dataset, sourced from the database group at the University of Leipzig, is derived from actual records concerning songs from the MusicBrainz database but has been deliberately altered using the DAPO data pollution tool.

This tool injects both duplicates and errors into the dataset, resulting in a situation where it contains duplicates for 50% of the original records in two to five sources.

Loading the Benchmark Dataset

We can load the data with the following code.

import requests
from io import BytesIO


# url = "https://cise.ufl.edu/~cgrant/files/er/musicbrainz_200k.csv"
url = "https://cise.ufl.edu/~cgrant/files/er/musicbrainz_20k.csv"

res = requests.get(url)
df = pd.read_csv(BytesIO(res.content))

# Check the shape of the dataset
df.shape
(193750, 12)

Example records from the dataset

df[df.CID == 41689]
TID CID CTID SourceID id number title length artist album year language
57699 57700 41689 3 4 14690-A031 006-My Father 3m 19sec Bill Cosby When I Was a Kid (2005) NaN Eng.
80716 80717 41689 1 2 MBox281897-HH Bill Cosby - My Father 199 NaN When I Wa sa Kid 05 English
116076 116077 41689 2 3 4534714MB-01 My Father - When I Was a Kid 3.325 Bill-Cosby NaN '05 ENGLISH

CID is cluster ID and the records having the same CID are duplicates (in the example above all three records represent the same song).

(You can find field descriptions in this link).

Let's focus on English songs...

Cleaning: English only data sets

The code below identifies records with cluster IDs that have English songs.

english_cids = df[
    df.language.str.lower().str.contains("^en|^eg", na=False)
].CID.unique()

df = df[df.CID.isin(english_cids)].reset_index(drop=True)

Cleaning: Remove crazy strings

We are also preprocessing some of the string fields to get standardized values.

for col in ["title", "artist", "album"]:
    df[col] = (
        df[col]
        .str.lower()
        .replace("[^a-z0-9]", " ", regex=True)  # replacing special characters with a space
        .replace(" +", " ", regex=True)         # removing consecutive spaces
        .str.strip()                            # removing leading and tailing spaces
    )

df.loc[df.number.notna(), "number"] = (
    df[df.number.notna()]
        .number.replace("[^0-9]", "", regex=True)             # removing non-digits
        .apply(lambda x: str(int(x)) if len(x) > 0 else None) # removing leading zeros
)

Blocking

Blocking is the first step in entity resolution that groups similar records together based on certain attributes.

Blocking should achieve a good balance between efficiency and accuracy.

We will explore three different blocking approaches:

Standard Blocking

Partitioning the dataset into blocks based on a specific attribute.

Approach is intuitive and easy to implement, very sensitive to noise.

Standard Blocking on Artist field

Implementing Standard Blocking

We can get standard blocks with the function below.
The dictionary block will store blocking keys (key) and their corresponding indices (idx) of blocked records.

from collections import defaultdict

def standard_blocking(field_values: pd.Series) -> dict[str, list]:
    blocks = defaultdict(list)
    {blocks[key].append(idx) for idx, key in enumerate(field_values) if key is not None}
    return blocks

Standard blocks for title, artist, and album

We can create three independent standard blocks using the fields of title, artist, and album.

sb_title = standard_blocking(df.title)
sb_artist = standard_blocking(df.artist)
sb_album = standard_blocking(df.album)

Token Blocking

Token blocking focuses tokenizing the values of attributes into smaller units, then using these tokens to create blocks for comparison.

  • Typically single words or small n-grams extracted from the text.

  • Creates a block for every distinct token value:

    • two records will be in the same block if they share a token in any of their attributes.
  • High recall, due to redundancy

    • (i.e. a single record can belong to multiple blocks)
  • at the cost of low precision.

    • Errors in blocks may lead to false positives.

Token Blocking Example

Example of Token Blocking

Token Blocking Implementation

import string
import nltk
from nltk.tokenize import word_tokenize # Tokenizing words
from nltk.corpus import stopwords
nltk.download('punkt') # Load punctuation examples
nltk.download('stopwords') # stopwords (e.g. "a", "the", "is", etc)

## Alternatively 

import spacy
from string import punctuation
from spacy.lang.en import stop_words
nlp = spacy.load('en_core_web_sm')

stop_words = stop_words.STOP_WORDS
punctuations = list(punctuation)

## OR load a Spacy model
nlp = spacy.load('en_core_web_sm')
stop_words = nlp.Defaults.stop_words

Token Blocking Implementation Function

The function below generates token blocks based on word tokens.

def token_blocking(df: pd.DataFrame, stop_words: set) -> dict[str, list]:

    blocks = defaultdict(list)

    for i, row in enumerate(df.itertuples()):

        # Concatenate Columns and Tokenize
        string = " ".join([str(value) for value in row if not pd.isna(value)])
        tokens = set(
            [word for word in word_tokenize(string) if word not in stop_words]
        )

        # Create Blocks
        for token in tokens:
            blocks[token].append(i)

    return blocks

Token Blocking for title, artist, and album

columns = ['title', 'artist', 'album']
stop_words = set(stopwords.words('english') + list(string.punctuation))
token_blocks = token_blocking(df[columns], stop_words)

Sorted Neighborhood

Sorted Neighborhood sorts records by specific fields' values in alphabetical order.

A fixed-size window slides over the sorted records and generate a list of candidate pairs within the window for comparison.

  • Effectively handles noise in blocking fields,
  • A smaller window: sacrifices recall in favor of precision
  • A larger window: has higher recall with lower precision

Sorted Neighborhood Example

Example of Sorted Neighborhood with window size 3

Sorted Neighborhood Implementation

def sorted_neighborhood(df: pd.DataFrame,
                        keys: list, window_size: int = 3) -> np.ndarray:

    sorted_indices = (
        df[keys].dropna(how="all").sort_values(keys).index.tolist()
    )
    pairs = []
    for window_end in range(1, len(sorted_indices)):
        window_start = max(0, window_end - window_size)
        for i in range(window_start, window_end):
            pairs.append([sorted_indices[i], sorted_indices[window_end]])

    return np.array(pairs)

Sorting the neighborhood for title, artist, and album Window Size 3

Sorted Neighborhood with window size 3, using the fields of title, artist, and album as the sorting keys.

columns = ['title', 'artist', 'album']
sn_pairs = sorted_neighborhood(df, columns, window_size=3)

Table of Contents

Block Processing

Improve the precision of blocks while maintaining a comparable level of recall.

Reduce unnecessary and redundant comparisons within the input set of blocks B,
resulting in generation of a new set of blocks B′ with improved precision.

Block Purging

Sets an upper limit on the block size and purges blocks if their sizes go over the limit.

It assumes:

  • excessively large blocks are dominated by redundant comparisons
  • duplicates contained in large blocks are more likely to appear in other smaller blocks.

Block Purging Implementation with Threshold

  • Purges blocks over the limit (set as 1000 records here).
  • Filters out blocks with just one record; they do not create pairs.
def purge_blocks(
    blocks: dict[str, list], purging_threshold: int = 1000
) -> dict[str, list]:

    blocks_purged = {
        key: indices
        for key, indices in blocks.items()
        if len(indices) < purging_threshold and len(indices) > 1
    }

    return blocks_purged

Block Purging for Standard Blocks and Token Blocks

token_blocks = purge_blocks(token_blocks)
sb_title = purge_blocks(sb_title)
sb_artist = purge_blocks(sb_artist)
sb_album = purge_blocks(sb_album)

Meta-blocking

Transforms the input block collection into a graph (or adjacency matrix).
Each node corresponds to a record, and edges link every pair of records that co-occur within in a block.

An edge weight represents the frequency of pair occurrences across blocks

  • Higher weights indicate a greater likelihood of a match.
  • Edges with low weights are pruned, as they likely represent superfluous comparisons.

For each retained edge, a new refined block is generated.

Meta-blocking Example

Example of Meta Blocking

Meta-blocking Implementation (Part 1: Build Graph)

Creates a list of pairs from the token blocks. Then converts it into an adjacency matrix.

import itertools
from scipy.sparse import csr_matrix


def get_pairs_from_blocks(blocks: dict[str, list]) -> list[list]:
    return [
        pair
        for indices in blocks.values()
        for pair in list(itertools.combinations(indices, 2))
    ]


def get_adjacency_matrix_from_pairs(
    pairs: list[list], matrix_shape: tuple[int, int]) -> csr_matrix:

    idx1 = [pair[0] for pair in pairs]
    idx2 = [pair[1] for pair in pairs]
    ones = np.ones(len(idx1))

    return csr_matrix(
        (ones, (idx1, idx2)), shape=matrix_shape, dtype=np.int8
    )

Meta-blocking for Token Blocks (Part 1: Build Graph)

We are performing meta-blocking only on token blocks as they have many overlaps across the blocks.

pairs = get_pairs_from_blocks(token_blocks)
adj_matrix = get_adjacency_matrix_from_pairs(pairs, (len(df), len(df)))`

Meta-blocking Implementation (Part 2: Prune Graph With Edge Weight 1)

Next, pruning edges in the adjacency matrix based on the edge weight.

def prune_edges(
    adj_matrix: csr_matrix,
    edge_weight_threshold: float,
    verbose: bool = False,
) -> csr_matrix:

    adj_matrix_pruned = adj_matrix >= edge_weight_threshold

    if verbose:
        print(f"original edge count: {adj_matrix.nonzero()[0].shape[0]:,}")
        print(
            f"edge count after pruning: {adj_matrix_pruned.nonzero()[0].shape[0]:,}"
        )

    return adj_matrix_pruned

Meta-blocking Continuing from Token Blocks (Part 2: Prune Graph)

def get_pairs_from_adj_matrix(adjacency_matrix: csr_matrix) -> np.ndarray:
    return np.array(adjacency_matrix.nonzero()).T

adj_matrix = prune_edges(adj_matrix, edge_weight_threshold=2, verbose=True)

    original edge count: 70,337,023
    edge count after pruning: 2,501,622

tb_pairs = get_pairs_from_adj_matrix(adj_matrix)

Meta-blocking Merge/Union of Blocks

adj_matrix_list = []
for blocks in [sb_title, sb_artist, sb_album]:
    pairs = get_pairs_from_blocks(blocks)
    adj_matrix_list.append(
        get_adjacency_matrix_from_pairs(pairs, (len(df), len(df)))
    )

Then, we get a union of the matrices and candidate pairs from it.

def get_union_of_adj_matrices(adj_matrix_list: list) -> csr_matrix:

    adj_matrix = csr_matrix(adj_matrix_list[0].shape)
    for matrix in adj_matrix_list:
        adj_matrix += matrix

    return adj_matrix

Generate the standard blocking pairs from the union of the adjacency matrices.

adj_matrix_union = get_union_of_adj_matrices(adj_matrix_list)
sb_pairs = get_pairs_from_adj_matrix(adj_matrix_union)

Bocking Results

The final number of candidate pairs from the three different blocking approaches.

pd.DataFrame(
    [
        ["Standard Blocking", f"{len(sb_pairs):,}"],
        ["Token Blocking", f"{len(tb_pairs):,}"],
        ["Sorted Neighborhood", f"{len(sn_pairs):,}"],
    ],
    columns=["Blocking Approach", "N of Candidate Pairs"]
)
Blocking Approach N of Candidate Pairs
0 Standard Blocking 1,338,965
1 Token Blocking 2,501,622
2 Sorted Neighborhood 382,296

Table of Contents

Entity Matching

Identify matched pairs from the candidate pairs generated in the previous step.

1. Measure similarity on each attribute

You can use any similarity metrics such as cosine similarity, Jaccard similarity, or Levenshtein distance similarity, based on suitability for your data or specific requirements.

2. Compute overall similarity

Combine the per-attribute similarities into an overall similarity score. USe manually defined rules or ML model.

3. Apply threshold

Determine matches using threshold to the overall similarity score to find matches.

Entity Matching Example

Example of Entity Matching

The function get_field_similarity_scores below takes care of the step 1 above. If sim_type is set to "fuzzy", it calculates cosine similarity; otherwise, it performs an exact match. The cosine similarity is calculated on character level 3-grams which are vectorized from input strings using the CountVectorizer module from scikit-learn. We compute the cosine similarity for the fields of title, artist, and album, while performing an exact match on number field.

Field Similarity Scores

from sklearn.preprocessing import normalize
from sklearn.feature_extraction.text import CountVectorizer


def get_field_similarity_scores(
    df: pd.DataFrame, pairs: np.ndarray, field_config: dict[str, str]
) -> dict[str, np.ndarray]:
    """
    Measuring similarity by field. It is either cosine similarity
    (if sim_type == 'fuzzy') or exact match 0/1 (if sim_type == 'exact').
    Attribute's similarity scores are stored in field_score dictionary
    with the field name as key.
    """

    field_scores = {}

    for field, sim_type in field_config.items():
        if sim_type == "fuzzy":
            field_scores[field] = cosine_similarities(
                df[field].fillna(""), pairs
            )
        else:
            field_scores[field] = exact_matches(df[field], pairs)

    return field_scores

Cosine Similarities

def cosine_similarities(
    field_values: pd.Series, pairs: np.ndarray
) -> np.ndarray:
    """
    Computing cosine similarities on pairs
    """

    token_matrix_1, token_matrix_2 = get_token_matrix_pair(
        field_values, pairs
    )
    cos_sim = cosine_similarities_on_pair_matrices(
        token_matrix_1, token_matrix_2
    )

    return cos_sim

Get Token Matrix Pair

def get_token_matrix_pair(
    field_values: pd.Series, pairs: np.ndarray,
) -> tuple[csr_matrix, csr_matrix]:
    """
    Converting pairs into matrices of token counts (matrix of records
    by tokens filled with token counts).
    """

    all_idx = np.unique(pairs)
    vectorizer = CountVectorizer(analyzer="char", ngram_range=(3, 3))
    vectorizer.fit(field_values.loc[all_idx])
    token_matrix_1 = vectorizer.transform(field_values.loc[pairs[:, 0]])
    token_matrix_2 = vectorizer.transform(field_values.loc[pairs[:, 1]])

    return token_matrix_1, token_matrix_2

Cosine Similarities on Pair Matrices

def cosine_similarities_on_pair_matrices(
    token_matrix_1: csr_matrix, token_matrix_2: csr_matrix
) -> np.ndarray:
    """
    Computing cosine similarities on pair of token count matrices.
    It normalizes each record (axis=1) first, then computes dot product
    for each pair of records.
    """

    token_matrix_1 = normalize(token_matrix_1, axis=1)
    token_matrix_2 = normalize(token_matrix_2, axis=1)
    cos_sim = np.asarray(
        token_matrix_1.multiply(token_matrix_2).sum(axis=1)
    ).flatten()

    return cos_sim

Exact Matches

def exact_matches(
    field_values: pd.Series, pairs: np.ndarray
) -> np.ndarray:
    """
    Performing exact matches on pairs
    """

    arr1 = field_values.loc[pairs[:, 0]].values
    arr2 = field_values.loc[pairs[:, 1]].values

    return ((arr1 == arr2) & (~pd.isna(arr1)) & (~pd.isna(arr2))).astype(int)

Field Configuration and Similarity Scores

field_config = {
    # <field>: <sim_type>
    "title": "fuzzy",
    "artist": "fuzzy",
    "album": "fuzzy",
    "number": "exact",
}

field_scores_sb = get_field_similarity_scores(df, sb_pairs, field_config)

Rule-based matching

Rule-based matching

Overall Similarity Scores and Matching

def calc_overall_scores(field_scores: dict[str, np.ndarray]) -> np.ndarray:
    return np.array(list(field_scores.values())).mean(axis=0)

def find_matches(scores: np.ndarray, threshold: float) -> np.ndarray:
    return scores >= threshold

Matching with Standard Blocking, Token Blocking, and Sorted Neighborhood

scores_sb = calc_overall_scores(field_scores_sb)
is_matched_sb = find_matches(scores_sb, threshold=0.64)
field_scores_tb = get_field_similarity_scores(df, tb_pairs, field_config)
scores_tb = calc_overall_scores(field_scores_tb)
is_matched_tb = find_matches(scores_tb, threshold=0.64)
field_scores_sn = get_field_similarity_scores(df, sn_pairs, field_config)
scores_sn = calc_overall_scores(field_scores_sn)
is_matched_sn = find_matches(scores_sn, threshold=0.64)

The code below summarizes the comparison in a table.

from IPython.display import display
from collections import Counter

def show_results(
    is_matched_list: list[np.ndarray],
    blocking_approach_name_list: list[str],
):

    result = pd.DataFrame(
        [Counter(is_matched).values() for is_matched in is_matched_list],
        columns=["Unmatch", "Match"],
    )
    result["Blocking Approach"] = blocking_approach_name_list
    result["Matching Rate"] = result.Match / (
        result.Match + result.Unmatch
    )
    result["Matching Rate"] = result["Matching Rate"].map("{:.1%}".format)
    result["Match"] = result["Match"].map("{:,}".format)
    result["Unmatch"] = result["Unmatch"].map("{:,}".format)

    display(
        result[["Blocking Approach", "Match", "Unmatch", "Matching Rate"]]
    )
is_matched_list = [is_matched_sb, is_matched_tb, is_matched_sn]
blocking_approach_name_list = [
    "Standard Blocking",
    "Token Blocking",
    "Sorted Neighborhood",
]
show_results(is_matched_list, blocking_approach_name_list)
Blocking Approachi Match Unmatch Matching Rate **
0 Standard Blocking 58,303 1,280, 662% 4.4%
1 Token Blocking 66,765 2,434,857 2.7%
2 Sorted Neighborhood 23,875 358,421 6.2%

Take aways

  • Token Blocking yields the highest number of matches
  • Sorted Neighborhood has the highest matching rate.

Machine-learning matching

Machine-learning matching

Machine-learning Matching Implementation

  • Pairs within the same cluster CID are designated as matches (y = 1)
  • pairs outside the same cluster are labeled as non-matches (y = 0).
def get_x_y(
    field_scores: dict[str, np.ndarray],
    pairs: np.ndarray,
    df: pd.DataFrame,
) -> tuple[pd.DataFrame, np.ndarray]:

    X = pd.DataFrame(field_scores)
    y = df.loc[pairs[:, 0], "CID"].values == df.loc[pairs[:, 1], "CID"].values

    return X, y

X, y = get_x_y(field_scores_tb, tb_pairs, df)

Splitting and Training and Testing Sets

Training a logistic regression model.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.5, random_state=42
)

model = LogisticRegression(random_state=0).fit(X_train, y_train)

Model Evaluation Score

The code below compares its performance (f1_score) with the rule-based approach.

from sklearn.metrics import f1_score

y_pred = model.predict(X_test)
print(f"Model f1_score: {f1_score(y_test, y_pred):.3f}")

y_rule_base = is_matched_tb[X_test.index.values]
print(f"Rule-base f1_score: {f1_score(y_test, y_rule_base):.3f}")
Model f1_score: 0.866
Rule-base f1_score: 0.721

The code below extracts matched pairs and their similarity scores from the candidate pairs and their scores on token blocking.

matched_pairs = tb_pairs[is_matched_tb]
matched_scores = scores_tb[is_matched_tb]

Table of Contents

Clustering

Clustering for entity resolution has several requirements:

1. Unconstrained algorithm
The algorithms should not require any domain-specific parameters as input, such as the number of clusters or the diameter of the clusters.

2. Capability of handling an incomplete similarity matrix
As the entity resolution process does not compute similarity on every possible pair (which could be described as an N by N matrix), the algorithms must be able to handle an incomplete similarity matrix (or a list of matched pairs).

3. Scalability
Entity resolution often handles sizable datasets, making it important that algorithms are capable of handling such data.

Three three major single-pass clustering algorithms:

Single-pass clustering algorithms (source: http://www.vldb.org/pvldb/vol2/vldb09-1025.pdf)
  1. Partitioning (i.e. connected components)
  2. Center Clustering
  3. Merge-Center Clustering

Partitioning/Connected Components

Forms a cluster by grouping all the connected nodes via edges (i.e. matched records via pairs).

from scipy.sparse.csgraph import connected_components

def connected_components_from_pairs(
    pairs: np.ndarray, dim: int
) -> np.ndarray:

    adjacency_matrix = get_adjacency_matrix_from_pairs(pairs, (dim, dim))
    _, clusters = connected_components(
        csgraph=adjacency_matrix, directed=False, return_labels=True
    )

    return clusters
cc_clusters = connected_components_from_pairs(matched_pairs, len(df))

Center Clustering

This algorithm [5] performs clustering where each cluster has a center and all records in each cluster are similar to the center of the cluster. It requires the list of similar pairs to be sorted by descending order of similarity scores. The algorithm then performs clustering by a single scan of the sorted list. When a node u is encountered for the first time in the scan, it's designated as the cluster center. Any subsequent nodes v that are similar to u (i.e., appear in a pair (u, v) in the list) are assigned to the cluster of u and are not considered again during the process.

Example of Center Clustering

Merge-Center Clustering

This algorithm [6] performs similarly to Center clustering, but merges two clusters cáµ¢ and câ±¼ whenever a record that is similar to the center of the cluster cáµ¢ is similar to the center of câ±¼. Note that when two clusters are merged, it does not choose a single center node, which means that merged clusters can have multiple center nodes. This algorithm can be performed similarly by a single scan of the list of similar pairs, but by keeping track of the records that are connected through a merged cluster.

Example of Merge-Center Clustering

To perform Center/Merge-Center clustering, we first need to sort the list of pairs by descending order of the similarity scores.

def sort_pairs(pairs: np.ndarray, scores: np.ndarray) -> np.ndarray:
    sorted_ids = (-1 * scores).argsort()
    return pairs[sorted_ids]

pairs_sored = sort_pairs(matched_pairs, matched_scores)

Next, the code below yields two sets of pairs: center-child pairs, denoted as center_cluster_pairs, and merged node pairs, referred to as merge_cluster_pairs. We can then generate Center clusters and Merge-Center clusters by applying connected components to these lists of pairs.

def get_center_cluster_pairs(pairs, dim):

    """
    cluster_centers:
        list tracking cluster center for each record.
        indices of the list correspond to the original df indices
        and the values represent assigned cluster centers' indices
    center_cluster_pairs:
        list of pairs of indices representing center-child pairs
    merge_cluster_pairs:
        list of pairs of merged nodes' indices
    """
    cluster_centers = [None] * dim
    center_cluster_pairs = []
    merge_cluster_pairs = []

    for idx1, idx2 in pairs:

        if (
            cluster_centers[idx1] is None
            or cluster_centers[idx1] == idx1
            or cluster_centers[idx2] is None
            or cluster_centers[idx2] == idx2
        ):
            # if both aren't child, those nodes are merged
            merge_cluster_pairs.append([idx1, idx2])

        if cluster_centers[idx1] is None and cluster_centers[idx2] is None:
            # if both weren't seen before, idx1 becomes center and idx2 gets child
            cluster_centers[idx1] = idx1
            cluster_centers[idx2] = idx1
            center_cluster_pairs.append([idx1, idx2])
        elif cluster_centers[idx2] is None:
            if cluster_centers[idx1] == idx1:
                # if idx1 is center, idx2 is assigned to that cluster
                cluster_centers[idx2] = idx1
                center_cluster_pairs.append([idx1, idx2])
            else:
                # if idx1 is not center, idx2 becomes new center
                cluster_centers[idx2] = idx2
        elif cluster_centers[idx1] is None:
            if cluster_centers[idx2] == idx2:
                # if idx2 is center, idx1 is assigned to that cluster
                cluster_centers[idx1] = idx2
                center_cluster_pairs.append([idx1, idx2])
            else:
                # if idx2 is not center, idx1 becomes new center
                cluster_centers[idx1] = idx1

    return center_cluster_pairs, merge_cluster_pairs
center_cluster_pairs, merge_cluster_pairs = get_center_cluster_pairs(pairs_sored, len(df))
ct_clusters = connected_components_from_pairs(center_cluster_pairs, len(df))
mc_clusters = connected_components_from_pairs(merge_cluster_pairs, len(df))

Table of Contents

Cluster Evaluation

As we have the cluster labels, we can evaluate the quality of the clusters using Rand Index or adjusted Rand Index.
Rand Index is a cluster evaluation metric that represents the proportion of pairs that are correctly clustered together or apart. It is defined as follows:

TP = Number of pairs that are clustered together in both predicted and true clusters.

TN = Number of pairs that are clustered apart in both predicted and true clusters.

Rand Index = (TP + TN) / Total number of possible pairs

> Example of Rand Index Calculation

Adjusted Rand Index is a modified version of Rand Index that is corrected for chance. The adjustment accounts for the possibility of random agreement from clustering results that were randomly assigned.

We won't delve into how each term in the above equation is calculated here, but anyone who is interested in this topic can refer to the paper from KY Yeung which explains the metric with some examples.

The code below gives us a comparison of the clusters using those metrics along with some additional basic statistics.

Get Stats Function

from sklearn.metrics.cluster import rand_score, adjusted_rand_score
from IPython.display import display

def get_stats(labels, clusters):

    stats = []
    stats.append(f"{rand_score(labels, clusters):.3f}")
    stats.append(f"{adjusted_rand_score(labels, clusters):.3f}")
    clus_dist = pd.Series(clusters).value_counts()
    stats.append(f"{len(clus_dist):,}")
    stats.append(f"{clus_dist.mean():.3f}")
    stats.append(f"{clus_dist.min():,}")
    stats.append(f"{clus_dist.max():,}")

    return stats

Compare Clusters Function

def compare_clusters(cluster_list, cluster_names, labels):

    stats_dict = {}
    for clusters, name in zip(cluster_list, cluster_names):
        stats = get_stats(labels, clusters)
        stats_dict[name] = stats

    display(
        pd.DataFrame(
            stats_dict,
            index=[
                "Rand Index",
                "Adjusted Rand Index",
                "Cluster Count",
                "Mean Cluster Size",
                "Min Cluster Size",
                "Max Cluster Size",
            ],
        )
    )

Cluster Comparison

cluster_list = [cc_clusters, ct_clusters, mc_clusters]
cluster_names = ["Connected Components", "Center", "Merge-Center"]
compare_clusters(cluster_list, cluster_names, df.CID)
Metric Connected Components Center Merge-Center
Rand Index 1.000 1.000 1.000
Adjusted Rand Index 0.782 0.591 0.784
Cluster Count 79,229 90,994 79,267
Mean Cluster Size 1.609 1.401 1.608
Min Cluster Size 1 1 1
Max Cluster Size 86 6 86

Components produce the larger clusters with the fewest cluster count, while the gap between connected components and the Merge-Center clusters is minimal.

  • Center clusters yield the smaller clusters with the highest count.
  • All three clusters have a perfect Rand Index
  • If you look at the Adjusted Rand Index, Merge-Center clustering is the best, while its difference from connected components is marginal.

Another Metric: B-Cubed

This metric from Bragga et al. is another evaluation metric for clustering that considers precision and recall. It is defined as follows:

where is the cluster and is the true cluster for record .

This is typically used with text documents, but it can be applied to any clustering task.

Conclusion

That concludes our exploration of the entity resolution framework.
How you proceed with the created clusters depends on your specific business requirements or use case.
If you aim to establish a canonical representation for each cluster, you can achieve this by extracting the most representative value (such as the most frequent value) for each field within each cluster.

References

[1] Christophides et al., End-to-End Entity Resolution for Big Data: A Survey (2019)

[2] Papadakis et al., Comparative analysis of approximate blocking techniques for entity resolution (2016)

[3] Papadakis et al., A survey of blocking and filtering techniques for entity resolution (2020)

[4] Hassanzadeh et al., Framework for evaluating clustering algorithms in duplicate detection (2009)

[5] Haveliwala et al., Scalable techniques for clustering the web (2000)

[6] Hassanzadeh & Miller, Creating probabilistic databases from duplicated data (2009)

BACK

Each record represents a song having attributes such as artist, title, album, year, etc

Please note that this benchmark dataset is a single dataset, and if you have multiple data sources for which you want to resolve entities, you need to standardize their data schemas and consolidate these multiple data sources into a unified dataset before proceeding with the subsequent steps.

By doing so, the process narrows its search to only consider comparisons within each block, rather than examining all possible record pairs in the dataset. This significantly reduces the number of comparisons and accelerates the ER process. As it skips many comparisons, it possibly leads to missed true matches. Therefore,

For example, in our dataset, one might create blocks based on `Artist` or `Title` field. The slightest difference in the blocking keys of duplicates places them in different blocks.

```python from collections import defaultdict def standard_blocking(field_values: pd.Series) -> dict[str, list]: blocks = defaultdict(list) for idx, key in enumerate(field_values): if key is not None: blocks[key].append(idx) return blocks ```

substrings of length `n`

, regardless of the associated attributes

We will compare the performance of the three approaches discussed in this section after performing block processing and entity matching in the next two sections. We will explore some of the major block-processing techniques in this section.

(or a list of pairs as each of the refined blocks only has a single pair of records).

After, we get pairs from the pruned adjacency matrix.

In the case of the standard blocks, we obtain a union of the three independent blocks. First, we convert the blocks into a list of adjacency matrices.

We will determine which one is the best for our data by looking at a matching result in the next section.

While there are various methods to find matches, one straightforward approach can be outlined as follows:

Text fields may want to be tokenized before computing similarity for some of the metrics.

After computing field-specific similarity scores, we want to combine them into an overall similarity score, as outlined in step 2 above. We perform a very simple approach here: we just calculate the average of the attributes' scores, and subsequently, we apply a score threshold to identify matches (step 3). The threshold value below is already tuned here, but you may want to tune it by looking at some examples of matched/unmatched pairs when you work on your own dataset.

The code above performs matching on the pairs from the standard block. Additionally, we extend this matching process to the pairs from the token block and the sorted neighborhood, allowing us to compare their performances.

As Token Blocking likely has the fewest missed matches, we will proceed with the outcome from this approach. It is worth noting that our small dataset does not present scalability concerns. However, for larger datasets where Token Blocking may not be feasible, you may want to consider the other more scalable approaches.

If you have labeled data or have manually labeled sample pairs as matches or non-matches, you can train a machine-learning model to predict matched pairs. As our data has cluster labels `CID`, we will convert these into matching labels (match/unmatch) for pairs and train a machine-learning model, subsequently comparing its performance with the rule-based approach performed in the previous section.

The following code generates the model input `X` and the corresponding target variable `y`.

While the model's performance is better, the performance of the rule-based approach may still be reasonably good.

For the following steps, we will use matches identified through the rule-based approach, considering that in many real-world cases, manual data labeling might not be practical due to resource constraints.

In this step, we are creating entity clusters based on the matched pairs from the previous step.

Each cluster includes all records corresponding to a distinct real-world entity.

all of which satisfy the requirements. These algorithms are highly efficient as they create clusters by a single scan (or O(n) complexity) of the list of candidate pairs, although some of them require the list to be sorted by similarity score.

This algorithm initiates clustering by initially assigning each node to its individual cluster. Then, it conducts a single scan of the list of matched pairs. If it identifies connected nodes that do not belong to the same cluster, it merges their clusters. In short,

The algorithm may create clusters that connect dissimilar records via long paths.

Connected components clustering can be performed using the Scipy module as you can see in the code below. Before performing it, you want to convert the list of pairs into an adjacency matrix.

, as they have a large numbe of clusters making inter-cluster pairs dominant (i.e. even random clustering yields a respectable Rand Index).