TID	CID	CTID	SourceID	id	number	title	length	artist	album	year	language
57699	57700	41689	3	4	14690-A031	006-My Father	3m 19sec	Bill Cosby	When I Was a Kid (2005)	NaN	Eng.
80716	80717	41689	1	2	MBox281897-HH	Bill Cosby - My Father	199	NaN	When I Wa sa Kid	05	English
116076	116077	41689	2	3	4534714MB-01	My Father - When I Was a Kid	3.325	Bill-Cosby	NaN	'05	ENGLISH

	Blocking Approach	N of Candidate Pairs
0	Standard Blocking	1,338,965
1	Token Blocking	2,501,622
2	Sorted Neighborhood	382,296

Blocking Approachi	Match	Unmatch	Matching Rate	**
0	Standard Blocking	58,303 1,280,	662%	4.4%
1	Token Blocking	66,765 2,434,857	2.7%
2	Sorted Neighborhood		23,875	358,421 6.2%

def get_center_cluster_pairs(pairs, dim):

    """
    cluster_centers:
        list tracking cluster center for each record.
        indices of the list correspond to the original df indices
        and the values represent assigned cluster centers' indices
    center_cluster_pairs:
        list of pairs of indices representing center-child pairs
    merge_cluster_pairs:
        list of pairs of merged nodes' indices
    """
    cluster_centers = [None] * dim
    center_cluster_pairs = []
    merge_cluster_pairs = []

    for idx1, idx2 in pairs:

        if (
            cluster_centers[idx1] is None
            or cluster_centers[idx1] == idx1
            or cluster_centers[idx2] is None
            or cluster_centers[idx2] == idx2
        ):
            # if both aren't child, those nodes are merged
            merge_cluster_pairs.append([idx1, idx2])

        if cluster_centers[idx1] is None and cluster_centers[idx2] is None:
            # if both weren't seen before, idx1 becomes center and idx2 gets child
            cluster_centers[idx1] = idx1
            cluster_centers[idx2] = idx1
            center_cluster_pairs.append([idx1, idx2])
        elif cluster_centers[idx2] is None:
            if cluster_centers[idx1] == idx1:
                # if idx1 is center, idx2 is assigned to that cluster
                cluster_centers[idx2] = idx1
                center_cluster_pairs.append([idx1, idx2])
            else:
                # if idx1 is not center, idx2 becomes new center
                cluster_centers[idx2] = idx2
        elif cluster_centers[idx1] is None:
            if cluster_centers[idx2] == idx2:
                # if idx2 is center, idx1 is assigned to that cluster
                cluster_centers[idx1] = idx2
                center_cluster_pairs.append([idx1, idx2])
            else:
                # if idx2 is not center, idx1 becomes new center
                cluster_centers[idx1] = idx1

    return center_cluster_pairs, merge_cluster_pairs

Metric	Connected Components	Center	Merge-Center
Rand Index	1.000	1.000	1.000
Adjusted Rand Index	0.782	0.591	0.784
Cluster Count	79,229	90,994	79,267
Mean Cluster Size	1.609	1.401	1.608
Min Cluster Size	1	1	1
Max Cluster Size	86	6	86

Entity Resolution - Identifying Real-World Entities in Noisy Data

Fundamental Theories and Python Implementations

Table of Contents

Pre-requisites Instillation

Overview of Entity Resolution

1. Blocking

2. Block Processing

3. Entity Matching

4. Clustering

Entity Resolution workflow

Benchmark Dataset

Loading the Benchmark Dataset

Example records from the dataset

Cleaning: English only data sets

Cleaning: Remove crazy strings

Blocking

Standard Blocking

Implementing Standard Blocking

Standard blocks for title, artist, and album

Token Blocking

Token Blocking Example

Token Blocking Implementation

Token Blocking Implementation Function

Token Blocking for title, artist, and album

Sorted Neighborhood

Sorted Neighborhood Example

Sorted Neighborhood Implementation

Sorting the neighborhood for title, artist, and album Window Size 3

Table of Contents

Block Processing

Block Purging

Block Purging Implementation with Threshold

Block Purging for Standard Blocks and Token Blocks

Meta-blocking

Meta-blocking Example

Meta-blocking Implementation (Part 1: Build Graph)

Meta-blocking for Token Blocks (Part 1: Build Graph)

Meta-blocking Implementation (Part 2: Prune Graph With Edge Weight 1)

Meta-blocking Continuing from Token Blocks (Part 2: Prune Graph)

Meta-blocking Merge/Union of Blocks

Bocking Results

Table of Contents

Entity Matching

Entity Matching Example

Field Similarity Scores

Cosine Similarities

Get Token Matrix Pair

Cosine Similarities on Pair Matrices

Exact Matches

Field Configuration and Similarity Scores

Rule-based matching

Overall Similarity Scores and Matching

Matching with Standard Blocking, Token Blocking, and Sorted Neighborhood

Take aways

Machine-learning matching

Machine-learning Matching Implementation

Splitting and Training and Testing Sets

Model Evaluation Score

Table of Contents

Clustering

Partitioning/Connected Components

Center Clustering

Merge-Center Clustering

Table of Contents

Cluster Evaluation

Get Stats Function

Compare Clusters Function

Cluster Comparison

Another Metric: B-Cubed

Conclusion

References

Standard blocks for `title`, `artist`, and `album`

Token Blocking for `title`, `artist`, and `album`

Sorting the neighborhood for `title`, `artist`, and `album` Window Size 3