Clustering for entity resolution has several requirements:
1. Unconstrained algorithm
The algorithms should not require any domain-specific parameters as input, such as the number of clusters or the diameter of the clusters.
2. Capability of handling an incomplete similarity matrix
As the entity resolution process does not compute similarity on every possible pair (which could be described as an N by N matrix), the algorithms must be able to handle an incomplete similarity matrix (or a list of matched pairs).
3. Scalability
Entity resolution often handles sizable datasets, making it important that algorithms are capable of handling such data.
Three three major single-pass clustering algorithms:
This algorithm [5] performs clustering where each cluster has a center and all records in each cluster are similar to the center of the cluster. It requires the list of similar pairs to be sorted by descending order of similarity scores. The algorithm then performs clustering by a single scan of the sorted list. When a node u is encountered for the first time in the scan, it's designated as the cluster center. Any subsequent nodes v that are similar to u (i.e., appear in a pair (u, v) in the list) are assigned to the cluster of u and are not considered again during the process.
Example of Center Clustering
Merge-Center Clustering
This algorithm [6] performs similarly to Center clustering, but merges two clusters cáµ¢ and câ±¼ whenever a record that is similar to the center of the cluster cáµ¢ is similar to the center of câ±¼. Note that when two clusters are merged, it does not choose a single center node, which means that merged clusters can have multiple center nodes. This algorithm can be performed similarly by a single scan of the list of similar pairs, but by keeping track of the records that are connected through a merged cluster.
Example of Merge-Center Clustering
To perform Center/Merge-Center clustering, we first need to sort the list of pairs by descending order of the similarity scores.
Next, the code below yields two sets of pairs: center-child pairs, denoted as center_cluster_pairs, and merged node pairs, referred to as merge_cluster_pairs. We can then generate Center clusters and Merge-Center clusters by applying connected components to these lists of pairs.
defget_center_cluster_pairs(pairs, dim):
"""
cluster_centers:
list tracking cluster center for each record.
indices of the list correspond to the original df indices
and the values represent assigned cluster centers' indices
center_cluster_pairs:
list of pairs of indices representing center-child pairs
merge_cluster_pairs:
list of pairs of merged nodes' indices
"""
cluster_centers = [None] * dim
center_cluster_pairs = []
merge_cluster_pairs = []
for idx1, idx2 in pairs:
if (
cluster_centers[idx1] isNoneor cluster_centers[idx1] == idx1
or cluster_centers[idx2] isNoneor cluster_centers[idx2] == idx2
):
# if both aren't child, those nodes are merged
merge_cluster_pairs.append([idx1, idx2])
if cluster_centers[idx1] isNoneand cluster_centers[idx2] isNone:
# if both weren't seen before, idx1 becomes center and idx2 gets child
cluster_centers[idx1] = idx1
cluster_centers[idx2] = idx1
center_cluster_pairs.append([idx1, idx2])
elif cluster_centers[idx2] isNone:
if cluster_centers[idx1] == idx1:
# if idx1 is center, idx2 is assigned to that cluster
cluster_centers[idx2] = idx1
center_cluster_pairs.append([idx1, idx2])
else:
# if idx1 is not center, idx2 becomes new center
cluster_centers[idx2] = idx2
elif cluster_centers[idx1] isNone:
if cluster_centers[idx2] == idx2:
# if idx2 is center, idx1 is assigned to that cluster
cluster_centers[idx1] = idx2
center_cluster_pairs.append([idx1, idx2])
else:
# if idx2 is not center, idx1 becomes new center
cluster_centers[idx1] = idx1
return center_cluster_pairs, merge_cluster_pairs
As we have the cluster labels, we can evaluate the quality of the clusters using Rand Index or adjusted Rand Index.
Rand Index is a cluster evaluation metric that represents the proportion of pairs that are correctly clustered together or apart. It is defined as follows:
TP = Number of pairs that are clustered together in both predicted and true clusters.
TN = Number of pairs that are clustered apart in both predicted and true clusters.
Rand Index = (TP + TN) / Total number of possible pairs
>
Example of Rand Index Calculation
Adjusted Rand Index is a modified version of Rand Index that is corrected for chance. The adjustment accounts for the possibility of random agreement from clustering results that were randomly assigned.
We won't delve into how each term in the above equation is calculated here, but anyone who is interested in this topic can refer to the paper from KY Yeung which explains the metric with some examples.
The code below gives us a comparison of the clusters using those metrics along with some additional basic statistics.
Components produce the larger clusters with the fewest cluster count, while the gap between connected components and the Merge-Center clusters is minimal.
Center clusters yield the smaller clusters with the highest count.
All three clusters have a perfect Rand Index
If you look at the Adjusted Rand Index, Merge-Center clustering is the best, while its difference from connected components is marginal.
Another Metric: B-Cubed
This metric from Bragga et al. is another evaluation metric for clustering that considers precision and recall. It is defined as follows:
where is the cluster and is the true cluster for record .
This is typically used with text documents, but it can be applied to any clustering task.
Conclusion
That concludes our exploration of the entity resolution framework.
How you proceed with the created clusters depends on your specific business requirements or use case.
If you aim to establish a canonical representation for each cluster, you can achieve this by extracting the most representative value (such as the most frequent value) for each field within each cluster.
Each record represents a song having attributes such as artist, title, album, year, etc
Please note that this benchmark dataset is a single dataset, and if you have multiple data sources for which you want to resolve entities, you need to standardize their data schemas and consolidate these multiple data sources into a unified dataset before proceeding with the subsequent steps.
By doing so, the process narrows its search to only consider comparisons within each block, rather than examining all possible record pairs in the dataset.
This significantly reduces the number of comparisons and accelerates the ER process.
As it skips many comparisons, it possibly leads to missed true matches.
Therefore,
For example, in our dataset, one might create blocks based on `Artist` or `Title` field.
The slightest difference in the blocking keys of duplicates places them in different blocks.
```python
from collections import defaultdict
def standard_blocking(field_values: pd.Series) -> dict[str, list]:
blocks = defaultdict(list)
for idx, key in enumerate(field_values):
if key is not None:
blocks[key].append(idx)
return blocks
```
substrings of length `n`
, regardless of the associated attributes
We will compare the performance of the three approaches discussed in this section after performing block processing and entity matching in the next two sections.
We will explore some of the major block-processing techniques in this section.
(or a list of pairs as each of the refined blocks only has a single pair of records).
After, we get pairs from the pruned adjacency matrix.
In the case of the standard blocks, we obtain a union of the three independent blocks.
First, we convert the blocks into a list of adjacency matrices.
We will determine which one is the best for our data by looking at a matching result in the next section.
While there are various methods to find matches, one straightforward approach can be outlined as follows:
Text fields may want to be tokenized before computing similarity for some of the metrics.
After computing field-specific similarity scores, we want to combine them into an overall similarity score, as outlined in step 2 above.
We perform a very simple approach here: we just calculate the average of the attributes' scores, and subsequently, we apply a score threshold to identify matches (step 3).
The threshold value below is already tuned here, but you may want to tune it by looking at some examples of matched/unmatched pairs when you work on your own dataset.
The code above performs matching on the pairs from the standard block.
Additionally, we extend this matching process to the pairs from the token block and the sorted neighborhood, allowing us to compare their performances.
As Token Blocking likely has the fewest missed matches, we will proceed with the outcome from this approach. It is worth noting that our small dataset does not present scalability concerns. However, for larger datasets where Token Blocking may not be feasible, you may want to consider the other more scalable approaches.
If you have labeled data or have manually labeled sample pairs as matches or non-matches, you can train a machine-learning model to predict matched pairs. As our data has cluster labels `CID`, we will convert these into matching labels (match/unmatch) for pairs and train a machine-learning model, subsequently comparing its performance with the rule-based approach performed in the previous section.
The following code generates the model input `X` and the corresponding target variable `y`.
While the model's performance is better, the performance of the rule-based approach may still be reasonably good.
For the following steps, we will use matches identified through the rule-based approach, considering that in many real-world cases, manual data labeling might not be practical due to resource constraints.
In this step, we are creating entity clusters based on the matched pairs from the previous step.
Each cluster includes all records corresponding to a distinct real-world entity.
all of which satisfy the requirements. These algorithms are highly efficient as they create clusters by a single scan (or O(n) complexity) of the list of candidate pairs, although some of them require the list to be sorted by similarity score.
This algorithm initiates clustering by initially assigning each node to its individual cluster.
Then, it conducts a single scan of the list of matched pairs.
If it identifies connected nodes that do not belong to the same cluster, it merges their clusters.
In short,
The algorithm may create clusters that connect dissimilar records via long paths.
Connected components clustering can be performed using the Scipy module as you can see in the code below.
Before performing it, you want to convert the list of pairs into an adjacency matrix.
, as they have a large numbe of clusters making inter-cluster pairs dominant (i.e. even random clustering yields a respectable Rand Index).