6.3.4.2. Frequent Itemsets#

Now that we’ve covered the key terminologies, let’s return to our main goal: identifying frequent patterns.

As the name suggests, frequent itemsets are combinations of items that appear together frequently in transactions. Instead of only stating whether an itemset is frequent or not, we assign a score that reflects how often it appears. These scores help us decide which patterns are significant.

Frequent Itemset Metrics#

We have a few different ways to measure how frequent an itemset is:

  • Support Count: This is the most basic and intuitive metric. The support count of an itemset is simply the number of transactions in which the itemset appears.
    From our example-market-basket-transactions, the support count of the itemset {Milk, Bread} is 3 (it appears in transactions 1, 3, and 5).

  • Support: This gives us a relative measure of frequency. It helps us compare itemsets fairly across datasets of different sizes. The support of an itemset is the fraction of total transactions that contain it.

    \[ \text{Support}(X) = \frac{\text{Support Count}(X)}{\text{Total Number of Transactions}} \]

    In the example-market-basket-transactions dataset, the support of {Milk, Bread} is:

    \[ \text{Support} = \frac{3}{5} = 0.6 \]

    This means 60% of the transactions include both Milk and Bread.

Calculating Frequent Itemsets#

To identify which itemsets are frequent, we define a threshold:

  • Minimum Support (e.g., 0.6) or

  • Minimum Support Count (e.g., 3 if we have 5 transactions)

Only the itemsets that meet or exceed this threshold are considered frequent itemsets.

Example: Finding Frequent Itemsets#

Let’s use the dataset above again and set:

  • Minimum Support = 0.6

Step 1: Generate All Itemsets

We list all possible itemsets from the dataset.

  • 1-itemsets: {Milk}, {Bread}, {Butter}, {Beer}

  • 2-itemsets: {Milk, Bread}, {Milk, Butter}, {Milk, Beer}, {Bread, Butter}, {Bread, Beer}, {Butter, Beer}

  • 3-itemsets: {Milk, Bread, Butter}, etc.

Step 2: Compute Support and Filter

Itemset

Support Count

Support

{Milk}

3

0.6 ✅

{Bread}

4

0.8 ✅

{Butter}

3

0.6 ✅

{Beer}

1

0.2 ❌

{Milk, Bread}

3

0.6 ✅

{Milk, Butter}

1

0.2 ❌

{Bread, Butter}

2

0.4 ❌

{Milk, Bread, Butter}

1

0.2 ❌

✅ = Frequent Itemsets (Support ≥ 0.6)

So, our frequent itemsets are:

  • {Milk}

  • {Bread}

  • {Butter}

  • {Milk, Bread}

Types of Frequent Itemsets#

Frequent itemsets can be further categorized based on their properties and the nature of their supersets. These specialized forms help reduce redundancy and optimize rule generation.

Maximal Frequent Itemsets#

An itemset is maximal frequent if none of its immediate supersets is frequent. In other words, it is frequent, but adding any additional item makes it infrequent.

Example: If {A, B, C} is frequent, but none of {A, B, C, D}, {A, B, C, E} etc. are frequent, then {A, B, C} is a maximal frequent itemset.

Closed Frequent Itemsets#

An itemset is closed if none of its immediate supersets has the same support count as the itemset itself.

This helps in retaining complete information about the frequency of itemsets, while reducing the total number of itemsets compared to all frequent itemsets.

Example: If both {A, B} and {A, B, C} are frequent, and both have a support count of 3, then {A, B} is not closed, because {A, B, C} (its superset) has the same support.

On the other hand, if {A, B} has a support count of 3, and {A, B, C} has a support of 2, then {A, B} is a closed itemset.

Example: Maximal vs Closed Frequent Itemsets#

Transaction Table:

TID

Items

1

{A, B}

2

{B, C, D}

3

{A, B, C, D}

4

{A, B, D}

5

{A, B, C, D}

Support threshold: 3

Let’s evaluate some itemsets:

  • {A, B} appears in TIDs 1, 3, 4, 5 → support = 4

  • {A, B, D} appears in TIDs 3, 4, 5 → support = 3

  • {A, B, C} appears in TIDs 3, 5 → support = 2

Now:

  • {A, B} is frequent and has a superset {A, B, D} which is also frequent → so {A, B} is not maximal

  • {A, B, D} has no frequent supersets → it is maximal

  • {A, B} has higher support than {A, B, D}, so it’s not closed.

  • {A, B, D} has no superset with same support → it is closed

Conclusion:

  • {A, B, D} is both closed and maximal

  • {A, B} is frequent, but not closed or maximal

Note

Given d items, there are 2^d possible itemsets. Efficient pruning strategies like closed and maximal frequent itemset mining help reduce the computational load during pattern mining.

Generating Itemsets#

Generating all itemsets manually or via brute-force is simple conceptually but computationally expensive. If a dataset contains d unique items, the number of possible itemsets is:

\[ 2^d - 1 \]

This grows exponentially as the number of items increases.

Itemset Generation Approaches#

  • Bottom-Up

    • Start with all 1-itemsets.

    • Use only frequent 1-itemsets to generate 2-itemsets.

    • Filter out non-frequent 2-itemsets.

    • Continue to generate larger itemsets from smaller frequent ones.

    • This process prunes the search space significantly.

  • Top-Down

    • Use compressed data structures like FP-trees to capture item frequencies and co-occurrences.

    • Recursively explore and split into frequent patterns.

    • Generally faster than Apriori for large datasets.

Between the two, bottom-up approaches are easier to understand and implement manually, whereas top-down is more efficient for large-scale mining.

Apriori Algorithm#

The Apriori Algorithm is one of the foundational algorithms in frequent pattern mining. It uses the Apriori Principle to efficiently reduce the search space of candidate itemsets.

Apriori Principle: Downward Closure Property#

If an itemset is frequent, then all of its subsets must also be frequent.

This is also known as the anti-monotonic property. It helps us eliminate large chunks of the search space without checking all possible combinations.

Mathematical Expression:

Given two itemsets \(X \subseteq Y\), the Apriori principle states:

\[ s(X) \geq s(Y) \]

Where:

  • \(s(X)\) is the support of itemset \(X\)

  • \(s(Y)\) is the support of its superset \(Y\)

Visualizing the Apriori Search Space#

To better understand how Apriori prunes the search tree, compare the following two images:

Full Search Tree: Itemset Tree

Pruned Tree Using Apriori: Pruned Tree

Instead of evaluating all \(2^5 = 32\) subsets from 5 items, Apriori reduces it drastically using support-based pruning.

Example#

Let’s walk through the Apriori process using the dataset illustrated:

Transaction ID

Items Purchased

1

Milk, Bread

2

Beer, Eggs Diapers, Bread

3

Milk, Diaper, Beer, Coke

4

Beer, Diapers, Bread, Milk

5

Milk, Diapers, Bread, Coke

Minimum Support Threshold: 3

Step 1: 1-itemsets (Single Items)#

Item

Count

Bread

4

Coke

2

Milk

4

Beer

3

Diaper

4

Eggs

1

→ Remove infrequent items (Coke, Eggs) → Remaining items: {Bread, Milk, Beer, Diaper}

Step 2: 2-itemsets (Pairs)#

Itemset

Count

{Bread, Milk}

3

{Bread, Diaper}

3

{Milk, Diaper}

3

{Beer, Diaper}

3

→ Infrequent pairs like {Bread, Beer}, {Milk, Beer} are pruned

Step 3: 3-itemsets (Triplets)#

Itemset

Count

{Bread, Milk, Diaper}

3

This itemset is frequent and all its subsets were also frequent → satisfies Apriori Principle

FP-Growth (Frequent Pattern Growth)#

FP-Growth is an efficient algorithm for mining frequent itemsets without the need to generate candidate itemsets, unlike the Apriori algorithm. It uses a compact data structure called the FP-Tree (Frequent Pattern Tree).

How It Works#

  1. Build the FP-Tree:

    • Scan the dataset once to compute item frequencies.

    • Discard infrequent items (based on min support threshold).

    • Sort frequent items in descending order and insert transactions into a prefix tree (FP-Tree), sharing common prefixes.

  2. Mine the FP-Tree:

    • Start from the least frequent item (bottom-up).

    • Extract conditional pattern bases (subsets of paths).

    • Construct conditional FP-Trees and recursively mine them.

FP-Growth Example#

Transactions:

Transaction

Itemset

T1

{A, B, D, E}

T2

{B, C, E}

T3

{A, B, C, E}

T4

{B, E}

T5

{A, B, C, E}

Minimum Support Threshold = 3

Step 1: Count item frequencies:

Itemset

Count

{A}

3

{B}

5

{C}

3

{D}

1

{E}

5

→ Remove D (support < 3)

Step 2: Order frequent items by frequency:

Itemset

Count

{B}

5

{E}

5

{A}

3

{C}

3

Step 3: Build the FP-Tree:

Insert reordered transactions into the tree:

Transaction

Itemset

T1

{B, E, A}

T2

{B, E, C}

T3

{B, E, A, C}

T4

{B, E}

T5

{B, E, A, C}

FP-Tree (simplified view):

       [B:5]
        |
       [E:5]
      /     \
   [A:3]   [C:1]
     |
   [C:2]

Step 4: Mine patterns (bottom-up):

Start with C:

  • Conditional Pattern Base for C:

    • {B, E, A}: 2

    • {B, E}: 1

  • Frequent patterns with C:

    • [B, E, A, C]: support = 2

    • [B, E, C]: support = 1 → discard (support < 3)

Continue similarly for A, E, B.

Key Advantages#

  • No Candidate Generation: Unlike Apriori, avoids costly generation of candidate sets.

  • Compact FP-Tree: Shares common prefixes, reducing space usage.

  • Faster for Dense Data: Especially efficient when transactions share many items.