CAP 5771 Spring 25

Logo

This is the web page for Introduction to Data Science at the University of Florida.

Frequent Patterns Finding Implementation in Database Management Systems

Apriori is a popular algorithm for frequent itemset mining and association rule learning over transactional databases. It is a classical algorithm in data mining and is used for mining frequent itemsets and relevant association rules. It is designed to operate on databases containing transactions, such as purchases by customers of a store.

In this tutorial, we will implement the Apriori and FP_Tree algorithms in a database engine. To complete the queries below you can use either SQLite or DuckDB. SQLite is a C-language library that implements a small, fast, self-contained, high-reliability, full-featured, SQL database engine. DuckDB is an embeddable SQL OLAP database management system. It is designed to provide high performance and efficiency for analytical queries. DuckDB is written in C++. DuckDB and SQLite are both open-source databases that support SQL queries. DuckDB is optimized for analytical queries, while SQLite is a general-purpose database. The both databases are easy to use and can be run in-memory or on disk.

We will also discuss the implementations of the algorthms using python and the mlxtend library.



Installation

Using the GNU/Linux commandline you can install DuckDB using the following command:

curl https://install.duckdb.org | sh
pip install duckdb

You can install SQLite3 in Ubuntu using the following command:

sudo apt install sqlite3 libsqlite3-dev

To run the a REPL shell of SQLite 3 or DuckDB you can use the following commands:

sqlite3 mydatabase.db
duckdb mydatabase.db

SQL WITH and WITH RECURSIVE Clause

The WITH clause in SQL is used to define a temporary table that can be referenced in the main query. It is useful for breaking down complex queries into smaller, more manageable parts. The WITH clause is also known as a Common Table Expression (CTE). The WITH RECURSIVE clause is used to define a recursive CTE. It allows you to define a CTE that references itself, enabling you to perform recursive queries. We will use the WITH clause below.

For more resources

Two-step Example

Step one create Transaction Table

CREATE TABLE transactions (
    transaction_id INTEGER,
    item TEXT
);

INSERT INTO transactions (transaction_id, item) VALUES
(1, 'Milk'), (1, 'Bread'), (1, 'Butter'),
(2, 'Milk'), (2, 'Bread'),
(3, 'Milk'), (3, 'Diapers'),
(4, 'Bread'), (4, 'Butter'),
(5, 'Diapers'), (5, 'Juice'),
(6, 'Milk'), (6, 'Juice'),
(7, 'Bread'), (7, 'Butter'), (7, 'Diapers');

Step two Implement Apriori Algorithm

To implement this, the key steps in SQL are as follows:

  1. Generate Frequent 1-itemsets: Count item occurrences.
  2. Generate Frequent 2-itemsets: Find frequent pairs of items occurring together.
  3. Filter by a Minimum Support: Remove itemsets below a frequency threshold.

Below is an example, also see comments in the code for more details.

The minsupport below is 2. This means that we are only interested in itemsets that occur in at least 2 transactions. You can adjust this threshold to your needs.

-- Step 1: Count the frequency of each item (1-itemsets)
WITH item_counts AS (
    SELECT item, COUNT(*) AS frequency
    FROM transactions
    GROUP BY item
    HAVING COUNT(*) >= 2  -- Minimum support threshold
),

-- Step 2: Generate item pairs (2-itemsets) from transactions
item_pairs AS (
    SELECT 
        t1.item AS item1, 
        t2.item AS item2, 
        COUNT(*) AS frequency
    FROM transactions t1
    JOIN transactions t2 
        ON t1.transaction_id = t2.transaction_id 
        AND t1.item < t2.item  -- Avoid duplicates
    GROUP BY item1, item2
    HAVING COUNT(*) >= 2  -- Minimum support threshold
)

-- Step 3: Display frequent itemsets
SELECT * FROM item_pairs;
  1. Frequent 1-itemsets: We count each item’s occurrences in transactions.
  2. Frequent 2-itemsets: We self-join the table on transaction_id to find pairs of items bought together.
  3. Filtering by Minimum Support: We keep only itemsets that appear at least twice.

Three Item sets

Create the tables again

CREATE TABLE transactions (
    transaction_id INTEGER,
    item TEXT
);

INSERT INTO transactions (transaction_id, item) VALUES
(1, 'Milk'), (1, 'Bread'), (1, 'Butter'),
(2, 'Milk'), (2, 'Bread'),
(3, 'Milk'), (3, 'Diapers'),
(4, 'Bread'), (4, 'Butter'),
(5, 'Diapers'), (5, 'Juice'),
(6, 'Milk'), (6, 'Juice'),
(7, 'Bread'), (7, 'Butter'), (7, 'Diapers');

Get 1-sets

WITH item_counts AS (
    SELECT item, COUNT(*) AS frequency
    FROM transactions
    GROUP BY item
    HAVING COUNT(*) >= 2  -- Minimum support threshold
)
SELECT * FROM item_counts;

Get 2-sets

WITH item_pairs AS (
    SELECT 
        t1.item AS item1, 
        t2.item AS item2, 
        COUNT(*) AS frequency
    FROM transactions t1
    JOIN transactions t2 
        ON t1.transaction_id = t2.transaction_id 
        AND t1.item < t2.item  -- Avoid duplicate pairs
    GROUP BY item1, item2
    HAVING COUNT(*) >= 2  -- Minimum support threshold
)
SELECT * FROM item_pairs;

Get 3-sets

WITH item_triples AS (
    SELECT 
        t1.item AS item1, 
        t2.item AS item2, 
        t3.item AS item3, 
        COUNT(*) AS frequency
    FROM transactions t1
    JOIN transactions t2 ON t1.transaction_id = t2.transaction_id AND t1.item < t2.item
    JOIN transactions t3 ON t1.transaction_id = t3.transaction_id AND t2.item < t3.item
    GROUP BY item1, item2, item3
    HAVING COUNT(*) >= 2  -- Minimum support threshold
)
SELECT * FROM item_triples;

Generate Association Rules

Association rules are of the form {A, B} → {C}, meaning if a user buys A and B, they are likely to buy C. The confidence of the rule is the probability of buying C given A and B, and it is calculated as \(P(C | A, B) = \frac{P(A, B, C)} { P(A, B)}\).

In the example below, we use SQL to calculate {A} → {B} and {B} → {A}. Also written as \(P(A|B)\).

WITH pair_support AS (
    SELECT item1, item2, COUNT(*) AS pair_count
    FROM (
        SELECT t1.transaction_id, t1.item AS item1, t2.item AS item2
        FROM transactions t1
        JOIN transactions t2 
            ON t1.transaction_id = t2.transaction_id 
            AND t1.item < t2.item
    ) pairs
    GROUP BY item1, item2
),
item_support AS (
    SELECT item, COUNT(*) AS item_count
    FROM transactions
    GROUP BY item
),
association_rules AS (
    SELECT 
        p.item1, 
        p.item2, 
        p.pair_count,
        i1.item_count AS item1_support,
        i2.item_count AS item2_support,
        (1.0 * p.pair_count / i1.item_count) AS confidence_A_to_B,
        (1.0 * p.pair_count / i2.item_count) AS confidence_B_to_A
    FROM pair_support p
    JOIN item_support i1 ON p.item1 = i1.item
    JOIN item_support i2 ON p.item2 = i2.item
    WHERE p.pair_count >= 2
)
SELECT * FROM association_rules ORDER BY confidence_A_to_B DESC;

Lift Calculation

Lift is a measure of how much more likely two items are bought together than independently. It is calculated as \(Lift(A, B) = \frac{P(A \cap B) }{ P(A) * P(B) }\). The query below calculates the lift for each pair of items.

  1. Frequent itemsets: Lists items that appear together frequently.
  2. Confidence scores: If a customer buys Milk, what is the probability they will also buy Bread?
  3. Lift values: Measures how much more likely an item is bought together rather than independently.
WITH pair_support AS (
    SELECT item1, item2, COUNT(*) AS pair_count
    FROM (
        SELECT t1.transaction_id, t1.item AS item1, t2.item AS item2
        FROM transactions t1
        JOIN transactions t2 
            ON t1.transaction_id = t2.transaction_id 
            AND t1.item < t2.item
    ) pairs
    GROUP BY item1, item2
),
item_support AS (
    SELECT item, COUNT(*) AS item_count
    FROM transactions
    GROUP BY item
),
association_rules AS (
    SELECT 
        p.item1, 
        p.item2, 
        p.pair_count,
        i1.item_count AS item1_support,
        i2.item_count AS item2_support,
        (1.0 * p.pair_count / i1.item_count) AS confidence_A_to_B,
        (1.0 * p.pair_count / i2.item_count) AS confidence_B_to_A,
        (1.0 * p.pair_count / (i1.item_count * i2.item_count)) AS lift
    FROM pair_support p
    JOIN item_support i1 ON p.item1 = i1.item
    JOIN item_support i2 ON p.item2 = i2.item
    WHERE p.pair_count >= 2
)
SELECT * FROM association_rules ORDER BY lift DESC;



FP-Tree

The FP-Tree is a data structure used to mine frequent itemsets in transactional databases. It is used in the Apriori algorithm to store and process itemsets efficiently.

Step one: Count Item Frequency & Sort Transactions

  1. Count occurrences of each item (i.e., frequent 1-itemsets).
  2. Sort items in descending frequency order for each transaction.
WITH item_counts AS (
    SELECT item, COUNT(*) AS frequency
    FROM transactions
    GROUP BY item
),
sorted_transactions AS (
    SELECT t.transaction_id, t.item
    FROM transactions t
    JOIN item_counts ic ON t.item = ic.item
    ORDER BY t.transaction_id, ic.frequency DESC
)
SELECT * FROM sorted_transactions;

Step two: Construct the FP-Tree Using Recursive CTE

  1. Base Case: Each transaction starts a new FP-tree branch.
  2. Recursive Step: Recursively appends items to paths within each transaction.
  3. Filtering: Keeps only frequent paths (support ≥ 2).
WITH RECURSIVE fp_tree AS (
    -- Base case: Root of FP-tree (no parent)
    SELECT 
        transaction_id, 
        item AS path, 
        item AS last_item, 
        1 AS depth
    FROM transactions

    UNION ALL

    -- Recursive case: Add items to the FP-tree
    SELECT 
        t.transaction_id, 
        ft.path || ',' || t.item AS path,  -- Extend item path
        t.item AS last_item,
        ft.depth + 1 AS depth
    FROM fp_tree ft
    JOIN transactions t 
        ON ft.transaction_id = t.transaction_id
        AND t.item > ft.last_item  -- Ensure order
)
SELECT path, COUNT(*) AS support
FROM fp_tree
GROUP BY path
HAVING COUNT(*) >= 2  -- Minimum support
ORDER BY LENGTH(path) DESC, support DESC;

Step four: Extract Frequent Itemsets

The queries below only works in DuckDB/PostgreSQL. SQLite does not support aggregate functions in recursive CTEs.

Once we have the FP-tree paths, we extract frequent patterns.

  1. Starts with individual items as patterns.
  2. Recursively extends patterns by adding items.
  3. Filters based on minimum support.
WITH RECURSIVE item_counts AS (
    -- Step 1: Count frequency of each item in transactions
    SELECT item, COUNT(*) AS frequency
    FROM transactions
    GROUP BY item
    HAVING COUNT(*) >= 2  -- Minimum support threshold
),
sorted_transactions AS (
    -- Step 2: Sort items by frequency within each transaction
    SELECT t.transaction_id, t.item
    FROM transactions t
    JOIN item_counts ic ON t.item = ic.item
    ORDER BY t.transaction_id, ic.frequency DESC
),
fp_tree AS (
    -- Step 3: Construct FP-Tree paths
    SELECT transaction_id, LIST(item) AS itemset
    FROM sorted_transactions
    GROUP BY transaction_id
),
fp_growth AS (
    -- Base case: Start with single-item frequent patterns
    SELECT 
        item AS pattern, 
        item AS last_item, 
        COUNT(*) AS support
    FROM sorted_transactions
    GROUP BY item
    HAVING COUNT(*) >= 2  -- Minimum support
),
fp_growth_exposed AS (TABLE fp_growth),  -- Expose the table to avoid self-join limitation

fp_growth_recursive AS (
    -- Recursive step: Extend existing frequent patterns
    SELECT 
        fg.pattern || ',' || t.item AS pattern,
        t.item AS last_item,
        COUNT(*) AS support
    FROM fp_growth_exposed fg
    JOIN sorted_transactions t 
        ON fg.last_item < t.item  -- Maintain lexicographic order
    GROUP BY fg.pattern, fg.last_item, t.item
    HAVING COUNT(*) >= 2  -- Apply minimum support threshold

    UNION ALL

    -- Re-expose the recursive table for the next iteration
    SELECT * FROM fp_growth_exposed
)

-- Step 4: Display frequent patterns
SELECT * FROM fp_growth_recursive 
ORDER BY LENGTH(pattern) DESC, support DESC;

Computing efficiency

The Apriori algorithm can be computationally expensive for large datasets. Because of this we should look at the efficiency of the algorithm in its SQL implementation. SQL is a declarative language, and the database engine is responsible for optimizing the query execution. The programmer only needs to specify the desired result, and the database engine will determine the most efficient way to compute it. The database engine can compute query plans. A query plan is a set of steps that the database engine will take to execute the query. The plans are often represented as a tree structure, where each node represents an operation that needs to be performed. The database engine can use various optimization techniques to improve query performance, such as indexing, caching, and parallel processing. In DuckDB and SQLite, you can use the EXPLAIN command to see the query plan and optimize the query.

To create the query plans you only need to prefix the SQL with EXPLAIN or EXPLAIN ANALYZE. Query plans are read bottom-up, starting with the base tables and moving up to the final result. Each node in the query plan represents an operation that the database engine will perform, such as a scan, join, or aggregation. The database engine will choose the most efficient way to execute the query based on the query plan. The EXPLAIN command generates an expected query plan. The EXPLAI ANALYZE command generates an actual query plan, including the time taken to execute each step.

Databbase Command
DuckDB EXPLAIN
DuckDB ANALYZE
SQLite EXPLAIN QUERY PLAN
SQLite ANALYZE

Below is an example of how to use the EXPLAIN command in DuckDB.

The Query Plan from query 1
-- Step 1: Count the frequency of each item (1-itemsets)
EXPLAIN WITH item_counts AS (
    SELECT item, COUNT(*) AS frequency
    FROM transactions
    GROUP BY item
    HAVING COUNT(*) >= 2  -- Minimum support threshold
),

-- Step 2: Generate item pairs (2-itemsets) from transactions
item_pairs AS (
    SELECT 
        t1.item AS item1, 
        t2.item AS item2, 
        COUNT(*) AS frequency
    FROM transactions t1
    JOIN transactions t2 
        ON t1.transaction_id = t2.transaction_id 
        AND t1.item < t2.item  -- Avoid duplicates
    GROUP BY item1, item2
    HAVING COUNT(*) >= 2  -- Minimum support threshold
)

-- Step 3: Display frequent itemsets
SELECT * FROM item_pairs;

┌─────────────────────────────┐
│┌───────────────────────────┐│
││       Physical Plan       ││
│└───────────────────────────┘│
└─────────────────────────────┘
┌───────────────────────────┐
│           FILTER          │
│    ────────────────────   │
│    (count_star() >= 2)    │
│                           │
│          ~0 Rows          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         PROJECTION        │
│    ────────────────────   │
│__internal_decompress_strin│
│           g(#0)           │
│__internal_decompress_strin│
│           g(#1)           │
│             #2            │
│                           │
│          ~4 Rows          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│       HASH_GROUP_BY       │
│    ────────────────────   │
│          Groups:          │
│             #0            │
│             #1            │
│                           │
│        Aggregates:        │
│        count_star()       │
│                           │
│          ~4 Rows          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         PROJECTION        │
│    ────────────────────   │
│           item1           │
│           item2           │
│                           │
│          ~18 Rows         │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         PROJECTION        │
│    ────────────────────   │
│__internal_compress_string_│
│        ubigint(#0)        │
│__internal_compress_string_│
│        ubigint(#1)        │
│                           │
│          ~18 Rows         │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         HASH_JOIN         │
│    ────────────────────   │
│      Join Type: INNER     │
│                           │
│        Conditions:        │
│      transaction_id =     ├──────────────┐
│       transaction_id      │              │
│        item < item        │              │
│                           │              │
│          ~18 Rows         │              │
└─────────────┬─────────────┘              │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         SEQ_SCAN          ││         SEQ_SCAN          │
│    ────────────────────   ││    ────────────────────   │
│    Table: transactions    ││    Table: transactions    │
│   Type: Sequential Scan   ││   Type: Sequential Scan   │
│                           ││                           │
│        Projections:       ││        Projections:       │
│       transaction_id      ││       transaction_id      │
│            item           ││            item           │
│                           ││                           │
│          ~16 Rows         ││          ~16 Rows         │
└───────────────────────────┘└───────────────────────────┘
EXPLAIN the query to construct the FP_tree
WITH RECURSIVE fp_tree AS (
    -- Base case: Root of FP-tree (no parent)
    SELECT
        transaction_id,
        item AS path,
        item AS last_item,
        1 AS depth
    FROM transactions

    UNION ALL

    -- Recursive case: Add items to the FP-tree
    SELECT
        t.transaction_id,
        ft.path || ',' || t.item AS path,  -- Extend item path
        t.item AS last_item,
        ft.depth + 1 AS depth
    FROM fp_tree ft
    JOIN transactions t
        ON ft.transaction_id = t.transaction_id
        AND t.item > ft.last_item  -- Ensure order
)
SELECT path, COUNT(*) AS support
FROM fp_tree
GROUP BY path
HAVING COUNT(*) >= 2  -- Minimum support
ORDER BY LENGTH(path) DESC, support DESC;

┌─────────────────────────────┐
│┌───────────────────────────┐│
││       Physical Plan       ││
│└───────────────────────────┘│
└─────────────────────────────┘
┌───────────────────────────┐
│          ORDER_BY         │
│    ────────────────────   │
│          #2 DESC          │
│     count_star() DESC     │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         PROJECTION        │
│    ────────────────────   │
│            path           │
│          support          │
│        length(path)       │
│                           │
│          ~1 Rows          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│           FILTER          │
│    ────────────────────   │
│    (count_star() >= 2)    │
│                           │
│          ~1 Rows          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│       HASH_GROUP_BY       │
│    ────────────────────   │
│         Groups: #0        │
│                           │
│        Aggregates:        │
│        count_star()       │
│                           │
│          ~8 Rows          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         PROJECTION        │
│    ────────────────────   │
│            path           │
│                           │
│          ~16 Rows         │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│          REC_CTE          │
│    ────────────────────   │
│     CTE Name: fp_tree     ├──────────────┐
│       Table Index: 0      │              │
└─────────────┬─────────────┘              │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         PROJECTION        ││         PROJECTION        │
│    ────────────────────   ││    ────────────────────   │
│       transaction_id      ││       transaction_id      │
│            path           ││            path           │
│         last_item         ││         last_item         │
│           depth           ││           depth           │
│                           ││                           │
│          ~16 Rows         ││          ~16 Rows         │
└─────────────┬─────────────┘└─────────────┬─────────────┘
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         SEQ_SCAN          ││         HASH_JOIN         │
│    ────────────────────   ││    ────────────────────   │
│    Table: transactions    ││      Join Type: INNER     │
│   Type: Sequential Scan   ││                           │
│                           ││        Conditions:        │
│        Projections:       ││      transaction_id =     ├──────────────┐
│       transaction_id      ││       transaction_id      │              │
│            item           ││      item > last_item     │              │
│                           ││                           │              │
│          ~16 Rows         ││          ~16 Rows         │              │
└───────────────────────────┘└─────────────┬─────────────┘              │
                             ┌─────────────┴─────────────┐┌─────────────┴─────────────┐
                             │         SEQ_SCAN          ││        REC_CTE_SCAN       │
                             │    ────────────────────   ││    ────────────────────   │
                             │    Table: transactions    ││        CTE Index: 0       │
                             │   Type: Sequential Scan   ││                           │
                             │                           ││                           │
                             │          ~16 Rows         ││          ~0 Rows          │
                             └───────────────────────────┘└───────────────────────────┘

Python Implementation

The mlxtend library in Python provides an implementation of the Apriori algorithm for frequent itemset mining. It was created by Sebastian Raschka and is available on GitHub. The library is easy to use and can be installed using pip install mlxtend.

Below is an example of how to use the mlxtend library to mine frequent itemsets and association rules.


import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, fpmax, fpgrowth


dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
           ['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],
           ['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]

te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)

# Apriori Algorithm 
apriori_frequent_itemsets = apriori(df, min_support=0.6, use_colnames=True)
print(apriori_frequent_itemsets)

# FP-Growth Algorithm
fp_growth_requent_itemsets = fpgrowth(df, min_support=0.6, use_colnames=True)
print(fp_growth_requent_itemsets)

The function apriori takes a DataFrame of transactions and a minimum support threshold as input and returns the frequent itemsets. You can also use the fpgrowth function to mine frequent itemsets using the FP-Growth algorithm. The parameter min_support specifies the minimum support threshold for an itemset to be considered frequent.


Back to BACK