This is the web page for Introduction to Data Science at the University of Florida.
Apriori is a popular algorithm for frequent itemset mining and association rule learning over transactional databases. It is a classical algorithm in data mining and is used for mining frequent itemsets and relevant association rules. It is designed to operate on databases containing transactions, such as purchases by customers of a store.
In this tutorial, we will implement the Apriori and FP_Tree algorithms in a database engine. To complete the queries below you can use either SQLite or DuckDB. SQLite is a C-language library that implements a small, fast, self-contained, high-reliability, full-featured, SQL database engine. DuckDB is an embeddable SQL OLAP database management system. It is designed to provide high performance and efficiency for analytical queries. DuckDB is written in C++. DuckDB and SQLite are both open-source databases that support SQL queries. DuckDB is optimized for analytical queries, while SQLite is a general-purpose database. The both databases are easy to use and can be run in-memory or on disk.
We will also discuss the implementations of the algorthms using python and the mlxtend
library.
Using the GNU/Linux commandline you can install DuckDB using the following command:
curl https://install.duckdb.org | sh
pip install duckdb
You can install SQLite3 in Ubuntu using the following command:
sudo apt install sqlite3 libsqlite3-dev
To run the a REPL shell of SQLite 3 or DuckDB you can use the following commands:
sqlite3 mydatabase.db
duckdb mydatabase.db
WITH
and WITH RECURSIVE
ClauseThe WITH
clause in SQL is used to define a temporary table that can be referenced in the main query.
It is useful for breaking down complex queries into smaller, more manageable parts.
The WITH
clause is also known as a Common Table Expression (CTE).
The WITH RECURSIVE clause is used to define a recursive CTE. It allows you to define a CTE that references itself, enabling you to perform recursive queries.
We will use the WITH
clause below.
WITH
clause: DuckDB Documentation.WITH RECURSIVE
clause: SQLITE Documentation.CREATE TABLE transactions (
transaction_id INTEGER,
item TEXT
);
INSERT INTO transactions (transaction_id, item) VALUES
(1, 'Milk'), (1, 'Bread'), (1, 'Butter'),
(2, 'Milk'), (2, 'Bread'),
(3, 'Milk'), (3, 'Diapers'),
(4, 'Bread'), (4, 'Butter'),
(5, 'Diapers'), (5, 'Juice'),
(6, 'Milk'), (6, 'Juice'),
(7, 'Bread'), (7, 'Butter'), (7, 'Diapers');
To implement this, the key steps in SQL are as follows:
Below is an example, also see comments in the code for more details.
The minsupport below is 2. This means that we are only interested in itemsets that occur in at least 2 transactions. You can adjust this threshold to your needs.
-- Step 1: Count the frequency of each item (1-itemsets)
WITH item_counts AS (
SELECT item, COUNT(*) AS frequency
FROM transactions
GROUP BY item
HAVING COUNT(*) >= 2 -- Minimum support threshold
),
-- Step 2: Generate item pairs (2-itemsets) from transactions
item_pairs AS (
SELECT
t1.item AS item1,
t2.item AS item2,
COUNT(*) AS frequency
FROM transactions t1
JOIN transactions t2
ON t1.transaction_id = t2.transaction_id
AND t1.item < t2.item -- Avoid duplicates
GROUP BY item1, item2
HAVING COUNT(*) >= 2 -- Minimum support threshold
)
-- Step 3: Display frequent itemsets
SELECT * FROM item_pairs;
transaction_id
to find pairs of items bought together.CREATE TABLE transactions (
transaction_id INTEGER,
item TEXT
);
INSERT INTO transactions (transaction_id, item) VALUES
(1, 'Milk'), (1, 'Bread'), (1, 'Butter'),
(2, 'Milk'), (2, 'Bread'),
(3, 'Milk'), (3, 'Diapers'),
(4, 'Bread'), (4, 'Butter'),
(5, 'Diapers'), (5, 'Juice'),
(6, 'Milk'), (6, 'Juice'),
(7, 'Bread'), (7, 'Butter'), (7, 'Diapers');
WITH item_counts AS (
SELECT item, COUNT(*) AS frequency
FROM transactions
GROUP BY item
HAVING COUNT(*) >= 2 -- Minimum support threshold
)
SELECT * FROM item_counts;
WITH item_pairs AS (
SELECT
t1.item AS item1,
t2.item AS item2,
COUNT(*) AS frequency
FROM transactions t1
JOIN transactions t2
ON t1.transaction_id = t2.transaction_id
AND t1.item < t2.item -- Avoid duplicate pairs
GROUP BY item1, item2
HAVING COUNT(*) >= 2 -- Minimum support threshold
)
SELECT * FROM item_pairs;
WITH item_triples AS (
SELECT
t1.item AS item1,
t2.item AS item2,
t3.item AS item3,
COUNT(*) AS frequency
FROM transactions t1
JOIN transactions t2 ON t1.transaction_id = t2.transaction_id AND t1.item < t2.item
JOIN transactions t3 ON t1.transaction_id = t3.transaction_id AND t2.item < t3.item
GROUP BY item1, item2, item3
HAVING COUNT(*) >= 2 -- Minimum support threshold
)
SELECT * FROM item_triples;
Association rules are of the form {A, B} → {C}
, meaning if a user buys A
and B
, they are likely to buy C
.
The confidence of the rule is the probability of buying C
given A
and B
, and it is calculated as \(P(C | A, B) = \frac{P(A, B, C)} { P(A, B)}\).
In the example below, we use SQL to calculate {A} → {B}
and {B} → {A}
.
Also written as \(P(A|B)\).
WITH pair_support AS (
SELECT item1, item2, COUNT(*) AS pair_count
FROM (
SELECT t1.transaction_id, t1.item AS item1, t2.item AS item2
FROM transactions t1
JOIN transactions t2
ON t1.transaction_id = t2.transaction_id
AND t1.item < t2.item
) pairs
GROUP BY item1, item2
),
item_support AS (
SELECT item, COUNT(*) AS item_count
FROM transactions
GROUP BY item
),
association_rules AS (
SELECT
p.item1,
p.item2,
p.pair_count,
i1.item_count AS item1_support,
i2.item_count AS item2_support,
(1.0 * p.pair_count / i1.item_count) AS confidence_A_to_B,
(1.0 * p.pair_count / i2.item_count) AS confidence_B_to_A
FROM pair_support p
JOIN item_support i1 ON p.item1 = i1.item
JOIN item_support i2 ON p.item2 = i2.item
WHERE p.pair_count >= 2
)
SELECT * FROM association_rules ORDER BY confidence_A_to_B DESC;
Lift is a measure of how much more likely two items are bought together than independently. It is calculated as \(Lift(A, B) = \frac{P(A \cap B) }{ P(A) * P(B) }\). The query below calculates the lift for each pair of items.
WITH pair_support AS (
SELECT item1, item2, COUNT(*) AS pair_count
FROM (
SELECT t1.transaction_id, t1.item AS item1, t2.item AS item2
FROM transactions t1
JOIN transactions t2
ON t1.transaction_id = t2.transaction_id
AND t1.item < t2.item
) pairs
GROUP BY item1, item2
),
item_support AS (
SELECT item, COUNT(*) AS item_count
FROM transactions
GROUP BY item
),
association_rules AS (
SELECT
p.item1,
p.item2,
p.pair_count,
i1.item_count AS item1_support,
i2.item_count AS item2_support,
(1.0 * p.pair_count / i1.item_count) AS confidence_A_to_B,
(1.0 * p.pair_count / i2.item_count) AS confidence_B_to_A,
(1.0 * p.pair_count / (i1.item_count * i2.item_count)) AS lift
FROM pair_support p
JOIN item_support i1 ON p.item1 = i1.item
JOIN item_support i2 ON p.item2 = i2.item
WHERE p.pair_count >= 2
)
SELECT * FROM association_rules ORDER BY lift DESC;
The FP-Tree is a data structure used to mine frequent itemsets in transactional databases. It is used in the Apriori algorithm to store and process itemsets efficiently.
WITH item_counts AS (
SELECT item, COUNT(*) AS frequency
FROM transactions
GROUP BY item
),
sorted_transactions AS (
SELECT t.transaction_id, t.item
FROM transactions t
JOIN item_counts ic ON t.item = ic.item
ORDER BY t.transaction_id, ic.frequency DESC
)
SELECT * FROM sorted_transactions;
WITH RECURSIVE fp_tree AS (
-- Base case: Root of FP-tree (no parent)
SELECT
transaction_id,
item AS path,
item AS last_item,
1 AS depth
FROM transactions
UNION ALL
-- Recursive case: Add items to the FP-tree
SELECT
t.transaction_id,
ft.path || ',' || t.item AS path, -- Extend item path
t.item AS last_item,
ft.depth + 1 AS depth
FROM fp_tree ft
JOIN transactions t
ON ft.transaction_id = t.transaction_id
AND t.item > ft.last_item -- Ensure order
)
SELECT path, COUNT(*) AS support
FROM fp_tree
GROUP BY path
HAVING COUNT(*) >= 2 -- Minimum support
ORDER BY LENGTH(path) DESC, support DESC;
The queries below only works in DuckDB/PostgreSQL. SQLite does not support aggregate functions in recursive CTEs.
Once we have the FP-tree paths, we extract frequent patterns.
WITH RECURSIVE item_counts AS (
-- Step 1: Count frequency of each item in transactions
SELECT item, COUNT(*) AS frequency
FROM transactions
GROUP BY item
HAVING COUNT(*) >= 2 -- Minimum support threshold
),
sorted_transactions AS (
-- Step 2: Sort items by frequency within each transaction
SELECT t.transaction_id, t.item
FROM transactions t
JOIN item_counts ic ON t.item = ic.item
ORDER BY t.transaction_id, ic.frequency DESC
),
fp_tree AS (
-- Step 3: Construct FP-Tree paths
SELECT transaction_id, LIST(item) AS itemset
FROM sorted_transactions
GROUP BY transaction_id
),
fp_growth AS (
-- Base case: Start with single-item frequent patterns
SELECT
item AS pattern,
item AS last_item,
COUNT(*) AS support
FROM sorted_transactions
GROUP BY item
HAVING COUNT(*) >= 2 -- Minimum support
),
fp_growth_exposed AS (TABLE fp_growth), -- Expose the table to avoid self-join limitation
fp_growth_recursive AS (
-- Recursive step: Extend existing frequent patterns
SELECT
fg.pattern || ',' || t.item AS pattern,
t.item AS last_item,
COUNT(*) AS support
FROM fp_growth_exposed fg
JOIN sorted_transactions t
ON fg.last_item < t.item -- Maintain lexicographic order
GROUP BY fg.pattern, fg.last_item, t.item
HAVING COUNT(*) >= 2 -- Apply minimum support threshold
UNION ALL
-- Re-expose the recursive table for the next iteration
SELECT * FROM fp_growth_exposed
)
-- Step 4: Display frequent patterns
SELECT * FROM fp_growth_recursive
ORDER BY LENGTH(pattern) DESC, support DESC;
The Apriori algorithm can be computationally expensive for large datasets.
Because of this we should look at the efficiency of the algorithm in its SQL implementation.
SQL is a declarative language, and the database engine is responsible for optimizing the query execution.
The programmer only needs to specify the desired result, and the database engine will determine the most efficient way to compute it.
The database engine can compute query plans. A query plan is a set of steps that the database engine will take to execute the query. The plans are often represented as a tree structure, where each node represents an operation that needs to be performed.
The database engine can use various optimization techniques to improve query performance, such as indexing, caching, and parallel processing.
In DuckDB and SQLite, you can use the EXPLAIN
command to see the query plan and optimize the query.
To create the query plans you only need to prefix the SQL with EXPLAIN
or EXPLAIN ANALYZE
.
Query plans are read bottom-up, starting with the base tables and moving up to the final result.
Each node in the query plan represents an operation that the database engine will perform, such as a scan, join, or aggregation.
The database engine will choose the most efficient way to execute the query based on the query plan.
The EXPLAIN
command generates an expected query plan.
The EXPLAI ANALYZE
command generates an actual query plan, including the time taken to execute each step.
Databbase | Command |
---|---|
DuckDB | EXPLAIN |
DuckDB | ANALYZE |
SQLite | EXPLAIN QUERY PLAN |
SQLite | ANALYZE |
Below is an example of how to use the EXPLAIN
command in DuckDB.
-- Step 1: Count the frequency of each item (1-itemsets)
EXPLAIN WITH item_counts AS (
SELECT item, COUNT(*) AS frequency
FROM transactions
GROUP BY item
HAVING COUNT(*) >= 2 -- Minimum support threshold
),
-- Step 2: Generate item pairs (2-itemsets) from transactions
item_pairs AS (
SELECT
t1.item AS item1,
t2.item AS item2,
COUNT(*) AS frequency
FROM transactions t1
JOIN transactions t2
ON t1.transaction_id = t2.transaction_id
AND t1.item < t2.item -- Avoid duplicates
GROUP BY item1, item2
HAVING COUNT(*) >= 2 -- Minimum support threshold
)
-- Step 3: Display frequent itemsets
SELECT * FROM item_pairs;
┌─────────────────────────────┐
│┌───────────────────────────┐│
││ Physical Plan ││
│└───────────────────────────┘│
└─────────────────────────────┘
┌───────────────────────────┐
│ FILTER │
│ ──────────────────── │
│ (count_star() >= 2) │
│ │
│ ~0 Rows │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ PROJECTION │
│ ──────────────────── │
│__internal_decompress_strin│
│ g(#0) │
│__internal_decompress_strin│
│ g(#1) │
│ #2 │
│ │
│ ~4 Rows │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ HASH_GROUP_BY │
│ ──────────────────── │
│ Groups: │
│ #0 │
│ #1 │
│ │
│ Aggregates: │
│ count_star() │
│ │
│ ~4 Rows │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ PROJECTION │
│ ──────────────────── │
│ item1 │
│ item2 │
│ │
│ ~18 Rows │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ PROJECTION │
│ ──────────────────── │
│__internal_compress_string_│
│ ubigint(#0) │
│__internal_compress_string_│
│ ubigint(#1) │
│ │
│ ~18 Rows │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ HASH_JOIN │
│ ──────────────────── │
│ Join Type: INNER │
│ │
│ Conditions: │
│ transaction_id = ├──────────────┐
│ transaction_id │ │
│ item < item │ │
│ │ │
│ ~18 Rows │ │
└─────────────┬─────────────┘ │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│ SEQ_SCAN ││ SEQ_SCAN │
│ ──────────────────── ││ ──────────────────── │
│ Table: transactions ││ Table: transactions │
│ Type: Sequential Scan ││ Type: Sequential Scan │
│ ││ │
│ Projections: ││ Projections: │
│ transaction_id ││ transaction_id │
│ item ││ item │
│ ││ │
│ ~16 Rows ││ ~16 Rows │
└───────────────────────────┘└───────────────────────────┘
WITH RECURSIVE fp_tree AS (
-- Base case: Root of FP-tree (no parent)
SELECT
transaction_id,
item AS path,
item AS last_item,
1 AS depth
FROM transactions
UNION ALL
-- Recursive case: Add items to the FP-tree
SELECT
t.transaction_id,
ft.path || ',' || t.item AS path, -- Extend item path
t.item AS last_item,
ft.depth + 1 AS depth
FROM fp_tree ft
JOIN transactions t
ON ft.transaction_id = t.transaction_id
AND t.item > ft.last_item -- Ensure order
)
SELECT path, COUNT(*) AS support
FROM fp_tree
GROUP BY path
HAVING COUNT(*) >= 2 -- Minimum support
ORDER BY LENGTH(path) DESC, support DESC;
┌─────────────────────────────┐
│┌───────────────────────────┐│
││ Physical Plan ││
│└───────────────────────────┘│
└─────────────────────────────┘
┌───────────────────────────┐
│ ORDER_BY │
│ ──────────────────── │
│ #2 DESC │
│ count_star() DESC │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ PROJECTION │
│ ──────────────────── │
│ path │
│ support │
│ length(path) │
│ │
│ ~1 Rows │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ FILTER │
│ ──────────────────── │
│ (count_star() >= 2) │
│ │
│ ~1 Rows │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ HASH_GROUP_BY │
│ ──────────────────── │
│ Groups: #0 │
│ │
│ Aggregates: │
│ count_star() │
│ │
│ ~8 Rows │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ PROJECTION │
│ ──────────────────── │
│ path │
│ │
│ ~16 Rows │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ REC_CTE │
│ ──────────────────── │
│ CTE Name: fp_tree ├──────────────┐
│ Table Index: 0 │ │
└─────────────┬─────────────┘ │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│ PROJECTION ││ PROJECTION │
│ ──────────────────── ││ ──────────────────── │
│ transaction_id ││ transaction_id │
│ path ││ path │
│ last_item ││ last_item │
│ depth ││ depth │
│ ││ │
│ ~16 Rows ││ ~16 Rows │
└─────────────┬─────────────┘└─────────────┬─────────────┘
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│ SEQ_SCAN ││ HASH_JOIN │
│ ──────────────────── ││ ──────────────────── │
│ Table: transactions ││ Join Type: INNER │
│ Type: Sequential Scan ││ │
│ ││ Conditions: │
│ Projections: ││ transaction_id = ├──────────────┐
│ transaction_id ││ transaction_id │ │
│ item ││ item > last_item │ │
│ ││ │ │
│ ~16 Rows ││ ~16 Rows │ │
└───────────────────────────┘└─────────────┬─────────────┘ │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│ SEQ_SCAN ││ REC_CTE_SCAN │
│ ──────────────────── ││ ──────────────────── │
│ Table: transactions ││ CTE Index: 0 │
│ Type: Sequential Scan ││ │
│ ││ │
│ ~16 Rows ││ ~0 Rows │
└───────────────────────────┘└───────────────────────────┘
The mlxtend
library in Python provides an implementation of the Apriori algorithm for frequent itemset mining.
It was created by Sebastian Raschka and is available on GitHub.
The library is easy to use and can be installed using pip install mlxtend
.
Below is an example of how to use the mlxtend
library to mine frequent itemsets and association rules.
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, fpmax, fpgrowth
dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],
['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]
te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)
# Apriori Algorithm
apriori_frequent_itemsets = apriori(df, min_support=0.6, use_colnames=True)
print(apriori_frequent_itemsets)
# FP-Growth Algorithm
fp_growth_requent_itemsets = fpgrowth(df, min_support=0.6, use_colnames=True)
print(fp_growth_requent_itemsets)
The function apriori
takes a DataFrame of transactions and a minimum support threshold as input and returns the frequent itemsets.
You can also use the fpgrowth
function to mine frequent itemsets using the FP-Growth algorithm.
The parameter min_support
specifies the minimum support threshold for an itemset to be considered frequent.
Back to BACK