Problem Setting

Implement the "Itemset mining algorithm (LCM)" to find out the most frequent closed patterns.

Frequent Itemset Mining¶

Finding all "frequent" sets of elements(items) appearing $\sigma$ times or more</font> in a database.

Terminology¶

Name	description
Itemset	$$I = \left\{1,\ldots,M\right\}$$
Transaction	$$t\subset I$$
Transaction database	$$\mathcal{T}=\left\{t_1,\ldots,t_N\right\}$$
Pattern	$$P \subset I$$
Occurrence	$$P\subset\exists t\in\mathcal{T}$$
Denotation	$$\mathcal{T}(P) = \left\{t\in\mathcal{T}\mid P\subset t\right\}$$
Support/Frequency	$$n_{\mathcal{T}}(P) = \mid\mathcal{T}(P)\mid$$

Implementation¶

data¶

In [1]:

from kerasy.utils import flatten_dual

In [2]:

# retail_1based_500.txt
with open("itemset_mining/retail_1based_500.txt", mode="r") as f:
    retail_500 = [name.rstrip(" \n").split(" ") for name in f.readlines()]
    print(f"The number of transactions: {len(retail_500)}")
    print(f"The number of unique items: {len(set(flatten_dual(retail_500)))}")

The number of transactions: 500
The number of unique items: 2058

In [3]:

# retail_1based_1000.txt
with open("itemset_mining/retail_1based_1000.txt", mode="r") as f:
    retail_1000 = [name.rstrip(" \n").split(" ") for name in f.readlines()]
    print(f"The number of transactions: {len(retail_1000)}")
    print(f"The number of unique items: {len(set(flatten_dual(retail_1000)))}")

The number of transactions: 1000
The number of unique items: 3182

In [4]:

from kerasy.search.itemset import FrequentSet
from kerasy.search.itemset import create_one_hot

def mine(method, data_name="500", threshold=10):
    retail = {
        "500"  : retail_500,
        "1000" : retail_1000,
    }.get(data_name)
    database, idx2data = create_one_hot(retail)
    
    model = FrequentSet(threshold=threshold)
    model.fit(database, method=method)
    print(f"num frequent sets: {len(model.all)-1}")
    
    fn = f"tree_structure-retail_{data_name}-{method}.png"
    ret = model.export_graphviz(fn, class_names=idx2data)
    if ret:
        print(f"Graph image was saved at: `{fn}`")
    else:
        raise ValueError("Graph was not generated.")
        
    print()itemset_mining/

Algorithm¶

In [5]:

mine(method="all", data_name="500", threshold=10)

num frequent sets: 68
Graph image was saved at: `tree_structure-retail_500-all.png`

500-all

In [6]:

mine(method="all", data_name="1000", threshold=10)

num frequent sets: 217
Graph image was saved at: `tree_structure-retail_1000-all.png`

1000-all

Summary

Itemset mining is the simplest of all mining algorithms.
Need to maintain occurrence of each pattern in database.
Tree by lexicographical order is (implicitly) used.

Closed Itemset mining¶

PROBLEM in Frequent Pattern Mining¶

Huge Number of frequent itemsets
Hard to analyze
Most of them are similar

SOLUTION in Frequent Pattern Mining¶

Find only closed patterns
Observation: Most frequent itemset $X$ can be extended without changing occurrence by adding new elements.
definition: An itemset $X$ is a "closed set" iff there is no proper superset of $X$ with the same frequency (thus the same occurrence set).
A closed itemset is the maximal set among all itemsets with the same occurrences.
Equivalence class $[X] =\left\{Y | \mathrm{Occ}(X)=\mathrm{Occ}(Y)\right\}$

Implementation (LCM)¶

In [7]:

mine(method="closed", data_name="500", threshold=10)

num frequent sets: 59
Graph image was saved at: `tree_structure-retail_500-closed.png`

500-closed

In [8]:

mine(method="closed", data_name="1000", threshold=10)

num frequent sets: 213
Graph image was saved at: `tree_structure-retail_1000-closed.png`

1000-closed

Compare¶

#	all	closed
`retail_1based_500.txt`
`retail_1based_1000.txt`

Optimization¶

[ ] Occurrence deliver
- when num_items large and sparse transaction.
- Exclude many potential children.
database reduction
- [x] Create small transaction database in any iteration by removing.
- [ ] transaction not containing $C$.
- [ ] items of $C$.
- [x] items with frequency less than minimal support.
- [x] transaction id and item id re-indexing.
- [ ] duplicated transactions are merged with weight factor.
[x] Bit matrix

Reference¶

In [ ]:

Pro.4 Itemset mining algorithm

Problem Setting

Frequent Itemset Mining¶

Terminology¶

Implementation¶

data¶

Algorithm¶

Summary

Closed Itemset mining¶

PROBLEM in Frequent Pattern Mining¶

SOLUTION in Frequent Pattern Mining¶

Implementation (LCM)¶

Compare¶

Optimization¶

Reference¶

Table of Contents

Published

Last Updated

Category

Tags

Contact

Other contents