Problem Setting
Implement the "Itemset mining algorithm (LCM)" to find out the most frequent closed patterns.
Frequent Itemset Mining¶
Finding all "frequent" sets of elements(items) appearing $\sigma$ times or more</font> in a database.
Terminology¶
| Name | description | 
|---|---|
| Itemset | $$I = \left\{1,\ldots,M\right\}$$ | 
| Transaction | $$t\subset I$$ | 
| Transaction database | $$\mathcal{T}=\left\{t_1,\ldots,t_N\right\}$$ | 
| Pattern | $$P \subset I$$ | 
| Occurrence | $$P\subset\exists t\in\mathcal{T}$$ | 
| Denotation | $$\mathcal{T}(P) = \left\{t\in\mathcal{T}\mid P\subset t\right\}$$ | 
| Support/Frequency | $$n_{\mathcal{T}}(P) = \mid\mathcal{T}(P)\mid$$ | 
Implementation¶
data¶
In [1]:
from kerasy.utils import flatten_dual
In [2]:
# retail_1based_500.txt
with open("itemset_mining/retail_1based_500.txt", mode="r") as f:
    retail_500 = [name.rstrip(" \n").split(" ") for name in f.readlines()]
    print(f"The number of transactions: {len(retail_500)}")
    print(f"The number of unique items: {len(set(flatten_dual(retail_500)))}")
In [3]:
# retail_1based_1000.txt
with open("itemset_mining/retail_1based_1000.txt", mode="r") as f:
    retail_1000 = [name.rstrip(" \n").split(" ") for name in f.readlines()]
    print(f"The number of transactions: {len(retail_1000)}")
    print(f"The number of unique items: {len(set(flatten_dual(retail_1000)))}")
In [4]:
from kerasy.search.itemset import FrequentSet
from kerasy.search.itemset import create_one_hot
def mine(method, data_name="500", threshold=10):
    retail = {
        "500"  : retail_500,
        "1000" : retail_1000,
    }.get(data_name)
    database, idx2data = create_one_hot(retail)
    
    model = FrequentSet(threshold=threshold)
    model.fit(database, method=method)
    print(f"num frequent sets: {len(model.all)-1}")
    
    fn = f"tree_structure-retail_{data_name}-{method}.png"
    ret = model.export_graphviz(fn, class_names=idx2data)
    if ret:
        print(f"Graph image was saved at: `{fn}`")
    else:
        raise ValueError("Graph was not generated.")
        
    print()itemset_mining/
Algorithm¶
In [5]:
mine(method="all", data_name="500", threshold=10)

In [6]:
mine(method="all", data_name="1000", threshold=10)

Summary
- Itemset mining is the simplest of all mining algorithms.
- Need to maintain occurrence of each pattern in database.
- Tree by lexicographical order is (implicitly) used.
Closed Itemset mining¶
PROBLEM in Frequent Pattern Mining¶
- Huge Number of frequent itemsets
- Hard to analyze
- Most of them are similar
SOLUTION in Frequent Pattern Mining¶
- Find only closed patterns
- Observation: Most frequent itemset $X$ can be extended without changing occurrence by adding new elements.
- definition: An itemset $X$ is a "closed set" iff there is no proper superset of $X$ with the same frequency (thus the same occurrence set).
- A closed itemset is the maximal set among all itemsets with the same occurrences.
- Equivalence class $[X] =\left\{Y | \mathrm{Occ}(X)=\mathrm{Occ}(Y)\right\}$
Implementation (LCM)¶
In [7]:
mine(method="closed", data_name="500", threshold=10)

In [8]:
mine(method="closed", data_name="1000", threshold=10)

Compare¶
| # | all | closed | 
|---|---|---|
| retail_1based_500.txt |  |  | 
| retail_1based_1000.txt |  |  | 
Optimization¶
- [ ] Occurrence deliver- when num_itemslarge and sparse transaction.
- Exclude many potential children.
 
- when 
- database reduction- [x] Create small transaction database in any iteration by removing.
- [ ] transaction not containing $C$.
- [ ] items of $C$.
- [x] items with frequency less than minimal support.
- [x] transaction id and item id re-indexing.
- [ ] duplicated transactions are merged with weight factor.
 
- [x] Bit matrix
In [ ]: