Problem Setting
Implement the "Itemset mining algorithm (LCM)" to find out the most frequent closed patterns.
Frequent Itemset Mining¶
Finding all "frequent" sets of elements(items) appearing $\sigma$ times or more</font> in a database.
Terminology¶
Name | description |
---|---|
Itemset | $$I = \left\{1,\ldots,M\right\}$$ |
Transaction | $$t\subset I$$ |
Transaction database | $$\mathcal{T}=\left\{t_1,\ldots,t_N\right\}$$ |
Pattern | $$P \subset I$$ |
Occurrence | $$P\subset\exists t\in\mathcal{T}$$ |
Denotation | $$\mathcal{T}(P) = \left\{t\in\mathcal{T}\mid P\subset t\right\}$$ |
Support/Frequency | $$n_{\mathcal{T}}(P) = \mid\mathcal{T}(P)\mid$$ |
Implementation¶
data¶
In [1]:
from kerasy.utils import flatten_dual
In [2]:
# retail_1based_500.txt
with open("itemset_mining/retail_1based_500.txt", mode="r") as f:
retail_500 = [name.rstrip(" \n").split(" ") for name in f.readlines()]
print(f"The number of transactions: {len(retail_500)}")
print(f"The number of unique items: {len(set(flatten_dual(retail_500)))}")
In [3]:
# retail_1based_1000.txt
with open("itemset_mining/retail_1based_1000.txt", mode="r") as f:
retail_1000 = [name.rstrip(" \n").split(" ") for name in f.readlines()]
print(f"The number of transactions: {len(retail_1000)}")
print(f"The number of unique items: {len(set(flatten_dual(retail_1000)))}")
In [4]:
from kerasy.search.itemset import FrequentSet
from kerasy.search.itemset import create_one_hot
def mine(method, data_name="500", threshold=10):
retail = {
"500" : retail_500,
"1000" : retail_1000,
}.get(data_name)
database, idx2data = create_one_hot(retail)
model = FrequentSet(threshold=threshold)
model.fit(database, method=method)
print(f"num frequent sets: {len(model.all)-1}")
fn = f"tree_structure-retail_{data_name}-{method}.png"
ret = model.export_graphviz(fn, class_names=idx2data)
if ret:
print(f"Graph image was saved at: `{fn}`")
else:
raise ValueError("Graph was not generated.")
print()itemset_mining/
Algorithm¶
In [5]:
mine(method="all", data_name="500", threshold=10)
In [6]:
mine(method="all", data_name="1000", threshold=10)
Summary
- Itemset mining is the simplest of all mining algorithms.
- Need to maintain occurrence of each pattern in database.
- Tree by lexicographical order is (implicitly) used.
Closed Itemset mining¶
PROBLEM in Frequent Pattern Mining¶
- Huge Number of frequent itemsets
- Hard to analyze
- Most of them are similar
SOLUTION in Frequent Pattern Mining¶
- Find only closed patterns
- Observation: Most frequent itemset $X$ can be extended without changing occurrence by adding new elements.
- definition: An itemset $X$ is a "closed set" iff there is no proper superset of $X$ with the same frequency (thus the same occurrence set).
- A closed itemset is the maximal set among all itemsets with the same occurrences.
- Equivalence class $[X] =\left\{Y | \mathrm{Occ}(X)=\mathrm{Occ}(Y)\right\}$
Implementation (LCM)¶
In [7]:
mine(method="closed", data_name="500", threshold=10)
In [8]:
mine(method="closed", data_name="1000", threshold=10)
Compare¶
# | all | closed |
---|---|---|
retail_1based_500.txt |
||
retail_1based_1000.txt |
Optimization¶
- [ ] Occurrence deliver
- when
num_items
large and sparse transaction. - Exclude many potential children.
- when
- database reduction
- [x] Create small transaction database in any iteration by removing.
- [ ] transaction not containing $C$.
- [ ] items of $C$.
- [x] items with frequency less than minimal support.
- [x] transaction id and item id re-indexing.
- [ ] duplicated transactions are merged with weight factor.
- [x] Bit matrix
In [ ]: