Problem Setting
Implement the "Itemset mining algorithm (LCM)" to find out the most frequent closed patterns.
Frequent Itemset Mining¶
Finding all "frequent" sets of elements(items) appearing σ times or more</font> in a database.
Terminology¶
Name | description |
---|---|
Itemset | I={1,…,M} |
Transaction | t⊂I |
Transaction database | T={t1,…,tN} |
Pattern | P⊂I |
Occurrence | P⊂∃t∈T |
Denotation | T(P)={t∈T∣P⊂t} |
Support/Frequency | nT(P)=∣T(P)∣ |
Implementation¶
data¶
In [1]:
from kerasy.utils import flatten_dual
In [2]:
# retail_1based_500.txt
with open("itemset_mining/retail_1based_500.txt", mode="r") as f:
retail_500 = [name.rstrip(" \n").split(" ") for name in f.readlines()]
print(f"The number of transactions: {len(retail_500)}")
print(f"The number of unique items: {len(set(flatten_dual(retail_500)))}")
In [3]:
# retail_1based_1000.txt
with open("itemset_mining/retail_1based_1000.txt", mode="r") as f:
retail_1000 = [name.rstrip(" \n").split(" ") for name in f.readlines()]
print(f"The number of transactions: {len(retail_1000)}")
print(f"The number of unique items: {len(set(flatten_dual(retail_1000)))}")
In [4]:
from kerasy.search.itemset import FrequentSet
from kerasy.search.itemset import create_one_hot
def mine(method, data_name="500", threshold=10):
retail = {
"500" : retail_500,
"1000" : retail_1000,
}.get(data_name)
database, idx2data = create_one_hot(retail)
model = FrequentSet(threshold=threshold)
model.fit(database, method=method)
print(f"num frequent sets: {len(model.all)-1}")
fn = f"tree_structure-retail_{data_name}-{method}.png"
ret = model.export_graphviz(fn, class_names=idx2data)
if ret:
print(f"Graph image was saved at: `{fn}`")
else:
raise ValueError("Graph was not generated.")
print()itemset_mining/
Algorithm¶
In [5]:
mine(method="all", data_name="500", threshold=10)
In [6]:
mine(method="all", data_name="1000", threshold=10)
Summary
- Itemset mining is the simplest of all mining algorithms.
- Need to maintain occurrence of each pattern in database.
- Tree by lexicographical order is (implicitly) used.
Closed Itemset mining¶
PROBLEM in Frequent Pattern Mining¶
- Huge Number of frequent itemsets
- Hard to analyze
- Most of them are similar
SOLUTION in Frequent Pattern Mining¶
- Find only closed patterns
- Observation: Most frequent itemset X can be extended without changing occurrence by adding new elements.
- definition: An itemset X is a "closed set" iff there is no proper superset of X with the same frequency (thus the same occurrence set).
- A closed itemset is the maximal set among all itemsets with the same occurrences.
- Equivalence class [X]={Y|Occ(X)=Occ(Y)}
Implementation (LCM)¶
In [7]:
mine(method="closed", data_name="500", threshold=10)
In [8]:
mine(method="closed", data_name="1000", threshold=10)
Compare¶
# | all | closed |
---|---|---|
retail_1based_500.txt |
![]() |
![]() |
retail_1based_1000.txt |
![]() |
![]() |
Optimization¶
- [ ] Occurrence deliver
- when
num_items
large and sparse transaction. - Exclude many potential children.
- when
- database reduction
- [x] Create small transaction database in any iteration by removing.
- [ ] transaction not containing C.
- [ ] items of C.
- [x] items with frequency less than minimal support.
- [x] transaction id and item id re-indexing.
- [ ] duplicated transactions are merged with weight factor.
- [x] Bit matrix
In [ ]: