3A
  • Portfolio Top
  • Categories
  • Tags
  • Archives

Pro.4 Itemset mining algorithm

Problem Setting

Implement the "Itemset mining algorithm (LCM)" to find out the most frequent closed patterns.

Frequent Itemset Mining¶

Finding all "frequent" sets of elements(items) appearing $\sigma$ times or more</font> in a database.

Terminology¶

Name description
Itemset $$I = \left\{1,\ldots,M\right\}$$
Transaction $$t\subset I$$
Transaction database $$\mathcal{T}=\left\{t_1,\ldots,t_N\right\}$$
Pattern $$P \subset I$$
Occurrence $$P\subset\exists t\in\mathcal{T}$$
Denotation $$\mathcal{T}(P) = \left\{t\in\mathcal{T}\mid P\subset t\right\}$$
Support/Frequency $$n_{\mathcal{T}}(P) = \mid\mathcal{T}(P)\mid$$

Implementation¶

data¶
In [1]:
from kerasy.utils import flatten_dual
In [2]:
# retail_1based_500.txt
with open("itemset_mining/retail_1based_500.txt", mode="r") as f:
    retail_500 = [name.rstrip(" \n").split(" ") for name in f.readlines()]
    print(f"The number of transactions: {len(retail_500)}")
    print(f"The number of unique items: {len(set(flatten_dual(retail_500)))}")
The number of transactions: 500
The number of unique items: 2058
In [3]:
# retail_1based_1000.txt
with open("itemset_mining/retail_1based_1000.txt", mode="r") as f:
    retail_1000 = [name.rstrip(" \n").split(" ") for name in f.readlines()]
    print(f"The number of transactions: {len(retail_1000)}")
    print(f"The number of unique items: {len(set(flatten_dual(retail_1000)))}")
The number of transactions: 1000
The number of unique items: 3182
In [4]:
from kerasy.search.itemset import FrequentSet
from kerasy.search.itemset import create_one_hot

def mine(method, data_name="500", threshold=10):
    retail = {
        "500"  : retail_500,
        "1000" : retail_1000,
    }.get(data_name)
    database, idx2data = create_one_hot(retail)
    
    model = FrequentSet(threshold=threshold)
    model.fit(database, method=method)
    print(f"num frequent sets: {len(model.all)-1}")
    
    fn = f"tree_structure-retail_{data_name}-{method}.png"
    ret = model.export_graphviz(fn, class_names=idx2data)
    if ret:
        print(f"Graph image was saved at: `{fn}`")
    else:
        raise ValueError("Graph was not generated.")
        
    print()itemset_mining/
Algorithm¶
In [5]:
mine(method="all", data_name="500", threshold=10)
num frequent sets: 68
Graph image was saved at: `tree_structure-retail_500-all.png`

500-all

In [6]:
mine(method="all", data_name="1000", threshold=10)
num frequent sets: 217
Graph image was saved at: `tree_structure-retail_1000-all.png`

1000-all

Summary

  • Itemset mining is the simplest of all mining algorithms.
  • Need to maintain occurrence of each pattern in database.
  • Tree by lexicographical order is (implicitly) used.

Closed Itemset mining¶

PROBLEM in Frequent Pattern Mining¶
  • Huge Number of frequent itemsets
  • Hard to analyze
  • Most of them are similar
SOLUTION in Frequent Pattern Mining¶
  • Find only closed patterns
  • Observation: Most frequent itemset $X$ can be extended without changing occurrence by adding new elements.
  • definition: An itemset $X$ is a "closed set" iff there is no proper superset of $X$ with the same frequency (thus the same occurrence set).
  • A closed itemset is the maximal set among all itemsets with the same occurrences.
  • Equivalence class $[X] =\left\{Y | \mathrm{Occ}(X)=\mathrm{Occ}(Y)\right\}$

Implementation (LCM)¶

In [7]:
mine(method="closed", data_name="500", threshold=10)
num frequent sets: 59
Graph image was saved at: `tree_structure-retail_500-closed.png`

500-closed

In [8]:
mine(method="closed", data_name="1000", threshold=10)
num frequent sets: 213
Graph image was saved at: `tree_structure-retail_1000-closed.png`

1000-closed

Compare¶

# all closed
retail_1based_500.txt 500-all 500-closed
retail_1based_1000.txt 1000-all 1000-closed

Optimization¶

  • [ ] Occurrence deliver
    • when num_items large and sparse transaction.
    • Exclude many potential children.
  • database reduction
    • [x] Create small transaction database in any iteration by removing.
    • [ ] transaction not containing $C$.
    • [ ] items of $C$.
    • [x] items with frequency less than minimal support.
    • [x] transaction id and item id re-indexing.
    • [ ] duplicated transactions are merged with weight factor.
  • [x] Bit matrix

Reference¶

  • LCM: An Efficient Algorithm for Enumerating Frequent Closed Item Sets
  • Statistical significance of combinatorial regulations.
  • A Fast Method of Statistical Assessment for Combinatorial Hypotheses Based on Frequent Itemset Enumeration
In [ ]:
 

  • « Pro.3 Gene Network Inference
  • 生物物理学 第7回(小テスト) »
hidden
Table of Contents
Published
Nov 5, 2019
Last Updated
Nov 5, 2019
Category
情報基礎実験(木立)
Tags
  • 3A 127
  • 情報基礎実験(木立) 20
Contact
Other contents
  • Home
  • Blog
  • Front-End
  • Kerasy
  • Python-Charmers
  • Translation-Gummy
    • 3A - Shuto's Notes
    • MIT
    • Powered by Pelican. Theme: Elegant