Data Mining
- A formal study of efficient methods for extracting interesting rules and patterns from massive data.
- Frequent itemset mining.
- Closed pattern mining.
- Structured data mining.
- Itemset Mining
- Frequent Itemset Mining
- Closed Itemset Mining
- Closure Extension
- LCM
Itemset Mining
Frequent Itemset Mining
Finding all "frequent" sets of elements(items) appearing $\sigma$ times or more in a database.
Definitions¶
- Database
- A set $\Sigma = \{1,\ldots,n\}$ of items(elements).
- Tramsaction database
- A set $T=\{t_1,\ldots,t_m\}$ of subsets of $\Sigma$
- Each subset $t\subseteq\Sigma$ is called a transaction
- Frequent sets
- Itemset $X$ appears in transaction t: $X\subseteq t$
- Occurrence of $X$ in database $T$: $$\mathrm{Occ}(X,T) = \left\{t\in T: X\subseteq t\right\}$$
- Frequency of $X$: $\mathrm{Fr}(X,T) = |\mathrm{Occ}(X,T)|$
- Minimum support (minsup): $0\leq\sigma\leq|T|$
- $X$ is frequent in $T$ if $\mathrm{Fr}(X,T)\geq\sigma$
Market Basket Data¶
- Popular application of itemset mining.
- Business and Market data analysis.
※ Visit "The Instacart Online Grocery Shopping Dataset 2017" and Download "instacart_2017_05_01" Folder.
In [1]:
! tree instacart_2017_05_01
In [2]:
import numpy as np
import pandas as pd
products.csv
¶
In [3]:
df_products = pd.read_csv("./instacart_2017_05_01/products.csv")
df_products.head(3)
Out[3]:
In [4]:
print(f"`products.csv` data num: {len(df_products)}")
In [5]:
id2name = dict(zip(df_products.product_id, df_products.product_name))
order_products__train.csv
¶
In [6]:
df_order_products = pd.read_csv("./instacart_2017_05_01/order_products__train.csv")
df_order_products.head(3)
Out[6]:
In [7]:
N_transactions_total = len(df_order_products.groupby("order_id"))
print(f"Total number of transactions: {N_transactions_total}")
Extract Only N products¶
In [8]:
N = 10
In [9]:
extracted_ids = df_order_products.groupby("product_id").size().reset_index().sort_values(by=0, ascending=False).product_id[:N].values
In [10]:
# Only focus on N products.
df_order_products_extracted = df_order_products[df_order_products.product_id.apply(lambda x:x in extracted_ids)]
In [11]:
N_transactions_extracted = len(df_order_products_extracted.groupby("order_id"))
print(f"Transactions: {N_transactions_extracted}/{N_transactions_total}")
In [12]:
for i,extracted_id in enumerate(extracted_ids):
print(f"No.{i+1:>0{len(str(N))}}: {id2name[extracted_id]}")
In [13]:
df_product_id_OneHot = pd.get_dummies(df_order_products_extracted.product_id)
In [14]:
df_order_OneHot = pd.concat([df_order_products_extracted.order_id, df_product_id_OneHot], axis=1)
In [15]:
df = df_order_OneHot.groupby("order_id").sum().reset_index()
In [16]:
df.head(3)
Out[16]:
In [17]:
database = np.array(df[df.columns[1:]])
In [18]:
database
Out[18]:
In [19]:
N,M = database.shape
In [20]:
print(f"n_samples: {N}")
print(f"n_features: {M}")
In [21]:
from kerasy.search.itemset import FrequentSet
In [22]:
model = FrequentSet(threshold=100)
In [23]:
model.fit(database)
In [24]:
# Frequent Itemset. (frq >= threshold)
model.all
Out[24]:
Association Rule Mining¶
- Confidence: $\mathrm{Supp}(A\text{ and }B)/\mathrm{Supp}(A)$ (Probability of $B$, Given $A$)
- What item is likely to be bought when $A$ is bought.
- Search: large support, confidence large
- Post-processing of itemset mining
Summary
- Itemset mining is the simplest of all mining algorithms.
- Need to maintain occurrence of each pattern in database.
- Tree by lexicographical order is (implicitly) used.
Closed Itemset Mining
[Problem] in Frequent Pattern Mining¶
- Huge Number of frequent itemsets
- Hard to analyze
- Most of them are similar
[Solution] Closed Pattern Mining¶
- Find only closed patterns
- Observation: Most frequent itemset $X$ can be extended without changing occurrence by adding new elements.
- definition: An itemset $X$ is a "closed set" if and only if there is no proper superset of $X$ with the same frequency (thus the same occurrence set).
- A closed itemset is the maximal set among all itemsets with the same occurrences.
- Equivalence class $[X] =\left\{Y | \mathrm{Occ}(X)=\mathrm{Occ}(Y)\right\}$
Implementation.¶
Brute-force: Stupid Baseline¶
- ALGORITHM: Bruteforcel
- First, generate all frequent itemsets.
- Check them one by one via maximality test
- Finally, extract all closed sets
- Maximality test for each candidate frequent set $X$
- Add some element $e$ in $\Sigma$ to $X$
- If $\mathrm{Freq}(X U \{e\})$ is properly less than $\mathrm{Freq}(X)$ then reject $X$.
※ Number of patterns usually exponential to input size, so Brute-force is exponential delay w.r.t. pattern size.
Reverse Search¶
- A general mathematical framework to design enumeration algorithms.
- Can be used to prove the correctness of the algorithm.
- Popular in computational geometry.
- Data mining algorithms can be explained in remarkable simplicity.
- How to do reverse Search?
- Native Backtracking → Duplication
- Duplication chack by Marking → Exponential Memory.
- Reduction Map: Mapping from a children to the parent.
LCM(Linear Time Closed Sets Miner)¶
- Key Algorithm: Prefix Preserving Closure Extension = Children generation from the reduction map.
- 飽和集合列挙アルゴリズムを用いた大規模データベースからのルール発見手法
- Python Implementation: I'm doing now XD
Summary
- Closure Extension: Jump from closed set to closed set.
- LCM: Linear Delay.
- Very fast in practice, too (Winner of FIMI'04(Frequent Itemset Mining Implementation Workshop)).
- Relation to clique enumeration.
In [ ]: