Data Mining

A formal study of efficient methods for extracting interesting rules and patterns from massive data.
Frequent itemset mining.
Closed pattern mining.
Structured data mining.

Itemset Mining
- Frequent Itemset Mining
Closed Itemset Mining
- Closure Extension
- LCM

Itemset Mining

Frequent Itemset Mining

Finding all "frequent" sets of elements(items) appearing $\sigma$ times or more in a database.

Definitions¶

Database
- A set $\Sigma = \{1,\ldots,n\}$ of items(elements).
- Tramsaction database
  - A set $T=\{t_1,\ldots,t_m\}$ of subsets of $\Sigma$
  - Each subset $t\subseteq\Sigma$ is called a transaction
Frequent sets
- Itemset $X$ appears in transaction t: $X\subseteq t$
- Occurrence of $X$ in database $T$: $$\mathrm{Occ}(X,T) = \left\{t\in T: X\subseteq t\right\}$$
- Frequency of $X$: $\mathrm{Fr}(X,T) = |\mathrm{Occ}(X,T)|$
- Minimum support (minsup): $0\leq\sigma\leq|T|$
- $X$ is frequent in $T$ if $\mathrm{Fr}(X,T)\geq\sigma$

Market Basket Data¶

Popular application of itemset mining.
Business and Market data analysis.

※ Visit "The Instacart Online Grocery Shopping Dataset 2017" and Download "instacart_2017_05_01" Folder.

In [1]:

! tree instacart_2017_05_01

instacart_2017_05_01
├── aisles.csv
├── departments.csv
├── order_products__prior.csv
├── order_products__train.csv
├── orders.csv
└── products.csv

0 directories, 6 files

In [2]:

import numpy as np
import pandas as pd

`products.csv`¶

In [3]:

df_products = pd.read_csv("./instacart_2017_05_01/products.csv")
df_products.head(3)

Out[3]:

	product_id	product_name	aisle_id	department_id
0	1	Chocolate Sandwich Cookies	61	19
1	2	All-Seasons Salt	104	13
2	3	Robust Golden Unsweetened Oolong Tea	94	7

In [4]:

print(f"`products.csv` data num: {len(df_products)}")

`products.csv` data num: 49688

In [5]:

id2name = dict(zip(df_products.product_id, df_products.product_name))

`order_products__train.csv`¶

In [6]:

df_order_products = pd.read_csv("./instacart_2017_05_01/order_products__train.csv")
df_order_products.head(3)

Out[6]:

	order_id	product_id	add_to_cart_order	reordered
0	1	49302	1	1
1	1	11109	2	1
2	1	10246	3	0

In [7]:

N_transactions_total = len(df_order_products.groupby("order_id"))
print(f"Total number of transactions: {N_transactions_total}")

Total number of transactions: 131209

Extract Only N products¶

In [8]:

N = 10

In [9]:

extracted_ids = df_order_products.groupby("product_id").size().reset_index().sort_values(by=0, ascending=False).product_id[:N].values

In [10]:

# Only focus on N products.
df_order_products_extracted = df_order_products[df_order_products.product_id.apply(lambda x:x in extracted_ids)]

In [11]:

N_transactions_extracted = len(df_order_products_extracted.groupby("order_id"))
print(f"Transactions: {N_transactions_extracted}/{N_transactions_total}")

Transactions: 59887/131209

In [12]:

for i,extracted_id in enumerate(extracted_ids):
    print(f"No.{i+1:>0{len(str(N))}}: {id2name[extracted_id]}")

No.01: Banana
No.02: Bag of Organic Bananas
No.03: Organic Strawberries
No.04: Organic Baby Spinach
No.05: Large Lemon
No.06: Organic Avocado
No.07: Organic Hass Avocado
No.08: Strawberries
No.09: Limes
No.10: Organic Raspberries

In [13]:

df_product_id_OneHot = pd.get_dummies(df_order_products_extracted.product_id)

In [14]:

df_order_OneHot = pd.concat([df_order_products_extracted.order_id, df_product_id_OneHot], axis=1)

In [15]:

df = df_order_OneHot.groupby("order_id").sum().reset_index()

In [16]:

df.head(3)

Out[16]:

	order_id	13176	27966	47209
0	1	1	0	1
1	96	0	1	0
2	98	1	1	0

In [17]:

database = np.array(df[df.columns[1:]])

In [18]:

database

Out[18]:

array([[1, 0, 0, ..., 1, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 1, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0]], dtype=uint8)

In [19]:

N,M = database.shape

In [20]:

print(f"n_samples:  {N}")
print(f"n_features: {M}")

n_samples:  59887
n_features: 10

In [21]:

from kerasy.search.itemset import FrequentSet

In [22]:

model = FrequentSet(threshold=100)

In [23]:

model.fit(database)

In [24]:

# Frequent Itemset. (frq >= threshold)
model.all

Out[24]:

[None,
 array([0]),
 array([0, 1]),
 array([0, 1, 3]),
 array([0, 2]),
 array([0, 2, 3]),
 array([0, 2, 3, 6]),
 array([0, 2, 3, 7]),
 array([0, 2, 5]),
 array([0, 2, 6]),
 array([0, 2, 6, 7]),
 array([0, 2, 7]),
 array([0, 2, 8]),
 array([0, 2, 9]),
 array([0, 3]),
 array([0, 3, 5]),
 array([0, 3, 6]),
 array([0, 3, 6, 7]),
 array([0, 3, 7]),
 array([0, 3, 8]),
 array([0, 3, 9]),
 array([0, 5]),
 array([0, 5, 6]),
 array([0, 5, 7]),
 array([0, 5, 8]),
 array([0, 5, 9]),
 array([0, 6]),
 array([0, 6, 7]),
 array([0, 6, 8]),
 array([0, 6, 9]),
 array([0, 7]),
 array([0, 7, 8]),
 array([0, 8]),
 array([0, 8, 9]),
 array([0, 9]),
 array([1]),
 array([1, 3]),
 array([1, 3, 4]),
 array([1, 3, 9]),
 array([1, 4]),
 array([1, 4, 5]),
 array([1, 4, 8]),
 array([1, 4, 9]),
 array([1, 5]),
 array([1, 5, 8]),
 array([1, 6]),
 array([1, 7]),
 array([1, 8]),
 array([1, 8, 9]),
 array([1, 9]),
 array([2]),
 array([2, 3]),
 array([2, 3, 4]),
 array([2, 3, 5]),
 array([2, 3, 6]),
 array([2, 3, 7]),
 array([2, 3, 8]),
 array([2, 3, 9]),
 array([2, 4]),
 array([2, 4, 5]),
 array([2, 4, 6]),
 array([2, 4, 7]),
 array([2, 4, 8]),
 array([2, 4, 9]),
 array([2, 5]),
 array([2, 5, 6]),
 array([2, 5, 7]),
 array([2, 5, 8]),
 array([2, 5, 9]),
 array([2, 6]),
 array([2, 6, 7]),
 array([2, 6, 8]),
 array([2, 6, 9]),
 array([2, 7]),
 array([2, 7, 8]),
 array([2, 8]),
 array([2, 8, 9]),
 array([2, 9]),
 array([3]),
 array([3, 4]),
 array([3, 4, 5]),
 array([3, 4, 5, 8]),
 array([3, 4, 6]),
 array([3, 4, 7]),
 array([3, 4, 8]),
 array([3, 4, 8, 9]),
 array([3, 4, 9]),
 array([3, 5]),
 array([3, 5, 7]),
 array([3, 5, 8]),
 array([3, 5, 8, 9]),
 array([3, 5, 9]),
 array([3, 6]),
 array([3, 6, 7]),
 array([3, 6, 8]),
 array([3, 6, 9]),
 array([3, 7]),
 array([3, 7, 8]),
 array([3, 8]),
 array([3, 8, 9]),
 array([3, 9]),
 array([4]),
 array([4, 5]),
 array([4, 5, 7]),
 array([4, 5, 8]),
 array([4, 5, 8, 9]),
 array([4, 5, 9]),
 array([4, 6]),
 array([4, 6, 8]),
 array([4, 6, 9]),
 array([4, 7]),
 array([4, 7, 8]),
 array([4, 8]),
 array([4, 8, 9]),
 array([4, 9]),
 array([5]),
 array([5, 6]),
 array([5, 6, 7]),
 array([5, 6, 8]),
 array([5, 7]),
 array([5, 7, 8]),
 array([5, 8]),
 array([5, 8, 9]),
 array([5, 9]),
 array([6]),
 array([6, 7]),
 array([6, 8]),
 array([6, 8, 9]),
 array([6, 9]),
 array([7]),
 array([7, 8]),
 array([8]),
 array([8, 9]),
 array([9])]

Association Rule Mining¶

Confidence: $\mathrm{Supp}(A\text{ and }B)/\mathrm{Supp}(A)$ (Probability of $B$, Given $A$)
What item is likely to be bought when $A$ is bought.
Search: large support, confidence large
Post-processing of itemset mining

Summary

Itemset mining is the simplest of all mining algorithms.
Need to maintain occurrence of each pattern in database.
Tree by lexicographical order is (implicitly) used.

Closed Itemset Mining

[Problem] in Frequent Pattern Mining¶

Huge Number of frequent itemsets
Hard to analyze
Most of them are similar

[Solution] Closed Pattern Mining¶

Find only closed patterns
Observation: Most frequent itemset $X$ can be extended without changing occurrence by adding new elements.
definition: An itemset $X$ is a "closed set" if and only if there is no proper superset of $X$ with the same frequency (thus the same occurrence set).
A closed itemset is the maximal set among all itemsets with the same occurrences.
Equivalence class $[X] =\left\{Y | \mathrm{Occ}(X)=\mathrm{Occ}(Y)\right\}$

Implementation.¶

Brute-force: Stupid Baseline¶

ALGORITHM: Bruteforcel
- First, generate all frequent itemsets.
- Check them one by one via maximality test
- Finally, extract all closed sets
Maximality test for each candidate frequent set $X$
- Add some element $e$ in $\Sigma$ to $X$
- If $\mathrm{Freq}(X U \{e\})$ is properly less than $\mathrm{Freq}(X)$ then reject $X$.

※ Number of patterns usually exponential to input size, so Brute-force is exponential delay w.r.t. pattern size.

Reverse Search¶

A general mathematical framework to design enumeration algorithms.
Can be used to prove the correctness of the algorithm.
Popular in computational geometry.
Data mining algorithms can be explained in remarkable simplicity.
How to do reverse Search?
- Native Backtracking → Duplication
- Duplication chack by Marking → Exponential Memory.
- Reduction Map: Mapping from a children to the parent.

LCM(Linear Time Closed Sets Miner)¶

Key Algorithm: Prefix Preserving Closure Extension = Children generation from the reduction map.
飽和集合列挙アルゴリズムを用いた大規模データベースからのルール発見手法
Python Implementation: I'm doing now XD

Summary

Closure Extension: Jump from closed set to closed set.
LCM: Linear Delay.
Very fast in practice, too (Winner of FIMI'04(Frequent Itemset Mining Implementation Workshop)).
Relation to clique enumeration.

In [ ]:

頻度パターンマイニング

Data Mining

Itemset Mining

Frequent Itemset Mining

Definitions¶

Market Basket Data¶

`products.csv`¶

`order_products__train.csv`¶

Extract Only N products¶

Association Rule Mining¶

Summary

Closed Itemset Mining

[Problem] in Frequent Pattern Mining¶

[Solution] Closed Pattern Mining¶

Implementation.¶

Brute-force: Stupid Baseline¶

Reverse Search¶

LCM(Linear Time Closed Sets Miner)¶

Summary

Table of Contents

Published

Last Updated

Category

Tags

Contact

Other contents

Data Mining

Itemset Mining

Frequent Itemset Mining

Definitions¶

Market Basket Data¶

products.csv¶

order_products__train.csv¶

Extract Only N products¶

Association Rule Mining¶

Summary

Closed Itemset Mining

[Problem] in Frequent Pattern Mining¶

[Solution] Closed Pattern Mining¶

Implementation.¶

Brute-force: Stupid Baseline¶

Reverse Search¶

LCM(Linear Time Closed Sets Miner)¶

Summary

Table of Contents

Published

Last Updated

Category

Tags

Contact

Other contents

`products.csv`¶

`order_products__train.csv`¶