3A
  • Portfolio Top
  • Categories
  • Tags
  • Archives

頻度パターンマイニング

  • 講師:津田宏治
  • 資料:生物データマイニング論(2019)
  • 参考資料:頻出パターン発見アルゴリズム入門 -アイテム集合からグラフまで-

Data Mining

  • A formal study of efficient methods for extracting interesting rules and patterns from massive data.
  • Frequent itemset mining.
  • Closed pattern mining.
  • Structured data mining.
  • Itemset Mining
    • Frequent Itemset Mining
  • Closed Itemset Mining
    • Closure Extension
    • LCM

Itemset Mining

Frequent Itemset Mining

Finding all "frequent" sets of elements(items) appearing $\sigma$ times or more in a database.

Definitions¶

  • Database
    • A set $\Sigma = \{1,\ldots,n\}$ of items(elements).
    • Tramsaction database
      • A set $T=\{t_1,\ldots,t_m\}$ of subsets of $\Sigma$
      • Each subset $t\subseteq\Sigma$ is called a transaction
  • Frequent sets
    • Itemset $X$ appears in transaction t: $X\subseteq t$
    • Occurrence of $X$ in database $T$: $$\mathrm{Occ}(X,T) = \left\{t\in T: X\subseteq t\right\}$$
    • Frequency of $X$: $\mathrm{Fr}(X,T) = |\mathrm{Occ}(X,T)|$
    • Minimum support (minsup): $0\leq\sigma\leq|T|$
    • $X$ is frequent in $T$ if $\mathrm{Fr}(X,T)\geq\sigma$

Market Basket Data¶

  • Popular application of itemset mining.
  • Business and Market data analysis.

※ Visit "The Instacart Online Grocery Shopping Dataset 2017" and Download "instacart_2017_05_01" Folder.

In [1]:
! tree instacart_2017_05_01
instacart_2017_05_01
├── aisles.csv
├── departments.csv
├── order_products__prior.csv
├── order_products__train.csv
├── orders.csv
└── products.csv

0 directories, 6 files
In [2]:
import numpy as np
import pandas as pd
products.csv¶
In [3]:
df_products = pd.read_csv("./instacart_2017_05_01/products.csv")
df_products.head(3)
Out[3]:
product_id product_name aisle_id department_id
0 1 Chocolate Sandwich Cookies 61 19
1 2 All-Seasons Salt 104 13
2 3 Robust Golden Unsweetened Oolong Tea 94 7
In [4]:
print(f"`products.csv` data num: {len(df_products)}")
`products.csv` data num: 49688
In [5]:
id2name = dict(zip(df_products.product_id, df_products.product_name))
order_products__train.csv¶
In [6]:
df_order_products = pd.read_csv("./instacart_2017_05_01/order_products__train.csv")
df_order_products.head(3)
Out[6]:
order_id product_id add_to_cart_order reordered
0 1 49302 1 1
1 1 11109 2 1
2 1 10246 3 0
In [7]:
N_transactions_total = len(df_order_products.groupby("order_id"))
print(f"Total number of transactions: {N_transactions_total}")
Total number of transactions: 131209
Extract Only N products¶
In [8]:
N = 10
In [9]:
extracted_ids = df_order_products.groupby("product_id").size().reset_index().sort_values(by=0, ascending=False).product_id[:N].values
In [10]:
# Only focus on N products.
df_order_products_extracted = df_order_products[df_order_products.product_id.apply(lambda x:x in extracted_ids)]
In [11]:
N_transactions_extracted = len(df_order_products_extracted.groupby("order_id"))
print(f"Transactions: {N_transactions_extracted}/{N_transactions_total}")
Transactions: 59887/131209
In [12]:
for i,extracted_id in enumerate(extracted_ids):
    print(f"No.{i+1:>0{len(str(N))}}: {id2name[extracted_id]}")
No.01: Banana
No.02: Bag of Organic Bananas
No.03: Organic Strawberries
No.04: Organic Baby Spinach
No.05: Large Lemon
No.06: Organic Avocado
No.07: Organic Hass Avocado
No.08: Strawberries
No.09: Limes
No.10: Organic Raspberries
In [13]:
df_product_id_OneHot = pd.get_dummies(df_order_products_extracted.product_id)
In [14]:
df_order_OneHot = pd.concat([df_order_products_extracted.order_id, df_product_id_OneHot], axis=1)
In [15]:
df = df_order_OneHot.groupby("order_id").sum().reset_index()
In [16]:
df.head(3)
Out[16]:
order_id 13176 16797 21137 21903 24852 26209 27966 47209 47626 47766
0 1 1 0 0 0 0 0 0 1 0 0
1 96 0 0 0 0 0 0 1 0 0 0
2 98 1 0 0 0 0 0 1 0 0 0
In [17]:
database = np.array(df[df.columns[1:]])
In [18]:
database
Out[18]:
array([[1, 0, 0, ..., 1, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 1, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0]], dtype=uint8)
In [19]:
N,M = database.shape
In [20]:
print(f"n_samples:  {N}")
print(f"n_features: {M}")
n_samples:  59887
n_features: 10
In [21]:
from kerasy.search.itemset import FrequentSet
In [22]:
model = FrequentSet(threshold=100)
In [23]:
model.fit(database)
In [24]:
# Frequent Itemset. (frq >= threshold)
model.all
Out[24]:
[None,
 array([0]),
 array([0, 1]),
 array([0, 1, 3]),
 array([0, 2]),
 array([0, 2, 3]),
 array([0, 2, 3, 6]),
 array([0, 2, 3, 7]),
 array([0, 2, 5]),
 array([0, 2, 6]),
 array([0, 2, 6, 7]),
 array([0, 2, 7]),
 array([0, 2, 8]),
 array([0, 2, 9]),
 array([0, 3]),
 array([0, 3, 5]),
 array([0, 3, 6]),
 array([0, 3, 6, 7]),
 array([0, 3, 7]),
 array([0, 3, 8]),
 array([0, 3, 9]),
 array([0, 5]),
 array([0, 5, 6]),
 array([0, 5, 7]),
 array([0, 5, 8]),
 array([0, 5, 9]),
 array([0, 6]),
 array([0, 6, 7]),
 array([0, 6, 8]),
 array([0, 6, 9]),
 array([0, 7]),
 array([0, 7, 8]),
 array([0, 8]),
 array([0, 8, 9]),
 array([0, 9]),
 array([1]),
 array([1, 3]),
 array([1, 3, 4]),
 array([1, 3, 9]),
 array([1, 4]),
 array([1, 4, 5]),
 array([1, 4, 8]),
 array([1, 4, 9]),
 array([1, 5]),
 array([1, 5, 8]),
 array([1, 6]),
 array([1, 7]),
 array([1, 8]),
 array([1, 8, 9]),
 array([1, 9]),
 array([2]),
 array([2, 3]),
 array([2, 3, 4]),
 array([2, 3, 5]),
 array([2, 3, 6]),
 array([2, 3, 7]),
 array([2, 3, 8]),
 array([2, 3, 9]),
 array([2, 4]),
 array([2, 4, 5]),
 array([2, 4, 6]),
 array([2, 4, 7]),
 array([2, 4, 8]),
 array([2, 4, 9]),
 array([2, 5]),
 array([2, 5, 6]),
 array([2, 5, 7]),
 array([2, 5, 8]),
 array([2, 5, 9]),
 array([2, 6]),
 array([2, 6, 7]),
 array([2, 6, 8]),
 array([2, 6, 9]),
 array([2, 7]),
 array([2, 7, 8]),
 array([2, 8]),
 array([2, 8, 9]),
 array([2, 9]),
 array([3]),
 array([3, 4]),
 array([3, 4, 5]),
 array([3, 4, 5, 8]),
 array([3, 4, 6]),
 array([3, 4, 7]),
 array([3, 4, 8]),
 array([3, 4, 8, 9]),
 array([3, 4, 9]),
 array([3, 5]),
 array([3, 5, 7]),
 array([3, 5, 8]),
 array([3, 5, 8, 9]),
 array([3, 5, 9]),
 array([3, 6]),
 array([3, 6, 7]),
 array([3, 6, 8]),
 array([3, 6, 9]),
 array([3, 7]),
 array([3, 7, 8]),
 array([3, 8]),
 array([3, 8, 9]),
 array([3, 9]),
 array([4]),
 array([4, 5]),
 array([4, 5, 7]),
 array([4, 5, 8]),
 array([4, 5, 8, 9]),
 array([4, 5, 9]),
 array([4, 6]),
 array([4, 6, 8]),
 array([4, 6, 9]),
 array([4, 7]),
 array([4, 7, 8]),
 array([4, 8]),
 array([4, 8, 9]),
 array([4, 9]),
 array([5]),
 array([5, 6]),
 array([5, 6, 7]),
 array([5, 6, 8]),
 array([5, 7]),
 array([5, 7, 8]),
 array([5, 8]),
 array([5, 8, 9]),
 array([5, 9]),
 array([6]),
 array([6, 7]),
 array([6, 8]),
 array([6, 8, 9]),
 array([6, 9]),
 array([7]),
 array([7, 8]),
 array([8]),
 array([8, 9]),
 array([9])]

Association Rule Mining¶

  • Confidence: $\mathrm{Supp}(A\text{ and }B)/\mathrm{Supp}(A)$ (Probability of $B$, Given $A$)
  • What item is likely to be bought when $A$ is bought.
  • Search: large support, confidence large
  • Post-processing of itemset mining

Summary

  • Itemset mining is the simplest of all mining algorithms.
  • Need to maintain occurrence of each pattern in database.
  • Tree by lexicographical order is (implicitly) used.

Closed Itemset Mining

[Problem] in Frequent Pattern Mining¶

  • Huge Number of frequent itemsets
  • Hard to analyze
  • Most of them are similar

[Solution] Closed Pattern Mining¶

  • Find only closed patterns
  • Observation: Most frequent itemset $X$ can be extended without changing occurrence by adding new elements.
  • definition: An itemset $X$ is a "closed set" if and only if there is no proper superset of $X$ with the same frequency (thus the same occurrence set).
  • A closed itemset is the maximal set among all itemsets with the same occurrences.
  • Equivalence class $[X] =\left\{Y | \mathrm{Occ}(X)=\mathrm{Occ}(Y)\right\}$

Implementation.¶

Brute-force: Stupid Baseline¶

  • ALGORITHM: Bruteforcel
    • First, generate all frequent itemsets.
    • Check them one by one via maximality test
    • Finally, extract all closed sets
  • Maximality test for each candidate frequent set $X$
    • Add some element $e$ in $\Sigma$ to $X$
    • If $\mathrm{Freq}(X U \{e\})$ is properly less than $\mathrm{Freq}(X)$ then reject $X$.

※ Number of patterns usually exponential to input size, so Brute-force is exponential delay w.r.t. pattern size.

Reverse Search¶

  • A general mathematical framework to design enumeration algorithms.
  • Can be used to prove the correctness of the algorithm.
  • Popular in computational geometry.
  • Data mining algorithms can be explained in remarkable simplicity.
  • How to do reverse Search?
    • Native Backtracking → Duplication
    • Duplication chack by Marking → Exponential Memory.
    • Reduction Map: Mapping from a children to the parent.

LCM(Linear Time Closed Sets Miner)¶

  • Key Algorithm: Prefix Preserving Closure Extension = Children generation from the reduction map.
  • 飽和集合列挙アルゴリズムを用いた大規模データベースからのルール発見手法
  • Python Implementation: I'm doing now XD

Summary

  • Closure Extension: Jump from closed set to closed set.
  • LCM: Linear Delay.
  • Very fast in practice, too (Winner of FIMI'04(Frequent Itemset Mining Implementation Workshop)).
  • Relation to clique enumeration.
In [ ]:
 

  • « 機械学習概論
  • Introduction »
hidden
Table of Contents
Published
Oct 28, 2019
Last Updated
Oct 28, 2019
Category
生物データマイニング論
Tags
  • 3A 127
  • 生物データマイニング論 10
Contact
Other contents
  • Home
  • Blog
  • Front-End
  • Kerasy
  • Python-Charmers
  • Translation-Gummy
    • 3A - Shuto's Notes
    • MIT
    • Powered by Pelican. Theme: Elegant