This is a simple implementation of Apriori Algorithm in Python Jupyter. It takes in a csv file with a list of transactions, and results out the association rules. The values for minimum_support and minimum_confidence need to be specified in the notebook.
1 We expect an input file to be a csv of the following format:
item1, item2, item3, ... so on
, t, ...
t, t, t,...
t, t, ...
... so on...
2 We input the file using dataframes from pandas library:
df = pd.read_csv("myDataFile.csv", low_memory=False)3 Next, we extract the header from the file and assign an index to each item.
* We first read all the column headers into item_list.
* Then, we index these headers. This is done by making a dictionary such that item_dict[item_name] = index_value.
* These index values are serial integers for every item in item_list.
item_list = list(df.columns)
item_dict = dict()
for i, item in enumerate(item_list):
item_dict[item] = i + 1- Now, we need to extract individual transactions from the data file.
- First we create an empty list
transactions. - Then, we iterate through the rows in the dataframe
df. - We create a local variable
transactionas an empty set. - Now we iterate through all the possible column items from
item_dict, and check if the value of that column in this row ist, i.e., true. If it is, then we add the assigned index value of this item to thistransactionset. - Finally, we append this
transactionto the list of alltransactions. - Thus,
transactionsis a list, where each transaction is a set of item index values.
- First we create an empty list
transactions = list()
for i, row in df.iterrows():
transaction = set()
for item in item_dict:
if row[item] == 't':
transaction.add(item_dict[item])
transactions.append(transaction)-
We first define some Utility Functions first.
-
get_support(transactions, item_set): This function calculates the support value for the givenitem_setfrom the provided list oftransactions.- It initialises a local variable
match_countto store the number of transactions where the givenitem_setis found. - It then iterates through the the list of
transactions - For each
transaction, it is checked whether the givenitem_setis a subset of thetransactionor not. If it is,match_countis incremented. - Finally support value calculated by dividing the
match_countby total number oftransactionsis returned
- It initialises a local variable
def get_support(transactions, item_set):
match_count = 0
for transaction in transactions:
if item_set.issubset(transaction):
match_count += 1
return float(match_count/len(transactions))self_join(frequent_item_sets_per_level, level): This function performs self join in the given list of frequent itemsets of previous level, and generates the candidate itemsets for the current level.- It takes 2 inputs:
frequent_item_sets_per_levelis a map of level to the list of itemsets found to be frequent for that level. Second argument is the currentlevelnumber. - It first initialises the
current_level_candidatesas an empty list, andlast_level_itemsas the list of frequent itemsets from the previous level. - If there are no frequent itemsets from the previous level, it returns an empty list for
current_level_candidates. Otherwise, it iterates through eachitemsetinlast_level_itemsstarting from0for indexi, and for eachitemsetinlast_level_itemsstarting from1for indexj. - It performs the union of itemsets at indices
iandj. - If this
union_setis not already present incurrent_level_candidatesand the number of elements in theunion_setis equal to the level number, then thisunion_setis appended intocurrent_level_candidates. - We have the check for the number of elements in
union_setto ensure that thecurrent_level_candidatescontain only the sets of fixed length. This is a requirement for Apriori Algorithm - Finally,
current_level_candidatesis returned.
- It takes 2 inputs:
def self_join(frequent_item_sets_per_level, level):
current_level_candidates = list()
last_level_items = frequent_item_sets_per_level[level - 1]
if len(last_level_items) == 0:
return current_level_candidates
for i in range(len(last_level_items)):
for j in range(i+1, len(last_level_items)):
itemset_i = last_level_items[i][0]
itemset_j = last_level_items[j][0]
union_set = itemset_i.union(itemset_j)
if union_set not in current_level_candidates and len(union_set) == level:
current_level_candidates.append(union_set)
return current_level_candidatesget_single_drop_subsets(item_set): This function returns the subsets of the givenitem_setwith one item less.- We first initialize the variable
single_drop_subsetsas an empty list. - Next, for each item in
item_set, we create a temporary set temp as a copy of theitem_setgiven. - We then remove this item from the temp set. It results in a subset of
item_setwithout the item, i.e., a subset of length one less than the length of theitem_set - We then append this temp set to the
single_drop_subsets - Finally, we return the list
single_drop_subsets.
- We first initialize the variable
def get_single_drop_subsets(item_set):
single_drop_subsets = list()
for item in item_set:
temp = item_set.copy()
temp.remove(item)
single_drop_subsets.append(temp)
return single_drop_subsetsis_valid_set(item_set, prev_level_sets): This checks if the givenitem_setis valid, i.e., has all its subsets with support value greater than the minimum support value. It relies on the fact thatprev_level_setscontains only thoseitem_setswhich are frequent, i.e., have support value greater than the minimum support value.- It first generates all the subsets of the given
item_setwith length one less than the length of the originalitem_set. This is done using the above described functionget_single_drop_subsets(). These subsets are stored insingle_drop_subsetsvariable - It then iterates through the
single_drop_subsetslist. - For each
single_drop_subset, it checks if it was present in theprev_level_sets. If it wasn’t it means the givenitem_setis a superset of a non-frequentitem_set. Thus, it returnsFalse - If all the
single_drop_subsetsare frequent itemsets, and are present in theprev_level_sets, it returnsTrue
- It first generates all the subsets of the given
def is_valid_set(item_set, prev_level_sets):
single_drop_subsets = get_single_drop_subsets(item_set)
for single_drop_set in single_drop_subsets:
if single_drop_set not in prev_level_sets:
return False
return Truepruning(frequent_item_sets_per_level, level, candidate_set): This function performs the pruning step of the Apriori Algorithm. It takes a listcandidate_setof all the candidate itemsets for the currentlevel, and for each candidate itemset checks if all its subsets are frequent itemsets. If not, it prunes it, If yes, it adds it to the list ofpost_pruning_set.- It first initialises an empty list variable
post_pruning_set. This is to store the list of frequent itemsets for the current level after performing pruning operation on the given list of candidate sets. - If there are no
candidate_set, it returns an empty listpost_pruning_set. - Otherwise, it first creates a list of frequent itemsets from the previous level and stores it in
prev_level_sets. - Then, it iterates over each
item_setincandidate_setlist. - For each
item_set, it checks whether it is a valid itemset or not. This is done using the above described functionis_valid_set(). This function uses the fact that all the subsets of the givenitem_set(formed by removing one item) need to be frequent itemsets for thisitem_setto be valid. - If this
item_setis valid, it is appended to the list ofpost_pruning_set. - Finally
post_pruning_setis returned.
- It first initialises an empty list variable
def pruning(frequent_item_sets_per_level, level, candidate_set):
post_pruning_set = list()
if len(candidate_set) == 0:
return post_pruning_set
prev_level_sets = list()
for item_set, _ in frequent_item_sets_per_level[level - 1]:
prev_level_sets.append(item_set)
for item_set in candidate_set:
if is_valid_set(item_set, prev_level_sets):
post_pruning_set.append(item_set)
return post_pruning_setapriori(min_support): This is the main function which uses all the above described Utility functions to implement the Apriori Algorithm and generate the list of frequent itemsets for each level for the providedtransactionsandmin_supportvalue.- It first creates a default empty dictionary
frequent_item_sets_per_level, which maps level numbers to the list of frequent itemsets for that level. - Next, it handles the first level itemsets. It means all the itemsets with only one item. To generate such itemsets, we iterate through the list of all items
item_list. We calculate thesupportvalue of each item using the utility functionget_support(). If this support value is greater than or equal to the providedmin_supportvalue, thisitem_setis added to the list of frequent itemsets for thislevel. - One thing to note here is that every itemset is stored as a pair of 2 values:
- The itemset
- The
supportvalue calculated for this itemset
- Now, we handle the levels greater than 1
- For each level greater than 1, we first generate the
current_level_candidatesitemsets by performingself_join()on the frequent itemsets of the previous level. - Next, we perform the pruning operation on these
current_level_candidatesusing thepruning()utility function described above, and obtain the results inpost_pruning_candidates - Now, if there is no itemset left after pruning, we break the loop. It means there is no point in processing for further levels.
- Otherwise, for each
item_setinpost_pruning_candidates, we calculate the support value using theget_support()utility function. - If this support value is greater than or equal to the given
min_support, we append thisitem_setinto the list of frequent itemsets for thislevel. - Note that this append operation also happens in pair format as described above.
- Finally, we return the dictionary
frequent_item_sets_per_level.
- It first creates a default empty dictionary
from collections import defaultdict
def apriori(min_support):
frequent_item_sets_per_level = defaultdict(list)
print("level : 1", end = " ")
for item in range(1, len(item_list) + 1):
support = get_support(transactions, {item})
if support >= min_support:
frequent_item_sets_per_level[1].append(({item}, support))
for level in range(2, len(item_list) + 1):
print(level, end = " ")
current_level_candidates = self_join(frequent_item_sets_per_level, level)
post_pruning_candidates = pruning(frequent_item_sets_per_level, level, current_level_candidates)
if len(post_pruning_candidates) == 0:
break
for item_set in post_pruning_candidates:
support = get_support(transactions, item_set)
if support >= min_support:
frequent_item_sets_per_level[level].append((item_set, support))
return frequent_item_sets_per_level- We specify the minimum support value for the given data here in variable
min_supportand invoke theapriori()function to generate thefrequent_item_sets_per_level.
min_support = 0.05
frequent_item_sets_per_level = apriori(min_support)- Below code produces a dictionary called
item_support_dictfromfrequent_item_sets_per_levelthat maps items to their support values.- First, a dictionary called
item_support_dictis created to store key value pairs of items and their support values, and an empty list called item_list is created to store the name of items corresponding toitem_dictvalues retrieved fromfrequent_item_sets_per_level. - Keys and values are retrieved from the
item_dictand put inside a list for the later use. - For each level in
frequent_item_sets_per_level, for each item-support pair, name of the item retrieved from thekey_listthat corresponds to the number inset_support_pair, and names are added to theitem_list. - Items names and their support values are mapped in the
item_support_dictas a frozenset-float number pair.
- First, a dictionary called
item_support_dict = dict()
item_list = list()
key_list = list(item_dict.keys())
val_list = list(item_dict.values())
for level in frequent_item_sets_per_level:
for set_support_pair in frequent_item_sets_per_level[level]:
for i in set_support_pair[0]:
item_list.append(key_list[val_list.index(i)])
item_support_dict[frozenset(item_list)] = set_support_pair[1]
item_list = list()find_subset(item, item_length): This function takes each item from theitem_support_dictand its lengthitem_lengthas parameter, and returns all possible combinations of elements inside the items.- It first creates an empty array called
combsto store a list of combinations. - It appends a list of all possible combinations of items to the
combsarray. - To reach the combinations from an array directly, for
combarray incombsarray, and for eacheltincombarray, each element appended to thesubsetsarray.
- It first creates an empty array called
def find_subset(item, item_length):
combs = []
for i in range(1, item_length + 1):
combs.append(list(combinations(item, i)))
subsets = []
for comb in combs:
for elt in comb:
subsets.append(elt)
return subsetsassociation_rules(min_confidence, support_dict): This function generates the association rules in accordance with the given minimum confidence value and the provided dictionary of itemsets against their support values. It takesmin_confidenceandsupport_dictas a parameter, and returnsrulesas a list.- For itemsets of more than one element, it first finds all their
subsetscalling thefind_subset(item, item_length)function. - For every subset
A, it calculates the setB = itemset-A. - If
Bis not empty, theconfidenceofBis calculated. - If this value is more than minimum confidence value, the rule
A->Bis added to the listruleswith the corresponding confidence value ofB.
- For itemsets of more than one element, it first finds all their
def association_rules(min_confidence, support_dict):
rules = list()
for item, support in support_dict.items():
item_length = len(item)
if item_length > 1:
subsets = find_subset(item, item_length)
for A in subsets:
B = item.difference(A)
if B:
A = frozenset(A)
AB = A | B
confidence = support_dict[AB] / support_dict[A]
if confidence >= min_confidence:
rules.append((A, B, confidence))
return rules- Output of the
association_rules(min_confidence, support_dict)function is calculated for givenmin_confidence=0.6below.
association_rules = association_rules(min_confidence = 0.6, support_dict = item_support_dict)- Number of rules and
association_rulesare printed. Rules are printed in the form ofA -> B <confidence: ... >, whereAandBcan be a comma separated list, if they consist of more than one item.
print("Number of rules: ", len(association_rules), "\n")
for rule in association_rules:
print('{0} -> {1} <confidence: {2}>'.format(set(rule[0]), set(rule[1]), rule[2]))