Process Mining using Python

7 min readJan 9, 2021

Process mining is a family of techniques in the field of process management that support the analysis of business processes based on event logs. During process mining, specialized data mining algorithms are applied to event-log data in order to identify trends, patterns and details contained in event logs recorded by an information system.

Process mining aims to improve process efficiency and the understanding of processes. It captures the “digital footprints” from any number of systems throughout an organization and organizes them in a way that shows each step of the journey to complete that process, along with any deviations from the “expected path”.

I often get comments:

Only data scientists can understand these complicated terms! or you need to buy expensive licenses to a fancy software to perform process mining.

Not really! Although I still advocate the usage of commercial tool (there is many in the market) because it is easy and usually has good guidance and support, still you can do the simple stuffs using free technologies like Python or R.

In this tutorial, I will present a powerful process mining “open source” library called PM4PY using python, and techniques for extracting information from any event data. Not only will you learn surprisingly powerful analysis techniques, you can also directly use the codes/information provided here to improve processes and systems, diagnose deviations, and understand bottlenecks in your own process.

**For references and more information please visit PM4PY page, most of the code provided below comes from PM4PY official documentation**

The data used in this tutorial are real logs obtained from https://www.win.tue.nl/bpi/doku.php?id=2012:challenge&redirect=1id=2012/challenge

and as described in their website: it is a loan application process data (personal loan or overdraft) within a global Dutch financial organization. The goal is to try to detect weakness and inefficiencies with their process.

Having very high level understanding of python can surely help. After creating a separate environment and installing the needed packages:

Load the needed packages

import pandas as pd
from pm4py.objects.conversion.log import converter as log_converter
from pm4py.objects.log.importer.xes import importer as xes_importer


# process mining 
from pm4py.algo.discovery.alpha import algorithm as alpha_miner
from pm4py.algo.discovery.inductive import algorithm as inductive_miner
from pm4py.algo.discovery.heuristics import algorithm as heuristics_miner
from pm4py.algo.discovery.dfg import algorithm as dfg_discovery


# viz
from pm4py.visualization.petrinet import visualizer as pn_visualizer
from pm4py.visualization.process_tree import visualizer as pt_visualizer
from pm4py.visualization.heuristics_net import visualizer as hn_visualizer
from pm4py.visualization.dfg import visualizer as dfg_visualization


# misc 
from pm4py.objects.conversion.process_tree import converter as pt_converter


pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

2. Load the datasets: here I am loading an XES file format however you can do the same with CSV (little different codes though, see PM4PY documentation for more information or get in touch and I will try to help):

log = xes_importer.apply('financial_log.xes.gz')

3. Check the log

## Printing the first trace
log[0]   ## printing the first event in first trace
log[0][0]  

Outputs:

{'attributes': {'REG_DATE': datetime.datetime(2011, 10, 1, 8, 8, 58, 256000, tzinfo=datetime.timezone(datetime.timedelta(seconds=7200))), 'concept:name': '173691', 'AMOUNT_REQ': '5000'}, 'events': [{'org:resource': '112', 'lifecycle:transition': 'COMPLETE', 'concept:name': 'A_SUBMITTED', 'time:timestamp': datetime.datetime(2011, 10, 1, 8, 8, 58, 256000, tzinfo=datetime.timezone(datetime.timedelta(seconds=7200)))}, '..', {'org:resource': '10809', 'lifecycle:transition': 'COMPLETE', 'concept:name': 'W_Valideren aanvraag', 'time:timestamp': datetime.datetime(2011, 10, 10, 14, 17, 34, 633000, tzinfo=datetime.timezone(datetime.timedelta(seconds=7200)))}]}

{'org:resource': '112', 'lifecycle:transition': 'COMPLETE', 'concept:name': 'A_SUBMITTED', 'time:timestamp': datetime.datetime(2011, 10, 1, 0, 38, 44, 546000, tzinfo=datetime.timezone(datetime.timedelta(seconds=7200)))}

4. Some high level Analysis:

from pm4py.algo.filtering.log.start_activities import start_activities_filter
from pm4py.algo.filtering.log.end_activities import end_activities_filter


log_start = start_activities_filter.get_start_activities(log)
end_activities = end_activities_filter.get_end_activities(log)
log_start  # Printing the start activity in our log

Outputs:
{'A_SUBMITTED': 13087}




end_activities  # Printing the end activity in our log


Outputs:


{'W_Valideren aanvraag': 2747,
 'W_Wijzigen contractgegevens': 4,
 'A_DECLINED': 3429,
 'W_Completeren aanvraag': 1939,
 'A_CANCELLED': 655,
 'W_Nabellen incomplete dossiers': 452,
 'W_Afhandelen leads': 2234,
 'W_Nabellen offertes': 1290,
 'W_Beoordelen fraude': 57,
 'O_CANCELLED': 279,
 'A_REGISTERED': 1}

Look like our processes has one starting activity and multiple ends. All our 13087 cases (loans) started with “A_submitted” while the majority ended with either “A_declined” or “W_Valideren aanvraag”

A process variant is a unique path that our loan application took from the very beginning to the very end of the process — here we see that we processed our 13087 loan in 4366 different ways!

from pm4py.algo.filtering.log.variants import variants_filter
from pm4py.statistics.traces.log import case_statistics
variants = variants_filter.get_variants(log)
  
print(f"We have:{len(variants)} variants in our log")

Outputs:

We have:4366 variants in our log

Let’s try to understand how many cases (loan applications) do those variants have?

variants_count = case_statistics.get_variant_statistics(log)
variants_count = sorted(variants_count, key=lambda x: x['count'], reverse=True)## Printing the top 10 variants by case number
variants_count[:10] 

Outputs:

[{'variant': 'A_SUBMITTED,A_PARTLYSUBMITTED,A_DECLINED', 'count': 3429},
 {'variant': 'A_SUBMITTED,A_PARTLYSUBMITTED,W_Afhandelen leads,W_Afhandelen leads,A_DECLINED,W_Afhandelen leads',
  'count': 1872},
 {'variant': 'A_SUBMITTED,A_PARTLYSUBMITTED,W_Afhandelen leads,W_Afhandelen leads,W_Afhandelen leads,W_Afhandelen leads,A_DECLINED,W_Afhandelen leads',
  'count': 271},
 {'variant': 'A_SUBMITTED,A_PARTLYSUBMITTED,W_Afhandelen leads,W_Afhandelen leads,A_PREACCEPTED,W_Completeren aanvraag,W_Afhandelen leads,W_Completeren aanvraag,A_DECLINED,W_Completeren aanvraag',
  'count': 209},
 {'variant': 'A_SUBMITTED,A_PARTLYSUBMITTED,A_PREACCEPTED,W_Completeren aanvraag,W_Completeren aanvraag,A_DECLINED,W_Completeren aanvraag',
  'count': 160},
 {'variant': 'A_SUBMITTED,A_PARTLYSUBMITTED,A_PREACCEPTED,W_Completeren aanvraag,W_Completeren aanvraag,A_CANCELLED,W_Completeren aanvraag',
  'count': 134},
 {'variant': 'A_SUBMITTED,A_PARTLYSUBMITTED,W_Afhandelen leads,W_Afhandelen leads,A_PREACCEPTED,W_Completeren aanvraag,W_Afhandelen leads,W_Completeren aanvraag,W_Completeren aanvraag,W_Completeren aanvraag,A_DECLINED,W_Completeren aanvraag',
  'count': 126},
 {'variant': 'A_SUBMITTED,A_PARTLYSUBMITTED,A_PREACCEPTED,W_Completeren aanvraag,W_Completeren aanvraag,W_Completeren aanvraag,W_Completeren aanvraag,A_DECLINED,W_Completeren aanvraag',
  'count': 93},
 {'variant': 'A_SUBMITTED,A_PARTLYSUBMITTED,A_PREACCEPTED,W_Completeren aanvraag,W_Completeren aanvraag,W_Completeren aanvraag,W_Completeren aanvraag,A_CANCELLED,W_Completeren aanvraag',
  'count': 87},
 {'variant': 'A_SUBMITTED,A_PARTLYSUBMITTED,W_Afhandelen leads,W_Afhandelen leads,A_PREACCEPTED,W_Completeren aanvraag,W_Afhandelen leads,W_Completeren aanvraag,W_Completeren aanvraag,W_Completeren aanvraag,W_Completeren aanvraag,W_Completeren aanvraag,A_DECLINED,W_Completeren aanvraag',
  'count': 74}]

This is very interesting! Out of 13087 loans we have in our event log, 3429 of them (i.e 26%) are in 1 variant. Only 1 variant out of 4366 and when we examine the variant closely, it is a 3 steps variant most probably for unqualified loans as it was “declined” directly — maybe by implementing a step to filter out those applications (unqualified) before the process start we can reduce some pressure on our loan application pipeline.

Let’s see what activities do we have in our event log? Including their frequencies and considering all cases/loans (without applying any filters)

from pm4py.algo.filtering.log.attributes import attributes_filter
activities = attributes_filter.get_attribute_values(log, "concept:name")
activities

Outouts:
{'A_SUBMITTED': 13087,
 'A_PARTLYSUBMITTED': 13087,
 'A_PREACCEPTED': 7367,
 'W_Completeren aanvraag': 54850,
 'A_ACCEPTED': 5113,
 'O_SELECTED': 7030,
 'A_FINALIZED': 5015,
 'O_CREATED': 7030,
 'O_SENT': 7030,
 'W_Nabellen offertes': 52016,
 'O_SENT_BACK': 3454,
 'W_Valideren aanvraag': 20809,
 'A_REGISTERED': 2246,
 'A_APPROVED': 2246,
 'O_ACCEPTED': 2243,
 'A_ACTIVATED': 2246,
 'O_CANCELLED': 3655,
 'W_Wijzigen contractgegevens': 12,
 'A_DECLINED': 7635,
 'A_CANCELLED': 2807,
 'W_Afhandelen leads': 16566,
 'O_DECLINED': 802,
 'W_Nabellen incomplete dossiers': 25190,
 
 'W_Beoordelen fraude': 664}

Few activities stand out:

“W_Completeren aanvraag”
“W_Nabellen offertes”
“W_Nabellen incomplete dossiers”

They have a lot of actions, it could be some sort of self-loop or rework or maybe some other reasons of course, but clearly we should do something to prevent them from turning into bottlenecks.

Let’s have some fun and start applying few known algorithms.

Alpha Miner: The starting points for the Alpha miner algorithm are ordering relations (sorted by timestamp of course) So, we do not consider the frequencies nor we consider other attributes (other features i event log like resources performing the action…etc)

net, initial_marking, final_marking = alpha_miner.apply(log)
gviz = pn_visualizer.apply(net, initial_marking, final_marking)
pn_visualizer.view(gviz)



## Adding frequency will make it more informative.

parameters = {pn_visualizer.Variants.FREQUENCY.value.Parameters.FORMAT: "png"}
gviz = pn_visualizer.apply(net, initial_marking, final_marking, 
                           parameters=parameters,
                           variant=pn_visualizer.Variants.FREQUENCY,
                           log=log)
pn_visualizer.view(gviz)

2. Inductive miner: As per PM4Py documentation, The basic idea of Inductive Miner is about detecting a ‘cut’ in the log (e.g. sequential cut, parallel cut, concurrent cut and loop cut) and then recur on sublogs, which were found applying the cut, until a base case is found. The Directly-Follows variant avoids the recursion on the sublogs but uses the Directly Follows graph.

“ * “ is the loop
“->” is the sequence operator
“X” is the exclusive choice

tree = inductive_miner.apply_tree(log)

gviz = pt_visualizer.apply(tree)
pt_visualizer.view(gviz)

The image was too big however this is how it will look like :) we can also convert the inductive miner into petri-net using the below code:

net, initial_marking, final_marking = pt_converter.apply(tree, 
variant=pt_converter.Variants.TO_PETRI_NET)
gviz = pn_visualizer.apply(net, initial_marking, final_marking)
pn_visualizer.view(gviz)

3. Heuristic miner: it is an algorithm that acts on the Directly-Follows Graph. The output of the Heuristics Miner is an Heuristics Net

heu_net = heuristics_miner.apply_heu(log, 
parameters={heuristics_miner.Variants.CLASSIC.value.Parameters.DEPENDENCY_THRESH: 0.5})

gviz = hn_visualizer.apply(heu_net)
hn_visualizer.view(gviz)

Here we can cleary see the level of insufficiencies or rework we have in all our steps, for example in stage “W_Nabellen offertes” we have 52016 events (i.e activities) out of which 36084 were reprocessing (self loop), in another world our loan applications processed several times within the same stage before they get moved to next stage — when we say processed we mean an action took place on this loan application, it could be also just status change which qualified as “Action”.

4. DFG — Direct Flow Graph with frequency and time: this algorithm is what most of the commercial tools are using. This paper describe the pros and cons of this simple algorithm so people conducting process mining project are really advised to have high level understanding at least to how these algorithm are creating the models

dfg = dfg_discovery.apply(log)


from pm4py.visualization.dfg import visualizer as dfg_visualization

gviz = dfg_visualization.apply(dfg, log=log, variant=dfg_visualization.Variants.FREQUENCY)
dfg_visualization.view(gviz)

The same algorithm but now adding time instead of frequency to the net

This is interesting! We pointed out stage “W_Nabellen offertes” as having some inefficiencies and re-work, well from this graph we can say that on average it takes 1 days for loan application to pass through this stage while on average it takes few minutes for other stages.

we can convert the net to workflow net by doing:

from pm4py.objects.conversion.dfg import converter as dfg_mining
net, im, fm = dfg_mining.apply(dfg)
gviz = pn_visualizer.apply(net, im, fm)
pn_visualizer.view(gviz)

these codes were implemented in jupyter notebook and of course you can run them in any IDE of your choosing.

Hope this help, for more information regarding the algorithms and the codes used here I’d recommend checking out PM4PY documentation or get in touch and I will be happy to help

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com

Process Mining using Python

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

Written by Hussam

Responses (2)