Title: | Data Quality Assessment for Process-Oriented Data |
---|---|
Description: | Provides a variety of methods to identify data quality issues in process-oriented data, which are useful to verify data quality in a process mining context. Builds on the class for activity logs implemented in the package 'bupaR'. Methods to identify data quality issues either consider each activity log entry independently (e.g. missing values, activity duration outliers,...), or focus on the relation amongst several activity log entries (e.g. batch registrations, violations of the expected activity order,...). |
Authors: | Niels Martin [aut, cre], Greg Van Houdt [ctb], Gert Janssenswillen [ctb] |
Maintainer: | Niels Martin <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.3.1 |
Built: | 2025-02-28 02:45:29 UTC |
Source: | https://github.com/bupaverse/daqapo |
This package is designed to perform data quality assessment on process-oriented data.
Function that detects activity frequency anomalies per case
detect_activity_frequency_violations( activitylog, ..., details, filter_condition )
detect_activity_frequency_violations( activitylog, ..., details, filter_condition )
activitylog |
The activity log |
... |
Named vectors with name of the activity, and value of the threshold. |
details |
Boolean indicating wheter details of the results need to be shown |
filter_condition |
Condition that is used to extract a subset of the activity log prior to the application of the function |
tbl_df providing an overview of cases for which activities are executed too many times
data("hospital_actlog") detect_activity_frequency_violations(activitylog = hospital_actlog, "Registration" = 1, "Clinical exam" = 1)
data("hospital_actlog") detect_activity_frequency_violations(activitylog = hospital_actlog, "Registration" = 1, "Clinical exam" = 1)
Function detecting violations in activity order. Having additional or less activity types than those specified in activity_order is no violation, but the activity types present should occur in the specified order, and only once.
detect_activity_order_violations( activitylog, activity_order, timestamp, details, filter_condition ) ## S3 method for class 'activitylog' detect_activity_order_violations( activitylog, activity_order, timestamp = c("both", "start", "complete"), details = TRUE, filter_condition = NULL )
detect_activity_order_violations( activitylog, activity_order, timestamp, details, filter_condition ) ## S3 method for class 'activitylog' detect_activity_order_violations( activitylog, activity_order, timestamp = c("both", "start", "complete"), details = TRUE, filter_condition = NULL )
activitylog |
The activity log |
activity_order |
Vector expressing the activity order that needs to be checked (using activity names) |
timestamp |
Type of timestamp that needs to be taken into account in the analysis (either "start", "complete" or "both) |
details |
Boolean indicating wheter details of the results need to be shown |
filter_condition |
Condition that is used to extract a subset of the activity log prior to the application of the function |
tbl_df providing an overview of detected activity orders which violate the specified activity order
activitylog
: Detect activity order_violations in activity log.
data("hospital_actlog") detect_activity_order_violations(activitylog = hospital_actlog, activity_order = c( "Registration", "Triage", "Clinical exam", "Treatment", "Treatment evaluation"))
data("hospital_actlog") detect_activity_order_violations(activitylog = hospital_actlog, activity_order = c( "Registration", "Triage", "Clinical exam", "Treatment", "Treatment evaluation"))
Function detecting violations of dependencies between attributes (i.e. condition(s) that should hold when (an)other condition(s) hold(s))
detect_attribute_dependencies( activitylog, antecedent, consequent, details = TRUE, filter_condition = NULL, ... )
detect_attribute_dependencies( activitylog, antecedent, consequent, details = TRUE, filter_condition = NULL, ... )
activitylog |
The activity log |
antecedent |
(Vector of) condition(s) which serve as an antecedent (if the condition(s) in antecedent hold, then the condition(s) in consequent should also hold) |
consequent |
(Vector of) condition(s) which serve as a consequent (if the condition(s) in antecedent hold, then the condition(s) in consequent should also hold) |
details |
Boolean indicating wheter details of the results need to be shown |
filter_condition |
Condition that is used to extract a subset of the activity log prior to the application of the function |
... |
Named vectors with name of the activity, and value of the threshold. |
activitylog containing the rows of the original activity log for which the dependencies between attributes are violated
data("hospital_actlog") detect_attribute_dependencies(activitylog = hospital_actlog, antecedent = activity == "Registration", consequent = startsWith(originator,"Clerk"))
data("hospital_actlog") detect_attribute_dependencies(activitylog = hospital_actlog, antecedent = activity == "Registration", consequent = startsWith(originator,"Clerk"))
Function detecting gaps in the sequence of case identifiers
detect_case_id_sequence_gaps(activitylog, details, filter_condition)
detect_case_id_sequence_gaps(activitylog, details, filter_condition)
activitylog |
The activity log |
details |
Boolean indicating wheter details of the results need to be shown |
filter_condition |
Condition that is used to extract a subset of the activity log prior to the application of the function |
data.frame providing an overview of the case identifiers which are expected, but which are not present in the activity log
data("hospital_actlog") detect_case_id_sequence_gaps(activitylog = hospital_actlog)
data("hospital_actlog") detect_case_id_sequence_gaps(activitylog = hospital_actlog)
Function detecting violations of conditional activity presence (i.e. an activity/activities that should be present when (a) particular condition(s) hold(s))
detect_conditional_activity_presence( activitylog, condition, activities, details, filter_condition )
detect_conditional_activity_presence( activitylog, condition, activities, details, filter_condition )
activitylog |
The activity log |
condition |
Condition which serve as an antecedent (if the condition in condition holds, then the activit(y)(ies) in activities should be present.) |
activities |
Vector of activity/activities which serve as a consequent (if the condition(s) in condition_vector hold, then the activity/activities in activity_vector should be recorded) |
details |
Boolean indicating wheter details of the results need to be shown |
filter_condition |
Condition that is used to extract a subset of the activity log prior to the application of the function |
Numeric vector containing the case identifiers of cases for which the specified conditional activity presence is violated
data("hospital_actlog") detect_conditional_activity_presence(activitylog = hospital_actlog, condition = specialization == "TRAU", activities = "Clinical exam")
data("hospital_actlog") detect_conditional_activity_presence(activitylog = hospital_actlog, condition = specialization == "TRAU", activities = "Clinical exam")
Function detecting duration outliers for a particular activity
detect_duration_outliers(activitylog, ..., details, filter_condition)
detect_duration_outliers(activitylog, ..., details, filter_condition)
activitylog |
The activity log |
... |
for each activity to be checked, an argument "activity_name" = duration_within(...) to define bounds. See ?duration_within |
details |
Boolean indicating wheter details of the results need to be shown |
filter_condition |
Condition that is used to extract a subset of the activity log prior to the application of the function |
activitylog containing the rows of the original activity log for which activity duration outliers are detected Information on the presence of activity duration outliers
data("hospital_actlog") detect_duration_outliers(activitylog = hospital_actlog, Treatment = duration_within(bound_sd = 1))
data("hospital_actlog") detect_duration_outliers(activitylog = hospital_actlog, Treatment = duration_within(bound_sd = 1))
Function detecting inactive periods, i.e. periods of time in which no activity executions/arrivals are recorded in the activity log
detect_inactive_periods( activitylog, threshold, type, timestamp, start_activities, details, filter_condition )
detect_inactive_periods( activitylog, threshold, type, timestamp, start_activities, details, filter_condition )
activitylog |
The activity log |
threshold |
Threshold after which a period without activity executions/arrivals is considered as an inactive period (expressed in minutes) |
type |
Type of inactive periods you want to detect. "arrivals" will look for periods without new cases arriving. "activities" will look for periods where no activities occur. |
timestamp |
Type of timestamp that needs to be taken into account in the analysis (either "start", "complete" or "both) |
start_activities |
List of activity labels marking the first activity in the process. When specified, an inactive period only occurs when the time between two consecutive arrivals exceeds the specified threshold (arrival is proxied by the activity/activities specified in this argument). |
details |
Boolean indicating wheter details of the results need to be shown |
filter_condition |
Condition that is used to extract a subset of the activity log prior to the application of the function |
tbl_df providing an overview of the start and end of the inactive periods that have been detected, together with the length of the inactive period
data("hospital_actlog") detect_inactive_periods(activitylog = hospital_actlog,threshold = 30)
data("hospital_actlog") detect_inactive_periods(activitylog = hospital_actlog,threshold = 30)
Function detecting incomplete cases in terms of the activities that need to be recorded for a case. The function only checks the presence of activities, not the completeness of the rows describing the activity executions.
detect_incomplete_cases(activitylog, activities, details, filter_condition)
detect_incomplete_cases(activitylog, activities, details, filter_condition)
activitylog |
The activity log |
activities |
A vector of activity names which should be present for a case |
details |
Boolean indicating wheter details of the results need to be shown |
filter_condition |
Condition that is used to extract a subset of the activity log prior to the application of the function |
tbl_df providing an overview of the traces (i.e. the activities executed for a particular case) in which the specified activities are not present, together with its occurrence frequency and cases having this trace
data("hospital_actlog") detect_incomplete_cases(activitylog = hospital_actlog, activities = c("Registration","Triage","Clinical exam","Treatment","Treatment evaluation"))
data("hospital_actlog") detect_incomplete_cases(activitylog = hospital_actlog, activities = c("Registration","Triage","Clinical exam","Treatment","Treatment evaluation"))
Function returning the incorrect activity labels in the log as indicated by the user. If details are requested, the entire activity log's rows containing incorrect activities are returned.
detect_incorrect_activity_names( activitylog, allowed_activities, details, filter_condition )
detect_incorrect_activity_names( activitylog, allowed_activities, details, filter_condition )
activitylog |
The activity log |
allowed_activities |
Vector with correct activity labels. If NULL, user input will be asked. |
details |
Boolean indicating wheter details of the results need to be shown |
filter_condition |
Condition that is used to extract a subset of the activity log prior to the application of the function |
activitylog containing the rows of the original activity log having incorrect activity labels
data("hospital_actlog") detect_incorrect_activity_names(activitylog = hospital_actlog, allowed_activities = c( "Registration", "Triage", "Clinical exam", "Treatment", "Treatment evaluation"))
data("hospital_actlog") detect_incorrect_activity_names(activitylog = hospital_actlog, allowed_activities = c( "Registration", "Triage", "Clinical exam", "Treatment", "Treatment evaluation"))
Function detecting missing values at different levels of aggregation
overview: presents an overview of the absolute and relative number of missing values for each column
column: presents an overview of the absolute and relative number of missing values for a particular column
activity: presents an overview of the absolute and relative number of missing values for each column, aggregated by activity
detect_missing_values( activitylog, level_of_aggregation, column, details, filter_condition )
detect_missing_values( activitylog, level_of_aggregation, column, details, filter_condition )
activitylog |
The activity log |
level_of_aggregation |
Level of aggregation at which missing values are identified (either "overview", "column" or "activity) |
column |
Column name of the column that needs to be analyzed when the level of aggregation is "column" |
details |
Boolean indicating wheter details of the results need to be shown |
filter_condition |
Condition that is used to extract a subset of the activity log prior to the application of the function |
activitylog containing the rows of the original activity log which contain a missing value
data("hospital_actlog") detect_missing_values(activitylog = hospital_actlog) detect_missing_values(activitylog = hospital_actlog, level_of_aggregation = "activity") detect_missing_values(activitylog = hospital_actlog, level_of_aggregation = "column", column = "triagecode")
data("hospital_actlog") detect_missing_values(activitylog = hospital_actlog) detect_missing_values(activitylog = hospital_actlog, level_of_aggregation = "activity") detect_missing_values(activitylog = hospital_actlog, level_of_aggregation = "column", column = "triagecode")
Function detecting multi-registration for the same case or by the same resource at the same point in time
detect_multiregistration( activitylog, level_of_aggregation, timestamp, threshold_in_seconds, details, filter_condition )
detect_multiregistration( activitylog, level_of_aggregation, timestamp, threshold_in_seconds, details, filter_condition )
activitylog |
The activity log (renamed/formatted using functions rename_activity_log and convert_timestamp_format) |
level_of_aggregation |
Level of aggregation at which multi-registration should be detected (either "resource" or "case") |
timestamp |
Type of timestamp that needs to be taken into account in the analysis (either "start", "complete" or "both") |
threshold_in_seconds |
Threshold which is applied to determine whether multi-registration occurs (expressed in seconds) (time gaps smaller than threshold are considered as multi-registration) |
details |
Boolean indicating wheter details of the results need to be shown |
filter_condition |
Condition that is used to extract a subset of the activity log prior to the application of the function |
activitylog containing the rows of the original activity log for which multi-registration is present
data("hospital_actlog") detect_multiregistration(activitylog = hospital_actlog, threshold_in_seconds = 10)
data("hospital_actlog") detect_multiregistration(activitylog = hospital_actlog, threshold_in_seconds = 10)
Detect overlapping acitivity instances
detect_overlaps(activitylog, details, level_of_aggregation, filter_condition)
detect_overlaps(activitylog, details, level_of_aggregation, filter_condition)
activitylog |
The activity log |
details |
Boolean indicating wheter details of the results need to be shown |
level_of_aggregation |
Look for overlapping activity instances within a case or within a resource. |
filter_condition |
Condition that is used to extract a subset of the activity log prior to the application of the function |
tbl_df providing an overview of activities which are performed in parallel by a resource, together with the occurrence frequency of the overlap and the average time overlap in minutes
data("hospital_actlog") detect_overlaps(activitylog = hospital_actlog)
data("hospital_actlog") detect_overlaps(activitylog = hospital_actlog)
Function that tries to detect spelling mistakes in a given activity log column
detect_similar_labels( activitylog, column_labels, max_edit_distance = 3, show_NA = FALSE, ignore_capitals = FALSE, filter_condition = NULL )
detect_similar_labels( activitylog, column_labels, max_edit_distance = 3, show_NA = FALSE, ignore_capitals = FALSE, filter_condition = NULL )
activitylog |
The activity log |
column_labels |
The name of the column(s) in which to search for spelling mistakes |
max_edit_distance |
The maximum number of insertions, deletions and substitutions that are allowed to be executed in order for two strings to be considered similar. |
show_NA |
A boolean indicating if labels that do not show similarities with others should be shown in the output |
ignore_capitals |
A boolean indicating if capitalization should be included or excluded when calculating the edit distance between two strings |
filter_condition |
Condition that is used to extract a subset of the activity log prior to the application of the function |
tbl_df providing an overview of similar labels for the indicated column
data("hospital_actlog") detect_similar_labels(activitylog = hospital_actlog, column_labels = "activity", max_edit_distance = 3)
data("hospital_actlog") detect_similar_labels(activitylog = hospital_actlog, column_labels = "activity", max_edit_distance = 3)
Function detecting time anomalies, which can refer to activities with negative or zero duration
detect_time_anomalies( activitylog, anomaly_type = c("both", "negative", "zero"), details = TRUE, filter_condition = NULL )
detect_time_anomalies( activitylog, anomaly_type = c("both", "negative", "zero"), details = TRUE, filter_condition = NULL )
activitylog |
The activity log |
anomaly_type |
Type of anomalies that need to be detected (either "negative", "zero" or "both") |
details |
Boolean indicating wheter details of the results need to be shown |
filter_condition |
Condition that is used to extract a subset of the activity log prior to the application of the function |
activitylog containing the rows of the original activity log for which a negative or zero duration is detected, together with the duration value and whether it constitutes a zero or negative duration
data("hospital_actlog") detect_time_anomalies(activitylog = hospital_actlog)
data("hospital_actlog") detect_time_anomalies(activitylog = hospital_actlog)
Function that lists all distinct combinations of the given columns in the activity log
detect_unique_values(activitylog, column_labels, filter_condition = NULL)
detect_unique_values(activitylog, column_labels, filter_condition = NULL)
activitylog |
The activity log |
column_labels |
The names of columns in the activity log for which you want to show the different combinations found in the log. If only one column is provided, this results in a list of unique values in that column. |
filter_condition |
Condition that is used to extract a subset of the activity log prior to the application of the function |
activitylog containing the unique (distinct) values (combinations) in the indicated column(s)
data("hospital_actlog") detect_unique_values(activitylog = hospital_actlog, column_labels = "activity") detect_unique_values(activitylog = hospital_actlog, column_labels = c("activity", "originator"))
data("hospital_actlog") detect_unique_values(activitylog = hospital_actlog, column_labels = "activity") detect_unique_values(activitylog = hospital_actlog, column_labels = c("activity", "originator"))
Function detecting violations of the value range, i.e. values outside the range of tolerable values
detect_value_range_violations(activitylog, ..., details, filter_condition)
detect_value_range_violations(activitylog, ..., details, filter_condition)
activitylog |
The activity log |
... |
Define domain range using domain_numeric, domain_categorical and/or domain_time for each column |
details |
Boolean indicating wheter details of the results need to be shown |
filter_condition |
Condition that is used to extract a subset of the activity log prior to the application of the function |
activitylog containing the rows of the original activity log for which the provided value range is violated
domain_categorical
,domain_time
,domain_numeric
data("hospital_actlog") detect_value_range_violations(activitylog = hospital_actlog, triagecode = domain_numeric(from = 0, to = 5))
data("hospital_actlog") detect_value_range_violations(activitylog = hospital_actlog, triagecode = domain_numeric(from = 0, to = 5))
Define allowable range of values
domain_categorical(allowed)
domain_categorical(allowed)
allowed |
Allowed values of categorical column (character or factor) |
No return value, called for side effects
Define allowable range of values
domain_numeric(from, to)
domain_numeric(from, to)
from |
Minimum of allowed range |
to |
Maximum of allowed range |
No return value, called for side effects
Define allowable time range
domain_time(from, to, format = ymd_hms)
domain_time(from, to, format = ymd_hms)
from |
Start time interval |
to |
End time interval |
format |
Format of to and from (either ymd_hms, dmy_hms, ymd_hm, ymd, dmy, dmy, ...). Both from and to should have the same format. |
No return value, called for side effects
Funtion to define bounds on the duration of an activity during detection of duration outliers.
duration_within(bound_sd = 3, lower_bound = NA, upper_bound = NA)
duration_within(bound_sd = 3, lower_bound = NA, upper_bound = NA)
bound_sd |
Number of standard deviations from the mean duration which is used to define an outlier in the absence of lower_bound and upper_bound (default value of 3 is used) |
lower_bound |
Lower bound for activity duration used during outlier detection (expressed in minutes). This means disregarding the sd and bound_sd for lower bound |
upper_bound |
Upper bound for activity duration used during outlier detection (expressed in minutes). This means disregarding the sd and bound_sd for upper bound |
No return value, called for side effects
Function that filters detected anomalies from the activity log
filter_anomalies(activity_log, anomaly_log)
filter_anomalies(activity_log, anomaly_log)
activity_log |
The activity log (renamed/formatted using functions rename_activity_log and convert_timestamp_format) |
anomaly_log |
The anomaly log generated from the different DAQAPO tests |
activitylog in which the anomaly rows are filtered out
Fix problems
fix(detected_problems, ...)
fix(detected_problems, ...)
detected_problems |
Output of a detect_ function. Currently supported: detect_resource_inconsistencies. |
... |
Additionals parameters, depending on type of anomalies to fix. |
No return value, called for side effects
A dataset containing the logged activities in an illustrative hospital process. 20 patients are described in the log. Process activities include Registration, Triage, Clinical exam, Treatment and Treatment evaluation.
hospital
hospital
A data frame with 53 rows and 7 variables:
the patient's identifier
the executed activity
the resource performing the activity execution
the timestamp at which the activity was started
the timestamp at which the activity was completed
a case attribute describing the triage code
a case attribute describing the specialization
An illustrative example developed in-house for demonstrational purposes.
A dataset containing the logged activities in an illustrative hospital process. 20 patients are described in the log. Process activities include Registration, Triage, Clinical exam, Treatment and Treatment evaluation.
hospital_actlog
hospital_actlog
An activity log with 53 rows and 7 variables:
the patient's identifier
the executed activity
the resource performing the activity execution
the timestamp at which the activity was started
the timestamp at which the activity was completed
a case attribute describing the triage code
a case attribute describing the specialization
An illustrative example developed in-house for demonstrational purposes.
A dataset containing the logged activities in an illustrative hospital process. 20 patients are described in this log Process activities include Registration, Triage, Clinical exam, Treatment and Treatment evaluation.
hospital_events
hospital_events
A data frame with 53 rows and 7 variables:
the patient's identifier
the executed activity
the resource performing the activity execution
the state the activity is in at the given timestamp
the moment in time the lifecycle state was reached
a case attribute describing the triage code
a case attribute describing the specialization
a specification of which events form a pair in the log
An illustrative example developed in-house for demonstrational purposes.