Lazy Formal Concept Analysis via Interval Pattern Structures

🔠 Версия на русском языке

Introduction

The project was aimed at verifying the possibility and quality of classifying "anonymized" objects by an interpretable method. The dataset includes 28 attributes and the transaction amount. It is not known what these signs are responsible for, but it is assumed that they are sufficient to determine the class of transaction (fraudulent or not).

CSV feild types:

id – integer – the unique number of the data row in the set
V1…V28 – float-point – converted some transaction attributes
Amount – decimal – transaction amount
Class – integer (0 or 1) – the transaction belongs to the class of fraudulent

Interval pattern structures were used to solve the problem, since all the signs are of the floating-point number type. The idea was considered to build a certain number of categories for each attribute, and to divide into categories depending on the distribution of the attribute values. However, categorization may become a feature related not to the transaction, but to the representativeness of the sample, so this path has not been implemented.

Algorithm

To test the classifier, several examples are selected from the input file with a low probability. They get into the testing kit.

The classification algorithm is as follows:

For the received element, an interval with a "training" set is being constructed.
For each element of the interval set, the similarity of another interval from this set to this interval is checked.
If there is a similarity in all signs, then depending on whether the target signs match (stolen/ not stolen), the positive and negative counters for the current interval under consideration change. If the target attribute is the same, then positive changes, if different – negative.
When the counter values are filled in for all hypothesis intervals, the "bad" hypotheses are screened out. The hypothesis is considered bad if the ratio min(positive, negative)/max(positive, negative) becomes greater than 1%. This is enough to weed out the "fuzzy" hypotheses.
The calculation of spent positive hypotheses and the number of spent negative hypotheses is performed.

A transaction is classified as fraudulent if:

$$\frac{a}{c}>\frac{b}{d},$$

where a is the number of intersections with hypotheses about reliably fraudulent transactions, c is the number of hypotheses about reliably fraudulent transactions, b is the number of intersections with hypotheses about ordinary transactions, d is the number of hypotheses about ordinary transactions.

The percentage of confidence in the verdict is calculated as follows (the notation is the same):

$$P=\left(1-\frac{\mathop{\text{min}}\left(\frac{a}{c},\frac{b}{d}\right)}{\mathop{\text{max}}\left(\frac{a}{c},\frac{b}{d}\right)}\right)*100$$

Test run

As part of the test runs, 10 classifications were performed: 5 on positive examples, 5 on negative ones.

The following distribution was obtained:

No.	Reliably fraudulent	Classification (#of hypotheses)	Classification (#of hypotheses)	Total number of hypotheses	Total number of hypotheses
No.	Reliably Fraudulent	Fraudulent	Regular	Fraudulent	Regular
1	no	284159	284177	284159	284190
2	no	284159	284190	284159	284190
3	no	96555	284172	284159	284190
4	no	266939	284185	284159	284190
5	no	284159	284189	284159	284190
6	yes	284159	284182	284159	284190
7	yes	284159	284190	284159	284190
8	yes	284159	284183	284159	284190
9	yes	284137	284190	284159	284190
10	yes	284149	254758	284159	284190

Next, if you make a verdict according to the formulas above:

No.	Reliably fraudulent	The verdict of the classifier	The confidence of the classifier	The correctness of the classifier
1	no	yes	0%	unclassified
2	no	no	0%	unclassified
3	no	no	66,02%	right
4	no	no	6,06%	right
5	no	yes	0%	unclassified
6	yes	yes	0%	unclassified
7	yes	no	0%	unclassified
8	yes	yes	0%	unclassified
9	yes	no	0,01%	unclassified
10	yes	yes	10,35%	right

If we assume that the classifier was able to classify only if its confidence is greater than 1%, then it turns out that it has never made a mistake, but, at the same time, could not classify 70% of the data...

This may probably be due to the features of the dataset (the mechanism of the data "anonymization" process during the preparation of the dataset), since the results turned out better on a closed dataset (for which the classifier was developed).

P.S.

When the classifier was single-threaded, it looked like this (in the commit it is multithreaded and looks more complicated).

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
LICENSE		LICENSE
LazyFCA_IPS.cpp		LazyFCA_IPS.cpp
LazyFCA_IPS.vcxproj		LazyFCA_IPS.vcxproj
README-ru.md		README-ru.md
README.md		README.md
example_dataset.png		example_dataset.png
single_thread_classifier.png		single_thread_classifier.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Lazy Formal Concept Analysis via Interval Pattern Structures

🔠 Версия на русском языке

Introduction

Algorithm

Test run

P.S.

About

Uh oh!

Releases 1

Uh oh!

Languages

License

mixaisealx/LazyFCA-IPS-demo

Folders and files

Latest commit

History

Repository files navigation

Lazy Formal Concept Analysis via Interval Pattern Structures

🔠 Версия на русском языке

Introduction

Algorithm

Test run

P.S.

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Uh oh!

Languages