Have you ever received a call or message from your bank asking if you had made purchases of products you are not used to buying or at outlets you have not even heard of?
Hopefully you have never been through this situation. But how would the bank identify that the transactions made are suspicious and could be a possible fraud? Banks often use algorithms that perform customer movement analysis and classify transactions as normal or potentially anomalous. One technique for identifying these abnormal behaviors is called anomaly detection.
Anomaly Detection is the name given to the task of identifying rare occurrences in a dataset. D. Hawkins defined an anomaly as follows: “Anomaly is an observation that differs so much from other observations that it raises the suspicion that it was created by a mechanism different from other observations”.
In the literature, the term anomaly can be found under different names, for example, outlier, outbreak, event, alteration, fraud, discordant observations, exceptions, aberrations, surprises, peculiarities, etc. Several areas of computer research propose to solve problems related to anomaly detection, using different techniques, such as data mining, machine learning and artificial intelligence. Some more recent work also uses Deep Anomaly Detection DAD for anomaly detection.
All studies in these areas are based on data mining and/or algorithm training so that normal data can be distinguished from anomalies. The technique used depends not only on the type of problem but also on the nature of the data used.
In addition to presenting numerous computing challenges, anomaly detection can also be applied to problem-solving in various business areas such as utilities, payment, healthcare, and industry. Therefore, there is great interest in developing algorithms capable of performing this task.
Types of anomaly detection
In order to carry out anomaly detection, the system must be able to distinguish normal from abnormal data.
Thus, like any other algorithm involving learning mechanisms, we can divide anomaly detection techniques into 3 types:
- Supervised Training: In this technique, we assume the availability of a training data set previously classified as normal or anomalies. Thus, the results are mapped and the algorithm should replicate data categorization between anomalies and normal data;
- Semi-supervised Training: In this technique, we have a small set of data previously classified as anomalies and normal data and a larger set of unclassified data. The main idea is to observe the similarities between already classified and unclassified data and to place similar data in the same groups.
- Unsupervised Training: This technique does not require previously classified data for training, intrinsic characteristics of the data are used to the identification of normal data and anomalies
We found in the literature more studies using unsupervised and semi-supervised techniques based only on normal data than supervised or semi-supervised techniques using only anomalous data, precisely because of the difficulty in classifying training data and because an anomaly is a Rare event.
When it comes to data manipulation and analysis, we always face the difficulties of dealing with the 3 Vs: Volume, Velocity and Variety (or Diversity). That is, the volume of information is very large, the speed at which the data needs to be consumed often needs to be almost in real-time and the wide variety of data formats.
As anomalies are rare, the amount of data that reflects normal behaviors is usually much larger than the data with anomalous behaviors. This imbalance between normal and abnormal data can impair the quality of training algorithms for machine learning.
Most of the time, the work of training data classification is manual, that is, a person decides if a data is an anomaly or a given manual one by one. Depending on the volume, this becomes an unfeasible task. To make matters worse, the quality of the rating depends on the experience and knowledge of the person who is rating that data.
Nobel laureate Daniel Kahneman said, “Humans are incorrigibly inconsistent in summary judgments of complex information. Surprisingly, people often provide different answers when requested to evaluate the same information twice. For example, experienced radiologists who evaluate chest radiographs as normal and abnormal contradict each other 20% of the time when they see the same image on separate occasions.
In addition, anomalous behaviour is often dynamic in nature. New types of anomalies may arise for which there is no training data, for example. When the anomaly is malicious data, a cyber attack, for example, it attempts to adapt to look like normal data.
Thus, defining a boundary between what is normal data and an anomaly is not a simple task either. The anomalous data may have characteristics very close to the considered standard data. Often what is defined as anomaly today will not necessarily be in the future, so classifying anomalies is a task of constant redefinition.
There are a multitude of other applications and areas where anomaly detection techniques can be used, such as detection of fraud on means of payment (credit card) and
in auto insurance; identification of financial inconsistencies in institutions such as hospitals; and even detection of cyberattacks.
Intrusion detection or cyber attacks refers to the identification of malicious activity in a computer system. An intrusion is different from normal system behaviour, so anomaly detection techniques are applicable to problem resolution. Our Utilities Head, Frederico Gonçalves, addressed the topic of cyber attacks in the electricity sector in his recent article.
In healthcare, anomaly detection allows us to find financial inconsistencies in medical procedures because we have a usual set of materials and values for each medical procedure. Thus, if the same procedure has very discrepant materials and values, the system is able to identify and analyze the values and check whether an error has occurred. In the area of industry, on the other hand, anomaly detection can be used in the detection of failures in the production process, helping in the process and maintenance planning of the machines among others.
As stated earlier, the challenges of detecting anomalies are not always simple. Large volumes of data need to be checked quickly, with anomalies that are not always clear – an impossible task if done manually. However, anomaly detection does not have to be done entirely by algorithms or people, there are several possibilities that are already being explored in the area.
For example, correctly classifying anomalous behavior may be a process of greater uncertainty in certain cases. To circumvent this kind of situation, we can create interactive solutions where data most likely to be anomalies are highlighted. These data are then analyzed by people, who confirm whether they are anomalies or not. Thus, we highlight the data most likely to be anomalies to facilitate confirmation by a human being and feed the algorithm, so that it can be continuously trained.
Thus, when the challenges of detecting anomalies are added to the wide range of possibilities (and needs) for their application in different segments, the importance of the development of the area becomes evident. Therefore, exploring forms of artificial intelligence training for anomaly detection becomes more crucial than ever.