A two-part course presented by Pavlo Mozharovskyi in a hybrid format.
Anomaly detection (Chandola et al., 2009) is a branch of machine learning which aims at identifying observations that exhibit abnormal behavior. Be it measurement errors, disease development, severe weather, production quality default(s) (items) or failed equipment, financial frauds or crisis events, their on-time identification, isolation and explanation constitute an important task in almost any branch of industry and science. When the data are presented in a form of a table that contains properties of individuals (a typical structure of a data base), multivariate anomaly detection (Rousseeuw & Hubert, 2018) methods should be employed. If the data are functions of an argument ,e.g., time (such as time series), projection on a multivariate sub-basis or functional anomaly detection methods (Hubert et al., 2015) can be in use.
For both multivariate and functional anomaly detection the following general steps are in place: first observations are ordered with respect to their normality/outlyingness, and then an application-specific threshold is to be chosen to distinguish abnormal observations. Thus defining appropriate ordering is the main task of anomaly detection methods. This ordering can of course be a direct extension of the probability density (Breunig et al., 2000; Polonik, 1997), but such an approach quickly suffer from the curse of dimensionality, which is rather a rule than exception in contemporary data analysis. For this reason, recently non-parametric ordering methods (Schölkopf et al., 2001; Liu et al., 2008), and in particular the notion of data depth (Zuo & Sefling, 2000; Mosler, 2013) increasingly attract attention.
Among non-parametric orderings, data depth occupies today a special place. Given an observation, it measures how typical (or deep) this observation is with respect to other available observations of the same nature. Multivariate data depth possesses such attractive properties as robustness and affine invariance, which can be further extended to functional depth (Gijbels & Nagy, 2017). In the current tutorial, this methodology is addressed in two parts.
Part I: In this part, after the task formalization and brief review of most used existing methods, we discuss the concept of data depth in the multivariate setting, that is for data representable as points in an Euclidean space. Then, we review most common notions of the depth function: halfspace (Tukey, 1975), projection (Zuo & Sefling, 2000), zonoid (Mosler, 2002), and spatial depth (Koltchinskii, 1997). After this, illustrative examples in R and Python are provided.
Keywords: Anomaly detection, machine learning, data depth, multivariate ordering, functional ordering, robustness, outliers, ranking, computational statistics, time series.
Slides: Slides on Part I (pdf)
Link to complete materials: Section ‘Multivariate and functional anomaly detection’ (html)
I will attend I may attend I won’t attend