Choose an application
Statistiek houdt zich bezig met het analyseren van data. Traditioneel bestaat een dataset uit data voor een aantal observaties en een aantal variabelen. Bijvoorbeeld kunnen de punten van elke student wiskunde (observatie) voor al zijn verschillende opleidingsonderdelen (variabelen) een dataset vormen. Uitschieters zijn observaties die afwijken van het patroon vertoond door de meerderheid van de observaties. Bij de analyse van de data, kunnen deze uitschieters enorm doorwegen op de resultaten. Robuuste statistiek is de tak van de statistiek die zich met het effect van uitschieters bezighoudt. Het ontdekken van uitschieters is belangrijk in deze tak maar ook de analyse van deze uitschieters en het ontwikkelen van methodes die minder gevoelig zijn voor uitschieters. In deze masterproef houden we ons vooral bezig met de volgende vraag: wanneer een uitschieter ontdekt is in een dataset, hoe vinden we dan de variabelen die het meest bijdragen aan het afwijkend gedrag van de uitschieter? Om deze vraag te beantwoorden, bespreken we een methode die gebaseerd is op de richting waarin de afwijking van de uitschieter maximaal is. De variabelen die deze richting het meest bepalen, worden dan geselecteerd. Afhankelijk van of de dataset hoog-dimensionaal is, d.w.z. uit relatief veel variabelen en weinig observaties bestaat, of laag-dimensionaal, zullen er andere technieken nodig zijn. Deze masterproef bestaat enerzijds uit de nodige achtergrondkennis en de theoretische onderbouwing voor de methodes. Anderzijds bespreken we een hoop simulatievoorbeelden en twee echte datasets om onze methodes te testen in verschillende situaties. Het praktische nut van onze methodes ligt vooral bij datasets met een hoog aantal variabelen. Hier is het namelijk zeer lastig om uitschieters te bestuderen.
Choose an application
In this master thesis, an adaptation of the concordance probability is proposed to the specific needs of a technical pricing of a non-life insurance product. Separate measures are developed for the frequency data, i.e. the data used to estimate the expected number of claims, and for the severity data, i.e. the data used to estimate the expected claim size. Since the latter insurance data sets typically are rather large, estimation methods needed to be adapted to handle this larger data volume. To this end, two estimation methods are proposed, each with their own strengths and weaknesses as investigated in a very extensive simulation study. Note that both estimation methods can be applied to the calculation of any version of the concordance probability, including the very widely used AUC measure. Well-functioning confidence intervals are designed for both approximations as well. In the extensive application chapter of this master thesis, the proposed measures are applied to a real insurance data set to select the optimal main and interaction effects of the industry standard model, i.e. the Generalized Linear Model or GLM. To this end, a genetic algorithm or GA is used, as well as an adaptation of an existing binning technique of continuous explanatory variables. In a second analysis, two Machine Learning or ML models are applied to the data and the GLM model that shows the highest correlation with the considered ML model is searched for using a GA and the proposed measures. Indeed, the best correlating GLM could be used as a proxy for (partially) explaining the predictions of the considered ML model. Finally, the discriminatory ability of the selected models are investigated in more detail using the proposed measures.
Choose an application
In this thesis we investigate a new regression method for modelling binary outcomes. Such methods are commonly used in various fields and provide data-driven insights in situations where the outcome of interest is a choice between two alternatives. Some of the potential questions that such models can answer are the following: “How likely is it that a customer will buy an offered product?”, “How likely is it that a customer will default on its mortgage payments?” , “How likely is it that a patient will experience complications following a certain medical procedure?”. To answer these questions, there is a need for data. Such recorded (historical) data contains several observations, each consisting of predictors and an outcome. The predictors can be all kinds of quantitative or qualitative data. For the example of the operation, the age, weight and known allergies and other risk factors can be examples. A very popular type of regression model for binary outcomes is the logistic regression model, since it has a strong statistical foundation in the Generalized Linear Model (GLM) framework and easy interpretation in terms of odds ratios. However, the fitting procedure of choice for logistic regression, Maximum Likelihood Estimation (MLE), is not very suitable for many modern real-life applications. In the age of big data, the abundance of data is both a blessing and a burden: a large array of potential predictors is often available, but many of these are likely to be uninformative. Even more extreme is the case with more potential predictors than observations (n < p problems), which makes them impossible to fit using maximum likelihood. Ideally, a sparse regression method should perform a selection of informative predictors in an automatic and statistically justified manner. Another issue is the unwanted effect that outliers have on the estimates of maximum likelihood. An outlier is a peculiar observation (an anomaly) that can completely change the resulting fitted regression, potentially removing any usefulness the model had for all other observations. Ideally, a robust regression method has an automatic way to detect and remove the effects of such outliers. A recent invention is the Elastic Net – Least Trimmed Squares estimator (ENet-LTS), a regression method that automatically performs the selection of useful predictors as well as the removal of outlying observations for logistic regression. Given the very recent invention of this method, this thesis investigated potential improvements. One of these is the use of information criteria as an alternative to the computationally heavy cross-validation procedure for the tuning of hyperparameters. These hyperparameters are parameters that need to be determined outside of the usual fitting procedure but are detrimental for the performance of the model. To calculate the information criteria, a new way of estimating the degrees was introduced. The performances of these criteria were compared in a simulation study. It was found that the use of information criteria is competitive compared to cross-validation and provided shorter computations times. Finally, an important aspect in this thesis is the inclusion of a thoroughly revised and updated version of the existing software implementation in the statistical software R. The new version contains the information criteria approach (with degrees of freedom calculation), improved reweighting step and several other improvements and bugfixes.
Choose an application
In this thesis, a new robust and sparse method for logistic regression based on the gamma-divergence measure is proposed and investigated in detail both theoretically and numerically. The motivation for a robust and sparse regression method is two-fold. Firstly, high-dimensional data, which means the size of explanatory variables p is (much) larger than the size of the observations n, emerge from various research domains. The enormous number of explanatory variables poses a challenge to model selection and parameter estimation and researchers would certainly prefer a model that is more concise and interpretable. Another issue we often encounter in real-world data is that there exist some observations that do not follow the pattern of the majority of the data. The presence of those so-called outliers may distort the results from conventional regression methods and result in a poor model. A possible solution to the first issue is by introducing an extra constraint or penalty term that imposes sparsity among variables and a solution to the second issue is to replace the non-robust loss function with a robust counterpart. Combining the two ideas above, a new type of robust loss function for logistic regression based on the concept of gamma-divergence measure is proposed. It assigns smaller weights to outliers through power transformation on density function. A penalty term that could produce sparse solution will be added to that loss function. The robustness property will be shown both in theory and from the perspective of the redescending behaviour of its psi-function. An iterative algorithm called MM-algorithm is introduced to overcome the difficulty in optimization due to the non-convexity of the robust loss function. Besides, a new robust tuning parameter selection criterion built on the same divergence measure is discussed. The proposed method is then compared with other existing methods through numerical simulation and real data analysis and we see that it outperforms other methods and produce more stable and efficient results.
Choose an application
Statistiek is de methodiek van het verzamelen en bestuderen van gegevens. Een verzameling van gegevens, of een dataset, wordt doorgaans gepresenteerd in tabelvorm. Elke rij vertegenwoordigt een bepaalde observatie en elke kolom stelt een variabele voor. Zo kunnen bijvoorbeeld de resultaten van alle leerlingen uit een klas (observaties) voor de verschillende vakken (variabelen) een dataset vormen. Zoals vaak in de wetenschappen bestaat ook statistiek uit verschillende deeldomeinen. Robuuste statistiek bijvoorbeeld houdt zich bezig met het effect van sterk afwijkende observaties in een dataset. Deze observaties, uitschieters genoemd, wijken af van het patroon gevormd door de grote meerderheid van de observaties. Bij de analyse van data kunnen ze sterk doorwegen op de resultaten. Om die reden zijn robuuste methoden ontwikkeld, die zich weinig laten beïnvloeden door uitschieters. Dergelijke methoden zijn bovendien in staat om sterk afwijkende observaties op te sporen. Wanneer een uitschieter gedetecteerd wordt in een dataset, betekent dit niet noodzakelijk dat de observatie in alle variabelen afwijkend gedrag vertoont. Zo kan een leerling mogelijk als uitschieter worden beschouwd wegens zwakke resultaten op school, niettegenstaande enkel de scores voor Frans en wiskunde zeer laag zijn, en de overige scores rond de mediaan liggen. Hierbij aansluitend vestigen we in deze masterproef de aandacht voornamelijk op het volgende probleem: eenmaal een uitschieter gedetecteerd is in een dataset, hoe kunnen dan de variabelen die het meest bijdragen tot het afwijkend gedrag van de uitschieter, geïdentificeerd worden? Om op deze vraag een antwoord te bieden, kunnen we de richting bekijken waarin de afwijking van de uitschieter maximaal is. Deze richting wordt bepaald door nagenoeg alle variabelen, hoewel sommige meer van belang zijn dan andere. Zij die zwaar doorwegen in het bepalen van de richting, hebben een grote impact op het afwijkend gedrag van de uitschieter. In deze masterproef bespreken we twee algoritmes die gebaseerd zijn op een soortgelijke richting. Echter, die richting wordt niet meer vastgelegd door alle variabelen, maar door slechts een deel van hen. Deze groep variabelen wordt uiteindelijk geselecteerd als de groep die de grootste bijdrage levert aan het afwijkend gedrag van de uitschieter. Bovenstaand probleem is vooral een hele uitdaging wanneer we te maken hebben met hoogdimensionale data, oftewel data met een zeer groot aantal variabelen. In genetica bijvoorbeeld bevatten datasets doorgaans honderden of duizenden genen (variabelen). Het is perfect mogelijk dat een observatie slechts voor enkele van deze genen abnormale waarden vertoont.
Choose an application
Rheumatoid Arthritis (RA) is a chronic, inflammatory, autoimmune disease and is characterized by painful and swollen joints. Because novel guidelines recommend treating RA early, intensively, and to-target, access to specialized treatment has come under stress in several countries. In this thesis, we explored two topics in approaches to reducing the number of clinical visits required for patients in the early phase (<2 years) of Rheumatoid Arthritis (RA) treatment. For the first approach, a literature search was performed for both classical techniques in survival analysis as well as more recently proposed techniques inspired by the machine learning community. Three models were selected: the Cox Proportional Hazards (Cox PH), the Linear- Multitask Logistic Regression (L-MTLR) and Random Survival Forests (RSF). The L-MTLR and RSF were proposed more recently to relax assumptions and increase the flexibility of the classical methods. These methods were applied to the Care in Rheumatoid Arthritis (CareRA) trial. This indicated the more flexible L-MTLR and RSF techniques did not improve the predictive performance compared to the Cox PH. The models obtained a reasonable ranking of patients’ risk but did not manage to discriminate individual visits. Interestingly, our analysis confirmed that prognostic factors obtained after initial treatment response improved the relative ranking of patients' risk for the need for later intervention over the first two years of treatment. For the second approach taken to visit discrimination, we diverted our attention to collected patient-reported outcome questionnaires (PROs): the Multi-Dimensional Fatigue Index (MFI), Short Form 36 (SF36), Illness perception questionnaire (IPQ), Utrecht Coping List (UCL), Social Support List (SSL), Pittsburgh Sleep Questionnaire (PSQ), RA Quality of Life (RAQOL) and patient-reported global health, pain and fatigue as measured by the Visual Analogue Scale (VAS). Such PROs consisting of many questions are usually summarized by one or several (sub)scores encoding different aspects of the unmet need assessed by the questionnaire. Relations between the (sub)scores of the PROs are investigated using the GLASSO graphical model. This analysis identified that the PRO subscores cluster in groups with most edges relating subscores to subscores within the same group. Three such groups were observed: (1) illness consequences perception, (2) coping strategies, and (3) mental, social, and physical functioning. The presence of cluster (3) confirmed the interplay between the mental, social, and physical in patients undergoing RA treatment. The overall structure of the estimated relations was also shown to be robust over the first two years of treatment by comparing estimated graphs at baseline, after one year, and after two years. Given the high connectedness between PRO subscores obtained by the GLASSO, a preliminary analysis was performed leveraging the estimated relations to provide a suggestion of possible subscores that contain information that can be reasonably reconstructed from the remaining PROs. Requiring R^2>0.70 suggested VAS PGA, MFI Physical Function, SF36 Vitality, RAQoL, MFI Reduction in Motivation, SSL Emotional Problems, and SF36 Bodily Pain contain a lot of mutual information included in the other PRO subscores.
Choose an application
Recently, data is growing at an explosive rate and it is often used to get insights into a certain process. They are obtained by predicting one response variable by all the other explanatory variables. This is called regression. Once such a model is created, an idea of the actual relationship between the response variable and the explanatory variables is formed. In this thesis we focus on Poisson regression, which means that the variable we want to predict follows a Poisson distribution (used to model event occurrences). As this is a generalized linear model, the focus is first set on these models. A frequently encountered problem in regression is the presence of extreme observations, possibly caused by an error. It is undesired that such an observation changes the entire model and thus modifies the interpretation of the process. Methods that are not affected by this type of observations are called robust and an example of such an estimator is the Mallows quasi-likelihood estimator, proposed by Cantoni and Ronchetti. Another recurring issue in regression is the presence of too many explanatory variables, which makes the interpretation very hard. Moreover, a large number of them do not have a significant influence on the response variable most of the time. A solution for this issue is selecting the most relevant explanatory variables, resulting in a so-called sparse solution for the regression coefficients. One of the methods that is able to obtain such a result is the lasso. In this thesis, we define the sparse Mallows quasi-likelihood estimator (SMQLE) by combining the lasso with the Mallows quasi-likelihood estimator. This estimator is used to determine sparse and robust regression coefficients for a Poisson model. With the theoretical definition of this new estimator in mind, an algorithm is created to determine this estimator practically for real data as well. This algorithm is implemented in R and is called smqleIRWPLS. Here, IRWPLS stands for iteratively reweighted penalized least squares, which is the internal procedure used in this algorithm. With this algorithm implemented, some simulations are done afterwards. They confirm that the SMQLE behaves as desires, as long as the data set contains enough observations. This means that it is indeed a robust algorithm that is able to select the right explanatory variables. However, before the algorithm can start, it needs a value for the so-called tuning parameter which has to do with the lasso. If the data originates from simulations, it is rather easy to determine this value since it is known beforehand how many explanatory variables should be selected. This approach is not possible anymore when the estimator needs to be calculated for a real data set. This is why a new method is proposed to determine the most appropriate value for the tuning parameter, based on robust adaptations of the Bayesian information criteria. Despite the fact that the algorithm has the ability to select the most appropriate tuning parameter automatically, it is advised to check the criterion graphically as well, especially for the situations where not many observations are present. In these cases, the tuning parameter is often chosen too large, which results in a regression model that is sparser than it should be in the ideal case. With this point of attention in mind, the final item of this master thesis is the application of the smqleIRWPLS algorithm on several real data sets.
Choose an application
Billions of dollars are lost every year due to fraudulent credit card transactions. The design of efficient machine learning techniques could provide an answer for detecting fraud and in this case reducing these losses. Each time a credit card is used, all kind of transactional data composed of different attributes (e.g. credit card identifier, transaction date, amount of the transaction, country where the transaction took place) is recorded. Automatic systems for detecting fraud are nowadays essential since most of the time it is not possible for human analysts to notice fraudulent behaviour due to a large number of transactions and variables available. The ultimate goal of fraud detection algorithms is to label new transactions as legitimate or as fraudulent ones. For this purpose, two different techniques could be used: supervised techniques that make use of labelled transactions and unsupervised techniques which do not use any labelling. For supervised techniques, we assume that reliable class labels of past transactions are available. These labelled observations are used for predicting the class labels of new transactions. Unsupervised techniques, on the other hand, make no use of the classes of transactions. Here one tries to find fraudulent behaviour by grouping transactions together or by finding rare observations which do not correspond with the usual behaviour of the majority of the data. The latter is also called anomaly detection. Both the supervised and unsupervised method rely strongly on a clear notion of similarity. The goal is namely to learn from instances based on how similar they are. The unsupervised learning technique groups similar objects together while the supervised technique needs nearby objects to decide on the label of the new instances. Many machine learning systems require that the input attributes are numerical to measure the similarity in an appropriate way. However, often, these techniques ignore or do not handle categorical attributes properly. The aim of this thesis is to use heterogeneous distance functions in various machine learning techniques for credit card fraud detection. These heterogeneous distance functions are designed to handle applications with numerical, categorical or both numerical and categorical attributes. Adding nominal attributes and using the heterogeneous distance function are shown to be useful by leading to improved versions of the existing (numerical) techniques.
Choose an application
In this master thesis the optimality of combining two data sources (paid and incurred data) in claims reserving is discussed. Joint estimation in the chain-ladder framework is possible by means of the Munich Chain-ladder model developed by Quarg and Mack (2004) and the General Multivariate Chain-ladder (GMCL) model developed by Zhang (2010). The first part of this work focusses on the univariate chain-ladder models that replicate the standard chain-ladder results. In this part, the focus is put on the differences regarding the variability between the different models. In a second part both multivariate models are described and their advantages with respect to the separate chain-ladder models. Since the optimality is not general, emphasis is put on the drivers that can cause optimality. All described models are applied to available data in the ChainLadder package in R and the results are discussed in this work.
Choose an application
In statistics, a dataset consists of information from measurements (i.e. variables) on different samples. It is well-known that real datasets often contain outliers. An outlier is an observation (i.e. sample) that deviates from the majority of the observations. It is of great importance to be able to detect the outlying samples in a dataset. Contaminated observations can be detected by measuring the ‘statistical distance’ between each observation and the center of the regular (clean) samples where the dispersion of the data has to be taken into account. If the distance of an observation is larger than a particular threshold, then we flag that observation as an outlier. In order to compute the dispersion of the data we need a so-called covariance matrix. A matrix is a rectangular array of numbers, arranged in rows and columns. A covariance matrix contains the underlying correlations (i.e. ‘relations’) between the different variables. The center (or mean) of the regular observations and the covariance matrix have to be estimated from the dataset by estimators. Of course, these estimators should not be affected (or at least less sensitive) to possible contamination in the data. This topic goes by the name of robust statistics. Most estimators of covariance require that the amount of available samples is larger than the number of variables in the dataset. However, modern datasets are often high-dimensional, meaning that the number of variables exceeds the sample size. The goal of this thesis is to find an estimator of covariance that is able to handle high-dimensional data and at the same time is robust against potential outliers. The starting point of the thesis is the popular Minimum Covariance Determinant (MCD) estimator of Rousseeuw (1984). This estimator is highly robust against outliers, but it is not suited for high-dimensional datasets. Boudt et al. (2016) found a way to generalize and extend the MCD estimator to a high number of dimensions based on the shrinkage principle proposed by Ledoit and Wolf (2003). The result is the Minimum Regularized Covariance Determinant (MRCD) estimator. We explore the MRCD method in great depth and show that the new method is indeed well-suited for robustly estimating both the center and covariance matrix of a large, high-dimensional dataset. We explore further improvements of the MRCD estimator and investigate its performance by means of a simulation study. The MRCD estimator allows us to analyze real high-dimensional datasets and detect potential contamination in the data. Furthermore, we explore the value of the MRCD method in an financial application: the construction of investment portfolios. The portfolio theory of Markowitz (1952) explains how to allocate capital in different assets (i.e. ‘diversification’) such that the risk of the portfolio is minimal for a given level of expected return. The risk of the portfolio is defined as the variability of the expected return. In order to measure this risk, we need the covariance matrix of the returns on each of the assets in the portfolio. Financial portfolios often include several hundreds of stocks or other assets, so the covariance matrix should be estimated by a method that is well-suited for these high-dimensional cases. Therefore, asset allocation problems offer a great opportunity to explore the value of the MRCD method in addressing (some of) the problems with the Markowitz model.