Listing 1 - 4 of 4 |
Sort by
|
Choose an application
Choose an application
Choose an application
Choose an application
In this thesis we investigate a new regression method for modelling binary outcomes. Such methods are commonly used in various fields and provide data-driven insights in situations where the outcome of interest is a choice between two alternatives. Some of the potential questions that such models can answer are the following: “How likely is it that a customer will buy an offered product?”, “How likely is it that a customer will default on its mortgage payments?” , “How likely is it that a patient will experience complications following a certain medical procedure?”. To answer these questions, there is a need for data. Such recorded (historical) data contains several observations, each consisting of predictors and an outcome. The predictors can be all kinds of quantitative or qualitative data. For the example of the operation, the age, weight and known allergies and other risk factors can be examples. A very popular type of regression model for binary outcomes is the logistic regression model, since it has a strong statistical foundation in the Generalized Linear Model (GLM) framework and easy interpretation in terms of odds ratios. However, the fitting procedure of choice for logistic regression, Maximum Likelihood Estimation (MLE), is not very suitable for many modern real-life applications. In the age of big data, the abundance of data is both a blessing and a burden: a large array of potential predictors is often available, but many of these are likely to be uninformative. Even more extreme is the case with more potential predictors than observations (n < p problems), which makes them impossible to fit using maximum likelihood. Ideally, a sparse regression method should perform a selection of informative predictors in an automatic and statistically justified manner. Another issue is the unwanted effect that outliers have on the estimates of maximum likelihood. An outlier is a peculiar observation (an anomaly) that can completely change the resulting fitted regression, potentially removing any usefulness the model had for all other observations. Ideally, a robust regression method has an automatic way to detect and remove the effects of such outliers. A recent invention is the Elastic Net – Least Trimmed Squares estimator (ENet-LTS), a regression method that automatically performs the selection of useful predictors as well as the removal of outlying observations for logistic regression. Given the very recent invention of this method, this thesis investigated potential improvements. One of these is the use of information criteria as an alternative to the computationally heavy cross-validation procedure for the tuning of hyperparameters. These hyperparameters are parameters that need to be determined outside of the usual fitting procedure but are detrimental for the performance of the model. To calculate the information criteria, a new way of estimating the degrees was introduced. The performances of these criteria were compared in a simulation study. It was found that the use of information criteria is competitive compared to cross-validation and provided shorter computations times. Finally, an important aspect in this thesis is the inclusion of a thoroughly revised and updated version of the existing software implementation in the statistical software R. The new version contains the information criteria approach (with degrees of freedom calculation), improved reweighting step and several other improvements and bugfixes.
Listing 1 - 4 of 4 |
Sort by
|