Koh-Lanta, missing data imputation in supervised context

Koh-Lanta, missing data imputation in supervised context

Abstract

The imputation is the process that estimates the missing values. Simplest approaches impute to fixed values such as mean/median based on observed values of the considered variables. Under the MAR assumption, multivariate approaches can be used to estimate missing data from the entire dataset. Most of modern algorithms are based on this approach and the most used implementations are certainly miceMICE which uses linear models, missForest which uses Random Forest, missMDA which uses regularized PCA models, or k-Nearest Neighbors imputations. To take into account sampling variability, following Rubin[4], it is better to propose m values for each missing value instead of a single one. This multiple imputation procedure allows to provide proper imputation, in contrast to improper imputation. In practice m≈5 is often sufficient. Most of the existing methods are not well suited to the high dimensional context, when the sample size n is much lower the number of variables p, often symbolized as n≪p. In supervised analysis, the variable x must be explained by the variable y. This implies that the part of x associated with y can be hard to find, especially in the high dimensional context where the classical imputation methodologies suffer. In this communication, we present a new methodology, called Koh-Lanta, able to deal with missing values in supervised context, using multiple imputation, tackling the high dimensional issues. For this, missing values are considered only in the x part.

Date
Location
Brest, France