Query Optimization for Dynamic Imputation


Missing values are common in data analysis and present a usability challenge. Users are forced to pick between removing tuples with missing values or creating a cleaned version of their data by applying a relatively expensive imputation strategy. Our system, ImputeDB, incorporates imputation into a cost- based query optimizer, performing necessary imputations on- the-fly for each query. This allows users to immediately explore their data, while the system picks the optimal placement of imputation operations. We evaluate this approach on three real-world survey-based datasets. Our experiments show that our query plans execute between 10 and 140 times faster than first imputing the base tables. Furthermore, we show that the query results from on-the-fly imputation differ from the traditional base-table imputation approach by 0–20%. Finally, we show that while dropping tuples with missing values that fail query constraints discards 6–78% of the data, on-the-fly imputation loses only 0–21%.

In VLDB 2017