Skip to content

ngntrgduc/seminar

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

My Seminar coursework at the University of Science - VNUHCM, 2024.

  • Topic: Handling Missing Data (Xử lý dữ liệu khuyết)
  • Language: Vietnamese
  • Supervisor: Dr. Hoang Van Ha

Overview

In this course, I try to understand, rephrase, and implement the neural network from the paper "NeuMiss networks: differentiable programming for supervised learning with missing values" (NeurIPS 2020). Though the paper still contains some small errors, this is still an interesting work that focuses on handling missing data in linear regression problems by using a neural network, so-called NeuMiss.

Repo's structure

  • notebooks/: Contain Jupyter notebooks to demonstrate some experiments
    • Neumann_series_approximation.ipynb: Numerical experiment for matrix inverse approximation using Neumann series
    • NeuMiss_network.ipynb: Reimplement NeuMiss network architecture and some experiments with different settings
    • NeuMiss_sota_network.ipynb: Testing NeuMiss from authors' later work:  "What’s a good imputation to predict with missing values?" (2021)
    • NeuMiss_vs_Others.ipynb: Experimenting with other impute-then-regress methods
  • report/: Contain report's pdf and LaTeX code
  • slide/: Contain slide's pdf and LaTeX code

Further works

Due to my skill issues, and the shortage of time ⌛, I could not do and learn more in this course 🥲. However, here are some ideas/questions/todos I wish I had time to work on:

  • Make the network work on GPU
  • Research on better architecture from authors' later work
    • Implement functionality for classification problem
  • The assumption for data (Gaussian), MNAR setting (Gaussian self-masking), and other assumptions are still strong/restrictive.
    • Integrated with Random Matrix Theory (?) -> Remove the assumption for Gaussian data.
    • Agnostic statistics/Agnostic learning (?)
  • Compare to more methods:
    • Mixture of models: Gaussians Mixture Model (GMM)
    • Hierarchical models
    • Imputations: Optimal Transport, PCA, Matrix Completion,...
    • Neural Network models: GAIN, MisGAN, MIWAE, StableMiss,...
    • NeuMiss (Morvan et al. - 2021): NeuMiss can be used for non-linear models by joining it with a MLP,...
    • NeuMISE: For missingness shift (or can be view as data drift/shift) -> Use for realtime application?
  • This network is considered a deep neural network. What if there's a small amount of data? Then which method is the best?
  • Experiment with more real-world datasets, with linear regression problems
  • How can NeuMiss be extended to work on large datasets?
  • How do outliers affect the model performance?
  • How do categorical variables and continuous variables affect the network?

Resources: