Skip to content

GSoC 2015 Proposal: scikit learn: Cross validation and Meta estimators for Semi supervised learning

Vinayak Mehta edited this page Mar 27, 2015 · 5 revisions

Student Information

Name: Vinayak Mehta

E-mail: [email protected]

Telephone: +91-9868525444

Time Zone: GMT+5.5 (New Delhi, India)

IRC: vortex_ape

GitHub: vortex-ape

Twitter: vortex_ape

Blog: http://vinayakmehta.me

GSoC Blog RSS feed: http://vinayakmehta.me/feeds/all.rss.xml

My background and programming experience:

I am a third year undergraduate student of computer science and engineering at Bharati Vidyapeeth’s College of Engineering, Delhi. I started programming with C++ in class 11th in which I made a couple of projects. In the second semester of college, I took an Introduction to programming course in which I learned C followed by a Java based programming course in fifth semester. I also have experience in MATLAB (for my Digital Signal Processing assignments). I’m a MOOC fan and have taken a number of online courses some of which include Algorithms: Design and Analysis (in addition to an algorithms course taught at my college), Introduction to Machine Learning (Udacity) and the Machine Learning course offered by Georgia Tech at Udacity for their Online MS CS program. At the starting of my fifth semester, I was introduced to Python, and it was love at first sight. I have coded a couple of projects in Python, using scikit-learn for the latter two courses, which are displayed in the Additional Information section of this proposal. Currently, I’m taking Prof Ng’s Machine Learning course at Coursera in addition to an Artificial Intelligence course at edX which are scheduled to finish in April 2015. I like to take part in programming contests and have qualified for the onsite rounds of ACM ICPC multiple times. Programming contests have given me experience in implementing complex algorithms.

University Information

University: Guru Gobind Singh Indraprastha University

Major: Computer Science and Engineering

Current Year and Expected Graduation Date: Third year, 1 August, 2016

Degree: Bachelor of Technology

Project Proposal Information

Title: scikit-learn: Cross-validation and Meta-estimators for Semi-supervised learning

Abstract:

This proposal aims to make the semi_supervised module a first-class citizen in scikit-learn. It can be achieved by improving cross_validation module’s support for it. A self-learning meta-estimator will be implemented. New algorithms will be added to the semi_supervised module in addition to improving existing ones.

Motivation:

Semi-supervised learning is still not a first-class citizen in scikit-learn despite it being an important class of machine learning, often outperforming supervised models. It would be awesome if users spend their time concentrating on the research task at hand instead of wasting it on acquiring large amounts of labelled data, which is rather difficult.

I have been contributing to scikit for over a month now, and understand the code base and other coding procedures well enough for the successful completion of my proposed project. I also have other machine learning experience I got from projects which I’ve mentioned in additional information.

Milestones:

  • Improve the cross_validation module to accommodate semi_supervised learning better, seeing that the new changes play well with the rest of scikit-learn.
  • Improve existing semi_supervised learning module to accommodate the changes made to cross_validation.
  • Implement a meta-estimator based on the self-learning method.
  • Implement the proposed semi-supervised learning algorithms.
  • Write benchmarks, tests, documentation and examples for the newly added components.

Details:

The current implementation in LabelPropagation assumes that the label for unlabelled data is -1 which is then mixed together with labelled data. If cross_validation is used, it splits the whole dataset which includes unlabelled data too. This causes problems as the -1 labels don’t make any sense for scoring.

The iterative self-learning method [1] [2] will be used to implement a meta-estimator, namely SelfLearner which can turn any estimator into a semi-supervised one.

Also, I propose on implementing the following algorithm in addition to improving existing graph-based ones.

  • Transductive SVM [3]
  • More algorithms will be added upon discussion with the community.

Timeline:

  • Pre GSoC (Today to April 27th)

I will work on designing a plan to fix the cross_validation module for semi_supervised keeping in mind that it plays well with rest of scikit-learn. I will continue fixing as many issues as possible to get more familiar with the codebase.

Milestone: Getting more familiar with the codebase

  • Community Bonding Period, Week 1 (April 27th to May 31st)

I will read about about the implementations of the algorithms highlighted in this proposal and start fixing the cross_validation module.

Milestone: Complete most of the work on cross-validation due to end term exams in Week 2, 3

  • Week 2, 3 (June 1st to June 14th)

I have my university end term exams during this period. My contribution rate may be slow during this period. I will try to fix any issues which arise with the improved cross_validation module and write tests for it.

Milestone: Make the work on cross_validation mergeable for the mid-term evaluation

  • Week 4 (June 15th to June 21st)

The priority would be to make the work mergeable for mid-term evaluation. I will further study the implementation details of the proposed algorithms.

Milestone: Make the cross_validation work mergeable

  • Week 5, 6 (June 22nd to July 5th)

Milestone: Implement the self-learning meta-estimator

  • Week 7, 8, 9 (July 6th to July 19th)

I will start implementing the proposed algorithms and improve existing ones.

Milestone: Implementation of algorithms while writing documentation

  • Week 9 (July 20th to July 26th)

Milestone: Complete benchmarking

  • Week 10 (July 27th to August 2nd)

Milestone: Complete tests

  • Week 11 (August 3rd to August 9th)

Milestone: Complete documentation

  • Week 12 (August 10th to August 16th)

Milestone: Complete examples

August 17th (Suggested ‘pencils down’ date.)

In this week, I will finalize the tests, documentation and examples.

August 21th - (Firm ‘pencils down’ date.)

Links to a patch/code sample (sorted by date):

Additional Information

References

[1] Unsupervised Word Sense Disambiguation Rivaling Supervised Methods. Yarowsky, 1995 Link

[2] Learning subjective nouns using extraction pattern bootstrapping, Riloff et al CoNLL-2003 http://www.cs.utah.edu/~riloff/pdfs/conll03.pdf

[3] Large Scale Transductive SVMs http://www.jmlr.org/papers/volume7/collobert06a/collobert06a.pdf

[4] Meta-estimator for semi-supervised learning https://github.com/scikit-learn/scikit-learn/issues/1243

[5] cross-validation generators broken for semi-supervised learning https://github.com/scikit-learn/scikit-learn/issues/2593

[6] Using Cross Validation on semi-supervised classifiers https://github.com/scikit-learn/scikit-learn/issues/3688

[7] http://www.acad.bg/ebook/ml/MITPress-%20SemiSupervised%20Learning.pdf

Clone this wiki locally