-
Notifications
You must be signed in to change notification settings - Fork 0
GSoC 2015 Proposal: scikit learn: Cross validation and Meta estimators for Semi supervised learning
Name: Vinayak Mehta
E-mail: [email protected]
Telephone: +91-9868525444
Time Zone: GMT+5.5 (New Delhi, India)
IRC: vortex_ape
GitHub: vortex-ape
Twitter: vortex_ape
Blog: http://vinayakmehta.me
GSoC Blog RSS feed: http://vinayakmehta.me/feeds/all.rss.xml
My background and programming experience:
I am a third year undergraduate student of computer science and engineering at Bharati Vidyapeeth’s College of Engineering, Delhi. I started programming with C++ in class 11th in which I made a couple of projects. In the second semester of college, I took an Introduction to programming course in which I learned C followed by a Java based programming course in fifth semester. I also have experience in MATLAB (for my Digital Signal Processing assignments). I’m a MOOC fan and have taken a number of online courses some of which include Algorithms: Design and Analysis (in addition to an algorithms course taught at my college), Introduction to Machine Learning (Udacity) and the Machine Learning course offered by Georgia Tech at Udacity for their Online MS CS program. At the starting of my fifth semester, I was introduced to Python, and it was love at first sight. I have coded a couple of projects in Python, using scikit-learn for the latter two courses, which are displayed in the Additional Information section of this proposal. Currently, I’m taking Prof Ng’s Machine Learning course at Coursera in addition to an Artificial Intelligence course at edX which are scheduled to finish in April 2015. I like to take part in programming contests and have qualified for the onsite rounds of ACM ICPC multiple times. Programming contests have given me experience in implementing complex algorithms.
University: Guru Gobind Singh Indraprastha University
Major: Computer Science and Engineering
Current Year and Expected Graduation Date: Third year, 1 August, 2016
Degree: Bachelor of Technology
Title: scikit-learn: Cross-validation and Meta-estimators for Semi-supervised learning
Abstract:
This proposal aims to make the semi_supervised
module a first-class citizen in scikit-learn. It can be achieved by improving cross_validation
module’s support for it. A self-learning meta-estimator will be implemented. New algorithms will be added to the semi_supervised
module in addition to improving existing ones.
Motivation:
Semi-supervised learning is still not a first-class citizen in scikit-learn despite it being an important class of machine learning, often outperforming supervised models. It would be awesome if users spend their time concentrating on the research task at hand instead of wasting it on acquiring large amounts of labelled data, which is rather difficult.
I have been contributing to scikit for over a month now, and understand the code base and other coding procedures well enough for the successful completion of my proposed project. I also have other machine learning experience I got from projects which I’ve mentioned in additional information.
Milestones:
- Improve the
cross_validation
module to accommodate semi_supervised learning better, seeing that the new changes play well with the rest of scikit-learn. - Improve existing
semi_supervised
learning module to accommodate the changes made tocross_validation
. - Implement a meta-estimator based on the self-learning method.
- Implement the proposed semi-supervised learning algorithms.
- Write benchmarks, tests, documentation and examples for the newly added components.
Details:
The current implementation in LabelPropagation
assumes that the label for unlabelled data is -1
which is then mixed together with labelled data. If cross_validation
is used, it splits the whole dataset which includes unlabelled data too. This causes problems as the -1
labels don’t make any sense for scoring.
The iterative self-learning method [1] [2] will be used to implement a meta-estimator, namely SelfLearner
which can turn any estimator into a semi-supervised one.
Also, I propose on implementing the following algorithm in addition to improving existing graph-based ones.
- Transductive SVM [3]
- More algorithms will be added upon discussion with the community.
Timeline:
- Pre GSoC (Today to April 27th)
I will work on designing a plan to fix the cross_validation
module for semi_supervised
keeping in mind that it plays well with rest of scikit-learn. I will continue fixing as many issues as possible to get more familiar with the codebase.
Milestone: Getting more familiar with the codebase
- Community Bonding Period, Week 1 (April 27th to May 31st)
I will read about about the implementations of the algorithms highlighted in this proposal and start fixing the cross_validation
module.
Milestone: Complete most of the work on cross-validation
due to end term exams in Week 2, 3
- Week 2, 3 (June 1st to June 14th)
I have my university end term exams during this period. My contribution rate may be slow during this period. I will try to fix any issues which arise with the improved cross_validation
module and write tests for it.
Milestone: Make the work on cross_validation
mergeable for the mid-term evaluation
- Week 4 (June 15th to June 21st)
The priority would be to make the work mergeable for mid-term evaluation. I will further study the implementation details of the proposed algorithms.
Milestone: Make the cross_validation
work mergeable
- Week 5, 6 (June 22nd to July 5th)
Milestone: Implement the self-learning meta-estimator
- Week 7, 8, 9 (July 6th to July 19th)
I will start implementing the proposed algorithms and improve existing ones.
Milestone: Implementation of algorithms while writing documentation
- Week 9 (July 20th to July 26th)
Milestone: Complete benchmarking
- Week 10 (July 27th to August 2nd)
Milestone: Complete tests
- Week 11 (August 3rd to August 9th)
Milestone: Complete documentation
- Week 12 (August 10th to August 16th)
Milestone: Complete examples
August 17th (Suggested ‘pencils down’ date.)
In this week, I will finalize the tests, documentation and examples.
August 21th - (Firm ‘pencils down’ date.)
-
https://github.com/scikit-learn/scikit-learn/pull/4313 - SpectralClustering should be explicit about include_self (Merged)
-
https://github.com/scikit-learn/scikit-learn/pull/4314 - Make clear that RBFSampler implements a variant of Random Kitchen Sinks (Merged)
-
https://github.com/scikit-learn/scikit-learn/pull/4350 - Running tests should not print anything on stdout / stderr or warnings (Merged)
-
https://github.com/scikit-learn/scikit-learn/pull/4356 - DictVectorizer.restrict docstring unclear (Open)
-
https://github.com/scikit-learn/scikit-learn/pull/4377 - LinearSVC(intercept_scaling=0) breaks (Merged)
-
https://github.com/scikit-learn/scikit-learn/pull/4389 - Move newton_cg test out of optimize (Merged)
-
https://github.com/scikit-learn/scikit-learn/pull/4416 - SVC (and SVR) docstring not informative for kernel=”precomputed” (Merged)
-
https://github.com/scikit-learn/scikit-learn/pull/4421 - renaming LDA and QDA to LinearDiscriminantAnalysis and QuadraticDiscriminantAnalysis (Open)
-
https://github.com/scikit-learn/scikit-learn/pull/4423 - Raising an error when n_clusters <= 0 in AgglomerativeClustering (Open)
-
https://github.com/scikit-learn/scikit-learn/pull/4431 - Deprecates load_lfw_pairs and load_lfw_people (Open)
-
Apart from my two weeks of end term exams, I have no other commitments during the GSoC period. I will try to cover for these two weeks by starting early in the community bonding period. If by any chance, by the end of GSoC, my work does not get merged, I would work towards merging it beyond the summer.
-
I have the done the following projects which are related to machine learning:
-
Person of Interest identifier: https://github.com/vortex-ape/POI-Identifier
-
Movie Recommender System: https://github.com/vortex-ape/Movie-Recommender-System
-
[1] Unsupervised Word Sense Disambiguation Rivaling Supervised Methods. Yarowsky, 1995 Link
[2] Learning subjective nouns using extraction pattern bootstrapping, Riloff et al CoNLL-2003 http://www.cs.utah.edu/~riloff/pdfs/conll03.pdf
[3] Large Scale Transductive SVMs http://www.jmlr.org/papers/volume7/collobert06a/collobert06a.pdf
[4] Meta-estimator for semi-supervised learning https://github.com/scikit-learn/scikit-learn/issues/1243
[5] cross-validation generators broken for semi-supervised learning https://github.com/scikit-learn/scikit-learn/issues/2593
[6] Using Cross Validation on semi-supervised classifiers https://github.com/scikit-learn/scikit-learn/issues/3688
[7] http://www.acad.bg/ebook/ml/MITPress-%20SemiSupervised%20Learning.pdf