Skip to content

GSoC 2015 Proposal: Cross validation and Meta estimators for Semi supervised Learning

Boyuan Deng edited this page Mar 23, 2015 · 7 revisions

Student Information

Name: Boyuan Deng

Email: [email protected]

Telephone: 008618906286868

Time zone: Central European Time (UTC+1)

IRC handle including network: [email protected]

Source control username: bryandeng on Github

Twitter: @boyuandeng

Blog: http://boyuandeng-gsoc2015.blogspot.de

GSoC Blog RSS feed: http://boyuandeng-gsoc2015.blogspot.com/feeds/posts/default

University Information

University: Saarland University (Universität des Saarlandes)

Major: Erasmus Mundus LCT (mainly natural language processing, also doing machine learning and information retrieval at Max-Planck Institute for Informatics. And after GSoC I’ll move to another university as arranged by the program.) Current Year and Expected Graduation date: Year 1, expected to graduate in latter half of 2016.

Degree: MSc

Other Related Backgrounds

KDD Cup 2013 - Author-Paper Identification Challenge (Track 1) 6th/554

Project Proposal

Abstract

Though being the de facto statistical machine learning library for Python, scikit-learn’s capabilities on semi-supervised learning are still not fully established.

The goal of this project is to provide new algorithm implementations for the sklearn.semi_supervised subpackage, and enable it to interact smoothly and correctly with other components. We particularly want to support cross validation for semi-supervised learning.

Details

Cross Validation for Semi-supervised Learning Currently the sklearn.cross_validation module is unaware of unlabeled data. When splitting the dataset, it blindly puts unlabeled data into testing set, which is meaningless and also confuses the scoring function.

We have to modify the current cross validation infrastructure to make it work correctly for semi-supervised algorithms (including newly added ones) and try to maintain backward compatibility (code for cross validation on supervised learning can run without modification).

Due to that we are going to modify the API for cross validation, this step should be done before new algorithm implementations.

And if anyone will be doing the “Multiple Metric Support for Cross Validation and Gridsearches” project, then API designs need to be fully discussed with that participant and our mentors.

New Algorithm Implementations for Semi-supervised Learning And until now sklearn.semi_supervised subpackage only provides graph-based label propagation algorithms. It’s nice to add more methods.

We plan to implement the “self-taught learning” algorithm as specified in [1]. (It’s generally a semi-supervised algorithm because it uses both labeled and unlabeled data, though the authors emphasize that labeled and unlabeled data don’t necessarily share the same distribution and that’s why it’s different from traditional semi-supervised learning.)

Timeline

Week 1 (May 25 - May 31) : API design (may start early in the community bonding period).

Week 2, 3 (Jun 1 - Jun 14) : Implement the new API for cross validation.

Week 4, 5 (Jun 15 - Jun 28) : Continue implementation and start writing tests.

Week 6, 7 (Jun 29 - Jul 12) : Do tests and update documentation. The new API should be mergeable now.

Week 8, 9 (July 13 - Jul 26) : Implementing self-taught learning algorithm and write corresponding documentation.

Week 10, 11 (July 27 - Aug 9) : Continue implementation and write tests.

Week 12 (Aug 10 - Aug 16) : Improve documentation.

Link to a patch

https://github.com/scikit-learn/scikit-learn/pull/4409

References

Raina, Rajat, et al. "Self-taught learning: transfer learning from unlabeled data."Proceedings of the 24th international conference on Machine learning. ACM, 2007.

https://github.com/scikit-learn/scikit-learn/issues/1243

https://github.com/scikit-learn/scikit-learn/issues/2593

Other Schedule Information

It’s actually still during teaching period (summer semester) in Germany when GSoC goes on. But there won’t be much course workload for me due to the extra credits I got earlier this year. And of course, I’m glad to work on weekends for GSoC.

On June 8-9, I’ll attend a meeting in Groningen, the Netherlands.

Clone this wiki locally