-
Notifications
You must be signed in to change notification settings - Fork 138
GSoC 2019
DFFML participated in Google Summer of Code under the Python Software Foundation umbrella. You can read all about what this means at http://python-gsoc.org/
Huge thanks to our students of the 2019 GSoC program who significantly grew DFFML's capabilities in using machine learning models and accessing various data sources. As well as various bug fixes.
Sudharsana @sudharsana-kjl
Project: Labeled and Versioned data sources and expansion of data source backends.
- Data Source for HDFS
- Data Source for MySQL protcol databases
- CSVSource allows for setting the Repo's
src_url
from a csv column - source: json: Load JSON and patch label in dump
- Labels for JSON sources
- Labels for CSV sources
- source: csv: Add update functionality
- Added support for zip file source
- Git feature cloc logs if no binaries are in path
- util: testing: FileSourceTest fix random call
Yash @yashlamba
Project: Addition of new Machine Learning Models written from scratch and using
SciKit sklearn
APIs.
- Multiple Scikit Models with dynamic config
- model: scikit: corrected entry points
- Simple Linear Regression model from scratch
- Scikit Linear Regression model
- Added support for Gzip file source
- Added support for lzma file source. Added support for xz file source
- Added support for bz2 file source
- docs: Fixed a few typos and grammatical errors
- README: Add link to CONTRIBUTING.md
DFFML is a plugin based library / framework for machine learning. It allows users to wrap high or low level implementations of models that use various machine learning libraries, so as to interact will lots of different model implementations in the same way.
DFFML is also a tool for dataset generation. DFFML defines a Feature
abstract base class which is responsible for generating feature data given a unique key.
We currently have three project ideas, you can read about them and discuss in their respective issues:
- GSoC 2019 Project Idea: File Source Compression. (Difficulty: easy)
- GSoC 2019 Project Idea: Labeled and Versioned Datasets. (Difficulty: intermediate)
- GSoC 2019 Project Idea: YOLO/darknet Model (Difficulty: hard)
If you've got a brilliant idea you'd like to propose, please make a new issue with the gsoc
and project
tags to discuss it! Students are also welcome to add "stretch goal" ideas to their application if they'd like to start with one of our ideas but have a few extra feature ideas of their own they'd like to work on at the end of the summer if everything stays on schedule. Take a look at the current open issues to see what users want. Issues which we've talked to someone who would use this as a part of their product or service for their business have the label customer
. Those are cool because we know they will get used!
- Follow the README and make sure you can run the tensorflow and git examples, Looking at the Travis CI may come in handy here.
- Run the tests. DFFML has unit tests which are at about 90% coverage (amount of lines of code tested) for the main library, the Git features, and the Tensorflow model. Make sure you know how to run them, and if you've never done Python unittests before you might want to read up on python's unittest library. Figure out how to run a single test! Running one test instead of all of them will speed up your workflow when you are writing your tests!
- Make your first contribution!
- Work on anything labeled good first issue.
- Help us increase the test coverage in any of the packages (check out the python package
coverage
to learn how to do this). - Write a new feature! Features can do anything you want, they generate some data based on a unique key, think of them like a scraper, see the new feature guide for more info. Make sure to include tests!
- Write a new model! Models are wrappers around any machine learning implementation or library, see the new model guide for more info. Make sure to include tests!
Instructions on How to apply can be found on the Python GSoC website. Please don't forget to use our name (dffml) in your application title!
Most of our communication will take place in the issue tracker under the label 'gsoc'. Not sure where to ask? Try here!
IRC: Contact us using the main python-gsoc channel, #python-gsoc on freenode. (How to connect.). Note that all our developers are located in US Pacific Standard time at this time.
Thanks to Terri for helping DFFML be a part of GSoC and letting us copy her format she used for CVE Binary Tool, another awesome project with a security focus that's a part of GSoC 2019.