Wednesday 30 March 2016

Cool Data Science GitHub Repos



By: Gordon Fleetwood – ODSC data science team contributor

We’ll be addressing this topic and others just like it at our upcoming Big Data Science Conference, ODSC East. 100+ speakers, 30+ workshops and 15+ training sessions all under one roof.

GitHub is the most popular central storage space for open-source projects, a good portion of which are now Data Science related with the field's rapid rise over recent years. Here are a couple cool Data Science repos to check out.

TPOT is the brainchild of Randal Olson, a post-doc researcher at the University of Pennsylvania. It uses genetic algorithms to automate dealing with features, selecting models, and optimizing hyperparameters to find the best pipeline to use on your data. It's built on top of sci-kit learn, so the best model pipeline is delivered in a familiar package.

There are a couple of Machine Learning engines out there in the shadows of the most popular frameworks like sci-kit learn or caret. One of these is Leaf, a Rust project by the startup Autumn. It may be a language that is not a traditional Data Science staple, but the claims made on the readme about its foundations, flexibility, and operational capacity are intriguing enough to keep a watch on its progress.

Some of the more popular Python libraries for topic modeling are gensim, blob, pattern, and nltk. Into this crowded field comes topik, a topic modeling package from Continuum Analytics, the same people who brought Data Science the Anaconda package distribution. However, topik isn't coming to compete with these packages. It is built upon them, and its goal is to provide a high level interface for users. It's a great way for users to do topic modeling out of the box rather than going through the lego assembling process that the usual frameworks require.

This R package comes from Etsy's Hilary Parker, and fits quite smoothly into the nominative determinism slot. The idea is simple. It serves as a wrapper around R objects to turn its results into full length explanations of what these numbers mean. It's a great tool for those just getting into Data Science, and would probably be a nice reference for experienced practitioners as well.

Pivot tables are a mainstay of the Business Intelligence world where Excel and Tableau rule the roost. There are interfaces for pivot tables in Python and R - pandas' implementation for one - but the interface is a step below that of either Excel or Tableau in terms of ease of use. pivottablejs takes care of this gap beautifully for users of Python and the Jupyter notebook. Once you give the package a dataframe you'll get an interactive interface to create a pivot table within the notebook. All you have to do is drag and drop.

What's going in deep neural networks? This highly effective application of Machine Learning is famously opaque, and efforts to make it less so are ongoing. One such attempt is tdb, a visual debugger for deep learning. It's built on top of TensorFlow and tries to give the user a sense of how data is flowing through the network by using data visualizations.

Quantopian's qgrid gives you extra flexibility in how you use your dataframes. Instead of filtering or sorting values using code, qgrid allows one to do all of this by pointing and clicking. It evens allows for interactive changing of values.

En garde, Monsieur! As unit testing is to software engineering, so is data validation to Data Science. Engarde is a libray that aids in the validation process to make sure that your assumptions don't become headaches later on.

This R package seeks to standardize the setup for analysis projects. At the outset it generates a complete directory subdivided by folders and helper files for various operations which pop up in a project. Some of these include connecting to a database, data validation, loading data into memory, and even logging. The obvious benefit is the strength such a framework lends to reproducibility of analyses.

These are just some of the interesting Data Science works on display through GitHub. We'll keep looking and post another set of notable projects soon.

Find this blog useful?

Help others find it by commenting and sharing.

No comments:

Post a Comment