Open Data Science Conference #ODSC: March 2016

Wednesday, 30 March 2016

Cool Data Science GitHub Repos

By: Gordon Fleetwood – ODSC data science team contributor

We’ll be addressing this topic and others just like it at our upcoming Big Data Science Conference, ODSC East. 100+ speakers, 30+ workshops and 15+ training sessions all under one roof.

GitHub is the most popular central storage space for open-source projects, a good portion of which are now Data Science related with the field's rapid rise over recent years. Here are a couple cool Data Science repos to check out.

TPOT is the brainchild of Randal Olson, a post-doc researcher at the University of Pennsylvania. It uses genetic algorithms to automate dealing with features, selecting models, and optimizing hyperparameters to find the best pipeline to use on your data. It's built on top of sci-kit learn, so the best model pipeline is delivered in a familiar package.

There are a couple of Machine Learning engines out there in the shadows of the most popular frameworks like sci-kit learn or caret. One of these is Leaf, a Rust project by the startup Autumn. It may be a language that is not a traditional Data Science staple, but the claims made on the readme about its foundations, flexibility, and operational capacity are intriguing enough to keep a watch on its progress.

Some of the more popular Python libraries for topic modeling are gensim, blob, pattern, and nltk. Into this crowded field comes topik, a topic modeling package from Continuum Analytics, the same people who brought Data Science the Anaconda package distribution. However, topik isn't coming to compete with these packages. It is built upon them, and its goal is to provide a high level interface for users. It's a great way for users to do topic modeling out of the box rather than going through the lego assembling process that the usual frameworks require.

This R package comes from Etsy's Hilary Parker, and fits quite smoothly into the nominative determinism slot. The idea is simple. It serves as a wrapper around R objects to turn its results into full length explanations of what these numbers mean. It's a great tool for those just getting into Data Science, and would probably be a nice reference for experienced practitioners as well.

Pivot tables are a mainstay of the Business Intelligence world where Excel and Tableau rule the roost. There are interfaces for pivot tables in Python and R - pandas' implementation for one - but the interface is a step below that of either Excel or Tableau in terms of ease of use. pivottablejs takes care of this gap beautifully for users of Python and the Jupyter notebook. Once you give the package a dataframe you'll get an interactive interface to create a pivot table within the notebook. All you have to do is drag and drop.

What's going in deep neural networks? This highly effective application of Machine Learning is famously opaque, and efforts to make it less so are ongoing. One such attempt is tdb, a visual debugger for deep learning. It's built on top of TensorFlow and tries to give the user a sense of how data is flowing through the network by using data visualizations.

Quantopian's qgrid gives you extra flexibility in how you use your dataframes. Instead of filtering or sorting values using code, qgrid allows one to do all of this by pointing and clicking. It evens allows for interactive changing of values.

En garde, Monsieur! As unit testing is to software engineering, so is data validation to Data Science. Engarde is a libray that aids in the validation process to make sure that your assumptions don't become headaches later on.

This R package seeks to standardize the setup for analysis projects. At the outset it generates a complete directory subdivided by folders and helper files for various operations which pop up in a project. Some of these include connecting to a database, data validation, loading data into memory, and even logging. The obvious benefit is the strength such a framework lends to reproducibility of analyses.

These are just some of the interesting Data Science works on display through GitHub. We'll keep looking and post another set of notable projects soon.

Find this blog useful?

Help others find it by commenting and sharing.

Monday, 28 March 2016

Influencing Data Visualization Forever

By: Gordon Fleetwood – ODSC data science team contributor

Data Viz, Big Data and other related topics will be discussed at our Data Science Workshops, training sessions and conferences. Take a look at some of Data Science’s most influential players.

Santiago Giraldo works for CartoDB and does a lot of data visualization work with geospatial data. One of his projects addressed gerrymandering, the manipulation of the boundaries of electoral constituencies to favor a political party. Specifically, he looked at this phenomenon in New York, and how it could be tied to income inequality.

Larry Buchanan’s experience in visualization stretches back years before his current position at the New York Times, where he created the piece linked to above. The stop of every subway in New York City is shown, and the median income associated with the neighborhood surrounding each of them. It is both awesome and sobering to follow the rise and fall of each graph’s points as it goes through boroughs and neighborhoods.

The map shows the magnitude of people needing to take advantage of feeding programs in 2013 across the world. The percentage of these people suffering from kwashiokor, a disease characterized by a severe lack of protein, is highlighted.

Mr. Cherven is a Data Visualization Specialist at General Motors, and his talk at ODSC East will focus on the open-source tools available for visualizing complex networks. His work in this field is extensive and most of its public face focuses on baseball.

Visualizing scientific concepts is extremely important in communicating their key ideas. This is where Bang Wong, Creative Director of the Broad Institute of MIT and Harvard, and others like him come in. Vis Skunkworks is one of his initiatives which seeks to provide clarity for concepts in genomics through visualization.

The visualization highlighted here, however, is a network of voting patterns in the United States House of Representatives in the Fall of 2014. With a click of a button you can see which votes passed or failed, and who on either side of the aisle voted for it.

Find this article useful?

Help others see it by commenting and sharing.

Wednesday, 23 March 2016

The Pros and Cons of Deep Learning

We will be discussing Deep Learning and related topics are our Big Data Conferences, training sessions and workshops.

Deep learning is a collection of statistical machine learning techniques used to learn feature hierarchies often based on artificial neural networks. That’s it. Not so scary after all.

For sounding so innocuous under the hood, there’s a lot of rumble in the news about what might be done with DL in the future. Let’s start with an example of what has already been done to motivate why it is proving interesting to so many.

Deep learning has been all over the news lately. In a presentation I gave at Boston Data Festival 2013 and at a recent PyData Boston meetup I provided some history of the method and a sense of what it is being used for presently. This post aims to cover the first half of that presentation, focusing on the question of why we have been hearing so much about deep learning lately.

What does it do that couldn’t be done before? We’ll first talk a bit about Deep learning in the context of the 2013 kaggle-hosted quest to save the whales1 The game asks its players the following question: given a set of 2-second sound clips from buoys in the ocean, can you classify each sound clip as having a call from a North Atlantic right whale or not?

The practical application of the competition is that if we can detect where the whales are migrating by picking up their calls, we can route shipping traffic to avoid them, a positive both for effective shipping and whale preservation.

The content is aimed at data scientists who might have heard a little about deep learning and are interested in a bit more context. Regardless of your background, hopefully you will see how deep learning might be relevant for you. At the very least, you should be able to separate the signal from the noise as the media hype around deep learning increases.

Find this blog useful?

Help others read it by commenting and sharing.

Monday, 21 March 2016

7 Important Model Evaluation Error Metrics Everyone should know

Finding the right model for your prediction is important but highly dependent on the metric you use to assess the quality of the predictions and the predictive power of the model. In this article, Tavish Srivastava, decrypts 7 evaluation metrics: Definition, usage and their influence on the model selection.

The metrics include the classics: RMSE, ROC-AUC and the confusion matrix, alternatives metrics such as the Gini Coefficient and Gain Chart which are frequently used in Kaggle competitions and less frequent ones such as the Kolmogorov Smirnov Chart or the Concordant – Discordant Ratio.

A very well illustrated article which offers a good introduction on the importance of the metric for model selection.

Classifying Bees With Google TensorFlow

The Bees Classifier Metis Challenge on DataDriven.org consisted in predicting the type of bees appearing in a 4000 images. Given the set of images it was up to the participants to build their own set of features using image processing techniques.

In this article, Philippe Dagher Data Scientist and Kaggler, builds a basic Google Tensorflow algorithm to determine the genus—Apis (honey bee) or Bombus (bumble bee)—based on photographs of the insects. A good coding example on how to apply Google TensorFlow on a real life dataset.

Find the blog useful?

Help others see it by commenting and sharing.

Learn more about this topic and others like it at our Data Visualization Conference, trainings and workshops.

References:

http://nasdag.github.io/blog/2016/01/19/classifying-bees-with-google-tensorflow
http://www.analyticsvidhya.com/blog/2016/02/7-important-model-evaluation-error-metrics

Wednesday, 9 March 2016

The Library of Python Data Sets

Python’s move towards being a language for data analysis has seen it copy many features from R, a language that was designed for dealing with data.

Such features include the dataframe, the statsmodels package for building linear models, and the Python port of ggplot2. The latest addition to this list is the PyDataset package, a resource modeled on the data sets that come pre-packaged with R.

7 Mistakes to Avoid in Machine Learning

It’s fairly straightforward to get started with Machine Learning due to the availability of several superb open source APIs. However, mastery in the subject can only be achieved by adding profundity to one’s knowledge.

One such facet involves learning how to deal with the assumptions and drawbacks of the various algorithms being used. In a post for KDnuggets, Ex-Google engineer Cheng-Tao Chu goes into seven mistakes to avoid for the aspiring Machine Learning expert.

Among his seven points, Chu talks about picking a suitable evaluation metric for your model that fits the domain in which it is being applied, being cognizant of and dealing with outliers carefully, and avoiding models which tend to overfit when dealing with data where the number of features outnumbers the number of data points.

Find this article useful?

Help others find it by sharing and commenting below.

Learn more about our Data Science Conference, speakers and workshops. Hurry some discount tickets are still available.

Monday, 7 March 2016

Analyzing Data with Salt Viz Library

There are many data visualization libraries available for the data scientist to create narratives. Plotly, matplotlib, seaborn, ggplot2, D3.js, … the list is long. However, when the data gets large and the memory gets low these libraries struggle to keep up.

These simple concepts allow for the creation of a variety of powerful big-data visualizations. Several well explained examples are available in the github repository.

Salt is a visualization library that leverages Apache Spark to create big-data visualization. It is built around 2 concepts:

1) Dimension reduction: i.e. transforming the data space into the a smaller visualization space.

2) Data aggregation: values in the visualization space which are close to each other are grouped via a collection of seven sample aggregators.

You will need the Docker, a Java compiler, Gradle (automation system builder) and Node + npm to be installed on your local in order to run these examples.

Did you find this blog post useful?

Help others see it too by commenting and sharing below.

Also, check out our latest Disruptive Data Science Conference, speakers and workshops.