Friday 15 April 2016

Big Data and the Internet of Things

By Melanie Jutras

04/15/2016

It seems hard to believe, but Big Data is about to become even bigger. This is, in part, due to the increased data being produced by the Internet of Things (IoT). The IoT could potentially involve as many as 50 billion connected devices by the year 2020.(1) Consider the staggering amounts of data that might be produced by this many devices. How can we access this data, how can we analyze it and how can we put it to good use? Dr. Kirk Borne, one of the most knowledgeable Data Science Speakers in the world will be speaking at ODSC East regarding his views on open data and how we can put it to good use.

Learn more about this topic and others just like it at one of our upcoming Open Data Science Workshops, trainings and tutorials.

Although Dr. Borne is an expert in the field of Data Science, he also feels strongly about the importance of people from all professions becoming data literate.(2) Everybody needs to understand what Big Data is and how we can use it.  Now at the advent of a data surge due to the Internet of Things data literacy and user friendly tools will become increasingly important.

When we talk about the Internet of Things, people tend to think of personal devices and gadgets used by individuals. While gadgets such as wearable technology and smart home monitoring devices are part of the picture, they barely scratch the surface. The IoT is essentially made up of sensors that can be placed anywhere in order to collect data.  Sensors will collect data in industries such as agriculture, automotive and retail.

Among other things, they will be used for security, monitoring and automation. If you stop to think about how much data might be produced from 50 billion connected devices, you will soon realize that the Internet of Things is really not so much about the devices that are connected, rather it is about the enormous data stream that these “things” will produce.

A recent blog post by Dr. Vincent Granville, highlights a number of sensor data set repositories.(3) Data sets that have been collected across many sectors including energy, healthcare, weather and transportation are available for viewing and analysis. One of these data sets published by Microsoft Research provides 15 million data points related to sensor data collected from taxi cabs in order to research driving directions.(4)

More information on this data can be found in the research paper, T-Drive: Driving Directions based on Taxi Traces.(5) Another sensor data set that is provided involves 160 million observations recorded by 20 thousand weather stations published on datahub by the Linking Open Data Cloud organization.(6) These are just a couple of examples of data that is being collected every day and available for analysis.

While it is powerful to have the ability to gather millions of pieces of sensor data, the next obvious problem is dealing with the data.  How does one go about managing and analyzing such a large data set? There are various products available for analysis and visualization of enterprise data.  A list of some of these can be found in a recent Data Science Central blog post, Eight IoT Analytics Products.(7)

Popular commercial products such as Dell Statistica and IBM IoT Platform are highlighted. These are valuable for professionals in many different fields who may not be data science researchers, yet they need to deal with Big Data. Another one of the analytics products highlighted is Intel IoT Analytics Platform.

Their IoT cloud analytics site is provided as a service to the IoT development community. Intel’s Internet of Things group has played a key role in an ongoing project building an open cloud-based platform to accelerate cancer research.(8) This is just one example of using data for social good. We all ought to be thinking about what types of questions can be answered from Big Data and how we can put it to good use. With the amount of expected data available from connected devices, the possibilities for analysis are endless.

Find this blog useful?

Help others read it by commenting and sharing.

  1.     Zdnet.com/article/the-internet-of-things-and-big-data-unlocking-the-power/
  2.     Searchdatamanagement.techtarget.com/feature/Kirk-Borne-on-data-science-and-big-data-analytics-data-literacy
  3.     Datasciencecentral.com/profiles/blogs/great-sensor-datasets-to-prepare-your-next-career-move-in-iot-int
  4.     Research.microsoft.com/apps/pubs/?id=152883
  5.     Research.microsoft.com/en-us/projects/tdrive/
  6.     Atahub.io/dataset/knoesis-linked-sensor-data
  7.     Datasciencecentral.com/profiles/blogs/eight-iot-analytics-products
  8.     Eweek.com/cloud/intel-unveils-analytics-technologies-for-big-data-iot.html

Monday 4 April 2016

Jupyter Developer Meetings


By: Gordon Fleetwood – ODSC data science team contributor

The IPython/Jupyter notebook is a staple of the Data Scientist's toolbox due its great visual and practical functionality. It turns out the minds behind this incredible platform have regular meetings which are available to watch on YouTube.

It's a fascinating behind-the-scenes look at what cool additions the developers are working on. Some of these include the nascent  Apache Toree project, splitting one notebook into two with the click of a button, easily injecting code from one notebook into another, adding to-do lists,and the exciting option to turn Jupyter notebooks into a dashboard or a web app.

This last idea could potentially be a game changer, especially in the Python ecosystem. Unlike R which has Shiny as a native application to build data-centric web apps, Python users have to adapt other tools.

Two of these are the web frameworks Django and Flask, and the more Shiny-esque DataSpyre and Pxyley, neither of which seem to have caught on that much. If Jupyter notebooks could become web apps with the click of a button, it could be the start of Shiny finding a worthy rival in the space.

While injecting code from another notebook may not be as superficially attractive as web apps, it would add important flexibility to a Data Scientist’s workflow. The ultimate goal seems to be enabling a person to import another notebook as easily as one would import a package.

Such functionality would allow different parts of analyses to be linked together seamlessly and allow for greater modularity.

Learn more about this topic and others like it at our Open Data Science Workshops, training sessions and conferences.

Find this blog useful?

Help others find it by commenting and sharing.

Friday 1 April 2016

Great Data Science Books


By: Gordon Fleetwood – ODSC data science team contributor

Writing any book is a momentous task. At ODSC, we have been fortunate to have speakers at our Big Data Science Conference who have completed this task, and added to the rich library of Data Science literature. Here are a few of these books.

Applied Predictive Modeling, Max Kuhn

Max Kuhn is a superstar in the R world known for this creation of the caret library, an all-purpose package which is R's equivalent of scikit-learn. Applied Predictive Modeling is a tome dedicated to every aspect of the model building process, from data pre-processing and feature engineering to model selection. All of the theory is accompanied by R code snippets showing the practical applications of concept through caret.

Think Bayes, Allen Downey

Bayesian Statistics is at the heart of Data Science, and Think Bayes is a superb way to from Bayesian apprentice to Bayesian master. The thoroughness and clarity that Mr. Downey has brought to his numerous talks over the years is on display in written form.

Python for Data Analysis, Wes McKinney

Once there were only R dataframes, and Python was left in the dark with only the csv module for company. Then came Wes McKinney, and so dataframes came to Python through the pandas library. Python for Data Analysis is written around pandas and the various operations it supports for use on the dataframes it produces. Mr. McKinney goes through data cleaning, visualization, aggregation all the way to time series. It's pretty comprehensive, to say the least.

R for Everyone: Advanced Analytics and Graphics, Jared Lander

Jared Lander is another name that R users will be instantly familiar with. R For Everyone lives up to its name by starting from the very basics of the language, and working its way up to advanced usage like running statistical tests, building models, and making R packages. The existence of a companion video series makes is the icing on the cake that is a comprehensive look at the one of Data Science's most popular languages.

ODSC conferences give attendees the opportunity to speak with and have books signed by authors. Don’t miss out on this opportunity at ODSC East!

Find this blog useful?

Help others find it by commenting and sharing.