Wednesday 9 November 2016

Here is What President Trump and Biased Polls Can Show Us About Data



Donald Trump is, of course, correct. Survey polls are biased. 

Bias is certainly nothing new to statisticians whom since stats 101 class have been trained to correct for it. In fact, pollsters go to great lengths to explain their polling methodology and the statistical bias in their results.

Despite that fact that a lot of effort and quite a bit of science goes into correcting for bias, charges of poll manipulation in favor of one party over another are frequent.  Reputable pollsters cognisant of this scrutiny are transparent about the methodology they employ to correct for bias. Statistical bias in polling takes many forms: from response bias, to under sampling, and nonresponse bias etc. More recent issues have arisen around cell phone vs landline usage in the sample sizes.

Take a recent ABC/Post poll that gave Hillary Clinton a 12 point lead of 50% vs 38%. The fine print of that poll at the beginning of page 5 identified party divisions at 36% Democrats, 27% Republicans, and 31% Independents. Quite a few people took issue with this poll having too many Democratic respondents making it unrepresentative of register voters.

Find this article interesting? Learn more about this topic and other like it at these upcoming data science workshops, talks and trainings.

Critics call for the pollsters to weight the poll based on party affiliation. However, these skeptics miss a key point. Reputable pollsters don’t weigh their survey data according to party identification for good reason. Pollsters adjust for demographic items to ensure a survey sample is not under or over-representing a demographic such as age, race, location etc. Demographic items should be verifiable (census data etc.).  However party affiliation is not demographic, but something the poll seeks to measure and thus should not be weighed.  

Identifying who a likely voter is is another area of contention. Pollsters rely on registered voter surveys early in the election cycle and typically switch to likely voters around September when respondents are more likely to know if they will actually vote and for whom. Pollsters contend that the pool of those who vote is not typically representative of the total eligible population, therefore there’s a need to determine likely voters. Identifying likely voters is a difficult undertaking for a number of reasons including respondents not actually voting, a priori judgments etc. And, so, likely voters is another area where critics say polls are skewed.

Response bias is also a well known phenomenon that reputable pollsters account for with practises like ensuring leading questions are excluded. A response bias getting extra attention this election season is social desirability response bias. We have a tendency to present ourselves in a favorable light even to perfect strangers cold calling us. Respondents may tend to give socially desirable responses. Many have argued that Donald Trump’s candidacy has increased response bias to this presidential election.

Consequently, despite reputable pollsters best efforts, accusations around party affiliation, likely voters, and response bias are just a few of the areas that are blamed for ‘rigged’ polls. So, what does all this teach you about your data? Well, there are quite a few lessons to be drawn from this.

Primarily, at some point in your career, expect your peers, your boss, or the public to question your data sources and any biases contained therein.

Reputable pollsters have long understood the importance of transparency regarding their data sources, data collection techniques, and what bias are inherent in the models and data they employ. The primary lesson for any data scientist is that reputable pollsters are transparent about their data and apply strict principles of disclosure. Data Scientists should consider adhering to standards of data quality to ensure data collected ‘correctly represents the real-world construct to which it refers’.

The era of big data compounds this problem rather than solves it. The era of more and faster data collection means that pitfalls can occur more frequently. Don’t fall for the trap of believing big data, or n=all will, reduce your bias problem. Pollsters also know that size isn’t everything.

Polls typically rely on 1,000 or less respondents. The important rule in sampling is not how many poll respondents are polled but, instead, how pollsters select their respondents using techniques such as random sampling. Not all, but some questions are best answered with “small data” before scaling up to big data. Outlier bias is particularly common in big data because the bigger the dataset, the harder it is to find outliers. FYI. correcting these anomalies may make sense in some cases, but not when looking for outliers like manufacturing defects.

Another lesson is that new sources of data will create new biases. A January 2016 the Federal Trace Commission report highlighted how the era of big data can lead to bias against certain demographics, such as low-income and underserved populations, due to their inclusion in or exclusion from large data sets. This purports that data scientists need to account for their model’s bias. The report notes, “If the process that generated the underlying data reflects biases in favor of or against certain types of individuals, then some statistical relationships revealed by that data could perpetuate those biases.”

Now, Social media may seem like a panacea to our unwillingness to partake in interview based polls, but new data sources will contain new forms of bias that need to be, at the very least explained, and somehow quantified and corrected for. Take for example unstructured data and, specifically, social media. Though using twitter for sentiment analysis continues to improve, social media also has inherent bias issues. Millennial Bias (skewed to a younger audience), influencer bias, and access bias are but a few.

Data scientists are often accused of being enamored with models and not heeding data quality and data bias issues. Polls are a good example of the scrutiny and criticism data sets are subject to. As data science permeates more aspects of our lives, (as I said) expect the public to rightly question your data sources and quality. When a Donald Trump questions your data and your methods how well prepared are you to answer them?

As you advance in your career you will put greater emphasis on connecting  with fellow data scientists who are in the trenches and can guide you on what coding languages, tools, and practices they find useful. 

Applied data science conferences such as ODSC West are an excellent way to accomplish and accelerate this goal. ODSC events give you the opportunity to connect with your peers, and learn the latest languages, tools, and topics associated with programming for data science. You also get to hear and learn from some of the top coders who brought you your favorite open source tools and libraries.

This blog was originally posted here: https://www.opendatascience.com/blog/what-donald-trump-and-biased-polls-teach-us-about-data/

Thursday 20 October 2016

Data Science: Navigating New Frontiers


In this post, the focus will be on the usage of Big Data as a tool to handle, process and analyze the large amounts of data available nowadays to companies.

At a data science event I recently attended, different speakers gave varying perspectives on issues facing companies trying to analyze data and generate profit.

The first speaker, being himself a board director, described the hesitation of board directors to realize the value that a Data Scientist can bring to a company in some occasions.

He mentioned that the lack of scientific knowledge from their side makes this problem common and made suggestions about how the cooperation of boards and Data Science teams can be improved so that critical decisions are being transferred from the later to the former.

The second speaker was a tech leader of a Data Science team and presented how it’s team is following an agile methodology in order to tackle critical business problems of his company. He also described specific characteristics that are really valuable for a data scientist in the current era.

Wish to learn more about this topic and others just like it? Check out these data science workshops taught by some of the brightest minds in the field.

The final speaker gave some examples of bad or good cooperation between business departments and data science teams and suggested several things that could improve the performance and the results delivered.

All in all, the whole meetup was really interesting, because the speakers were from different backgrounds and managed to touch the field of Data Science from completely different perspectives.

Friday 9 September 2016

Data Viz and Climate Change

With the passing of labor day another Summer is on the books and unfortunately the planet’s streak of record-breaking summers every year this century continues. July was the hottest month ever in recorded human history and it’s looks like 2016 will most likely go down as the hottest year on record.

With climate change’s increasingly adverse impact on the environment, data is a playing a key role in understanding and demonstrating the effect of climate change on our world. Climate change data visualization has come a long way from the famous “hockey stick” chart and now use a multitude of high-tech tools and mediums to help to communicate the gravity of this issue.

One work of temperature data visualization that went viral this summer is the following GIF which shows the change in temperature for each month for every year going back to 1850. What makes this display so impactful is that you witness the drastic change in temperature over the course of 166 years in a dozen seconds.

Find this article hepful? Find many others just like it at opendatascience.com.

The rings denoting 1.5 and 2.0 degrees celsius inform the audience the scale of the change and that those two seemingly small numbers would mean grave consequences for the Earth if we ever reached them.

The graphic was created by climate scientist Ed Hawkins of the the National Centre for Atmospheric Science at the University of Reading and roundly praised in media outlets such as Gizmodo and Mashable.



Last month, the New York Times published an eye-opening article about future temperatures in a climate change world. The Times created a series of maps (shown below) displayed how much of the country experiences 100F+ degree days at certain points in time.

It begins with the following map that shows the historical averages. Based on past data, we can see that only a fraction of the country is subject triple digits five days or more per year. It’s worth noting that everywhere east of the Mississippi river is shaded grey.


The following two maps demonstrate our grim and scorching future.


In about 45 years, we can expect that the majority of the country will experience at least five triple digits days per year, with huge swathes of the eastern half of the country now in the orange shade. In 2060, the cities of Phoenix, Las Vegas, and Pecos will all have more triple digits days than the number of days in summer.

And what should we expect for the year 2100.



In the year 2100, in almost every single square mile of America, 100 degree weather will be a common occurrence in the. Many major cities in America can expect at least a month or two months’ worth of triple digit weather.

These visualizations do an excellent job of hitting home the dystopic realities climate change holds for mankind. There’s a data viz lesson to be learned from the simple yet extremely effective aesthetic expressed in these graphics.

Learn more about this topic and others like it at this popular open visualization conference in Santa Clara.