PornHub.com : Terrible Data

Some years ago Pornhub.com was asked to gather and analyze data for BuzzFeed to answer a seemingly simple yet statistically involved question. The question was “Who watches more porn, the Reds (republicans) or the Blues (democrats) ?” Aside from the fact that this question was probably not the most efficient use of their time, BuzzFeed relied completely on the prowess of Pornhub’s statisticians and declared that it was evident from the analysis that the Democrats took the lead with a 13% difference. It so turned out that the data mining process flawed. This led to a lot of seemingly plausible insights which were later proved to be inaccurate.

Here are a few mistakes they did :


  • The first step was to obtain data about state-wise distribution of republicans versus democrats. Now you would think this can emerge from the data about the percentage of voters per state. But initially they assumed that just because there was a democratic victory (or a republican victory) in a state that all its porn-viewership must be counted towards the winning category. It was easy to see how this was flawed and this was corrected in time.
  • How can any company involved in the business of porn violate the basic right to voter anonymity for data gathering. It cannot ! Therefore, they collected data based on viewers without actually having an identity. This is the first step where they should have realized their mistake. The number of people watching porn in a state has got nothing to do with the number of people who go to vote. Therefore, trying to obtain any correlation in this regard would be a waste of effort.
  • The technology used in data gathering plays a key role in directly impacting the accuracy of data. It also indirectly helps reduce the effort required to clean the data. A good data scientist will always take into account uncertainty associated with data input, even if it was from the belly of the bleeding edge of technology. The data gathered by Pornhub.com was based on the IP address repository they had and they used a technique called geocoding which is known to give only a “likely” location because networks like VPNs provide a fake location. Even if we assume that very few people have VPNs, we cannot ignore the fact that a large population must know of the incognito feature in the browser that actively hinders the accuracy of this technique. Hence, what was most surprising is when the data was laid out, people of Kansas state turned out to have the most porn viewers. The reason was that when the geocoding fails to find a precise location, it returns a vague “middle of USA” location and that is where Kansas is situated geographically.
In conclusion, bad data practices can lead to an ineffective data analysis, most of which can be prevented by not letting data take place of common sense. It is important to remember in this regard that data analysis can aid an analysis but should not become the only point of reliance.

In case you're wondering how old this was - it was 2014.

References


  • https://source.opennews.org/en-US/learning/distrust-your-data/

Comments

Popular posts from this blog

Why computers aren't amazingly smart

Making Machines : Introduction

Data Mining and R/Rattle : First Experiment