Posts

Showing posts from June, 2015

Data Types in R

For understanding data types in any language, it often helps to find out the most elementary data type because often other data types are built on elementary ones. In our case, vectors are the most elementary so let’s define a vector first. Vectors   - They are used to store ordinal data (data where order matters) of the  same base type  ie. you can store 1 and 3.4 in a vector because they have the same base type - numeric. There are other types like logical (TRUE, FALSE), and character ( “a” ,  “ b").  If that ’ s very confusing think about coordinates in a plane or space. The order of numbers used to  specify   coordinates matters( ie. (2,1) is different from (1,2) ) and hence we would use a vector to store things like that. Arrays(multi-dimensional)  - Arrays are the most common data types in other languages but not in R. R stores arrays as  vectors  with  two  other parameters - number of dimensions and names f...

Data Mining and R/Rattle : First Experiment

Data mining is the activity of harnessing useful insights from vast amount of data based on a certain model. Instead of one particular model or way analysis, data mining is context-based and may potentially employ multiple models and can work on a variety of sources of data including text, audio, video, images etc. Data mining is an ongoing process and there are scenarios where the development of the data model stagnates are relatively few. This is because the mining insights act as feedback for a better model and hence, generally the process is much like agile software development. The role of a data miner in an organization begins with an understanding of the domain and data itself. This process is crucial to defining and refining the model that is the heart of the mining process. A model is constructed and critiqued by domain experts and data experts. This completes one cycle of data mining and the cycle continues in the same way every time using the insights generated from th...

PornHub.com : Terrible Data

Some years ago Pornhub.com was asked to gather and analyze data for BuzzFeed to answer a seemingly simple yet statistically involved question. The question was “Who watches more porn, the Reds (republicans) or the Blues (democrats) ?” Aside from the fact that this question was probably not the most efficient use of their time, BuzzFeed relied completely on the prowess of Pornhub’s statisticians and declared that it was evident from the analysis that the Democrats took the lead with a 13% difference. It so turned out that the data mining process flawed. This led to a lot of seemingly plausible insights which were later proved to be inaccurate. Here are a few mistakes they did : The first step was to obtain data about state-wise distribution of republicans versus democrats. Now you would think this can emerge from the data about the percentage of voters per state. But initially they assumed that just because there was a democratic victory (or a republican victo...

Knewton : Gravitating towards data-driven education

There are multiple companies that have sprung up in the past decade which have used big data to create novel products that might completely change they way we think about education itself. Knewton (probably a famous name spelled wrong) is a company in the data-driven education space that derives insights from extended interaction of students with their course material and adapts the material to the learner’s pace, strengths and  weaknesses. The platform provides features like tracking day-to-day activity of a student to determine their “most efficient” study time and what materials suit them and then dynamically generating varied learning plans tailored to a student’s specific needs. What is more exciting about this company is that it was started in 2008 and has now grown to involve textbook publishers, universities and software companies. Knewton’s CEO Jose Ferreira, provides an approximate classification of data gathering categories that the platform uses. ...

Data Privacy : How Little We Know

Data privacy has always been a concern and now with a growing data hungry corporate where everyone wants user-specific data, the privacy of individuals is in grave danger. Here I describe in two sections cases covered by popular media. It wouldn’t be far fetched to assume that there are plenty more that haven’t struck the media as news-worthy. The Facebook Moods Experiment I guess everyone came across this article once, or must have heard about this in the news because guess who it’s about ? Facebook !         Yes, our good old social network that has now become second nature to surfing over the internet dented the beliefs of its users (well, certainly some of us) by conducting an experiment that concerned manipulating the emotions of 700,000 Facebook users. The experiment was done to study the effects that the news feed had on people’s emotions and how they spread through networks. The study was published in the National Academy of Sciences, USA....

The P vs NP question : Relativization and its importance

1.0: Relativization In this section we discuss the importance of relativization as a technique and how it has provided for new directions in complexity theory. We will also see why this otherwise powerful technique is insufficient to resolve the P-NP dilemma but provides counter-arguments against other techniques and hence has continued to be useful in complexity theory. 1.1: Introduction Relativization was first introduced by Baker, Gill and Solovay [2.1] and has continued to be an important technique because it provides arguments against proofs that try to solve the P versus NP question. The general idea is as explained below: Suppose we can show for some statement S that there exists an oracle A such that S fails relative to A in some oracle model. Now any proof that S holds must not relativize in that model or otherwise that statement would also hold relative to A. If we can also find an oracle relative to which S holds then no relativizable technique can decide the tr...

Data Journalism : Do we trust the numbers ?

Data journalism can be rephrased as data-aided journalism. The ability to extract information from various sources is at the heart of the big data process called Data Mining. The current trend in journalism is to rely more on the data mining techniques to gather data that allows for meaningful exploitation of facts and insights that would be otherwise difficult to analyze manually. A lot of news stations have started investing in big data over the past few years but this has led to another problem.          How can we trust the analysis performed by the new channels ? Not only does the technique of big data create a myth that data analysis serves as conclusive evidence, but also disengages the audience from conducting its own inquiry. I’m not implying here the curation of viewers minds, rather the false trust generated by statistical analysis in general. Statistical analysis which is at the heart of big data is only used to indicate there presence of trends ...