Data Mining and R/Rattle : First Experiment
Data mining is the activity of harnessing useful insights from vast amount of data based on a certain model. Instead of one particular model or way analysis, data mining is context-based and may potentially employ multiple models and can work on a variety of sources of data including text, audio, video, images etc. Data mining is an ongoing process and there are scenarios where the development of the data model stagnates are relatively few. This is because the mining insights act as feedback for a better model and hence, generally the process is much like agile software development. The role of a data miner in an organization begins with an understanding of the domain and data itself. This process is crucial to defining and refining the model that is the heart of the mining process. A model is constructed and critiqued by domain experts and data experts. This completes one cycle of data mining and the cycle continues in the same way every time using the insights generated from the previous process.
The book presents Rattle a free and open source GUI to the statistical language R. A typical data mining project in Rattle goes through the phases of importing the dataset, deciding variables to focus on, exploring data from this perspective, actively cleaning and transforming data before inferring anything, developing models around this data and evaluating them. When this is done, we deploy our models and continually monitor performance.
R is a command line tool which can be easily installed on all major operating systems. Rattle can be invoked from the R prompt using the command ‘library(rattle)’. Rattle is a tab-based graphical user interface where each tab deals with a different part of the data mining process.
The tab based division of work is as follows -
1. Loading a dataset : A simple .csv file can be imported into Rattle via the GUI. A dataset is termed as a data frame in Rattle. The GUI also support other data formats that I have not yet experimented with.
2. Creating a model : Models are what I would term as a constrained representation of data that has a stronger semantic. I tried viewing the default weather.csv file as a tree as per the example in the book. As per my understanding, rules are constrains that split the data to form something that looks like a decision tree. There is a per decision split of data and the leaves represent whatever is filtered from the root of this tree to a leaf based on all rules encountered in its path.
3. The explore tab : This section allow us to examine different data distributions by looks at possible bar graphs, pie charts etc. among two or more variables in data. This seems like a highly useful feature that gives the first level of insights into the data and highly impacts data cleaning and model structure. I was able to plot sunshine vs evaporation and rainfall vs evaporation. This helped me analyze my assertion about relation between humidity(directly proportional to rainfall tomorrow) and evaporation.
4. The evaluate tab : This tab allows for checking the hypothesis formed based on results obtained from the model. This is where everything gets interesting and we realize how good or bad our model is based on previous known values. I was not entirely able to use it to determine the confidence level of my statement.
Comments
Post a Comment