Data Mining

Posted by Kent Potts on Jan 4, 2011 11:00:00 AM

Data mining (DM) is a broad term that encompasses branches of computer science, applied mathematics and statistics. Generally DM is concerned with extracting patterns from large and complicated datasets and can be divided into two types of analyses, unsupervised and supervised. Unsupervised analyses try to determine how the data is organized into groups of similar units. Typical examples of unsupervised DM methods include clustering (data segmentation), principal component analysis and association rule analysis (market basket analysis).

Supervised DM uses a variety of inputs from a data set to build a function of these inputs that accurately predicts a given output. Example inputs could be the number of product samples received by a physician or the number of podcasts they have listened to. Example outputs could be a product’s sales volume or its percent of the total market share. Common examples of supervised learning algorithms include linear regression, neural networks, support vector machines, decision trees and random forests. Depending on the goal of the analysis, different algorithms should be preferred. If predicting the output is of primary concern and describing the roles of the individual inputs in that prediction is not important, methods like neural networks and support vector machines can very effective. If the goal is to better understand the part each input plays in predicting the output, then regression or decision tree based methods would be favoured.

Supervised learning algorithms can be exceedingly effective at filtering through a very large number of inputs and determining which ones are strongly associated with a given output. These associations can help shed light on the structure in a dataset but they should not be mistaken as causal relationships. A causal relationship implies that one event (ex: listening to podcasts) caused a second event (ex: Sales volume to increase). Properly estimating the magnitude and direction of the relationship between the two events is very important if a valid measure of an inputs return on investment (ROI) is desired.

The gold standard for estimating causal effects is a randomized experiment where experimental units are randomized to one or more treatments (ex: listening to podcasts or not). Proper randomization will eliminate any sources of bias (ex: gender, age, education …) that could skew the ROI estimate. Unfortunately in an enterprise environment, randomized experiments are rarely a feasible option. A more realistic and rapid approach is to take the large volumes of data that already exist, try to control for all the sources of bias that are present in the data and then estimate the ROI of an input.

Estimating ROI in this manner can be done in two ways. The first involves using matching or weighting to create two groups of units, one that received the input and one that did not but are otherwise identical (ex: the average age in the two groups is approximately 60). Comparable implies that the sources of bias have the same distribution in the two groups. The second method uses probabilistic graphical models (PGM) to represent the relationships between the input, the output and the sources of bias. PGM models can be very effective when plenty of information is known about the relationships between inputs and outputs in a dataset. If this information is not available, then the former method is preferred when estimating an input’s ROI.

Our Blog

Data Mining

The Power of SKURA

Subscribe to Email Updates

Recent Posts

Posts by Topic

U.S.A

Canada