Multifactor dimensionality reduction

Multifactor dimensionality reduction (MDR) is a data mining approach for detecting and characterizing combinations of attributes or independent variables that interact to influence a dependent or class variable. MDR was designed specifically to identify interactions among discrete variables that influence a binary outcome and is considered a nonparametric alternative to traditional statistical methods such as logistic regression.

The basis of the MDR method is a constructive induction algorithm that converts two or more variables or attributes to a single attribute. This process of constructing a new attribute changes the representation space of the data. The end goal is to create or discover a representation that facilitates the detection of nonlinear or nonadditive interactions among the attributes such that prediction of the class variable is improved over that of the original representation of the data.

Illustrative example

Consider the following simple example using the exclusive OR (XOR) function. XOR is a logical operator that is commonly used in data mining and machine learning as an example of a function that is not linearly separable. The table below represents a simple dataset where the relationship between the attributes (X1 and X2) and the class variable (Y) is defined by the XOR function such that Y = X1 XOR X2.

Table 1

X1 X2 Y
0 0 0
0 1 1
1 0 1
1 1 0

A data mining algorithm would need to discover or approximate the XOR function in order to accurately predict Y using information about X1 and X2. An alternative strategy would be to first change the representation of the data using constructive induction to facilitate predictive modeling. The MDR algorithm would change the representation of the data (X1 and X2) in the following manner. MDR starts by selecting two attributes. In this simple example, X1 and X2 are selected. Each combination of values for X1 and X2 are examined and the number of times Y=1 and/or Y=0 is counted. In this simple example, Y=1 occurs zero times and Y=0 occurs once for the combination of X1=0 and X2=0. With MDR, the ratio of these counts is computed and compared to a fixed threshold. Here, the ratio of counts is 0/1 which is less than our fixed threshold of 1. Since 0/1 < 1 we encode a new attribute (Z) as a 0. When the ratio is greater than one we encode Z as a 1. This process is repeated for all unique combinations of values for X1 and X2. Table 2 illustrates our new transformation of the data.

Table 2

Z Y
0 0
1 1
1 1
0 0

The data mining algorithm now has much less work to do to find a good predictive function. In fact, in this very simple example, the function Y = Z has a classification accuracy of 1. A nice feature of constructive induction methods such as MDR is the ability to use any data mining or machine learning method to analyze the new representation of the data. Decision trees, neural networks, or a naive Bayes classifier could be used.

Data mining with MDR

As illustrated above, the basic constructive induction algorithm in MDR is very simple. However, its implementation for mining patterns from real data can be computationally complex. As with any data mining algorithm there is always concern about overfitting. That is, data mining algorithms are good at finding patterns in completely random data. It is often difficult to determine whether a reported pattern is an important signal or just chance. One approach is to estimate the generalizability of a model to independent datasets using methods such as cross-validation. Models that describe random data typically don't generalize. Another approach is to generate many random permutations of the data to see what the data mining algorithm finds when given the chance to overfit. Permutation testing makes it possible to generate an empirical p-value for the result. These approaches have all been shown to be useful for choosing and evaluating MDR models.

Applications

MDR has mostly been applied[1] to detecting gene-gene interactions or epistasis in genetic studies of common human diseases such as atrial fibrillation, autism, bladder cancer, breast cancer, cardiovascular disease, hypertension, prostate cancer, schizophrenia, and type II diabetes. However, it can be applied to other domains such as economics, engineering, meteorology, etc. where interactions among discrete attributes might be important for predicting a binary outcome.

Software

www.epistasis.org provides an open-source and freely-available MDR software package.

See also

References

Expert Rev Mol Diagn. 2004 Nov;4(6):795-803. PubMed

Hum Hered. 2004;57(1):28-38. PubMed

Eur J Hum Genet. 2005 Jul;13(7):807-14. PubMed

Further reading

References

This article is issued from Wikipedia - version of the 9/22/2015. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.