Statistical data type

In statistics, groups of individual data points may be classified as belonging to any of various statistical data types, e.g. categorical ("red", "blue", "green"), real number (1.68, -5, 1.7e+6), etc. The data type is a fundamental component of the semantic content of the variable, and controls which sorts of probability distributions can logically be used to describe the variable, the permissible operations on the variable, the type of regression analysis used to predict the variable, etc. The concept of data type is similar to the concept of level of measurement, but more specific: For example, count data require a different distribution (e.g. a Poisson distribution or binomial distribution) than non-negative real-valued data require, but both fall under the same level of measurement (a ratio scale).

Various attempts have been made to produce a taxonomy of levels of measurement. The psychophysicist Stanley Smith Stevens defined nominal, ordinal, interval, and ratio scales. Nominal measurements do not have meaningful rank order among values, and permit any one-to-one transformation. Ordinal measurements have imprecise differences between consecutive values, but have a meaningful order to those values, and permit any order-preserving transformation. Interval measurements have meaningful distances between measurements defined, but the zero value is arbitrary (as in the case with longitude and temperature measurements in Celsius or Fahrenheit), and permit any linear transformation. Ratio measurements have both a meaningful zero value and the distances between different measurements defined, and permit any rescaling transformation.

Because variables conforming only to nominal or ordinal measurements cannot be reasonably measured numerically, sometimes they are grouped together as categorical variables, whereas ratio and interval measurements are grouped together as quantitative variables, which can be either discrete or continuous, due to their numerical nature. Such distinctions can often be loosely correlated with data type in computer science, in that dichotomous categorical variables may be represented with the Boolean data type, polytomous categorical variables with arbitrarily assigned integers in the integral data type, and continuous variables with the real data type involving floating point computation. But the mapping of computer science data types to statistical data types depends on which categorization of the latter is being implemented.

Other categorizations have been proposed. For example, Mosteller and Tukey (1977)[1] distinguished grades, ranks, counted fractions, counts, amounts, and balances. Nelder (1990)[2] described continuous counts, continuous ratios, count ratios, and categorical modes of data. See also Chrisman (1998),[3] van den Berg (1991).[4]

The issue of whether or not it is appropriate to apply different kinds of statistical methods to data obtained from different kinds of measurement procedures is complicated by issues concerning the transformation of variables and the precise interpretation of research questions. "The relationship between the data and what they describe merely reflects the fact that certain kinds of statistical statements may have truth values which are not invariant under some transformations. Whether or not a transformation is sensible to contemplate depends on the question one is trying to answer" (Hand, 2004, p. 82).[5]

Simple data types

The following table classifies the various simple data types, associated distributions, permissible operations, etc. Regardless of the logical possible values, all of these data types are generally coded using real numbers, because the theory of random variables often explicitly assumes that they hold real numbers.

Data Type Possible values Example usage Level of measurement Distribution Scale of relative differences Permissible statistics Regression analysis
binary 0, 1 (arbitrary labels) binary outcome ("yes/no", "true/false", "success/failure", etc.) nominal scale Bernoulli incomparable mode, Chi-squared logistic, probit
categorical 1, 2, ..., K (arbitrary labels) categorical outcome (specific blood type, political party, word, etc.) categorical multinomial logit, multinomial probit
ordinal integer or real number (arbitrary scale) relative score, significant only for creating a ranking ordinal scale categorical?? relative comparison ordinal regression (ordered logit, ordered probit)
binomial 0, 1, ..., N number of successes (e.g. yes votes) out of N possible interval scale?? binomial, beta-binomial, etc. additive?? mean, median, mode, standard deviation, correlation binomial regression (logistic, probit)
count nonnegative integers (0, 1, ...) number of items (telephone calls, people, molecules, births, deaths, etc.) in given interval/area/volume ratio scale Poisson, negative binomial, etc. multiplicative All statistics permitted for interval scales plus the following: geometric mean, harmonic mean, coefficient of variation Poisson, negative binomial regression
real-valued additive real number temperature, relative distance, location parameter, etc. (or approximately, anything not varying over a large scale) interval scale normal, etc. (usually symmetric about the mean) additive mean, median, mode, standard deviation, correlation standard linear regression
real-valued multiplicative positive real number price, income, size, scale parameter, etc. (especially when varying over a large scale) ratio scale log-normal, gamma, exponential, etc. (usually a skewed distribution) multiplicative All statistics permitted for interval scales plus the following: geometric mean, harmonic mean, coefficient of variation generalized linear model with logarithmic link

Multivariate data types

Data that cannot be described using a single number are often shoehorned into random vectors of real-valued random variables, although there is an increasing tendency to treat them on their own. Some examples:

These concepts originate in various scientific fields and frequently overlap in usage. As a result, it is very often the case that multiple concepts could potentially be applied to the same problem.

References

  1. Mosteller, F., & Tukey, J. W. (1977). Data analysis and regression. Boston: Addison-Wesley.
  2. Nelder, J. A. (1990). The knowledge needed to computerise the analysis and interpretation of statistical information. In Expert systems and artificial intelligence: the need for information about data. Library Association Report, London, March, 23–27.
  3. Chrisman, Nicholas R. (1998). Rethinking Levels of Measurement for Cartography. Cartography and Geographic Information Science, vol. 25 (4), pp. 231–242
  4. van den Berg, G. (1991). Choosing an analysis method. Leiden: DSWO Press
  5. Hand, D. J. (2004). Measurement theory and practice: The world through quantification. London, UK: Arnold.
This article is issued from Wikipedia - version of the 9/15/2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.