Contingency tables

Statsmodels supports a variety of approaches for analyzing contingency tables, including methods for assessing independence, symmetry, homogeneity, and methods for working with collections of tables from a stratified population.

The methods described here are mainly for two-way tables. Multi-way tables can be analyzed using log-linear models. Statsmodels does not currently have a dedicated API for loglinear modeling, but Poisson regression in statsmodels.genmod.GLM can be used for this purpose.

A contingency table is a multi-way table that describes a data set in which each observation belongs to one category for each of several variables. For example, if there are two variables, one with \(r\) levels and one with \(c\) levels, then we have a \(r \times c\) contingency table. The table can be described in terms of the number of observations that fall into a given cell of the table, e.g. \(T_{ij}\) is the number of observations that have level \(i\) for the first variable and level \(j\) for the second variable. Note that each variable must have a finite number of levels (or categories), which can be either ordered or unordered. In different contexts, the variables defining the axes of a contingency table may be called categorical variables or factor variables. They may be either nominal (if their levels are unordered) or ordinal (if their levels are ordered).

The underlying population for a contingency table is described by a distribution table \(P_{i, j}\). The elements of \(P\) are probabilities, and the sum of all elements in \(P\) is 1. Methods for analyzing contingency tables use the data in \(T\) to learn about properties of \(P\).

The statsmodels.stats.Table is the most basic class for working with contingency tables. We can create a Table object directly from any rectangular array-like object containing the contingency table cell counts:

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: import statsmodels.api as sm

In [4]: df = sm.datasets.get_rdataset("Arthritis", "vcd").data

In [5]: tab = pd.crosstab(df['Treatment'], df['Improved'])

In [6]: tab = tab.loc[:, ["None", "Some", "Marked"]]

In [7]: table = sm.stats.Table(tab)

Alternatively, we can pass the raw data and let the Table class construct the array of cell counts for us:

In [8]: table = sm.stats.Table.from_data(df[["Treatment", "Improved"]])

Independence

Independence is the property that the row and column factors occur independently. Association is the lack of independence. If the joint distribution is independent, it can be written as the outer product of the row and column marginal distributions:

\[\]

P_{ij} = sum_k P_{ij} cdot sum_k P_{kj} forall i, j

We can obtain the best-fitting independent distribution for our observed data, and then view residuals which identify particular cells that most strongly violate independence:

In [9]: print(table.table_orig)
Improved   Marked  None  Some
Treatment                    
Placebo         7    29     7
Treated        21    13     7

In [10]: print(table.fittedvalues)

© 2009–2012 Statsmodels Developers
© 2006–2008 Scipy Developers
© 2006 Jonathan E. Taylor
Licensed under the 3-clause BSD License.
http://www.statsmodels.org/stable/contingency_tables.html