Generalized Linear Models¶
Generalized linear models currently supports estimation using the one-parameter exponential families.
See Module Reference for commands and arguments.
Examples¶
# Load modules and data
In [1]: import statsmodels.api as sm
ImportErrorTraceback (most recent call last)
<ipython-input-1-6030a6549dc0> in <module>()
----> 1 import statsmodels.api as sm
/builddir/build/BUILD/statsmodels-0.8.0/statsmodels/api.py in <module>()
5 from . import regression
6 from .regression.linear_model import OLS, GLS, WLS, GLSAR
----> 7 from .regression.recursive_ls import RecursiveLS
8 from .regression.quantile_regression import QuantReg
9 from .regression.mixed_linear_model import MixedLM
/builddir/build/BUILD/statsmodels-0.8.0/statsmodels/regression/recursive_ls.py in <module>()
14 from statsmodels.regression.linear_model import OLS
15 from statsmodels.tools.data import _is_using_pandas
---> 16 from statsmodels.tsa.statespace.mlemodel import (
17 MLEModel, MLEResults, MLEResultsWrapper)
18 from statsmodels.tools.tools import Bunch
/builddir/build/BUILD/statsmodels-0.8.0/statsmodels/tsa/statespace/mlemodel.py in <module>()
12 from scipy.stats import norm
13
---> 14 from .kalman_smoother import KalmanSmoother, SmootherResults
15 from .kalman_filter import (KalmanFilter, FilterResults, INVERT_UNIVARIATE,
16 SOLVE_LU)
/builddir/build/BUILD/statsmodels-0.8.0/statsmodels/tsa/statespace/kalman_smoother.py in <module>()
12 import numpy as np
13
---> 14 from statsmodels.tsa.statespace.representation import OptionWrapper
15 from statsmodels.tsa.statespace.kalman_filter import (KalmanFilter,
16 FilterResults)
/builddir/build/BUILD/statsmodels-0.8.0/statsmodels/tsa/statespace/representation.py in <module>()
8
9 import numpy as np
---> 10 from .tools import (
11 find_best_blas_type, prefix_dtype_map, prefix_statespace_map,
12 validate_matrix_shape, validate_vector_shape
/builddir/build/BUILD/statsmodels-0.8.0/statsmodels/tsa/statespace/tools.py in <module>()
10 from scipy.linalg import solve_sylvester
11 from statsmodels.tools.data import _is_using_pandas
---> 12 from . import _statespace
13
14 has_find_best_blas_type = True
ImportError: cannot import name _statespace
In [2]: data = sm.datasets.scotland.load()
NameErrorTraceback (most recent call last)
<ipython-input-2-502d0854b657> in <module>()
----> 1 data = sm.datasets.scotland.load()
NameError: name 'sm' is not defined
In [3]: data.exog = sm.add_constant(data.exog)
NameErrorTraceback (most recent call last)
<ipython-input-3-528ff98c77bc> in <module>()
----> 1 data.exog = sm.add_constant(data.exog)
NameError: name 'sm' is not defined
# Instantiate a gamma family model with the default link function.
In [4]: gamma_model = sm.GLM(data.endog, data.exog, family=sm.families.Gamma())
NameErrorTraceback (most recent call last)
<ipython-input-4-31361c390b1b> in <module>()
----> 1 gamma_model = sm.GLM(data.endog, data.exog, family=sm.families.Gamma())
NameError: name 'sm' is not defined
In [5]: gamma_results = gamma_model.fit()
NameErrorTraceback (most recent call last)
<ipython-input-5-d869a937e097> in <module>()
----> 1 gamma_results = gamma_model.fit()
NameError: name 'gamma_model' is not defined
In [6]: print(gamma_results.summary())
NameErrorTraceback (most recent call last)
<ipython-input-6-6f0091a5bce5> in <module>()
----> 1 print(gamma_results.summary())
NameError: name 'gamma_results' is not defined
Detailed examples can be found here:
Technical Documentation¶
The statistical model for each observation i is assumed to be
Y_i \sim F_{EDM}(\cdot|\theta,\phi,w_i) and \mu_i = E[Y_i|x_i] = g^{-1}(x_i^\prime\beta).
where g is the link function and F_{EDM}(\cdot|\theta,\phi,w) is a distribution of the family of exponential dispersion models (EDM) with natural parameter \theta, scale parameter \phi and weight w. Its density is given by
f_{EDM}(y|\theta,\phi,w) = c(y,\phi,w) \exp\left(\frac{y\theta-b(\theta)}{\phi}w\right)\,.
It follows that \mu = b'(\theta) and Var[Y|x]=\frac{\phi}{w}b''(\theta). The inverse of the first equation gives the natural parameter as a function of the expected value \theta(\mu) such that
Var[Y_i|x_i] = \frac{\phi}{w_i} v(\mu_i)
with v(\mu) = b''(\theta(\mu)). Therefore it is said that a GLM is determined by link function g and variance function v(\mu) alone (and x of course).
Note that while \phi is the same for every observation y_i and therefore does not influence the estimation of \beta, the weights w_i might be different for every y_i such that the estimation of \beta depends on them.
Distribution | Domain | \mu=E[Y|x] | v(\mu) | \theta(\mu) | b(\theta) | \phi |
---|---|---|---|---|---|---|
Binomial B(n,p) | 0,1,\ldots,n | np | \mu-\frac{\mu^2}{n} | \log\frac{p}{1-p} | n\log(1+e^\theta) | 1 |
Poisson P(\mu) | 0,1,\ldots,\infty | \mu | \mu | \log(\mu) | e^\theta | 1 |
Neg. Binom. NB(\mu,\alpha) | 0,1,\ldots,\infty | \mu | \mu+\alpha\mu^2 | \log(\frac{\alpha\mu}{1+\alpha\mu}) | -\frac{1}{\alpha}\log(1-\alpha e^\theta) | 1 |
Gaussian/Normal N(\mu,\sigma^2) | (-\infty,\infty) | \mu | 1 | \mu | \frac{1}{2}\theta^2 | \sigma^2 |
Gamma N(\mu,\nu) | (0,\infty) | \mu | \mu^2 | -\frac{1}{\mu} | -\log(-\theta) | \frac{1}{\nu} |
Inv. Gauss. IG(\mu,\sigma^2) | (0,\infty) | \mu | \mu^3 | -\frac{1}{2\mu^2} | -\sqrt{-2\theta} | \sigma^2 |
Tweedie p\geq 1 | depends on p | \mu | \mu^p | \frac{\mu^{1-p}}{1-p} | \frac{\alpha-1}{\alpha}\left(\frac{\theta}{\alpha-1}\right)^{\alpha} | \phi |
The Tweedie distribution has special cases for p=0,1,2 not listed in the table and uses \alpha=\frac{p-2}{p-1}.
Correspondence of mathematical variables to code:
- Y and y are coded as
endog
, the variable one wants to model - x is coded as
exog
, the covariates alias explanatory variables - \beta is coded as
params
, the parameters one wants to estimate - \mu is coded as
mu
, the expectation (conditional on x) of Y - g is coded as
link
argument to theclass Family
- \phi is coded as
scale
, the dispersion parameter of the EDM - w is not yet supported (i.e. w=1), in the future it might be
var_weights
- p is coded as
var_power
for the power of the variance function v(\mu) of the Tweedie distribution, see table - \alpha is either
- Negative Binomial: the ancillary parameter
alpha
, see table - Tweedie: an abbreviation for \frac{p-2}{p-1} of the power p of the variance function, see table
- Negative Binomial: the ancillary parameter
References¶
- Gill, Jeff. 2000. Generalized Linear Models: A Unified Approach. SAGE QASS Series.
- Green, PJ. 1984. “Iteratively reweighted least squares for maximum likelihood estimation, and some robust and resistant alternatives.” Journal of the Royal Statistical Society, Series B, 46, 149-192.
- Hardin, J.W. and Hilbe, J.M. 2007. “Generalized Linear Models and Extensions.” 2nd ed. Stata Press, College Station, TX.
- McCullagh, P. and Nelder, J.A. 1989. “Generalized Linear Models.” 2nd ed. Chapman & Hall, Boca Rotan.
Module Reference¶
Results Class¶
GLMResults (model, params, …[, cov_type, …]) |
Class to contain GLM results. |
Families¶
The distribution families currently implemented are
Family (link, variance) |
The parent class for one-parameter exponential families. |
Binomial ([link]) |
Binomial exponential family distribution. |
Gamma ([link]) |
Gamma exponential family distribution. |
Gaussian ([link]) |
Gaussian exponential family distribution. |
InverseGaussian ([link]) |
InverseGaussian exponential family. |
NegativeBinomial ([link, alpha]) |
Negative Binomial exponential family. |
Poisson ([link]) |
Poisson exponential family. |
Tweedie ([link, var_power, link_power]) |
Tweedie family. |
Link Functions¶
The link functions currently implemented are the following. Not all link functions are available for each distribution family. The list of available link functions can be obtained by
>>> sm.families.family.<familyname>.links
Link |
A generic link function for one-parameter exponential family. |
CDFLink ([dbn]) |
The use the CDF of a scipy.stats distribution |
CLogLog |
The complementary log-log transform |
Log |
The log transform |
Logit |
The logit transform |
NegativeBinomial ([alpha]) |
The negative binomial link function |
Power ([power]) |
The power transform |
cauchy () |
The Cauchy (standard Cauchy CDF) transform |
cloglog |
The CLogLog transform link function. |
identity () |
The identity transform |
inverse_power () |
The inverse transform |
inverse_squared () |
The inverse squared transform |
log |
The log transform |
logit |
Methods |
nbinom ([alpha]) |
The negative binomial link function. |
probit ([dbn]) |
The probit (standard normal CDF) transform |