Wilkinson Notation
Overview
Wilkinson notation provides a way to describe regression and repeated measures models without specifying coefficient values. This specialized notation identifies the response variable and which predictor variables to include or exclude from the model. You can also include squared and higher-order terms, interaction terms, and grouping variables in the model formula.
Specifying a model using Wilkinson notation provides several advantages:
You can include or exclude individual predictors and interaction terms from the model. For example, using the
'Interactions'
name-value pair available in each model fitting functions includes interaction terms for all pairs of variables. Using Wilkinson notation instead allows you to include only the interaction terms of interest.You can change the model formula without changing the design matrix, if your input data uses the
table
data type. For example, if you fit an initial model using all the available predictor variables, but decide to remove a variable that is not statistically significant, then you can re-write the model formula to include only the variables of interest. You do not need to make any changes to the input data itself.
Statistics and Machine Learning Toolbox™ offers several model fitting functions that use Wilkinson notation, including:
Linear models (using
fitlm
andstepwiselm
)Generalized linear models (using
fitglm
)Linear mixed-effects models (using
fitlme
andfitlmematrix
)Generalized linear mixed-effects models (using
fitglme
)Repeated measures models (using
fitrm
)Cox proportional hazards model (using
fitcox
)
Formula Specification
A formula for model specification is a character vector or string scalar of the form y ~ terms
, where y
is the name of the response variable, and terms
defines the model using the predictor variable names and the following operators.
Predictor Variables
Predictor Terms in Model | Wilkinson Notation |
---|---|
intercept | 1 |
no intercept | –1 |
x1 | x1 |
x1, x2 | x1 + x2 |
x1, x2, x1x2 | x1*x2 or x1 + x2 + x1:x2 |
x1x2 | x1:x2 |
x1, x12 | x1^2 |
x12 | x1^2 – x1 |
Wilkinson notation includes an intercept term in the model by default, even if you do not add 1 to the model formula. To exclude the intercept from the model, use -1 in the formula.
The *
operator (for interactions) and the ^
operator (for power and exponents) automatically include all lower-order terms. For example, if you specify x^3
, the model will automatically include x3, x2, and x. If you want to exclude certain variables from the model, use the –
operator to remove the unwanted terms.
Random-Effects and Mixed-Effects Models
For random-effects and mixed-effects models, the formula specification includes the names of the predictor variables and the grouping variables. For example, if the predictor variable x1 is a random effect grouped by the variable g, then represent this in Wilkinson notation as follows:
(x1 | g)
Repeated Measures Models
For repeated measures models, the formula specification includes all of the repeated measures as responses, and the factors as predictor variables. Specify the response variables for repeated measures models as described in the following table.
Response Terms in Model | Wilkinson Notation |
---|---|
y1 | y1 |
y1, y2, y3 | y1,y2,y3 |
y1, y2, y3, y4, y5 | y1–y5 |
For example, if you have three repeated measures as responses and the factors x1, x2, and x3 as the predictor variables, then you can define the repeated measures model using Wilkinson notation as follows:
y1,y2,y3 ~ x1 + x2 + x3
or
y1-y3 ~ x1 + x2 + x3
Variable Names
If the input data (response and predictor variables) is stored in a table or dataset array, you can specify the formula using the variable names. For example, load the carsmall
sample data. Create a table containing Weight
, Acceleration
, and MPG
. Name each variable using the 'VariableNames'
name-value pair argument of the fitting function fitlm
. Then fit the following model to the data:
load carsmall tbl = table(Weight,Acceleration,MPG, ... 'VariableNames',{'Weight','Acceleration','MPG'}); mdl = fitlm(tbl,'MPG ~ Weight + Acceleration')
mdl = Linear regression model: MPG ~ 1 + Weight + Acceleration Estimated Coefficients: Estimate SE tStat pValue __________ __________ _______ __________ (Intercept) 45.155 3.4659 13.028 1.6266e-22 Weight -0.0082475 0.00059836 -13.783 5.3165e-24 Acceleration 0.19694 0.14743 1.3359 0.18493 Number of observations: 94, Error degrees of freedom: 91 Root Mean Squared Error: 4.12 R-squared: 0.743, Adjusted R-Squared: 0.738 F-statistic vs. constant model: 132, p-value = 1.38e-27
The model object display uses the variable names provided in the input table.
If the input data is stored as a matrix, you can specify the formula using default variable names such as y
, x1
, and x2
. For example, load the carsmall
sample data. Create a matrix containing the predictor variables Weight
and Acceleration
. Then fit the following model to the data:
load carsmall X = [Weight,Acceleration]; y = MPG; mdl = fitlm(X,y,'y ~ x1 + x2')
mdl = Linear regression model: y ~ 1 + x1 + x2 Estimated Coefficients: Estimate SE tStat pValue __________ __________ _______ __________ (Intercept) 45.155 3.4659 13.028 1.6266e-22 x1 -0.0082475 0.00059836 -13.783 5.3165e-24 x2 0.19694 0.14743 1.3359 0.18493 Number of observations: 94, Error degrees of freedom: 91 Root Mean Squared Error: 4.12 R-squared: 0.743, Adjusted R-Squared: 0.738 F-statistic vs. constant model: 132, p-value = 1.38e-27
The term x1
in the model specification formula corresponds to the first column of the predictor variable matrix X
. The term x2
corresponds to the second column of the input matrix. The term y
corresponds to the response variable.
Linear Model Examples
Use fitlm
and stepwiselm
to fit linear models.
Intercept and Two Predictors
For a linear regression model with an intercept and two fixed-effects predictors, such as
specify the model formula using Wilkinson notation as follows:
'y ~ x1 + x2'
No Intercept and Two Predictors
For a linear regression model with no intercept and two fixed-effects predictors, such as
specify the model formula using Wilkinson notation as follows:
'y ~ -1 + x1 + x2'
Intercept, Two Predictors, and an Interaction Term
For a linear regression model with an intercept, two fixed-effects predictors, and an interaction term, such as
specify the model formula using Wilkinson notation as follows:
'y ~ x1*x2'
or
'y ~ x1 + x2 + x1:x2'
Intercept, Three Predictors, and All Interaction Effects
For a linear regression model with an intercept, three fixed-effects predictors, and interaction effects between all three predictors plus all lower-order terms, such as
specify the model formula using Wilkinson notation as follows:
'y ~ x1*x2*x3'
Intercept, Three Predictors, and Selected Interaction Effects
For a linear regression model with an intercept, three fixed-effects predictors, and interaction effects between two of the predictors, such as
specify the model formula using Wilkinson notation as follows:
'y ~ x1*x2 + x3'
or
'y ~ x1 + x2 + x3 + x1:x2'
Intercept, Three Predictors, and Lower-Order Interaction Effects Only
For a linear regression model with an intercept, three fixed-effects predictors, and pairwise interaction effects between all three predictors, but excluding an interaction effect between all three predictors simultaneously, such as
specify the model formula using Wilkinson notation as follows:
'y ~ x1*x2*x3 - x1:x2:x3'
Linear Mixed-Effects Model Examples
Use fitlme
and fitlmematrix
to fit linear mixed-effects models.
Random Effect Intercept, No Predictors
For a linear mixed-effects model that contains a random intercept but no predictor terms, such as
where
and g is the grouping variable with m levels, specify the model formula using Wilkinson notation as follows:
'y ~ (1 | g)'
Random Intercept and Fixed Slope for One Predictor
For a linear mixed-effects model that contains a fixed intercept, random intercept, and fixed slope for the continuous predictor variable, such as
where
and g is the grouping variable with m levels, specify the model formula using Wilkinson notation as follows:
'y ~ x1 + (1 | g)'
Random Intercept and Random Slope for One Predictor
For a linear mixed-effects model that contains a fixed intercept, plus a random intercept and a random slope that have a possible correlation between them, such as
where
and D is a 2-by-2 symmetric and positive semidefinite covariance matrix, parameterized by a variance component vector θ, specify the model formula using Wilkinson notation as follows:
'y ~ x1 + (x1 | g)'
The pattern of the random effects covariance matrix is determined by the model fitting function. To specify the covariance matrix pattern, use the name-value pairs available through fitlme
when fitting the model. For example, you can specify the assumption that the random intercept and random slope are independent of one another using the 'CovariancePattern'
name-value pair argument in fitlme
.
Generalized Linear Model Examples
Use fitglm
and stepwiseglm
to fit generalized linear models.
In a generalized linear model, the y response variable has a distribution other than normal, but you can represent the model as an equation that is linear in the regression coefficients. Specifying a generalized linear model requires three parts:
Distribution of the response variable
Link function
Linear predictor
The distribution of the response variable and the link function are specified using name-value pair arguments in the fit function fitglm
or stepwiseglm
.
The linear predictor portion of the equation, which appears on the right side of the ~
symbol in the model specification formula, uses Wilkinson notation in the same way as for the linear model examples.
A generalized linear model models the link function, rather than the actual response, as y. This is reflected in the output display for the model object.
Intercept and Two Predictors
For a generalized linear regression model with an intercept and two predictors, such as
specify the model formula using Wilkinson notation as follows:
'y ~ x1 + x2'
Generalized Linear Mixed-Effects Model Examples
Use fitglme
to fit generalized linear mixed-effects models.
In a generalized linear mixed-effects model, the y response variable has a distribution other than normal, but you can represent the model as an equation that is linear in the regression coefficients. Specifying a generalized linear model requires three parts:
Distribution of the response variable
Link function
Linear predictor
The distribution of the response variable and the link function are specified using name-value pair arguments in the fit function fitglme
.
The linear predictor portion of the equation, which appears on the right side of the ~
symbol in the model specification formula, uses Wilkinson notation in the same way as for the linear mixed-effects model examples.
A generalized linear model models the link function as y, not the response itself. This is reflected in the output display for the model object.
The pattern of the random effects covariance matrix is determined by the model fitting function. To specify the covariance matrix pattern, use the name-value pairs available through fitglme
when fitting the model. For example, you can specify the assumption that the random intercept and random slope are independent of one another using the 'CovariancePattern'
name-value pair argument in fitglme
.
Random Intercept and Fixed Slope for One Predictor
For a generalized linear mixed-effects model that contains a fixed intercept, random intercept, and fixed slope for the continuous predictor variable, where the response can be modeled using a Poisson distribution, such as
where
and g is the grouping variable with m levels, specify the model formula using Wilkinson notation as follows:
'y ~ x1 + (1 | g)'
Repeated Measures Model Examples
Use fitrm
to fit repeated measures models.
One Predictor
For a repeated measures model with five response measurements and one predictor variable, specify the model formula using Wilkinson notation as follows:
'y1-y5 ~ x1'
Three Predictors and an Interaction Term
For a repeated measures model with five response measurements and three predictor variables, plus an interaction between two of the predictor variables, specify the model formula using Wilkinson notation as follows:
'y1-y5 ~ x1*x2 + x3'
References
[1] Wilkinson, G. N., and C. E. Rogers. "Symbolic description of factorial models for analysis of variance." J. Royal Statistics Society 22, pp. 392–399, 1973.