Linear regression models are easy to fit and interpret. Moreover, they are suprisingly accurate in many real world situations where the relationship between the response and the predictors is approximately linear. However, it is often the case that not all the variables used in a multiple regression model are in associated with the response.
Including irrelevant variables leads to unnecessary complexity. By removing them we can obtain a model that is more easily interpreted. Additionally, when the number of predictors is close to the number of samples, linear models tend to overfit and therefore to perform badly in their further predictions.
The best subset model selection approach identifies a subset of the predictors that show to be related to the response. We can then fit a model using least squares on the reduced set of variables.
Here, we are going to do that using R
.
The swiss
data set
We will use the classic swiss
data set provided with R datasets. We can have
a look at its documentation.
There we see that contains standardized fertility measure and socio-econimic indicators for each of the 47 French-speaking provinces of Switzerlad at about 1888. It is composed by 47 observations on 6 variables, each of which is in percent (i.e. in [0, 100]), including:
- Fertility.
- Agriculture in % of men involded in agriculture as occupation.
- Examination as % draftees receiving highest mark on army examination.
- Education as % of education beyond primary school for draftees.
- Catholic as % of catholic (as opposed to protestant).
- Infant mortality as % of live births who live less than 1 year.
We are interested in predicting infant mortality using a multi-linear model.
We can have a quick look at how these variables interact using some plts.
And the correlation matrix.
We can see that Infant.Mortality
is positively correlated with Fertility
(obviously) with being Catholic
and negatively with Examination
and
Education
. Additionally we see that Fertility
is positively correlated with
being Catholic
and with Agriculture
and negatively with Edication
and
Examination
.
Let us now select among these predictors using best subset selection. The
function regsubsets
in the leaps
package does exactly this. It performs
best predictor subset selection by identifying the best model that contains
a given number of predictors, where best is quantified using RSS.
The summary() command outputs the best set of variables for each model size.
The outmat
field on the summary contains a matrix with the best subset of
predictors for 1 to 5 predictor models. For example, the best model with two
variables includes Fertility
and Education
as predictors for
Infant.Mortality
. We can also see that all models include Fertility
, and
that all models with at least two variables include also Education
. The
summary object also includes metrics such as
adjusted R2,
CP, or
BIC, that we
can use to determine the best overall model.
Adjusted R2 tells us that the best model is that with two variables, that is
Fertility
and Education
.
Using CP we arrive to the same conclusion.
However using BIC we should go for the model using just Fertility
as a
predictor. We can plot this information.
From there we can see that the 2-variable model is not that bad regarding the
BIC coefficient. So, as a conclusion, we show the coefficients of the 2-variable
model using Fertility
and Education
as inputs to predict Infant.Mortality
.