Explaining multicollinearity in layman's terms

Say we have a study where we want to run a logistic regression on a group of people, and we want to find out whether one attribute of a person makes them more likely to be a smoker.

So we have smoker yes/no as the dependent variable and let's say we have weight and gender as independent variables.

From this information, I'm guessing weight and gender are introduce collinearity because men generally weigh more than women, and therefore are not independent variables. Is this correct?

If so, I'm confused because I thought a regression controls for the other variables, so it seeks to look if men are more likely to smoke, regardless of their weight - and whether overweight people are more likely to smoke, regardless of gender.

Or do I have it all wrong?

Thank you.

asked Dec 18 at 11:54

Paze

1696

add a comment |

Say we have a study where we want to run a logistic regression on a group of people, and we want to find out whether one attribute of a person makes them more likely to be a smoker.

So we have smoker yes/no as the dependent variable and let's say we have weight and gender as independent variables.

From this information, I'm guessing weight and gender are introduce collinearity because men generally weigh more than women, and therefore are not independent variables. Is this correct?

Or do I have it all wrong?

Thank you.

asked Dec 18 at 11:54

Paze

1696

add a comment |

Say we have a study where we want to run a logistic regression on a group of people, and we want to find out whether one attribute of a person makes them more likely to be a smoker.

So we have smoker yes/no as the dependent variable and let's say we have weight and gender as independent variables.

From this information, I'm guessing weight and gender are introduce collinearity because men generally weigh more than women, and therefore are not independent variables. Is this correct?

Or do I have it all wrong?

Thank you.

asked Dec 18 at 11:54

Paze

1696

Say we have a study where we want to run a logistic regression on a group of people, and we want to find out whether one attribute of a person makes them more likely to be a smoker.

So we have smoker yes/no as the dependent variable and let's say we have weight and gender as independent variables.

From this information, I'm guessing weight and gender are introduce collinearity because men generally weigh more than women, and therefore are not independent variables. Is this correct?

Or do I have it all wrong?

Thank you.

regression multicollinearity

asked Dec 18 at 11:54

Paze

1696

asked Dec 18 at 11:54

Paze

1696

asked Dec 18 at 11:54

Paze

1696

asked Dec 18 at 11:54

Paze

1696

asked Dec 18 at 11:54

Paze

1696

add a comment |

3 Answers
3

active

oldest

votes

When it comes to collinearity, we have to distinguish exact collinearity from approximate collinearity.

The former occurs when one independent variable (IV) is an exact linear combination of other independent variables. Usually this is due to some sort of elementary mistake (like entering weight in pounds and also weight in kilograms) and is easily corrected.

The latter occurs when one IV is almost a linear combination of other IVs. Assessing when "almost" is close enough to be troublesome can be done through various means, the one I think best is condition indexes and proportion of variance explained; this has been discussed elsewhere and I won't add more here.

In your case, it is very unlikely that the relationship between sex and weight is close enough to cause trouble. In a general population, it's true that men, on average, weigh more than women. But there's lots of overlap: There are plenty of light men and heavy women. If (for some reason) your sample consisted of jockeys and football players, then it would cause problems.

The problems that collinearity causes include high variances for the parameter estimates and extreme sensitivity to changes in the data. In other words, we only have a very poor idea of what the parameter estimates are and a small change in the input data can result in a huge change in the parameter estimates.

answered Dec 18 at 12:13

Peter Flom♦

74.2k11105202

add a comment |

The good news is that multicollinarity has absolutely no effect on the estimated coefficients $beta$ of the model. The bad new is that they inflate its standard error.

The formula for the standard error of a coefficient $beta_i$ for the independent variable $X_i$ is
$$ s.e.(beta_i)=sqrt{VIF_{X_i}frac{sigma_{varepsilon}^2}{nsigma_{X_i}^2}}$$
with $sigma_{varepsilon}$ being the standared error of regression (the smaller this value the larger the $R^2$ (goodness of fit) of the model), $VIF_{X_i}$ is a measure of multicolinarity (1 means no multicolinarity and the larger the more multicolinearity), $n$ is just the sample size and $sigma_{X_i}^2$ is the variance of $X_i$.

The key point is that if you add a new variable to the model, it increases $VIF_{X_i}$ and decreases $sigma_{varepsilon}$. If the variable matters and is not highly correlated with $X_i$, typically the standard error $s.e.(beta_i)$ goes down rather than up. But it is true, it can also go the other way around.

edited Dec 18 at 14:31

answered Dec 18 at 12:21

Tom Pape

354110

2

Perfect multicollinearity means the matrix $(X^T X)$ will not be invertible, so the a OLS solution doesn't exist.
– Vimal
Dec 18 at 14:47

That is indeed correct, because in this case $VIF_{X_i}$, which is the result of regressing $X_i$ on all other dependent variables, would be undefined. However, perfect multicolinarity is in practice very rare except when working with constructed variables like Value, Cost and Value+Cost all in one model. The far more interesting case in partial multicolinarity.
– Tom Pape
Dec 18 at 15:13

add a comment |

This may be too layme ...

In layman's terms "too many cooks (input series) can spoil the broth"

edited Dec 18 at 15:01

answered Dec 18 at 14:38

IrishStat

20.5k32040

2

Wouldn't that be overfitting, not multicollinearity?
– Paze
Dec 18 at 15:27

That's a possibility ..
– IrishStat
Dec 18 at 16:27

when you bring a second series that is nearly the equivalent of the first series then that is both over-fitting and causing multicollinearity. I don't know how to define over-fitting other than the new series doesn't add anything ( a second cook for example perhaps a student of the first cook ) when 1 cook was enough .
– IrishStat
Dec 18 at 20:03

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "65"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f383569%2fexplaining-multicollinearity-in-laymans-terms%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

When it comes to collinearity, we have to distinguish exact collinearity from approximate collinearity.

answered Dec 18 at 12:13

Peter Flom♦

74.2k11105202

add a comment |

When it comes to collinearity, we have to distinguish exact collinearity from approximate collinearity.

answered Dec 18 at 12:13

Peter Flom♦

74.2k11105202

add a comment |

When it comes to collinearity, we have to distinguish exact collinearity from approximate collinearity.

answered Dec 18 at 12:13

Peter Flom♦

74.2k11105202

When it comes to collinearity, we have to distinguish exact collinearity from approximate collinearity.

answered Dec 18 at 12:13

Peter Flom♦

74.2k11105202

answered Dec 18 at 12:13

Peter Flom♦

74.2k11105202

answered Dec 18 at 12:13

Peter Flom♦

74.2k11105202

answered Dec 18 at 12:13

Peter Flom♦

74.2k11105202

add a comment |

The good news is that multicollinarity has absolutely no effect on the estimated coefficients $beta$ of the model. The bad new is that they inflate its standard error.

edited Dec 18 at 14:31

answered Dec 18 at 12:21

Tom Pape

354110

2

Perfect multicollinearity means the matrix $(X^T X)$ will not be invertible, so the a OLS solution doesn't exist.
– Vimal
Dec 18 at 14:47

That is indeed correct, because in this case $VIF_{X_i}$, which is the result of regressing $X_i$ on all other dependent variables, would be undefined. However, perfect multicolinarity is in practice very rare except when working with constructed variables like Value, Cost and Value+Cost all in one model. The far more interesting case in partial multicolinarity.
– Tom Pape
Dec 18 at 15:13

add a comment |

The good news is that multicollinarity has absolutely no effect on the estimated coefficients $beta$ of the model. The bad new is that they inflate its standard error.

edited Dec 18 at 14:31

answered Dec 18 at 12:21

Tom Pape

354110

2

Perfect multicollinearity means the matrix $(X^T X)$ will not be invertible, so the a OLS solution doesn't exist.
– Vimal
Dec 18 at 14:47

That is indeed correct, because in this case $VIF_{X_i}$, which is the result of regressing $X_i$ on all other dependent variables, would be undefined. However, perfect multicolinarity is in practice very rare except when working with constructed variables like Value, Cost and Value+Cost all in one model. The far more interesting case in partial multicolinarity.
– Tom Pape
Dec 18 at 15:13

add a comment |

The good news is that multicollinarity has absolutely no effect on the estimated coefficients $beta$ of the model. The bad new is that they inflate its standard error.

edited Dec 18 at 14:31

answered Dec 18 at 12:21

Tom Pape

354110

The good news is that multicollinarity has absolutely no effect on the estimated coefficients $beta$ of the model. The bad new is that they inflate its standard error.

edited Dec 18 at 14:31

answered Dec 18 at 12:21

Tom Pape

354110

edited Dec 18 at 14:31

answered Dec 18 at 12:21

Tom Pape

354110

answered Dec 18 at 12:21

Tom Pape

354110

answered Dec 18 at 12:21

Tom Pape

354110

2

Perfect multicollinearity means the matrix $(X^T X)$ will not be invertible, so the a OLS solution doesn't exist.
– Vimal
Dec 18 at 14:47

That is indeed correct, because in this case $VIF_{X_i}$, which is the result of regressing $X_i$ on all other dependent variables, would be undefined. However, perfect multicolinarity is in practice very rare except when working with constructed variables like Value, Cost and Value+Cost all in one model. The far more interesting case in partial multicolinarity.
– Tom Pape
Dec 18 at 15:13

add a comment |

2

Perfect multicollinearity means the matrix $(X^T X)$ will not be invertible, so the a OLS solution doesn't exist.
– Vimal
Dec 18 at 14:47

That is indeed correct, because in this case $VIF_{X_i}$, which is the result of regressing $X_i$ on all other dependent variables, would be undefined. However, perfect multicolinarity is in practice very rare except when working with constructed variables like Value, Cost and Value+Cost all in one model. The far more interesting case in partial multicolinarity.
– Tom Pape
Dec 18 at 15:13

Perfect multicollinearity means the matrix $(X^T X)$ will not be invertible, so the a OLS solution doesn't exist.
– Vimal
Dec 18 at 14:47

That is indeed correct, because in this case $VIF_{X_i}$, which is the result of regressing $X_i$ on all other dependent variables, would be undefined. However, perfect multicolinarity is in practice very rare except when working with constructed variables like Value, Cost and Value+Cost all in one model. The far more interesting case in partial multicolinarity.
– Tom Pape
Dec 18 at 15:13

add a comment |

This may be too layme ...

In layman's terms "too many cooks (input series) can spoil the broth"

edited Dec 18 at 15:01

answered Dec 18 at 14:38

IrishStat

20.5k32040

2

Wouldn't that be overfitting, not multicollinearity?
– Paze
Dec 18 at 15:27

That's a possibility ..
– IrishStat
Dec 18 at 16:27

when you bring a second series that is nearly the equivalent of the first series then that is both over-fitting and causing multicollinearity. I don't know how to define over-fitting other than the new series doesn't add anything ( a second cook for example perhaps a student of the first cook ) when 1 cook was enough .
– IrishStat
Dec 18 at 20:03

add a comment |

This may be too layme ...

In layman's terms "too many cooks (input series) can spoil the broth"

edited Dec 18 at 15:01

answered Dec 18 at 14:38

IrishStat

20.5k32040

2

Wouldn't that be overfitting, not multicollinearity?
– Paze
Dec 18 at 15:27

That's a possibility ..
– IrishStat
Dec 18 at 16:27

when you bring a second series that is nearly the equivalent of the first series then that is both over-fitting and causing multicollinearity. I don't know how to define over-fitting other than the new series doesn't add anything ( a second cook for example perhaps a student of the first cook ) when 1 cook was enough .
– IrishStat
Dec 18 at 20:03

add a comment |

This may be too layme ...

In layman's terms "too many cooks (input series) can spoil the broth"

edited Dec 18 at 15:01

answered Dec 18 at 14:38

IrishStat

20.5k32040

This may be too layme ...

In layman's terms "too many cooks (input series) can spoil the broth"

edited Dec 18 at 15:01

answered Dec 18 at 14:38

IrishStat

20.5k32040

edited Dec 18 at 15:01

answered Dec 18 at 14:38

IrishStat

20.5k32040

answered Dec 18 at 14:38

IrishStat

20.5k32040

answered Dec 18 at 14:38

IrishStat

20.5k32040

2

Wouldn't that be overfitting, not multicollinearity?
– Paze
Dec 18 at 15:27

That's a possibility ..
– IrishStat
Dec 18 at 16:27

when you bring a second series that is nearly the equivalent of the first series then that is both over-fitting and causing multicollinearity. I don't know how to define over-fitting other than the new series doesn't add anything ( a second cook for example perhaps a student of the first cook ) when 1 cook was enough .
– IrishStat
Dec 18 at 20:03

add a comment |

2

Wouldn't that be overfitting, not multicollinearity?
– Paze
Dec 18 at 15:27

That's a possibility ..
– IrishStat
Dec 18 at 16:27

when you bring a second series that is nearly the equivalent of the first series then that is both over-fitting and causing multicollinearity. I don't know how to define over-fitting other than the new series doesn't add anything ( a second cook for example perhaps a student of the first cook ) when 1 cook was enough .
– IrishStat
Dec 18 at 20:03

Wouldn't that be overfitting, not multicollinearity?
– Paze
Dec 18 at 15:27

That's a possibility ..
– IrishStat
Dec 18 at 16:27

when you bring a second series that is nearly the equivalent of the first series then that is both over-fitting and causing multicollinearity. I don't know how to define over-fitting other than the new series doesn't add anything ( a second cook for example perhaps a student of the first cook ) when 1 cook was enough .
– IrishStat
Dec 18 at 20:03

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Cross Validated!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Cfrtjryk