Explaining multicollinearity in layman's terms
Say we have a study where we want to run a logistic regression on a group of people, and we want to find out whether one attribute of a person makes them more likely to be a smoker.
So we have smoker yes/no as the dependent variable and let's say we have weight and gender as independent variables.
From this information, I'm guessing weight and gender are introduce collinearity because men generally weigh more than women, and therefore are not independent variables. Is this correct?
If so, I'm confused because I thought a regression controls for the other variables, so it seeks to look if men are more likely to smoke, regardless of their weight - and whether overweight people are more likely to smoke, regardless of gender.
Or do I have it all wrong?
Thank you.
regression multicollinearity
add a comment |
Say we have a study where we want to run a logistic regression on a group of people, and we want to find out whether one attribute of a person makes them more likely to be a smoker.
So we have smoker yes/no as the dependent variable and let's say we have weight and gender as independent variables.
From this information, I'm guessing weight and gender are introduce collinearity because men generally weigh more than women, and therefore are not independent variables. Is this correct?
If so, I'm confused because I thought a regression controls for the other variables, so it seeks to look if men are more likely to smoke, regardless of their weight - and whether overweight people are more likely to smoke, regardless of gender.
Or do I have it all wrong?
Thank you.
regression multicollinearity
add a comment |
Say we have a study where we want to run a logistic regression on a group of people, and we want to find out whether one attribute of a person makes them more likely to be a smoker.
So we have smoker yes/no as the dependent variable and let's say we have weight and gender as independent variables.
From this information, I'm guessing weight and gender are introduce collinearity because men generally weigh more than women, and therefore are not independent variables. Is this correct?
If so, I'm confused because I thought a regression controls for the other variables, so it seeks to look if men are more likely to smoke, regardless of their weight - and whether overweight people are more likely to smoke, regardless of gender.
Or do I have it all wrong?
Thank you.
regression multicollinearity
Say we have a study where we want to run a logistic regression on a group of people, and we want to find out whether one attribute of a person makes them more likely to be a smoker.
So we have smoker yes/no as the dependent variable and let's say we have weight and gender as independent variables.
From this information, I'm guessing weight and gender are introduce collinearity because men generally weigh more than women, and therefore are not independent variables. Is this correct?
If so, I'm confused because I thought a regression controls for the other variables, so it seeks to look if men are more likely to smoke, regardless of their weight - and whether overweight people are more likely to smoke, regardless of gender.
Or do I have it all wrong?
Thank you.
regression multicollinearity
regression multicollinearity
asked Dec 18 at 11:54
Paze
1696
1696
add a comment |
add a comment |
3 Answers
3
active
oldest
votes
When it comes to collinearity, we have to distinguish exact collinearity from approximate collinearity.
The former occurs when one independent variable (IV) is an exact linear combination of other independent variables. Usually this is due to some sort of elementary mistake (like entering weight in pounds and also weight in kilograms) and is easily corrected.
The latter occurs when one IV is almost a linear combination of other IVs. Assessing when "almost" is close enough to be troublesome can be done through various means, the one I think best is condition indexes and proportion of variance explained; this has been discussed elsewhere and I won't add more here.
In your case, it is very unlikely that the relationship between sex and weight is close enough to cause trouble. In a general population, it's true that men, on average, weigh more than women. But there's lots of overlap: There are plenty of light men and heavy women. If (for some reason) your sample consisted of jockeys and football players, then it would cause problems.
The problems that collinearity causes include high variances for the parameter estimates and extreme sensitivity to changes in the data. In other words, we only have a very poor idea of what the parameter estimates are and a small change in the input data can result in a huge change in the parameter estimates.
add a comment |
The good news is that multicollinarity has absolutely no effect on the estimated coefficients $beta$ of the model. The bad new is that they inflate its standard error.
The formula for the standard error of a coefficient $beta_i$ for the independent variable $X_i$ is
$$ s.e.(beta_i)=sqrt{VIF_{X_i}frac{sigma_{varepsilon}^2}{nsigma_{X_i}^2}}$$
with $sigma_{varepsilon}$ being the standared error of regression (the smaller this value the larger the $R^2$ (goodness of fit) of the model), $VIF_{X_i}$ is a measure of multicolinarity (1 means no multicolinarity and the larger the more multicolinearity), $n$ is just the sample size and $sigma_{X_i}^2$ is the variance of $X_i$.
The key point is that if you add a new variable to the model, it increases $VIF_{X_i}$ and decreases $sigma_{varepsilon}$. If the variable matters and is not highly correlated with $X_i$, typically the standard error $s.e.(beta_i)$ goes down rather than up. But it is true, it can also go the other way around.
2
Perfect multicollinearity means the matrix $(X^T X)$ will not be invertible, so the a OLS solution doesn't exist.
– Vimal
Dec 18 at 14:47
That is indeed correct, because in this case $VIF_{X_i}$, which is the result of regressing $X_i$ on all other dependent variables, would be undefined. However, perfect multicolinarity is in practice very rare except when working with constructed variables like Value, Cost and Value+Cost all in one model. The far more interesting case in partial multicolinarity.
– Tom Pape
Dec 18 at 15:13
add a comment |
This may be too layme ...
In layman's terms "too many cooks (input series) can spoil the broth"
2
Wouldn't that be overfitting, not multicollinearity?
– Paze
Dec 18 at 15:27
That's a possibility ..
– IrishStat
Dec 18 at 16:27
when you bring a second series that is nearly the equivalent of the first series then that is both over-fitting and causing multicollinearity. I don't know how to define over-fitting other than the new series doesn't add anything ( a second cook for example perhaps a student of the first cook ) when 1 cook was enough .
– IrishStat
Dec 18 at 20:03
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "65"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f383569%2fexplaining-multicollinearity-in-laymans-terms%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
When it comes to collinearity, we have to distinguish exact collinearity from approximate collinearity.
The former occurs when one independent variable (IV) is an exact linear combination of other independent variables. Usually this is due to some sort of elementary mistake (like entering weight in pounds and also weight in kilograms) and is easily corrected.
The latter occurs when one IV is almost a linear combination of other IVs. Assessing when "almost" is close enough to be troublesome can be done through various means, the one I think best is condition indexes and proportion of variance explained; this has been discussed elsewhere and I won't add more here.
In your case, it is very unlikely that the relationship between sex and weight is close enough to cause trouble. In a general population, it's true that men, on average, weigh more than women. But there's lots of overlap: There are plenty of light men and heavy women. If (for some reason) your sample consisted of jockeys and football players, then it would cause problems.
The problems that collinearity causes include high variances for the parameter estimates and extreme sensitivity to changes in the data. In other words, we only have a very poor idea of what the parameter estimates are and a small change in the input data can result in a huge change in the parameter estimates.
add a comment |
When it comes to collinearity, we have to distinguish exact collinearity from approximate collinearity.
The former occurs when one independent variable (IV) is an exact linear combination of other independent variables. Usually this is due to some sort of elementary mistake (like entering weight in pounds and also weight in kilograms) and is easily corrected.
The latter occurs when one IV is almost a linear combination of other IVs. Assessing when "almost" is close enough to be troublesome can be done through various means, the one I think best is condition indexes and proportion of variance explained; this has been discussed elsewhere and I won't add more here.
In your case, it is very unlikely that the relationship between sex and weight is close enough to cause trouble. In a general population, it's true that men, on average, weigh more than women. But there's lots of overlap: There are plenty of light men and heavy women. If (for some reason) your sample consisted of jockeys and football players, then it would cause problems.
The problems that collinearity causes include high variances for the parameter estimates and extreme sensitivity to changes in the data. In other words, we only have a very poor idea of what the parameter estimates are and a small change in the input data can result in a huge change in the parameter estimates.
add a comment |
When it comes to collinearity, we have to distinguish exact collinearity from approximate collinearity.
The former occurs when one independent variable (IV) is an exact linear combination of other independent variables. Usually this is due to some sort of elementary mistake (like entering weight in pounds and also weight in kilograms) and is easily corrected.
The latter occurs when one IV is almost a linear combination of other IVs. Assessing when "almost" is close enough to be troublesome can be done through various means, the one I think best is condition indexes and proportion of variance explained; this has been discussed elsewhere and I won't add more here.
In your case, it is very unlikely that the relationship between sex and weight is close enough to cause trouble. In a general population, it's true that men, on average, weigh more than women. But there's lots of overlap: There are plenty of light men and heavy women. If (for some reason) your sample consisted of jockeys and football players, then it would cause problems.
The problems that collinearity causes include high variances for the parameter estimates and extreme sensitivity to changes in the data. In other words, we only have a very poor idea of what the parameter estimates are and a small change in the input data can result in a huge change in the parameter estimates.
When it comes to collinearity, we have to distinguish exact collinearity from approximate collinearity.
The former occurs when one independent variable (IV) is an exact linear combination of other independent variables. Usually this is due to some sort of elementary mistake (like entering weight in pounds and also weight in kilograms) and is easily corrected.
The latter occurs when one IV is almost a linear combination of other IVs. Assessing when "almost" is close enough to be troublesome can be done through various means, the one I think best is condition indexes and proportion of variance explained; this has been discussed elsewhere and I won't add more here.
In your case, it is very unlikely that the relationship between sex and weight is close enough to cause trouble. In a general population, it's true that men, on average, weigh more than women. But there's lots of overlap: There are plenty of light men and heavy women. If (for some reason) your sample consisted of jockeys and football players, then it would cause problems.
The problems that collinearity causes include high variances for the parameter estimates and extreme sensitivity to changes in the data. In other words, we only have a very poor idea of what the parameter estimates are and a small change in the input data can result in a huge change in the parameter estimates.
answered Dec 18 at 12:13
Peter Flom♦
74.2k11105202
74.2k11105202
add a comment |
add a comment |
The good news is that multicollinarity has absolutely no effect on the estimated coefficients $beta$ of the model. The bad new is that they inflate its standard error.
The formula for the standard error of a coefficient $beta_i$ for the independent variable $X_i$ is
$$ s.e.(beta_i)=sqrt{VIF_{X_i}frac{sigma_{varepsilon}^2}{nsigma_{X_i}^2}}$$
with $sigma_{varepsilon}$ being the standared error of regression (the smaller this value the larger the $R^2$ (goodness of fit) of the model), $VIF_{X_i}$ is a measure of multicolinarity (1 means no multicolinarity and the larger the more multicolinearity), $n$ is just the sample size and $sigma_{X_i}^2$ is the variance of $X_i$.
The key point is that if you add a new variable to the model, it increases $VIF_{X_i}$ and decreases $sigma_{varepsilon}$. If the variable matters and is not highly correlated with $X_i$, typically the standard error $s.e.(beta_i)$ goes down rather than up. But it is true, it can also go the other way around.
2
Perfect multicollinearity means the matrix $(X^T X)$ will not be invertible, so the a OLS solution doesn't exist.
– Vimal
Dec 18 at 14:47
That is indeed correct, because in this case $VIF_{X_i}$, which is the result of regressing $X_i$ on all other dependent variables, would be undefined. However, perfect multicolinarity is in practice very rare except when working with constructed variables like Value, Cost and Value+Cost all in one model. The far more interesting case in partial multicolinarity.
– Tom Pape
Dec 18 at 15:13
add a comment |
The good news is that multicollinarity has absolutely no effect on the estimated coefficients $beta$ of the model. The bad new is that they inflate its standard error.
The formula for the standard error of a coefficient $beta_i$ for the independent variable $X_i$ is
$$ s.e.(beta_i)=sqrt{VIF_{X_i}frac{sigma_{varepsilon}^2}{nsigma_{X_i}^2}}$$
with $sigma_{varepsilon}$ being the standared error of regression (the smaller this value the larger the $R^2$ (goodness of fit) of the model), $VIF_{X_i}$ is a measure of multicolinarity (1 means no multicolinarity and the larger the more multicolinearity), $n$ is just the sample size and $sigma_{X_i}^2$ is the variance of $X_i$.
The key point is that if you add a new variable to the model, it increases $VIF_{X_i}$ and decreases $sigma_{varepsilon}$. If the variable matters and is not highly correlated with $X_i$, typically the standard error $s.e.(beta_i)$ goes down rather than up. But it is true, it can also go the other way around.
2
Perfect multicollinearity means the matrix $(X^T X)$ will not be invertible, so the a OLS solution doesn't exist.
– Vimal
Dec 18 at 14:47
That is indeed correct, because in this case $VIF_{X_i}$, which is the result of regressing $X_i$ on all other dependent variables, would be undefined. However, perfect multicolinarity is in practice very rare except when working with constructed variables like Value, Cost and Value+Cost all in one model. The far more interesting case in partial multicolinarity.
– Tom Pape
Dec 18 at 15:13
add a comment |
The good news is that multicollinarity has absolutely no effect on the estimated coefficients $beta$ of the model. The bad new is that they inflate its standard error.
The formula for the standard error of a coefficient $beta_i$ for the independent variable $X_i$ is
$$ s.e.(beta_i)=sqrt{VIF_{X_i}frac{sigma_{varepsilon}^2}{nsigma_{X_i}^2}}$$
with $sigma_{varepsilon}$ being the standared error of regression (the smaller this value the larger the $R^2$ (goodness of fit) of the model), $VIF_{X_i}$ is a measure of multicolinarity (1 means no multicolinarity and the larger the more multicolinearity), $n$ is just the sample size and $sigma_{X_i}^2$ is the variance of $X_i$.
The key point is that if you add a new variable to the model, it increases $VIF_{X_i}$ and decreases $sigma_{varepsilon}$. If the variable matters and is not highly correlated with $X_i$, typically the standard error $s.e.(beta_i)$ goes down rather than up. But it is true, it can also go the other way around.
The good news is that multicollinarity has absolutely no effect on the estimated coefficients $beta$ of the model. The bad new is that they inflate its standard error.
The formula for the standard error of a coefficient $beta_i$ for the independent variable $X_i$ is
$$ s.e.(beta_i)=sqrt{VIF_{X_i}frac{sigma_{varepsilon}^2}{nsigma_{X_i}^2}}$$
with $sigma_{varepsilon}$ being the standared error of regression (the smaller this value the larger the $R^2$ (goodness of fit) of the model), $VIF_{X_i}$ is a measure of multicolinarity (1 means no multicolinarity and the larger the more multicolinearity), $n$ is just the sample size and $sigma_{X_i}^2$ is the variance of $X_i$.
The key point is that if you add a new variable to the model, it increases $VIF_{X_i}$ and decreases $sigma_{varepsilon}$. If the variable matters and is not highly correlated with $X_i$, typically the standard error $s.e.(beta_i)$ goes down rather than up. But it is true, it can also go the other way around.
edited Dec 18 at 14:31
answered Dec 18 at 12:21
Tom Pape
354110
354110
2
Perfect multicollinearity means the matrix $(X^T X)$ will not be invertible, so the a OLS solution doesn't exist.
– Vimal
Dec 18 at 14:47
That is indeed correct, because in this case $VIF_{X_i}$, which is the result of regressing $X_i$ on all other dependent variables, would be undefined. However, perfect multicolinarity is in practice very rare except when working with constructed variables like Value, Cost and Value+Cost all in one model. The far more interesting case in partial multicolinarity.
– Tom Pape
Dec 18 at 15:13
add a comment |
2
Perfect multicollinearity means the matrix $(X^T X)$ will not be invertible, so the a OLS solution doesn't exist.
– Vimal
Dec 18 at 14:47
That is indeed correct, because in this case $VIF_{X_i}$, which is the result of regressing $X_i$ on all other dependent variables, would be undefined. However, perfect multicolinarity is in practice very rare except when working with constructed variables like Value, Cost and Value+Cost all in one model. The far more interesting case in partial multicolinarity.
– Tom Pape
Dec 18 at 15:13
2
2
Perfect multicollinearity means the matrix $(X^T X)$ will not be invertible, so the a OLS solution doesn't exist.
– Vimal
Dec 18 at 14:47
Perfect multicollinearity means the matrix $(X^T X)$ will not be invertible, so the a OLS solution doesn't exist.
– Vimal
Dec 18 at 14:47
That is indeed correct, because in this case $VIF_{X_i}$, which is the result of regressing $X_i$ on all other dependent variables, would be undefined. However, perfect multicolinarity is in practice very rare except when working with constructed variables like Value, Cost and Value+Cost all in one model. The far more interesting case in partial multicolinarity.
– Tom Pape
Dec 18 at 15:13
That is indeed correct, because in this case $VIF_{X_i}$, which is the result of regressing $X_i$ on all other dependent variables, would be undefined. However, perfect multicolinarity is in practice very rare except when working with constructed variables like Value, Cost and Value+Cost all in one model. The far more interesting case in partial multicolinarity.
– Tom Pape
Dec 18 at 15:13
add a comment |
This may be too layme ...
In layman's terms "too many cooks (input series) can spoil the broth"
2
Wouldn't that be overfitting, not multicollinearity?
– Paze
Dec 18 at 15:27
That's a possibility ..
– IrishStat
Dec 18 at 16:27
when you bring a second series that is nearly the equivalent of the first series then that is both over-fitting and causing multicollinearity. I don't know how to define over-fitting other than the new series doesn't add anything ( a second cook for example perhaps a student of the first cook ) when 1 cook was enough .
– IrishStat
Dec 18 at 20:03
add a comment |
This may be too layme ...
In layman's terms "too many cooks (input series) can spoil the broth"
2
Wouldn't that be overfitting, not multicollinearity?
– Paze
Dec 18 at 15:27
That's a possibility ..
– IrishStat
Dec 18 at 16:27
when you bring a second series that is nearly the equivalent of the first series then that is both over-fitting and causing multicollinearity. I don't know how to define over-fitting other than the new series doesn't add anything ( a second cook for example perhaps a student of the first cook ) when 1 cook was enough .
– IrishStat
Dec 18 at 20:03
add a comment |
This may be too layme ...
In layman's terms "too many cooks (input series) can spoil the broth"
This may be too layme ...
In layman's terms "too many cooks (input series) can spoil the broth"
edited Dec 18 at 15:01
answered Dec 18 at 14:38
IrishStat
20.5k32040
20.5k32040
2
Wouldn't that be overfitting, not multicollinearity?
– Paze
Dec 18 at 15:27
That's a possibility ..
– IrishStat
Dec 18 at 16:27
when you bring a second series that is nearly the equivalent of the first series then that is both over-fitting and causing multicollinearity. I don't know how to define over-fitting other than the new series doesn't add anything ( a second cook for example perhaps a student of the first cook ) when 1 cook was enough .
– IrishStat
Dec 18 at 20:03
add a comment |
2
Wouldn't that be overfitting, not multicollinearity?
– Paze
Dec 18 at 15:27
That's a possibility ..
– IrishStat
Dec 18 at 16:27
when you bring a second series that is nearly the equivalent of the first series then that is both over-fitting and causing multicollinearity. I don't know how to define over-fitting other than the new series doesn't add anything ( a second cook for example perhaps a student of the first cook ) when 1 cook was enough .
– IrishStat
Dec 18 at 20:03
2
2
Wouldn't that be overfitting, not multicollinearity?
– Paze
Dec 18 at 15:27
Wouldn't that be overfitting, not multicollinearity?
– Paze
Dec 18 at 15:27
That's a possibility ..
– IrishStat
Dec 18 at 16:27
That's a possibility ..
– IrishStat
Dec 18 at 16:27
when you bring a second series that is nearly the equivalent of the first series then that is both over-fitting and causing multicollinearity. I don't know how to define over-fitting other than the new series doesn't add anything ( a second cook for example perhaps a student of the first cook ) when 1 cook was enough .
– IrishStat
Dec 18 at 20:03
when you bring a second series that is nearly the equivalent of the first series then that is both over-fitting and causing multicollinearity. I don't know how to define over-fitting other than the new series doesn't add anything ( a second cook for example perhaps a student of the first cook ) when 1 cook was enough .
– IrishStat
Dec 18 at 20:03
add a comment |
Thanks for contributing an answer to Cross Validated!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f383569%2fexplaining-multicollinearity-in-laymans-terms%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown