Finding the states with the three most populous counties
I just started to use Python and Pandas. My current solution to a problem looks ugly and inefficient. I would like to know how to improve it. Data file is Census 2010 can be viewed here
Question:
Only looking at the three most populous counties for each state, what are the three most populous states (in order of highest population to lowest population)? Use CENSUS2010POP.
This function should return a list of string values.
Current code
def answer_six():
cdf = census_df[census_df['SUMLEV'] == 50]
cdf = cdf.groupby('STNAME')
cdf = cdf.apply(lambda x:x.sort_values('CENSUS2010POP', ascending=False)).reset_index(drop=True)
cdf = cdf.groupby('STNAME').head(3)
cdf = cdf.groupby('STNAME').sum()
cdf = cdf.sort_values('CENSUS2010POP', axis=0, ascending=False).head(3)
return list(cdf.index)
python performance beginner csv pandas
add a comment |
I just started to use Python and Pandas. My current solution to a problem looks ugly and inefficient. I would like to know how to improve it. Data file is Census 2010 can be viewed here
Question:
Only looking at the three most populous counties for each state, what are the three most populous states (in order of highest population to lowest population)? Use CENSUS2010POP.
This function should return a list of string values.
Current code
def answer_six():
cdf = census_df[census_df['SUMLEV'] == 50]
cdf = cdf.groupby('STNAME')
cdf = cdf.apply(lambda x:x.sort_values('CENSUS2010POP', ascending=False)).reset_index(drop=True)
cdf = cdf.groupby('STNAME').head(3)
cdf = cdf.groupby('STNAME').sum()
cdf = cdf.sort_values('CENSUS2010POP', axis=0, ascending=False).head(3)
return list(cdf.index)
python performance beginner csv pandas
add a comment |
I just started to use Python and Pandas. My current solution to a problem looks ugly and inefficient. I would like to know how to improve it. Data file is Census 2010 can be viewed here
Question:
Only looking at the three most populous counties for each state, what are the three most populous states (in order of highest population to lowest population)? Use CENSUS2010POP.
This function should return a list of string values.
Current code
def answer_six():
cdf = census_df[census_df['SUMLEV'] == 50]
cdf = cdf.groupby('STNAME')
cdf = cdf.apply(lambda x:x.sort_values('CENSUS2010POP', ascending=False)).reset_index(drop=True)
cdf = cdf.groupby('STNAME').head(3)
cdf = cdf.groupby('STNAME').sum()
cdf = cdf.sort_values('CENSUS2010POP', axis=0, ascending=False).head(3)
return list(cdf.index)
python performance beginner csv pandas
I just started to use Python and Pandas. My current solution to a problem looks ugly and inefficient. I would like to know how to improve it. Data file is Census 2010 can be viewed here
Question:
Only looking at the three most populous counties for each state, what are the three most populous states (in order of highest population to lowest population)? Use CENSUS2010POP.
This function should return a list of string values.
Current code
def answer_six():
cdf = census_df[census_df['SUMLEV'] == 50]
cdf = cdf.groupby('STNAME')
cdf = cdf.apply(lambda x:x.sort_values('CENSUS2010POP', ascending=False)).reset_index(drop=True)
cdf = cdf.groupby('STNAME').head(3)
cdf = cdf.groupby('STNAME').sum()
cdf = cdf.sort_values('CENSUS2010POP', axis=0, ascending=False).head(3)
return list(cdf.index)
python performance beginner csv pandas
python performance beginner csv pandas
edited Jan 3 '17 at 15:35
200_success
128k15150413
128k15150413
asked Jan 3 '17 at 8:39
YohanRoth
126113
126113
add a comment |
add a comment |
4 Answers
4
active
oldest
votes
nlargest could help, it finds the maximum n values in pandas series.
In this line of code, groupby groups the frame according to state name, then apply finds the 3 largest values in column CENSUS2010POP and sums them up. The resulting series has the unique state names as index and the corresponding top 3 counties sum, applying nlargest gets a series with the top 3 states as required
return census_df[census_df['SUMLEV'] == 50].groupby(
'STNAME')['CENSUS2010POP'].apply(
lambda x: x.nlargest(3).sum()).nlargest(
3).index.values.tolist()
1
You should explain a little bit more about the function you're suggesting.
– Mario Santini
Jan 3 '17 at 12:26
add a comment |
SUMLEV
is explained here
Definitely want to use nlargest
The advantage of nlargest
is that it performs a partial sort therefore in linear time.
However, you don't want to groupby
twice. So we'll define a helper function in which we only groupby
once.
I'm using a lot of .values
to access underlying numpy
objects. This will improve some efficiency a bit.
Lastly, I don't like accessing names from the outer scope without a purpose so I'm passing the dataframe as a parameter.
def answer_six(df):
# subset df to things I care about
sumlev = df.SUMLEV.values == 50
data = df[['CENSUS2010POP', 'STNAME', 'CTYNAME']].values[sumlev]
# build a pandas series with State and County in the index
# vaues are from CENSUS2010POP
s = pd.Series(data[:, 0], [data[:, 1], data[:, 2]], dtype=np.int64)
# define a function that does the nlargest and sum in one
# otherwise you'd have to do a second groupby
def sum_largest(x, n=3):
return x.nlargest(n).sum()
return s.groupby(level=0).apply(sum_largest).nlargest(3).index.tolist()
Demonstration
answer_six(census_df)
['California', 'Texas', 'Illinois']
add a comment |
census_df[census_df['SUMLEV'] == 50].groupby(
'STNAME')['CENSUS2010POP'].apply(
lambda x: x.nlargest(3).sum()).nlargest(
3).index.values.tolist()
Above: This seems to be the best way to do it. Below: I found another way that is slightly less elegant, but it helped me understand why sgDysregulation's solution works. I hope it can help you also.
census_df[census_df['SUMLEV'] == 50].groupby(
'STNAME')['CENSUS2010POP'].nlargest(3).groupby(
'STNAME').sum().nlargest(3).index.values.tolist()
Welcome to Code Review! You have presented an alternative solution, but haven't reviewed the code. Please edit it to explain your reasoning (how your solution works and how it improves upon the original) so that everyone can learn from your thought process.
– Toby Speight
Sep 21 '17 at 8:55
add a comment |
This is from the data:
New York 19378102
Illinois 12830632
How is your result excluding New York and including Illinois?
['California', 'Texas', 'Illinois']
!= correct answer
New contributor
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
});
});
}, "mathjax-editing");
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f151530%2ffinding-the-states-with-the-three-most-populous-counties%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
4 Answers
4
active
oldest
votes
4 Answers
4
active
oldest
votes
active
oldest
votes
active
oldest
votes
nlargest could help, it finds the maximum n values in pandas series.
In this line of code, groupby groups the frame according to state name, then apply finds the 3 largest values in column CENSUS2010POP and sums them up. The resulting series has the unique state names as index and the corresponding top 3 counties sum, applying nlargest gets a series with the top 3 states as required
return census_df[census_df['SUMLEV'] == 50].groupby(
'STNAME')['CENSUS2010POP'].apply(
lambda x: x.nlargest(3).sum()).nlargest(
3).index.values.tolist()
1
You should explain a little bit more about the function you're suggesting.
– Mario Santini
Jan 3 '17 at 12:26
add a comment |
nlargest could help, it finds the maximum n values in pandas series.
In this line of code, groupby groups the frame according to state name, then apply finds the 3 largest values in column CENSUS2010POP and sums them up. The resulting series has the unique state names as index and the corresponding top 3 counties sum, applying nlargest gets a series with the top 3 states as required
return census_df[census_df['SUMLEV'] == 50].groupby(
'STNAME')['CENSUS2010POP'].apply(
lambda x: x.nlargest(3).sum()).nlargest(
3).index.values.tolist()
1
You should explain a little bit more about the function you're suggesting.
– Mario Santini
Jan 3 '17 at 12:26
add a comment |
nlargest could help, it finds the maximum n values in pandas series.
In this line of code, groupby groups the frame according to state name, then apply finds the 3 largest values in column CENSUS2010POP and sums them up. The resulting series has the unique state names as index and the corresponding top 3 counties sum, applying nlargest gets a series with the top 3 states as required
return census_df[census_df['SUMLEV'] == 50].groupby(
'STNAME')['CENSUS2010POP'].apply(
lambda x: x.nlargest(3).sum()).nlargest(
3).index.values.tolist()
nlargest could help, it finds the maximum n values in pandas series.
In this line of code, groupby groups the frame according to state name, then apply finds the 3 largest values in column CENSUS2010POP and sums them up. The resulting series has the unique state names as index and the corresponding top 3 counties sum, applying nlargest gets a series with the top 3 states as required
return census_df[census_df['SUMLEV'] == 50].groupby(
'STNAME')['CENSUS2010POP'].apply(
lambda x: x.nlargest(3).sum()).nlargest(
3).index.values.tolist()
edited Jan 3 '17 at 21:01
answered Jan 3 '17 at 9:02
sgDysregulation
19116
19116
1
You should explain a little bit more about the function you're suggesting.
– Mario Santini
Jan 3 '17 at 12:26
add a comment |
1
You should explain a little bit more about the function you're suggesting.
– Mario Santini
Jan 3 '17 at 12:26
1
1
You should explain a little bit more about the function you're suggesting.
– Mario Santini
Jan 3 '17 at 12:26
You should explain a little bit more about the function you're suggesting.
– Mario Santini
Jan 3 '17 at 12:26
add a comment |
SUMLEV
is explained here
Definitely want to use nlargest
The advantage of nlargest
is that it performs a partial sort therefore in linear time.
However, you don't want to groupby
twice. So we'll define a helper function in which we only groupby
once.
I'm using a lot of .values
to access underlying numpy
objects. This will improve some efficiency a bit.
Lastly, I don't like accessing names from the outer scope without a purpose so I'm passing the dataframe as a parameter.
def answer_six(df):
# subset df to things I care about
sumlev = df.SUMLEV.values == 50
data = df[['CENSUS2010POP', 'STNAME', 'CTYNAME']].values[sumlev]
# build a pandas series with State and County in the index
# vaues are from CENSUS2010POP
s = pd.Series(data[:, 0], [data[:, 1], data[:, 2]], dtype=np.int64)
# define a function that does the nlargest and sum in one
# otherwise you'd have to do a second groupby
def sum_largest(x, n=3):
return x.nlargest(n).sum()
return s.groupby(level=0).apply(sum_largest).nlargest(3).index.tolist()
Demonstration
answer_six(census_df)
['California', 'Texas', 'Illinois']
add a comment |
SUMLEV
is explained here
Definitely want to use nlargest
The advantage of nlargest
is that it performs a partial sort therefore in linear time.
However, you don't want to groupby
twice. So we'll define a helper function in which we only groupby
once.
I'm using a lot of .values
to access underlying numpy
objects. This will improve some efficiency a bit.
Lastly, I don't like accessing names from the outer scope without a purpose so I'm passing the dataframe as a parameter.
def answer_six(df):
# subset df to things I care about
sumlev = df.SUMLEV.values == 50
data = df[['CENSUS2010POP', 'STNAME', 'CTYNAME']].values[sumlev]
# build a pandas series with State and County in the index
# vaues are from CENSUS2010POP
s = pd.Series(data[:, 0], [data[:, 1], data[:, 2]], dtype=np.int64)
# define a function that does the nlargest and sum in one
# otherwise you'd have to do a second groupby
def sum_largest(x, n=3):
return x.nlargest(n).sum()
return s.groupby(level=0).apply(sum_largest).nlargest(3).index.tolist()
Demonstration
answer_six(census_df)
['California', 'Texas', 'Illinois']
add a comment |
SUMLEV
is explained here
Definitely want to use nlargest
The advantage of nlargest
is that it performs a partial sort therefore in linear time.
However, you don't want to groupby
twice. So we'll define a helper function in which we only groupby
once.
I'm using a lot of .values
to access underlying numpy
objects. This will improve some efficiency a bit.
Lastly, I don't like accessing names from the outer scope without a purpose so I'm passing the dataframe as a parameter.
def answer_six(df):
# subset df to things I care about
sumlev = df.SUMLEV.values == 50
data = df[['CENSUS2010POP', 'STNAME', 'CTYNAME']].values[sumlev]
# build a pandas series with State and County in the index
# vaues are from CENSUS2010POP
s = pd.Series(data[:, 0], [data[:, 1], data[:, 2]], dtype=np.int64)
# define a function that does the nlargest and sum in one
# otherwise you'd have to do a second groupby
def sum_largest(x, n=3):
return x.nlargest(n).sum()
return s.groupby(level=0).apply(sum_largest).nlargest(3).index.tolist()
Demonstration
answer_six(census_df)
['California', 'Texas', 'Illinois']
SUMLEV
is explained here
Definitely want to use nlargest
The advantage of nlargest
is that it performs a partial sort therefore in linear time.
However, you don't want to groupby
twice. So we'll define a helper function in which we only groupby
once.
I'm using a lot of .values
to access underlying numpy
objects. This will improve some efficiency a bit.
Lastly, I don't like accessing names from the outer scope without a purpose so I'm passing the dataframe as a parameter.
def answer_six(df):
# subset df to things I care about
sumlev = df.SUMLEV.values == 50
data = df[['CENSUS2010POP', 'STNAME', 'CTYNAME']].values[sumlev]
# build a pandas series with State and County in the index
# vaues are from CENSUS2010POP
s = pd.Series(data[:, 0], [data[:, 1], data[:, 2]], dtype=np.int64)
# define a function that does the nlargest and sum in one
# otherwise you'd have to do a second groupby
def sum_largest(x, n=3):
return x.nlargest(n).sum()
return s.groupby(level=0).apply(sum_largest).nlargest(3).index.tolist()
Demonstration
answer_six(census_df)
['California', 'Texas', 'Illinois']
answered Jan 5 '17 at 20:52
piRSquared
20613
20613
add a comment |
add a comment |
census_df[census_df['SUMLEV'] == 50].groupby(
'STNAME')['CENSUS2010POP'].apply(
lambda x: x.nlargest(3).sum()).nlargest(
3).index.values.tolist()
Above: This seems to be the best way to do it. Below: I found another way that is slightly less elegant, but it helped me understand why sgDysregulation's solution works. I hope it can help you also.
census_df[census_df['SUMLEV'] == 50].groupby(
'STNAME')['CENSUS2010POP'].nlargest(3).groupby(
'STNAME').sum().nlargest(3).index.values.tolist()
Welcome to Code Review! You have presented an alternative solution, but haven't reviewed the code. Please edit it to explain your reasoning (how your solution works and how it improves upon the original) so that everyone can learn from your thought process.
– Toby Speight
Sep 21 '17 at 8:55
add a comment |
census_df[census_df['SUMLEV'] == 50].groupby(
'STNAME')['CENSUS2010POP'].apply(
lambda x: x.nlargest(3).sum()).nlargest(
3).index.values.tolist()
Above: This seems to be the best way to do it. Below: I found another way that is slightly less elegant, but it helped me understand why sgDysregulation's solution works. I hope it can help you also.
census_df[census_df['SUMLEV'] == 50].groupby(
'STNAME')['CENSUS2010POP'].nlargest(3).groupby(
'STNAME').sum().nlargest(3).index.values.tolist()
Welcome to Code Review! You have presented an alternative solution, but haven't reviewed the code. Please edit it to explain your reasoning (how your solution works and how it improves upon the original) so that everyone can learn from your thought process.
– Toby Speight
Sep 21 '17 at 8:55
add a comment |
census_df[census_df['SUMLEV'] == 50].groupby(
'STNAME')['CENSUS2010POP'].apply(
lambda x: x.nlargest(3).sum()).nlargest(
3).index.values.tolist()
Above: This seems to be the best way to do it. Below: I found another way that is slightly less elegant, but it helped me understand why sgDysregulation's solution works. I hope it can help you also.
census_df[census_df['SUMLEV'] == 50].groupby(
'STNAME')['CENSUS2010POP'].nlargest(3).groupby(
'STNAME').sum().nlargest(3).index.values.tolist()
census_df[census_df['SUMLEV'] == 50].groupby(
'STNAME')['CENSUS2010POP'].apply(
lambda x: x.nlargest(3).sum()).nlargest(
3).index.values.tolist()
Above: This seems to be the best way to do it. Below: I found another way that is slightly less elegant, but it helped me understand why sgDysregulation's solution works. I hope it can help you also.
census_df[census_df['SUMLEV'] == 50].groupby(
'STNAME')['CENSUS2010POP'].nlargest(3).groupby(
'STNAME').sum().nlargest(3).index.values.tolist()
answered Sep 21 '17 at 7:53
reka18
1112
1112
Welcome to Code Review! You have presented an alternative solution, but haven't reviewed the code. Please edit it to explain your reasoning (how your solution works and how it improves upon the original) so that everyone can learn from your thought process.
– Toby Speight
Sep 21 '17 at 8:55
add a comment |
Welcome to Code Review! You have presented an alternative solution, but haven't reviewed the code. Please edit it to explain your reasoning (how your solution works and how it improves upon the original) so that everyone can learn from your thought process.
– Toby Speight
Sep 21 '17 at 8:55
Welcome to Code Review! You have presented an alternative solution, but haven't reviewed the code. Please edit it to explain your reasoning (how your solution works and how it improves upon the original) so that everyone can learn from your thought process.
– Toby Speight
Sep 21 '17 at 8:55
Welcome to Code Review! You have presented an alternative solution, but haven't reviewed the code. Please edit it to explain your reasoning (how your solution works and how it improves upon the original) so that everyone can learn from your thought process.
– Toby Speight
Sep 21 '17 at 8:55
add a comment |
This is from the data:
New York 19378102
Illinois 12830632
How is your result excluding New York and including Illinois?
['California', 'Texas', 'Illinois']
!= correct answer
New contributor
add a comment |
This is from the data:
New York 19378102
Illinois 12830632
How is your result excluding New York and including Illinois?
['California', 'Texas', 'Illinois']
!= correct answer
New contributor
add a comment |
This is from the data:
New York 19378102
Illinois 12830632
How is your result excluding New York and including Illinois?
['California', 'Texas', 'Illinois']
!= correct answer
New contributor
This is from the data:
New York 19378102
Illinois 12830632
How is your result excluding New York and including Illinois?
['California', 'Texas', 'Illinois']
!= correct answer
New contributor
edited 3 mins ago
Sᴀᴍ Onᴇᴌᴀ
8,34261853
8,34261853
New contributor
answered 42 mins ago
Jack Andrew
1
1
New contributor
New contributor
add a comment |
add a comment |
Thanks for contributing an answer to Code Review Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f151530%2ffinding-the-states-with-the-three-most-populous-counties%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown