Finding the states with the three most populous counties

I just started to use Python and Pandas. My current solution to a problem looks ugly and inefficient. I would like to know how to improve it. Data file is Census 2010 can be viewed here

Question:

Only looking at the three most populous counties for each state, what are the three most populous states (in order of highest population to lowest population)? Use CENSUS2010POP.
This function should return a list of string values.

Current code

def answer_six():

    cdf = census_df[census_df['SUMLEV'] == 50]

    cdf = cdf.groupby('STNAME')

    cdf = cdf.apply(lambda x:x.sort_values('CENSUS2010POP', ascending=False)).reset_index(drop=True)

    cdf = cdf.groupby('STNAME').head(3)

    cdf = cdf.groupby('STNAME').sum()

    cdf = cdf.sort_values('CENSUS2010POP', axis=0, ascending=False).head(3)

    return list(cdf.index)

edited Jan 3 '17 at 15:35

200_success

128k15150413

asked Jan 3 '17 at 8:39

YohanRoth

126113

add a comment |

I just started to use Python and Pandas. My current solution to a problem looks ugly and inefficient. I would like to know how to improve it. Data file is Census 2010 can be viewed here

Question:

Only looking at the three most populous counties for each state, what are the three most populous states (in order of highest population to lowest population)? Use CENSUS2010POP.
This function should return a list of string values.

Current code

def answer_six():

    cdf = census_df[census_df['SUMLEV'] == 50]

    cdf = cdf.groupby('STNAME')

    cdf = cdf.apply(lambda x:x.sort_values('CENSUS2010POP', ascending=False)).reset_index(drop=True)

    cdf = cdf.groupby('STNAME').head(3)

    cdf = cdf.groupby('STNAME').sum()

    cdf = cdf.sort_values('CENSUS2010POP', axis=0, ascending=False).head(3)

    return list(cdf.index)

edited Jan 3 '17 at 15:35

200_success

128k15150413

asked Jan 3 '17 at 8:39

YohanRoth

126113

add a comment |

I just started to use Python and Pandas. My current solution to a problem looks ugly and inefficient. I would like to know how to improve it. Data file is Census 2010 can be viewed here

Question:

Only looking at the three most populous counties for each state, what are the three most populous states (in order of highest population to lowest population)? Use CENSUS2010POP.
This function should return a list of string values.

Current code

def answer_six():

    cdf = census_df[census_df['SUMLEV'] == 50]

    cdf = cdf.groupby('STNAME')

    cdf = cdf.apply(lambda x:x.sort_values('CENSUS2010POP', ascending=False)).reset_index(drop=True)

    cdf = cdf.groupby('STNAME').head(3)

    cdf = cdf.groupby('STNAME').sum()

    cdf = cdf.sort_values('CENSUS2010POP', axis=0, ascending=False).head(3)

    return list(cdf.index)

edited Jan 3 '17 at 15:35

200_success

128k15150413

asked Jan 3 '17 at 8:39

YohanRoth

126113

I just started to use Python and Pandas. My current solution to a problem looks ugly and inefficient. I would like to know how to improve it. Data file is Census 2010 can be viewed here

Question:

Only looking at the three most populous counties for each state, what are the three most populous states (in order of highest population to lowest population)? Use CENSUS2010POP.
This function should return a list of string values.

Current code

def answer_six():

    cdf = census_df[census_df['SUMLEV'] == 50]

    cdf = cdf.groupby('STNAME')

    cdf = cdf.apply(lambda x:x.sort_values('CENSUS2010POP', ascending=False)).reset_index(drop=True)

    cdf = cdf.groupby('STNAME').head(3)

    cdf = cdf.groupby('STNAME').sum()

    cdf = cdf.sort_values('CENSUS2010POP', axis=0, ascending=False).head(3)

    return list(cdf.index)

python performance beginner csv pandas

edited Jan 3 '17 at 15:35

200_success

128k15150413

asked Jan 3 '17 at 8:39

YohanRoth

126113

edited Jan 3 '17 at 15:35

200_success

128k15150413

asked Jan 3 '17 at 8:39

YohanRoth

126113

edited Jan 3 '17 at 15:35

200_success

128k15150413

edited Jan 3 '17 at 15:35

200_success

128k15150413

edited Jan 3 '17 at 15:35

200_success

128k15150413

asked Jan 3 '17 at 8:39

YohanRoth

126113

asked Jan 3 '17 at 8:39

YohanRoth

126113

asked Jan 3 '17 at 8:39

YohanRoth

126113

add a comment |

4 Answers
4

active

oldest

votes

nlargest could help, it finds the maximum n values in pandas series.

In this line of code, groupby groups the frame according to state name, then apply finds the 3 largest values in column CENSUS2010POP and sums them up. The resulting series has the unique state names as index and the corresponding top 3 counties sum, applying nlargest gets a series with the top 3 states as required

return census_df[census_df['SUMLEV'] == 50].groupby(

    'STNAME')['CENSUS2010POP'].apply(

    lambda x: x.nlargest(3).sum()).nlargest(

    3).index.values.tolist()

edited Jan 3 '17 at 21:01

answered Jan 3 '17 at 9:02

sgDysregulation

19116

1

You should explain a little bit more about the function you're suggesting.
– Mario Santini
Jan 3 '17 at 12:26

add a comment |

SUMLEV is explained here

Definitely want to use nlargest

The advantage of nlargest is that it performs a partial sort therefore in linear time.

However, you don't want to groupby twice. So we'll define a helper function in which we only groupby once.

I'm using a lot of .values to access underlying numpy objects. This will improve some efficiency a bit.

Lastly, I don't like accessing names from the outer scope without a purpose so I'm passing the dataframe as a parameter.

def answer_six(df):

    # subset df to things I care about

    sumlev = df.SUMLEV.values == 50

    data = df[['CENSUS2010POP', 'STNAME', 'CTYNAME']].values[sumlev]



    # build a pandas series with State and County in the index

    # vaues are from CENSUS2010POP

    s = pd.Series(data[:, 0], [data[:, 1], data[:, 2]], dtype=np.int64)



    # define a function that does the nlargest and sum in one

    # otherwise you'd have to do a second groupby

    def sum_largest(x, n=3):

        return x.nlargest(n).sum()



    return s.groupby(level=0).apply(sum_largest).nlargest(3).index.tolist()

Demonstration

answer_six(census_df)



['California', 'Texas', 'Illinois']

answered Jan 5 '17 at 20:52

piRSquared

20613

add a comment |

census_df[census_df['SUMLEV'] == 50].groupby(

    'STNAME')['CENSUS2010POP'].apply(

    lambda x: x.nlargest(3).sum()).nlargest(

    3).index.values.tolist()

Above: This seems to be the best way to do it. Below: I found another way that is slightly less elegant, but it helped me understand why sgDysregulation's solution works. I hope it can help you also.

census_df[census_df['SUMLEV'] == 50].groupby(

    'STNAME')['CENSUS2010POP'].nlargest(3).groupby(

    'STNAME').sum().nlargest(3).index.values.tolist()

answered Sep 21 '17 at 7:53

reka18

1112

Welcome to Code Review! You have presented an alternative solution, but haven't reviewed the code. Please edit it to explain your reasoning (how your solution works and how it improves upon the original) so that everyone can learn from your thought process.
– Toby Speight
Sep 21 '17 at 8:55

add a comment |

This is from the data:

New York    19378102

Illinois    12830632

How is your result excluding New York and including Illinois?

['California', 'Texas', 'Illinois'] != correct answer

edited 3 mins ago

Sᴀᴍ Onᴇᴌᴀ

8,34261853

answered 42 mins ago

Jack Andrew

New contributor

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f151530%2ffinding-the-states-with-the-three-most-populous-counties%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

4 Answers
4

active

oldest

votes

4 Answers
4

active

oldest

votes

nlargest could help, it finds the maximum n values in pandas series.

return census_df[census_df['SUMLEV'] == 50].groupby(

    'STNAME')['CENSUS2010POP'].apply(

    lambda x: x.nlargest(3).sum()).nlargest(

    3).index.values.tolist()

edited Jan 3 '17 at 21:01

answered Jan 3 '17 at 9:02

sgDysregulation

19116

1

You should explain a little bit more about the function you're suggesting.
– Mario Santini
Jan 3 '17 at 12:26

add a comment |

nlargest could help, it finds the maximum n values in pandas series.

return census_df[census_df['SUMLEV'] == 50].groupby(

    'STNAME')['CENSUS2010POP'].apply(

    lambda x: x.nlargest(3).sum()).nlargest(

    3).index.values.tolist()

edited Jan 3 '17 at 21:01

answered Jan 3 '17 at 9:02

sgDysregulation

19116

1

You should explain a little bit more about the function you're suggesting.
– Mario Santini
Jan 3 '17 at 12:26

add a comment |

nlargest could help, it finds the maximum n values in pandas series.

return census_df[census_df['SUMLEV'] == 50].groupby(

    'STNAME')['CENSUS2010POP'].apply(

    lambda x: x.nlargest(3).sum()).nlargest(

    3).index.values.tolist()

edited Jan 3 '17 at 21:01

answered Jan 3 '17 at 9:02

sgDysregulation

19116

nlargest could help, it finds the maximum n values in pandas series.

return census_df[census_df['SUMLEV'] == 50].groupby(

    'STNAME')['CENSUS2010POP'].apply(

    lambda x: x.nlargest(3).sum()).nlargest(

    3).index.values.tolist()

edited Jan 3 '17 at 21:01

answered Jan 3 '17 at 9:02

sgDysregulation

19116

edited Jan 3 '17 at 21:01

answered Jan 3 '17 at 9:02

sgDysregulation

19116

answered Jan 3 '17 at 9:02

sgDysregulation

19116

answered Jan 3 '17 at 9:02

sgDysregulation

19116

1

You should explain a little bit more about the function you're suggesting.
– Mario Santini
Jan 3 '17 at 12:26

add a comment |

1

You should explain a little bit more about the function you're suggesting.
– Mario Santini
Jan 3 '17 at 12:26

You should explain a little bit more about the function you're suggesting.
– Mario Santini
Jan 3 '17 at 12:26

add a comment |

SUMLEV is explained here

Definitely want to use nlargest

The advantage of nlargest is that it performs a partial sort therefore in linear time.

However, you don't want to groupby twice. So we'll define a helper function in which we only groupby once.

I'm using a lot of .values to access underlying numpy objects. This will improve some efficiency a bit.

Lastly, I don't like accessing names from the outer scope without a purpose so I'm passing the dataframe as a parameter.

def answer_six(df):

    # subset df to things I care about

    sumlev = df.SUMLEV.values == 50

    data = df[['CENSUS2010POP', 'STNAME', 'CTYNAME']].values[sumlev]



    # build a pandas series with State and County in the index

    # vaues are from CENSUS2010POP

    s = pd.Series(data[:, 0], [data[:, 1], data[:, 2]], dtype=np.int64)



    # define a function that does the nlargest and sum in one

    # otherwise you'd have to do a second groupby

    def sum_largest(x, n=3):

        return x.nlargest(n).sum()



    return s.groupby(level=0).apply(sum_largest).nlargest(3).index.tolist()

Demonstration

answer_six(census_df)



['California', 'Texas', 'Illinois']

answered Jan 5 '17 at 20:52

piRSquared

20613

add a comment |

SUMLEV is explained here

Definitely want to use nlargest

The advantage of nlargest is that it performs a partial sort therefore in linear time.

However, you don't want to groupby twice. So we'll define a helper function in which we only groupby once.

I'm using a lot of .values to access underlying numpy objects. This will improve some efficiency a bit.

Lastly, I don't like accessing names from the outer scope without a purpose so I'm passing the dataframe as a parameter.

def answer_six(df):

    # subset df to things I care about

    sumlev = df.SUMLEV.values == 50

    data = df[['CENSUS2010POP', 'STNAME', 'CTYNAME']].values[sumlev]



    # build a pandas series with State and County in the index

    # vaues are from CENSUS2010POP

    s = pd.Series(data[:, 0], [data[:, 1], data[:, 2]], dtype=np.int64)



    # define a function that does the nlargest and sum in one

    # otherwise you'd have to do a second groupby

    def sum_largest(x, n=3):

        return x.nlargest(n).sum()



    return s.groupby(level=0).apply(sum_largest).nlargest(3).index.tolist()

Demonstration

answer_six(census_df)



['California', 'Texas', 'Illinois']

answered Jan 5 '17 at 20:52

piRSquared

20613

add a comment |

SUMLEV is explained here

Definitely want to use nlargest

The advantage of nlargest is that it performs a partial sort therefore in linear time.

However, you don't want to groupby twice. So we'll define a helper function in which we only groupby once.

I'm using a lot of .values to access underlying numpy objects. This will improve some efficiency a bit.

Lastly, I don't like accessing names from the outer scope without a purpose so I'm passing the dataframe as a parameter.

def answer_six(df):

    # subset df to things I care about

    sumlev = df.SUMLEV.values == 50

    data = df[['CENSUS2010POP', 'STNAME', 'CTYNAME']].values[sumlev]



    # build a pandas series with State and County in the index

    # vaues are from CENSUS2010POP

    s = pd.Series(data[:, 0], [data[:, 1], data[:, 2]], dtype=np.int64)



    # define a function that does the nlargest and sum in one

    # otherwise you'd have to do a second groupby

    def sum_largest(x, n=3):

        return x.nlargest(n).sum()



    return s.groupby(level=0).apply(sum_largest).nlargest(3).index.tolist()

Demonstration

answer_six(census_df)



['California', 'Texas', 'Illinois']

answered Jan 5 '17 at 20:52

piRSquared

20613

SUMLEV is explained here

Definitely want to use nlargest

The advantage of nlargest is that it performs a partial sort therefore in linear time.

However, you don't want to groupby twice. So we'll define a helper function in which we only groupby once.

I'm using a lot of .values to access underlying numpy objects. This will improve some efficiency a bit.

Lastly, I don't like accessing names from the outer scope without a purpose so I'm passing the dataframe as a parameter.

def answer_six(df):

    # subset df to things I care about

    sumlev = df.SUMLEV.values == 50

    data = df[['CENSUS2010POP', 'STNAME', 'CTYNAME']].values[sumlev]



    # build a pandas series with State and County in the index

    # vaues are from CENSUS2010POP

    s = pd.Series(data[:, 0], [data[:, 1], data[:, 2]], dtype=np.int64)



    # define a function that does the nlargest and sum in one

    # otherwise you'd have to do a second groupby

    def sum_largest(x, n=3):

        return x.nlargest(n).sum()



    return s.groupby(level=0).apply(sum_largest).nlargest(3).index.tolist()

Demonstration

answer_six(census_df)



['California', 'Texas', 'Illinois']

answered Jan 5 '17 at 20:52

piRSquared

20613

answered Jan 5 '17 at 20:52

piRSquared

20613

answered Jan 5 '17 at 20:52

piRSquared

20613

answered Jan 5 '17 at 20:52

piRSquared

20613

add a comment |

census_df[census_df['SUMLEV'] == 50].groupby(

    'STNAME')['CENSUS2010POP'].apply(

    lambda x: x.nlargest(3).sum()).nlargest(

    3).index.values.tolist()

Above: This seems to be the best way to do it. Below: I found another way that is slightly less elegant, but it helped me understand why sgDysregulation's solution works. I hope it can help you also.

census_df[census_df['SUMLEV'] == 50].groupby(

    'STNAME')['CENSUS2010POP'].nlargest(3).groupby(

    'STNAME').sum().nlargest(3).index.values.tolist()

answered Sep 21 '17 at 7:53

reka18

1112

Welcome to Code Review! You have presented an alternative solution, but haven't reviewed the code. Please edit it to explain your reasoning (how your solution works and how it improves upon the original) so that everyone can learn from your thought process.
– Toby Speight
Sep 21 '17 at 8:55

add a comment |

census_df[census_df['SUMLEV'] == 50].groupby(

    'STNAME')['CENSUS2010POP'].apply(

    lambda x: x.nlargest(3).sum()).nlargest(

    3).index.values.tolist()

Above: This seems to be the best way to do it. Below: I found another way that is slightly less elegant, but it helped me understand why sgDysregulation's solution works. I hope it can help you also.

census_df[census_df['SUMLEV'] == 50].groupby(

    'STNAME')['CENSUS2010POP'].nlargest(3).groupby(

    'STNAME').sum().nlargest(3).index.values.tolist()

answered Sep 21 '17 at 7:53

reka18

1112

Welcome to Code Review! You have presented an alternative solution, but haven't reviewed the code. Please edit it to explain your reasoning (how your solution works and how it improves upon the original) so that everyone can learn from your thought process.
– Toby Speight
Sep 21 '17 at 8:55

add a comment |

census_df[census_df['SUMLEV'] == 50].groupby(

    'STNAME')['CENSUS2010POP'].apply(

    lambda x: x.nlargest(3).sum()).nlargest(

    3).index.values.tolist()

Above: This seems to be the best way to do it. Below: I found another way that is slightly less elegant, but it helped me understand why sgDysregulation's solution works. I hope it can help you also.

census_df[census_df['SUMLEV'] == 50].groupby(

    'STNAME')['CENSUS2010POP'].nlargest(3).groupby(

    'STNAME').sum().nlargest(3).index.values.tolist()

answered Sep 21 '17 at 7:53

reka18

1112

census_df[census_df['SUMLEV'] == 50].groupby(

    'STNAME')['CENSUS2010POP'].apply(

    lambda x: x.nlargest(3).sum()).nlargest(

    3).index.values.tolist()

Above: This seems to be the best way to do it. Below: I found another way that is slightly less elegant, but it helped me understand why sgDysregulation's solution works. I hope it can help you also.

census_df[census_df['SUMLEV'] == 50].groupby(

    'STNAME')['CENSUS2010POP'].nlargest(3).groupby(

    'STNAME').sum().nlargest(3).index.values.tolist()

answered Sep 21 '17 at 7:53

reka18

1112

answered Sep 21 '17 at 7:53

reka18

1112

answered Sep 21 '17 at 7:53

reka18

1112

answered Sep 21 '17 at 7:53

reka18

1112

Welcome to Code Review! You have presented an alternative solution, but haven't reviewed the code. Please edit it to explain your reasoning (how your solution works and how it improves upon the original) so that everyone can learn from your thought process.
– Toby Speight
Sep 21 '17 at 8:55

add a comment |

Welcome to Code Review! You have presented an alternative solution, but haven't reviewed the code. Please edit it to explain your reasoning (how your solution works and how it improves upon the original) so that everyone can learn from your thought process.
– Toby Speight
Sep 21 '17 at 8:55

Welcome to Code Review! You have presented an alternative solution, but haven't reviewed the code. Please edit it to explain your reasoning (how your solution works and how it improves upon the original) so that everyone can learn from your thought process.
– Toby Speight
Sep 21 '17 at 8:55

add a comment |

This is from the data:

New York    19378102

Illinois    12830632

How is your result excluding New York and including Illinois?

['California', 'Texas', 'Illinois'] != correct answer

edited 3 mins ago

Sᴀᴍ Onᴇᴌᴀ

8,34261853

answered 42 mins ago

Jack Andrew

New contributor

add a comment |

This is from the data:

New York    19378102

Illinois    12830632

How is your result excluding New York and including Illinois?

['California', 'Texas', 'Illinois'] != correct answer

edited 3 mins ago

Sᴀᴍ Onᴇᴌᴀ

8,34261853

answered 42 mins ago

Jack Andrew

New contributor

add a comment |

This is from the data:

New York    19378102

Illinois    12830632

How is your result excluding New York and including Illinois?

['California', 'Texas', 'Illinois'] != correct answer

edited 3 mins ago

Sᴀᴍ Onᴇᴌᴀ

8,34261853

answered 42 mins ago

Jack Andrew

New contributor

This is from the data:

New York    19378102

Illinois    12830632

How is your result excluding New York and including Illinois?

['California', 'Texas', 'Illinois'] != correct answer

edited 3 mins ago

Sᴀᴍ Onᴇᴌᴀ

8,34261853

answered 42 mins ago

Jack Andrew

New contributor

edited 3 mins ago

Sᴀᴍ Onᴇᴌᴀ

8,34261853

edited 3 mins ago

Sᴀᴍ Onᴇᴌᴀ

8,34261853

edited 3 mins ago

Sᴀᴍ Onᴇᴌᴀ

8,34261853

answered 42 mins ago

Jack Andrew

New contributor

answered 42 mins ago

Jack Andrew

answered 42 mins ago

Jack Andrew

New contributor

Jack Andrew is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Code Review Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Cfrtjryk