How can I get a Wikipedia article's text using Python 3 with Beautiful Soup?












14














I have this script made in Python 3:



response = simple_get("https://en.wikipedia.org/wiki/Mathematics")
result = {}
result["url"] = url
if response is not None:
html = BeautifulSoup(response, 'html.parser')
title = html.select("#firstHeading")[0].text


As you can see I can get the title from the article, but I cannot figure out how to get the text from "Mathematics (from Greek μά..." to the contents table...










share|improve this question





























    14














    I have this script made in Python 3:



    response = simple_get("https://en.wikipedia.org/wiki/Mathematics")
    result = {}
    result["url"] = url
    if response is not None:
    html = BeautifulSoup(response, 'html.parser')
    title = html.select("#firstHeading")[0].text


    As you can see I can get the title from the article, but I cannot figure out how to get the text from "Mathematics (from Greek μά..." to the contents table...










    share|improve this question



























      14












      14








      14


      4





      I have this script made in Python 3:



      response = simple_get("https://en.wikipedia.org/wiki/Mathematics")
      result = {}
      result["url"] = url
      if response is not None:
      html = BeautifulSoup(response, 'html.parser')
      title = html.select("#firstHeading")[0].text


      As you can see I can get the title from the article, but I cannot figure out how to get the text from "Mathematics (from Greek μά..." to the contents table...










      share|improve this question















      I have this script made in Python 3:



      response = simple_get("https://en.wikipedia.org/wiki/Mathematics")
      result = {}
      result["url"] = url
      if response is not None:
      html = BeautifulSoup(response, 'html.parser')
      title = html.select("#firstHeading")[0].text


      As you can see I can get the title from the article, but I cannot figure out how to get the text from "Mathematics (from Greek μά..." to the contents table...







      python web-scraping beautifulsoup






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Dec 17 at 13:43









      Boann

      36.7k1287121




      36.7k1287121










      asked Dec 16 at 17:17









      wiki one

      855




      855
























          6 Answers
          6






          active

          oldest

          votes


















          13














          select the <p> tag. There are 52 elements. Not sure if you want the whole thing, but you can iterate through those tags to store it as you may. I just chose to print each of them to show the output.



          import bs4
          import requests


          response = requests.get("https://en.wikipedia.org/wiki/Mathematics")

          if response is not None:
          html = bs4.BeautifulSoup(response.text, 'html.parser')

          title = html.select("#firstHeading")[0].text
          paragraphs = html.select("p")
          for para in paragraphs:
          print (para.text)

          # just grab the text up to contents as stated in question
          intro = 'n'.join([ para.text for para in paragraphs[0:5]])
          print (intro)





          share|improve this answer



















          • 3




            if response is not None can be rewritten as if response. also since the content may change in the future I would have suggested to get the entire div, read only the p and stop when you reach the div with class "toclimit-3"
            – PinoSan
            Dec 16 at 17:42






          • 4




            @PinoSan I think it doesn't hurt to check for None explicitly. For example bool('' is not None) is not the same as bool(''). However, in this case the None check is completely unnecessary because response will always be a requests.models.Response object. If the request fails an exception will be raised.
            – t.m.adam
            Dec 16 at 18:42












          • @t.m.adam what you are saying is true but as you said the response is not a string. So you just wanted to check that it was a valid object, not an empty string, None or an empty dictionary, ... About the exceptions, I agree we should check for exceptions in case of networks errors but also we should check for status code to be 200
            – PinoSan
            Dec 16 at 19:17












          • @PinoSan Of course, and I too prefer the if response style, but you know, "Explicit is better than implicit.". The problem with if response is that it may produce strange errors, difficult to debug. But yes, in most cases a simple boolean check should be enough.
            – t.m.adam
            Dec 16 at 19:22












          • Just because you can scrape the page, doesn't mean you should. The Wikipedia API has python packages that allow for easy and direct access to articles without undue load on the site or extra work on your end.
            – Noah B. Johnson
            Dec 18 at 1:56



















          25














          There is a much, much more easy way to get information from wikipedia - Wikipedia API.



          There is this Python wrapper, which allows you to do it in a few lines only with zero HTML-parsing:



          import wikipediaapi

          wiki_wiki = wikipediaapi.Wikipedia('en')

          page = wiki_wiki.page('Mathematics')
          print(page.summary)


          Prints:




          Mathematics (from Greek μάθημα máthēma, "knowledge, study, learning")
          includes the study of such topics as quantity, structure, space, and
          change...(omitted intentionally)




          And, in general, try to avoid screen-scraping if there's a direct API available.






          share|improve this answer































            13














            Use the library wikipedia



            import wikipedia
            #print(wikipedia.summary("Mathematics"))
            #wikipedia.search("Mathematics")
            print(wikipedia.page("Mathematics").content)





            share|improve this answer



















            • 3




              I'd use wikipediaapi instead, wikipedia module seems to be not maintained. Though, both would do the job in a similar manner.
              – alecxe
              Dec 16 at 17:49





















            4














            You can get the desired output using lxml library like following.



            import requests
            from lxml.html import fromstring

            url = "https://en.wikipedia.org/wiki/Mathematics"

            res = requests.get(url)
            source = fromstring(res.content)
            paragraph = 'n'.join([item.text_content() for item in source.xpath('//p[following::h2[2][span="History"]]')])
            print(paragraph)


            Using BeautifulSoup:



            from bs4 import BeautifulSoup
            import requests

            res = requests.get("https://en.wikipedia.org/wiki/Mathematics")
            soup = BeautifulSoup(res.text, 'html.parser')
            for item in soup.find_all("p"):
            if item.text.startswith("The history"):break
            print(item.text)





            share|improve this answer































              2














              What you seem to want is the (HTML) page content without the surrounding navigation elements. As I described in this earlier answer from 2013, there are (at least) two ways to get it:




              • Probably the easiest way in your case is to include the parameter action=render in the URL, as in https://en.wikipedia.org/wiki/Mathematics?action=render. This will give you just the content HTML, and nothing else.


              • Alternatively, you can also obtain the page content via the MediaWiki API, as in https://en.wikipedia.org/w/api.php?format=xml&action=parse&page=Mathematics.



              The advantage of using the API is that it can also give you a lot of other information about the page that you may find useful. For example, if you'd like to have a list of the interlanguage links normally shown in the page's sidebar, or the categories normally shown below the content area, you can get those from the API like this:



              https://en.wikipedia.org/w/api.php?format=xml&action=parse&page=Mathematics&prop=langlinks|categories



              (To also get the page content with the same request, use prop=langlinks|categories|text.)



              There are several Python libraries for using the MediaWiki API that can automate some of nitty gritty details of using it, although the feature set they support can vary. That said, just using the API directly from your code without a library in between is perfectly possible, too.






              share|improve this answer





























                2














                To get a proper way using function, you can just get JSON API offered by Wikipedia :



                from urllib.request import urlopen
                from urllib.parse import urlencode
                from json import loads


                def getJSON(page):
                params = urlencode({
                'format': 'json',
                'action': 'parse',
                'prop': 'text',
                'redirects' : 'true',
                'page': page})
                API = "https://en.wikipedia.org/w/api.php"
                response = urlopen(API + "?" + params)
                return response.read().decode('utf-8')


                def getRawPage(page):
                parsed = loads(getJSON(page))
                try:
                title = parsed['parse']['title']
                content = parsed['parse']['text']['*']
                return title, content
                except KeyError:
                # The page doesn't exist
                return None, None

                title, content = getRawPage("Mathematics")


                enter image description here



                You can then parse it with any library you want to extract what you need :)






                share|improve this answer





















                  Your Answer






                  StackExchange.ifUsing("editor", function () {
                  StackExchange.using("externalEditor", function () {
                  StackExchange.using("snippets", function () {
                  StackExchange.snippets.init();
                  });
                  });
                  }, "code-snippets");

                  StackExchange.ready(function() {
                  var channelOptions = {
                  tags: "".split(" "),
                  id: "1"
                  };
                  initTagRenderer("".split(" "), "".split(" "), channelOptions);

                  StackExchange.using("externalEditor", function() {
                  // Have to fire editor after snippets, if snippets enabled
                  if (StackExchange.settings.snippets.snippetsEnabled) {
                  StackExchange.using("snippets", function() {
                  createEditor();
                  });
                  }
                  else {
                  createEditor();
                  }
                  });

                  function createEditor() {
                  StackExchange.prepareEditor({
                  heartbeatType: 'answer',
                  autoActivateHeartbeat: false,
                  convertImagesToLinks: true,
                  noModals: true,
                  showLowRepImageUploadWarning: true,
                  reputationToPostImages: 10,
                  bindNavPrevention: true,
                  postfix: "",
                  imageUploader: {
                  brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
                  contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
                  allowUrls: true
                  },
                  onDemand: true,
                  discardSelector: ".discard-answer"
                  ,immediatelyShowMarkdownHelp:true
                  });


                  }
                  });














                  draft saved

                  draft discarded


















                  StackExchange.ready(
                  function () {
                  StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53804643%2fhow-can-i-get-a-wikipedia-articles-text-using-python-3-with-beautiful-soup%23new-answer', 'question_page');
                  }
                  );

                  Post as a guest















                  Required, but never shown

























                  6 Answers
                  6






                  active

                  oldest

                  votes








                  6 Answers
                  6






                  active

                  oldest

                  votes









                  active

                  oldest

                  votes






                  active

                  oldest

                  votes









                  13














                  select the <p> tag. There are 52 elements. Not sure if you want the whole thing, but you can iterate through those tags to store it as you may. I just chose to print each of them to show the output.



                  import bs4
                  import requests


                  response = requests.get("https://en.wikipedia.org/wiki/Mathematics")

                  if response is not None:
                  html = bs4.BeautifulSoup(response.text, 'html.parser')

                  title = html.select("#firstHeading")[0].text
                  paragraphs = html.select("p")
                  for para in paragraphs:
                  print (para.text)

                  # just grab the text up to contents as stated in question
                  intro = 'n'.join([ para.text for para in paragraphs[0:5]])
                  print (intro)





                  share|improve this answer



















                  • 3




                    if response is not None can be rewritten as if response. also since the content may change in the future I would have suggested to get the entire div, read only the p and stop when you reach the div with class "toclimit-3"
                    – PinoSan
                    Dec 16 at 17:42






                  • 4




                    @PinoSan I think it doesn't hurt to check for None explicitly. For example bool('' is not None) is not the same as bool(''). However, in this case the None check is completely unnecessary because response will always be a requests.models.Response object. If the request fails an exception will be raised.
                    – t.m.adam
                    Dec 16 at 18:42












                  • @t.m.adam what you are saying is true but as you said the response is not a string. So you just wanted to check that it was a valid object, not an empty string, None or an empty dictionary, ... About the exceptions, I agree we should check for exceptions in case of networks errors but also we should check for status code to be 200
                    – PinoSan
                    Dec 16 at 19:17












                  • @PinoSan Of course, and I too prefer the if response style, but you know, "Explicit is better than implicit.". The problem with if response is that it may produce strange errors, difficult to debug. But yes, in most cases a simple boolean check should be enough.
                    – t.m.adam
                    Dec 16 at 19:22












                  • Just because you can scrape the page, doesn't mean you should. The Wikipedia API has python packages that allow for easy and direct access to articles without undue load on the site or extra work on your end.
                    – Noah B. Johnson
                    Dec 18 at 1:56
















                  13














                  select the <p> tag. There are 52 elements. Not sure if you want the whole thing, but you can iterate through those tags to store it as you may. I just chose to print each of them to show the output.



                  import bs4
                  import requests


                  response = requests.get("https://en.wikipedia.org/wiki/Mathematics")

                  if response is not None:
                  html = bs4.BeautifulSoup(response.text, 'html.parser')

                  title = html.select("#firstHeading")[0].text
                  paragraphs = html.select("p")
                  for para in paragraphs:
                  print (para.text)

                  # just grab the text up to contents as stated in question
                  intro = 'n'.join([ para.text for para in paragraphs[0:5]])
                  print (intro)





                  share|improve this answer



















                  • 3




                    if response is not None can be rewritten as if response. also since the content may change in the future I would have suggested to get the entire div, read only the p and stop when you reach the div with class "toclimit-3"
                    – PinoSan
                    Dec 16 at 17:42






                  • 4




                    @PinoSan I think it doesn't hurt to check for None explicitly. For example bool('' is not None) is not the same as bool(''). However, in this case the None check is completely unnecessary because response will always be a requests.models.Response object. If the request fails an exception will be raised.
                    – t.m.adam
                    Dec 16 at 18:42












                  • @t.m.adam what you are saying is true but as you said the response is not a string. So you just wanted to check that it was a valid object, not an empty string, None or an empty dictionary, ... About the exceptions, I agree we should check for exceptions in case of networks errors but also we should check for status code to be 200
                    – PinoSan
                    Dec 16 at 19:17












                  • @PinoSan Of course, and I too prefer the if response style, but you know, "Explicit is better than implicit.". The problem with if response is that it may produce strange errors, difficult to debug. But yes, in most cases a simple boolean check should be enough.
                    – t.m.adam
                    Dec 16 at 19:22












                  • Just because you can scrape the page, doesn't mean you should. The Wikipedia API has python packages that allow for easy and direct access to articles without undue load on the site or extra work on your end.
                    – Noah B. Johnson
                    Dec 18 at 1:56














                  13












                  13








                  13






                  select the <p> tag. There are 52 elements. Not sure if you want the whole thing, but you can iterate through those tags to store it as you may. I just chose to print each of them to show the output.



                  import bs4
                  import requests


                  response = requests.get("https://en.wikipedia.org/wiki/Mathematics")

                  if response is not None:
                  html = bs4.BeautifulSoup(response.text, 'html.parser')

                  title = html.select("#firstHeading")[0].text
                  paragraphs = html.select("p")
                  for para in paragraphs:
                  print (para.text)

                  # just grab the text up to contents as stated in question
                  intro = 'n'.join([ para.text for para in paragraphs[0:5]])
                  print (intro)





                  share|improve this answer














                  select the <p> tag. There are 52 elements. Not sure if you want the whole thing, but you can iterate through those tags to store it as you may. I just chose to print each of them to show the output.



                  import bs4
                  import requests


                  response = requests.get("https://en.wikipedia.org/wiki/Mathematics")

                  if response is not None:
                  html = bs4.BeautifulSoup(response.text, 'html.parser')

                  title = html.select("#firstHeading")[0].text
                  paragraphs = html.select("p")
                  for para in paragraphs:
                  print (para.text)

                  # just grab the text up to contents as stated in question
                  intro = 'n'.join([ para.text for para in paragraphs[0:5]])
                  print (intro)






                  share|improve this answer














                  share|improve this answer



                  share|improve this answer








                  edited Dec 16 at 17:33

























                  answered Dec 16 at 17:24









                  chitown88

                  1,6041314




                  1,6041314








                  • 3




                    if response is not None can be rewritten as if response. also since the content may change in the future I would have suggested to get the entire div, read only the p and stop when you reach the div with class "toclimit-3"
                    – PinoSan
                    Dec 16 at 17:42






                  • 4




                    @PinoSan I think it doesn't hurt to check for None explicitly. For example bool('' is not None) is not the same as bool(''). However, in this case the None check is completely unnecessary because response will always be a requests.models.Response object. If the request fails an exception will be raised.
                    – t.m.adam
                    Dec 16 at 18:42












                  • @t.m.adam what you are saying is true but as you said the response is not a string. So you just wanted to check that it was a valid object, not an empty string, None or an empty dictionary, ... About the exceptions, I agree we should check for exceptions in case of networks errors but also we should check for status code to be 200
                    – PinoSan
                    Dec 16 at 19:17












                  • @PinoSan Of course, and I too prefer the if response style, but you know, "Explicit is better than implicit.". The problem with if response is that it may produce strange errors, difficult to debug. But yes, in most cases a simple boolean check should be enough.
                    – t.m.adam
                    Dec 16 at 19:22












                  • Just because you can scrape the page, doesn't mean you should. The Wikipedia API has python packages that allow for easy and direct access to articles without undue load on the site or extra work on your end.
                    – Noah B. Johnson
                    Dec 18 at 1:56














                  • 3




                    if response is not None can be rewritten as if response. also since the content may change in the future I would have suggested to get the entire div, read only the p and stop when you reach the div with class "toclimit-3"
                    – PinoSan
                    Dec 16 at 17:42






                  • 4




                    @PinoSan I think it doesn't hurt to check for None explicitly. For example bool('' is not None) is not the same as bool(''). However, in this case the None check is completely unnecessary because response will always be a requests.models.Response object. If the request fails an exception will be raised.
                    – t.m.adam
                    Dec 16 at 18:42












                  • @t.m.adam what you are saying is true but as you said the response is not a string. So you just wanted to check that it was a valid object, not an empty string, None or an empty dictionary, ... About the exceptions, I agree we should check for exceptions in case of networks errors but also we should check for status code to be 200
                    – PinoSan
                    Dec 16 at 19:17












                  • @PinoSan Of course, and I too prefer the if response style, but you know, "Explicit is better than implicit.". The problem with if response is that it may produce strange errors, difficult to debug. But yes, in most cases a simple boolean check should be enough.
                    – t.m.adam
                    Dec 16 at 19:22












                  • Just because you can scrape the page, doesn't mean you should. The Wikipedia API has python packages that allow for easy and direct access to articles without undue load on the site or extra work on your end.
                    – Noah B. Johnson
                    Dec 18 at 1:56








                  3




                  3




                  if response is not None can be rewritten as if response. also since the content may change in the future I would have suggested to get the entire div, read only the p and stop when you reach the div with class "toclimit-3"
                  – PinoSan
                  Dec 16 at 17:42




                  if response is not None can be rewritten as if response. also since the content may change in the future I would have suggested to get the entire div, read only the p and stop when you reach the div with class "toclimit-3"
                  – PinoSan
                  Dec 16 at 17:42




                  4




                  4




                  @PinoSan I think it doesn't hurt to check for None explicitly. For example bool('' is not None) is not the same as bool(''). However, in this case the None check is completely unnecessary because response will always be a requests.models.Response object. If the request fails an exception will be raised.
                  – t.m.adam
                  Dec 16 at 18:42






                  @PinoSan I think it doesn't hurt to check for None explicitly. For example bool('' is not None) is not the same as bool(''). However, in this case the None check is completely unnecessary because response will always be a requests.models.Response object. If the request fails an exception will be raised.
                  – t.m.adam
                  Dec 16 at 18:42














                  @t.m.adam what you are saying is true but as you said the response is not a string. So you just wanted to check that it was a valid object, not an empty string, None or an empty dictionary, ... About the exceptions, I agree we should check for exceptions in case of networks errors but also we should check for status code to be 200
                  – PinoSan
                  Dec 16 at 19:17






                  @t.m.adam what you are saying is true but as you said the response is not a string. So you just wanted to check that it was a valid object, not an empty string, None or an empty dictionary, ... About the exceptions, I agree we should check for exceptions in case of networks errors but also we should check for status code to be 200
                  – PinoSan
                  Dec 16 at 19:17














                  @PinoSan Of course, and I too prefer the if response style, but you know, "Explicit is better than implicit.". The problem with if response is that it may produce strange errors, difficult to debug. But yes, in most cases a simple boolean check should be enough.
                  – t.m.adam
                  Dec 16 at 19:22






                  @PinoSan Of course, and I too prefer the if response style, but you know, "Explicit is better than implicit.". The problem with if response is that it may produce strange errors, difficult to debug. But yes, in most cases a simple boolean check should be enough.
                  – t.m.adam
                  Dec 16 at 19:22














                  Just because you can scrape the page, doesn't mean you should. The Wikipedia API has python packages that allow for easy and direct access to articles without undue load on the site or extra work on your end.
                  – Noah B. Johnson
                  Dec 18 at 1:56




                  Just because you can scrape the page, doesn't mean you should. The Wikipedia API has python packages that allow for easy and direct access to articles without undue load on the site or extra work on your end.
                  – Noah B. Johnson
                  Dec 18 at 1:56













                  25














                  There is a much, much more easy way to get information from wikipedia - Wikipedia API.



                  There is this Python wrapper, which allows you to do it in a few lines only with zero HTML-parsing:



                  import wikipediaapi

                  wiki_wiki = wikipediaapi.Wikipedia('en')

                  page = wiki_wiki.page('Mathematics')
                  print(page.summary)


                  Prints:




                  Mathematics (from Greek μάθημα máthēma, "knowledge, study, learning")
                  includes the study of such topics as quantity, structure, space, and
                  change...(omitted intentionally)




                  And, in general, try to avoid screen-scraping if there's a direct API available.






                  share|improve this answer




























                    25














                    There is a much, much more easy way to get information from wikipedia - Wikipedia API.



                    There is this Python wrapper, which allows you to do it in a few lines only with zero HTML-parsing:



                    import wikipediaapi

                    wiki_wiki = wikipediaapi.Wikipedia('en')

                    page = wiki_wiki.page('Mathematics')
                    print(page.summary)


                    Prints:




                    Mathematics (from Greek μάθημα máthēma, "knowledge, study, learning")
                    includes the study of such topics as quantity, structure, space, and
                    change...(omitted intentionally)




                    And, in general, try to avoid screen-scraping if there's a direct API available.






                    share|improve this answer


























                      25












                      25








                      25






                      There is a much, much more easy way to get information from wikipedia - Wikipedia API.



                      There is this Python wrapper, which allows you to do it in a few lines only with zero HTML-parsing:



                      import wikipediaapi

                      wiki_wiki = wikipediaapi.Wikipedia('en')

                      page = wiki_wiki.page('Mathematics')
                      print(page.summary)


                      Prints:




                      Mathematics (from Greek μάθημα máthēma, "knowledge, study, learning")
                      includes the study of such topics as quantity, structure, space, and
                      change...(omitted intentionally)




                      And, in general, try to avoid screen-scraping if there's a direct API available.






                      share|improve this answer














                      There is a much, much more easy way to get information from wikipedia - Wikipedia API.



                      There is this Python wrapper, which allows you to do it in a few lines only with zero HTML-parsing:



                      import wikipediaapi

                      wiki_wiki = wikipediaapi.Wikipedia('en')

                      page = wiki_wiki.page('Mathematics')
                      print(page.summary)


                      Prints:




                      Mathematics (from Greek μάθημα máthēma, "knowledge, study, learning")
                      includes the study of such topics as quantity, structure, space, and
                      change...(omitted intentionally)




                      And, in general, try to avoid screen-scraping if there's a direct API available.







                      share|improve this answer














                      share|improve this answer



                      share|improve this answer








                      edited Dec 17 at 0:10

























                      answered Dec 16 at 17:47









                      alecxe

                      321k64619845




                      321k64619845























                          13














                          Use the library wikipedia



                          import wikipedia
                          #print(wikipedia.summary("Mathematics"))
                          #wikipedia.search("Mathematics")
                          print(wikipedia.page("Mathematics").content)





                          share|improve this answer



















                          • 3




                            I'd use wikipediaapi instead, wikipedia module seems to be not maintained. Though, both would do the job in a similar manner.
                            – alecxe
                            Dec 16 at 17:49


















                          13














                          Use the library wikipedia



                          import wikipedia
                          #print(wikipedia.summary("Mathematics"))
                          #wikipedia.search("Mathematics")
                          print(wikipedia.page("Mathematics").content)





                          share|improve this answer



















                          • 3




                            I'd use wikipediaapi instead, wikipedia module seems to be not maintained. Though, both would do the job in a similar manner.
                            – alecxe
                            Dec 16 at 17:49
















                          13












                          13








                          13






                          Use the library wikipedia



                          import wikipedia
                          #print(wikipedia.summary("Mathematics"))
                          #wikipedia.search("Mathematics")
                          print(wikipedia.page("Mathematics").content)





                          share|improve this answer














                          Use the library wikipedia



                          import wikipedia
                          #print(wikipedia.summary("Mathematics"))
                          #wikipedia.search("Mathematics")
                          print(wikipedia.page("Mathematics").content)






                          share|improve this answer














                          share|improve this answer



                          share|improve this answer








                          edited Dec 16 at 17:34

























                          answered Dec 16 at 17:26









                          QHarr

                          29.9k81841




                          29.9k81841








                          • 3




                            I'd use wikipediaapi instead, wikipedia module seems to be not maintained. Though, both would do the job in a similar manner.
                            – alecxe
                            Dec 16 at 17:49
















                          • 3




                            I'd use wikipediaapi instead, wikipedia module seems to be not maintained. Though, both would do the job in a similar manner.
                            – alecxe
                            Dec 16 at 17:49










                          3




                          3




                          I'd use wikipediaapi instead, wikipedia module seems to be not maintained. Though, both would do the job in a similar manner.
                          – alecxe
                          Dec 16 at 17:49






                          I'd use wikipediaapi instead, wikipedia module seems to be not maintained. Though, both would do the job in a similar manner.
                          – alecxe
                          Dec 16 at 17:49













                          4














                          You can get the desired output using lxml library like following.



                          import requests
                          from lxml.html import fromstring

                          url = "https://en.wikipedia.org/wiki/Mathematics"

                          res = requests.get(url)
                          source = fromstring(res.content)
                          paragraph = 'n'.join([item.text_content() for item in source.xpath('//p[following::h2[2][span="History"]]')])
                          print(paragraph)


                          Using BeautifulSoup:



                          from bs4 import BeautifulSoup
                          import requests

                          res = requests.get("https://en.wikipedia.org/wiki/Mathematics")
                          soup = BeautifulSoup(res.text, 'html.parser')
                          for item in soup.find_all("p"):
                          if item.text.startswith("The history"):break
                          print(item.text)





                          share|improve this answer




























                            4














                            You can get the desired output using lxml library like following.



                            import requests
                            from lxml.html import fromstring

                            url = "https://en.wikipedia.org/wiki/Mathematics"

                            res = requests.get(url)
                            source = fromstring(res.content)
                            paragraph = 'n'.join([item.text_content() for item in source.xpath('//p[following::h2[2][span="History"]]')])
                            print(paragraph)


                            Using BeautifulSoup:



                            from bs4 import BeautifulSoup
                            import requests

                            res = requests.get("https://en.wikipedia.org/wiki/Mathematics")
                            soup = BeautifulSoup(res.text, 'html.parser')
                            for item in soup.find_all("p"):
                            if item.text.startswith("The history"):break
                            print(item.text)





                            share|improve this answer


























                              4












                              4








                              4






                              You can get the desired output using lxml library like following.



                              import requests
                              from lxml.html import fromstring

                              url = "https://en.wikipedia.org/wiki/Mathematics"

                              res = requests.get(url)
                              source = fromstring(res.content)
                              paragraph = 'n'.join([item.text_content() for item in source.xpath('//p[following::h2[2][span="History"]]')])
                              print(paragraph)


                              Using BeautifulSoup:



                              from bs4 import BeautifulSoup
                              import requests

                              res = requests.get("https://en.wikipedia.org/wiki/Mathematics")
                              soup = BeautifulSoup(res.text, 'html.parser')
                              for item in soup.find_all("p"):
                              if item.text.startswith("The history"):break
                              print(item.text)





                              share|improve this answer














                              You can get the desired output using lxml library like following.



                              import requests
                              from lxml.html import fromstring

                              url = "https://en.wikipedia.org/wiki/Mathematics"

                              res = requests.get(url)
                              source = fromstring(res.content)
                              paragraph = 'n'.join([item.text_content() for item in source.xpath('//p[following::h2[2][span="History"]]')])
                              print(paragraph)


                              Using BeautifulSoup:



                              from bs4 import BeautifulSoup
                              import requests

                              res = requests.get("https://en.wikipedia.org/wiki/Mathematics")
                              soup = BeautifulSoup(res.text, 'html.parser')
                              for item in soup.find_all("p"):
                              if item.text.startswith("The history"):break
                              print(item.text)






                              share|improve this answer














                              share|improve this answer



                              share|improve this answer








                              edited Dec 16 at 19:54

























                              answered Dec 16 at 19:14









                              SIM

                              9,8453740




                              9,8453740























                                  2














                                  What you seem to want is the (HTML) page content without the surrounding navigation elements. As I described in this earlier answer from 2013, there are (at least) two ways to get it:




                                  • Probably the easiest way in your case is to include the parameter action=render in the URL, as in https://en.wikipedia.org/wiki/Mathematics?action=render. This will give you just the content HTML, and nothing else.


                                  • Alternatively, you can also obtain the page content via the MediaWiki API, as in https://en.wikipedia.org/w/api.php?format=xml&action=parse&page=Mathematics.



                                  The advantage of using the API is that it can also give you a lot of other information about the page that you may find useful. For example, if you'd like to have a list of the interlanguage links normally shown in the page's sidebar, or the categories normally shown below the content area, you can get those from the API like this:



                                  https://en.wikipedia.org/w/api.php?format=xml&action=parse&page=Mathematics&prop=langlinks|categories



                                  (To also get the page content with the same request, use prop=langlinks|categories|text.)



                                  There are several Python libraries for using the MediaWiki API that can automate some of nitty gritty details of using it, although the feature set they support can vary. That said, just using the API directly from your code without a library in between is perfectly possible, too.






                                  share|improve this answer


























                                    2














                                    What you seem to want is the (HTML) page content without the surrounding navigation elements. As I described in this earlier answer from 2013, there are (at least) two ways to get it:




                                    • Probably the easiest way in your case is to include the parameter action=render in the URL, as in https://en.wikipedia.org/wiki/Mathematics?action=render. This will give you just the content HTML, and nothing else.


                                    • Alternatively, you can also obtain the page content via the MediaWiki API, as in https://en.wikipedia.org/w/api.php?format=xml&action=parse&page=Mathematics.



                                    The advantage of using the API is that it can also give you a lot of other information about the page that you may find useful. For example, if you'd like to have a list of the interlanguage links normally shown in the page's sidebar, or the categories normally shown below the content area, you can get those from the API like this:



                                    https://en.wikipedia.org/w/api.php?format=xml&action=parse&page=Mathematics&prop=langlinks|categories



                                    (To also get the page content with the same request, use prop=langlinks|categories|text.)



                                    There are several Python libraries for using the MediaWiki API that can automate some of nitty gritty details of using it, although the feature set they support can vary. That said, just using the API directly from your code without a library in between is perfectly possible, too.






                                    share|improve this answer
























                                      2












                                      2








                                      2






                                      What you seem to want is the (HTML) page content without the surrounding navigation elements. As I described in this earlier answer from 2013, there are (at least) two ways to get it:




                                      • Probably the easiest way in your case is to include the parameter action=render in the URL, as in https://en.wikipedia.org/wiki/Mathematics?action=render. This will give you just the content HTML, and nothing else.


                                      • Alternatively, you can also obtain the page content via the MediaWiki API, as in https://en.wikipedia.org/w/api.php?format=xml&action=parse&page=Mathematics.



                                      The advantage of using the API is that it can also give you a lot of other information about the page that you may find useful. For example, if you'd like to have a list of the interlanguage links normally shown in the page's sidebar, or the categories normally shown below the content area, you can get those from the API like this:



                                      https://en.wikipedia.org/w/api.php?format=xml&action=parse&page=Mathematics&prop=langlinks|categories



                                      (To also get the page content with the same request, use prop=langlinks|categories|text.)



                                      There are several Python libraries for using the MediaWiki API that can automate some of nitty gritty details of using it, although the feature set they support can vary. That said, just using the API directly from your code without a library in between is perfectly possible, too.






                                      share|improve this answer












                                      What you seem to want is the (HTML) page content without the surrounding navigation elements. As I described in this earlier answer from 2013, there are (at least) two ways to get it:




                                      • Probably the easiest way in your case is to include the parameter action=render in the URL, as in https://en.wikipedia.org/wiki/Mathematics?action=render. This will give you just the content HTML, and nothing else.


                                      • Alternatively, you can also obtain the page content via the MediaWiki API, as in https://en.wikipedia.org/w/api.php?format=xml&action=parse&page=Mathematics.



                                      The advantage of using the API is that it can also give you a lot of other information about the page that you may find useful. For example, if you'd like to have a list of the interlanguage links normally shown in the page's sidebar, or the categories normally shown below the content area, you can get those from the API like this:



                                      https://en.wikipedia.org/w/api.php?format=xml&action=parse&page=Mathematics&prop=langlinks|categories



                                      (To also get the page content with the same request, use prop=langlinks|categories|text.)



                                      There are several Python libraries for using the MediaWiki API that can automate some of nitty gritty details of using it, although the feature set they support can vary. That said, just using the API directly from your code without a library in between is perfectly possible, too.







                                      share|improve this answer












                                      share|improve this answer



                                      share|improve this answer










                                      answered Dec 17 at 2:17









                                      Ilmari Karonen

                                      37.1k566125




                                      37.1k566125























                                          2














                                          To get a proper way using function, you can just get JSON API offered by Wikipedia :



                                          from urllib.request import urlopen
                                          from urllib.parse import urlencode
                                          from json import loads


                                          def getJSON(page):
                                          params = urlencode({
                                          'format': 'json',
                                          'action': 'parse',
                                          'prop': 'text',
                                          'redirects' : 'true',
                                          'page': page})
                                          API = "https://en.wikipedia.org/w/api.php"
                                          response = urlopen(API + "?" + params)
                                          return response.read().decode('utf-8')


                                          def getRawPage(page):
                                          parsed = loads(getJSON(page))
                                          try:
                                          title = parsed['parse']['title']
                                          content = parsed['parse']['text']['*']
                                          return title, content
                                          except KeyError:
                                          # The page doesn't exist
                                          return None, None

                                          title, content = getRawPage("Mathematics")


                                          enter image description here



                                          You can then parse it with any library you want to extract what you need :)






                                          share|improve this answer


























                                            2














                                            To get a proper way using function, you can just get JSON API offered by Wikipedia :



                                            from urllib.request import urlopen
                                            from urllib.parse import urlencode
                                            from json import loads


                                            def getJSON(page):
                                            params = urlencode({
                                            'format': 'json',
                                            'action': 'parse',
                                            'prop': 'text',
                                            'redirects' : 'true',
                                            'page': page})
                                            API = "https://en.wikipedia.org/w/api.php"
                                            response = urlopen(API + "?" + params)
                                            return response.read().decode('utf-8')


                                            def getRawPage(page):
                                            parsed = loads(getJSON(page))
                                            try:
                                            title = parsed['parse']['title']
                                            content = parsed['parse']['text']['*']
                                            return title, content
                                            except KeyError:
                                            # The page doesn't exist
                                            return None, None

                                            title, content = getRawPage("Mathematics")


                                            enter image description here



                                            You can then parse it with any library you want to extract what you need :)






                                            share|improve this answer
























                                              2












                                              2








                                              2






                                              To get a proper way using function, you can just get JSON API offered by Wikipedia :



                                              from urllib.request import urlopen
                                              from urllib.parse import urlencode
                                              from json import loads


                                              def getJSON(page):
                                              params = urlencode({
                                              'format': 'json',
                                              'action': 'parse',
                                              'prop': 'text',
                                              'redirects' : 'true',
                                              'page': page})
                                              API = "https://en.wikipedia.org/w/api.php"
                                              response = urlopen(API + "?" + params)
                                              return response.read().decode('utf-8')


                                              def getRawPage(page):
                                              parsed = loads(getJSON(page))
                                              try:
                                              title = parsed['parse']['title']
                                              content = parsed['parse']['text']['*']
                                              return title, content
                                              except KeyError:
                                              # The page doesn't exist
                                              return None, None

                                              title, content = getRawPage("Mathematics")


                                              enter image description here



                                              You can then parse it with any library you want to extract what you need :)






                                              share|improve this answer












                                              To get a proper way using function, you can just get JSON API offered by Wikipedia :



                                              from urllib.request import urlopen
                                              from urllib.parse import urlencode
                                              from json import loads


                                              def getJSON(page):
                                              params = urlencode({
                                              'format': 'json',
                                              'action': 'parse',
                                              'prop': 'text',
                                              'redirects' : 'true',
                                              'page': page})
                                              API = "https://en.wikipedia.org/w/api.php"
                                              response = urlopen(API + "?" + params)
                                              return response.read().decode('utf-8')


                                              def getRawPage(page):
                                              parsed = loads(getJSON(page))
                                              try:
                                              title = parsed['parse']['title']
                                              content = parsed['parse']['text']['*']
                                              return title, content
                                              except KeyError:
                                              # The page doesn't exist
                                              return None, None

                                              title, content = getRawPage("Mathematics")


                                              enter image description here



                                              You can then parse it with any library you want to extract what you need :)







                                              share|improve this answer












                                              share|improve this answer



                                              share|improve this answer










                                              answered Dec 17 at 17:55









                                              LaSul

                                              446215




                                              446215






























                                                  draft saved

                                                  draft discarded




















































                                                  Thanks for contributing an answer to Stack Overflow!


                                                  • Please be sure to answer the question. Provide details and share your research!

                                                  But avoid



                                                  • Asking for help, clarification, or responding to other answers.

                                                  • Making statements based on opinion; back them up with references or personal experience.


                                                  To learn more, see our tips on writing great answers.





                                                  Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                                                  Please pay close attention to the following guidance:


                                                  • Please be sure to answer the question. Provide details and share your research!

                                                  But avoid



                                                  • Asking for help, clarification, or responding to other answers.

                                                  • Making statements based on opinion; back them up with references or personal experience.


                                                  To learn more, see our tips on writing great answers.




                                                  draft saved


                                                  draft discarded














                                                  StackExchange.ready(
                                                  function () {
                                                  StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53804643%2fhow-can-i-get-a-wikipedia-articles-text-using-python-3-with-beautiful-soup%23new-answer', 'question_page');
                                                  }
                                                  );

                                                  Post as a guest















                                                  Required, but never shown





















































                                                  Required, but never shown














                                                  Required, but never shown












                                                  Required, but never shown







                                                  Required, but never shown

































                                                  Required, but never shown














                                                  Required, but never shown












                                                  Required, but never shown







                                                  Required, but never shown







                                                  Popular posts from this blog

                                                  Morgemoulin

                                                  Scott Moir

                                                  Souastre