How can I get a Wikipedia article's text using Python 3 with Beautiful Soup?
I have this script made in Python 3:
response = simple_get("https://en.wikipedia.org/wiki/Mathematics")
result = {}
result["url"] = url
if response is not None:
html = BeautifulSoup(response, 'html.parser')
title = html.select("#firstHeading")[0].text
As you can see I can get the title from the article, but I cannot figure out how to get the text from "Mathematics (from Greek μά..." to the contents table...
python web-scraping beautifulsoup
add a comment |
I have this script made in Python 3:
response = simple_get("https://en.wikipedia.org/wiki/Mathematics")
result = {}
result["url"] = url
if response is not None:
html = BeautifulSoup(response, 'html.parser')
title = html.select("#firstHeading")[0].text
As you can see I can get the title from the article, but I cannot figure out how to get the text from "Mathematics (from Greek μά..." to the contents table...
python web-scraping beautifulsoup
add a comment |
I have this script made in Python 3:
response = simple_get("https://en.wikipedia.org/wiki/Mathematics")
result = {}
result["url"] = url
if response is not None:
html = BeautifulSoup(response, 'html.parser')
title = html.select("#firstHeading")[0].text
As you can see I can get the title from the article, but I cannot figure out how to get the text from "Mathematics (from Greek μά..." to the contents table...
python web-scraping beautifulsoup
I have this script made in Python 3:
response = simple_get("https://en.wikipedia.org/wiki/Mathematics")
result = {}
result["url"] = url
if response is not None:
html = BeautifulSoup(response, 'html.parser')
title = html.select("#firstHeading")[0].text
As you can see I can get the title from the article, but I cannot figure out how to get the text from "Mathematics (from Greek μά..." to the contents table...
python web-scraping beautifulsoup
python web-scraping beautifulsoup
edited Dec 17 at 13:43
Boann
36.7k1287121
36.7k1287121
asked Dec 16 at 17:17
wiki one
855
855
add a comment |
add a comment |
6 Answers
6
active
oldest
votes
select the <p>
tag. There are 52 elements. Not sure if you want the whole thing, but you can iterate through those tags to store it as you may. I just chose to print each of them to show the output.
import bs4
import requests
response = requests.get("https://en.wikipedia.org/wiki/Mathematics")
if response is not None:
html = bs4.BeautifulSoup(response.text, 'html.parser')
title = html.select("#firstHeading")[0].text
paragraphs = html.select("p")
for para in paragraphs:
print (para.text)
# just grab the text up to contents as stated in question
intro = 'n'.join([ para.text for para in paragraphs[0:5]])
print (intro)
3
if response is not None
can be rewritten asif response
. also since the content may change in the future I would have suggested to get the entire div, read only thep
and stop when you reach the div with class "toclimit-3"
– PinoSan
Dec 16 at 17:42
4
@PinoSan I think it doesn't hurt to check for None explicitly. For examplebool('' is not None)
is not the same asbool('')
. However, in this case the None check is completely unnecessary becauseresponse
will always be arequests.models.Response
object. If the request fails an exception will be raised.
– t.m.adam
Dec 16 at 18:42
@t.m.adam what you are saying is true but as you said the response is not a string. So you just wanted to check that it was a valid object, not an empty string, None or an empty dictionary, ... About the exceptions, I agree we should check for exceptions in case of networks errors but also we should check for status code to be 200
– PinoSan
Dec 16 at 19:17
@PinoSan Of course, and I too prefer theif response
style, but you know, "Explicit is better than implicit.". The problem withif response
is that it may produce strange errors, difficult to debug. But yes, in most cases a simple boolean check should be enough.
– t.m.adam
Dec 16 at 19:22
Just because you can scrape the page, doesn't mean you should. The Wikipedia API has python packages that allow for easy and direct access to articles without undue load on the site or extra work on your end.
– Noah B. Johnson
Dec 18 at 1:56
add a comment |
There is a much, much more easy way to get information from wikipedia - Wikipedia API.
There is this Python wrapper, which allows you to do it in a few lines only with zero HTML-parsing:
import wikipediaapi
wiki_wiki = wikipediaapi.Wikipedia('en')
page = wiki_wiki.page('Mathematics')
print(page.summary)
Prints:
Mathematics (from Greek μάθημα máthēma, "knowledge, study, learning")
includes the study of such topics as quantity, structure, space, and
change...(omitted intentionally)
And, in general, try to avoid screen-scraping if there's a direct API available.
add a comment |
Use the library wikipedia
import wikipedia
#print(wikipedia.summary("Mathematics"))
#wikipedia.search("Mathematics")
print(wikipedia.page("Mathematics").content)
3
I'd usewikipediaapi
instead,wikipedia
module seems to be not maintained. Though, both would do the job in a similar manner.
– alecxe
Dec 16 at 17:49
add a comment |
You can get the desired output using lxml
library like following.
import requests
from lxml.html import fromstring
url = "https://en.wikipedia.org/wiki/Mathematics"
res = requests.get(url)
source = fromstring(res.content)
paragraph = 'n'.join([item.text_content() for item in source.xpath('//p[following::h2[2][span="History"]]')])
print(paragraph)
Using BeautifulSoup
:
from bs4 import BeautifulSoup
import requests
res = requests.get("https://en.wikipedia.org/wiki/Mathematics")
soup = BeautifulSoup(res.text, 'html.parser')
for item in soup.find_all("p"):
if item.text.startswith("The history"):break
print(item.text)
add a comment |
What you seem to want is the (HTML) page content without the surrounding navigation elements. As I described in this earlier answer from 2013, there are (at least) two ways to get it:
Probably the easiest way in your case is to include the parameter
action=render
in the URL, as in https://en.wikipedia.org/wiki/Mathematics?action=render. This will give you just the content HTML, and nothing else.Alternatively, you can also obtain the page content via the MediaWiki API, as in https://en.wikipedia.org/w/api.php?format=xml&action=parse&page=Mathematics.
The advantage of using the API is that it can also give you a lot of other information about the page that you may find useful. For example, if you'd like to have a list of the interlanguage links normally shown in the page's sidebar, or the categories normally shown below the content area, you can get those from the API like this:
https://en.wikipedia.org/w/api.php?format=xml&action=parse&page=Mathematics&prop=langlinks|categories
(To also get the page content with the same request, use prop=langlinks|categories|text
.)
There are several Python libraries for using the MediaWiki API that can automate some of nitty gritty details of using it, although the feature set they support can vary. That said, just using the API directly from your code without a library in between is perfectly possible, too.
add a comment |
To get a proper way using function, you can just get JSON API offered by Wikipedia :
from urllib.request import urlopen
from urllib.parse import urlencode
from json import loads
def getJSON(page):
params = urlencode({
'format': 'json',
'action': 'parse',
'prop': 'text',
'redirects' : 'true',
'page': page})
API = "https://en.wikipedia.org/w/api.php"
response = urlopen(API + "?" + params)
return response.read().decode('utf-8')
def getRawPage(page):
parsed = loads(getJSON(page))
try:
title = parsed['parse']['title']
content = parsed['parse']['text']['*']
return title, content
except KeyError:
# The page doesn't exist
return None, None
title, content = getRawPage("Mathematics")
You can then parse it with any library you want to extract what you need :)
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53804643%2fhow-can-i-get-a-wikipedia-articles-text-using-python-3-with-beautiful-soup%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
6 Answers
6
active
oldest
votes
6 Answers
6
active
oldest
votes
active
oldest
votes
active
oldest
votes
select the <p>
tag. There are 52 elements. Not sure if you want the whole thing, but you can iterate through those tags to store it as you may. I just chose to print each of them to show the output.
import bs4
import requests
response = requests.get("https://en.wikipedia.org/wiki/Mathematics")
if response is not None:
html = bs4.BeautifulSoup(response.text, 'html.parser')
title = html.select("#firstHeading")[0].text
paragraphs = html.select("p")
for para in paragraphs:
print (para.text)
# just grab the text up to contents as stated in question
intro = 'n'.join([ para.text for para in paragraphs[0:5]])
print (intro)
3
if response is not None
can be rewritten asif response
. also since the content may change in the future I would have suggested to get the entire div, read only thep
and stop when you reach the div with class "toclimit-3"
– PinoSan
Dec 16 at 17:42
4
@PinoSan I think it doesn't hurt to check for None explicitly. For examplebool('' is not None)
is not the same asbool('')
. However, in this case the None check is completely unnecessary becauseresponse
will always be arequests.models.Response
object. If the request fails an exception will be raised.
– t.m.adam
Dec 16 at 18:42
@t.m.adam what you are saying is true but as you said the response is not a string. So you just wanted to check that it was a valid object, not an empty string, None or an empty dictionary, ... About the exceptions, I agree we should check for exceptions in case of networks errors but also we should check for status code to be 200
– PinoSan
Dec 16 at 19:17
@PinoSan Of course, and I too prefer theif response
style, but you know, "Explicit is better than implicit.". The problem withif response
is that it may produce strange errors, difficult to debug. But yes, in most cases a simple boolean check should be enough.
– t.m.adam
Dec 16 at 19:22
Just because you can scrape the page, doesn't mean you should. The Wikipedia API has python packages that allow for easy and direct access to articles without undue load on the site or extra work on your end.
– Noah B. Johnson
Dec 18 at 1:56
add a comment |
select the <p>
tag. There are 52 elements. Not sure if you want the whole thing, but you can iterate through those tags to store it as you may. I just chose to print each of them to show the output.
import bs4
import requests
response = requests.get("https://en.wikipedia.org/wiki/Mathematics")
if response is not None:
html = bs4.BeautifulSoup(response.text, 'html.parser')
title = html.select("#firstHeading")[0].text
paragraphs = html.select("p")
for para in paragraphs:
print (para.text)
# just grab the text up to contents as stated in question
intro = 'n'.join([ para.text for para in paragraphs[0:5]])
print (intro)
3
if response is not None
can be rewritten asif response
. also since the content may change in the future I would have suggested to get the entire div, read only thep
and stop when you reach the div with class "toclimit-3"
– PinoSan
Dec 16 at 17:42
4
@PinoSan I think it doesn't hurt to check for None explicitly. For examplebool('' is not None)
is not the same asbool('')
. However, in this case the None check is completely unnecessary becauseresponse
will always be arequests.models.Response
object. If the request fails an exception will be raised.
– t.m.adam
Dec 16 at 18:42
@t.m.adam what you are saying is true but as you said the response is not a string. So you just wanted to check that it was a valid object, not an empty string, None or an empty dictionary, ... About the exceptions, I agree we should check for exceptions in case of networks errors but also we should check for status code to be 200
– PinoSan
Dec 16 at 19:17
@PinoSan Of course, and I too prefer theif response
style, but you know, "Explicit is better than implicit.". The problem withif response
is that it may produce strange errors, difficult to debug. But yes, in most cases a simple boolean check should be enough.
– t.m.adam
Dec 16 at 19:22
Just because you can scrape the page, doesn't mean you should. The Wikipedia API has python packages that allow for easy and direct access to articles without undue load on the site or extra work on your end.
– Noah B. Johnson
Dec 18 at 1:56
add a comment |
select the <p>
tag. There are 52 elements. Not sure if you want the whole thing, but you can iterate through those tags to store it as you may. I just chose to print each of them to show the output.
import bs4
import requests
response = requests.get("https://en.wikipedia.org/wiki/Mathematics")
if response is not None:
html = bs4.BeautifulSoup(response.text, 'html.parser')
title = html.select("#firstHeading")[0].text
paragraphs = html.select("p")
for para in paragraphs:
print (para.text)
# just grab the text up to contents as stated in question
intro = 'n'.join([ para.text for para in paragraphs[0:5]])
print (intro)
select the <p>
tag. There are 52 elements. Not sure if you want the whole thing, but you can iterate through those tags to store it as you may. I just chose to print each of them to show the output.
import bs4
import requests
response = requests.get("https://en.wikipedia.org/wiki/Mathematics")
if response is not None:
html = bs4.BeautifulSoup(response.text, 'html.parser')
title = html.select("#firstHeading")[0].text
paragraphs = html.select("p")
for para in paragraphs:
print (para.text)
# just grab the text up to contents as stated in question
intro = 'n'.join([ para.text for para in paragraphs[0:5]])
print (intro)
edited Dec 16 at 17:33
answered Dec 16 at 17:24
chitown88
1,6041314
1,6041314
3
if response is not None
can be rewritten asif response
. also since the content may change in the future I would have suggested to get the entire div, read only thep
and stop when you reach the div with class "toclimit-3"
– PinoSan
Dec 16 at 17:42
4
@PinoSan I think it doesn't hurt to check for None explicitly. For examplebool('' is not None)
is not the same asbool('')
. However, in this case the None check is completely unnecessary becauseresponse
will always be arequests.models.Response
object. If the request fails an exception will be raised.
– t.m.adam
Dec 16 at 18:42
@t.m.adam what you are saying is true but as you said the response is not a string. So you just wanted to check that it was a valid object, not an empty string, None or an empty dictionary, ... About the exceptions, I agree we should check for exceptions in case of networks errors but also we should check for status code to be 200
– PinoSan
Dec 16 at 19:17
@PinoSan Of course, and I too prefer theif response
style, but you know, "Explicit is better than implicit.". The problem withif response
is that it may produce strange errors, difficult to debug. But yes, in most cases a simple boolean check should be enough.
– t.m.adam
Dec 16 at 19:22
Just because you can scrape the page, doesn't mean you should. The Wikipedia API has python packages that allow for easy and direct access to articles without undue load on the site or extra work on your end.
– Noah B. Johnson
Dec 18 at 1:56
add a comment |
3
if response is not None
can be rewritten asif response
. also since the content may change in the future I would have suggested to get the entire div, read only thep
and stop when you reach the div with class "toclimit-3"
– PinoSan
Dec 16 at 17:42
4
@PinoSan I think it doesn't hurt to check for None explicitly. For examplebool('' is not None)
is not the same asbool('')
. However, in this case the None check is completely unnecessary becauseresponse
will always be arequests.models.Response
object. If the request fails an exception will be raised.
– t.m.adam
Dec 16 at 18:42
@t.m.adam what you are saying is true but as you said the response is not a string. So you just wanted to check that it was a valid object, not an empty string, None or an empty dictionary, ... About the exceptions, I agree we should check for exceptions in case of networks errors but also we should check for status code to be 200
– PinoSan
Dec 16 at 19:17
@PinoSan Of course, and I too prefer theif response
style, but you know, "Explicit is better than implicit.". The problem withif response
is that it may produce strange errors, difficult to debug. But yes, in most cases a simple boolean check should be enough.
– t.m.adam
Dec 16 at 19:22
Just because you can scrape the page, doesn't mean you should. The Wikipedia API has python packages that allow for easy and direct access to articles without undue load on the site or extra work on your end.
– Noah B. Johnson
Dec 18 at 1:56
3
3
if response is not None
can be rewritten as if response
. also since the content may change in the future I would have suggested to get the entire div, read only the p
and stop when you reach the div with class "toclimit-3"– PinoSan
Dec 16 at 17:42
if response is not None
can be rewritten as if response
. also since the content may change in the future I would have suggested to get the entire div, read only the p
and stop when you reach the div with class "toclimit-3"– PinoSan
Dec 16 at 17:42
4
4
@PinoSan I think it doesn't hurt to check for None explicitly. For example
bool('' is not None)
is not the same as bool('')
. However, in this case the None check is completely unnecessary because response
will always be a requests.models.Response
object. If the request fails an exception will be raised.– t.m.adam
Dec 16 at 18:42
@PinoSan I think it doesn't hurt to check for None explicitly. For example
bool('' is not None)
is not the same as bool('')
. However, in this case the None check is completely unnecessary because response
will always be a requests.models.Response
object. If the request fails an exception will be raised.– t.m.adam
Dec 16 at 18:42
@t.m.adam what you are saying is true but as you said the response is not a string. So you just wanted to check that it was a valid object, not an empty string, None or an empty dictionary, ... About the exceptions, I agree we should check for exceptions in case of networks errors but also we should check for status code to be 200
– PinoSan
Dec 16 at 19:17
@t.m.adam what you are saying is true but as you said the response is not a string. So you just wanted to check that it was a valid object, not an empty string, None or an empty dictionary, ... About the exceptions, I agree we should check for exceptions in case of networks errors but also we should check for status code to be 200
– PinoSan
Dec 16 at 19:17
@PinoSan Of course, and I too prefer the
if response
style, but you know, "Explicit is better than implicit.". The problem with if response
is that it may produce strange errors, difficult to debug. But yes, in most cases a simple boolean check should be enough.– t.m.adam
Dec 16 at 19:22
@PinoSan Of course, and I too prefer the
if response
style, but you know, "Explicit is better than implicit.". The problem with if response
is that it may produce strange errors, difficult to debug. But yes, in most cases a simple boolean check should be enough.– t.m.adam
Dec 16 at 19:22
Just because you can scrape the page, doesn't mean you should. The Wikipedia API has python packages that allow for easy and direct access to articles without undue load on the site or extra work on your end.
– Noah B. Johnson
Dec 18 at 1:56
Just because you can scrape the page, doesn't mean you should. The Wikipedia API has python packages that allow for easy and direct access to articles without undue load on the site or extra work on your end.
– Noah B. Johnson
Dec 18 at 1:56
add a comment |
There is a much, much more easy way to get information from wikipedia - Wikipedia API.
There is this Python wrapper, which allows you to do it in a few lines only with zero HTML-parsing:
import wikipediaapi
wiki_wiki = wikipediaapi.Wikipedia('en')
page = wiki_wiki.page('Mathematics')
print(page.summary)
Prints:
Mathematics (from Greek μάθημα máthēma, "knowledge, study, learning")
includes the study of such topics as quantity, structure, space, and
change...(omitted intentionally)
And, in general, try to avoid screen-scraping if there's a direct API available.
add a comment |
There is a much, much more easy way to get information from wikipedia - Wikipedia API.
There is this Python wrapper, which allows you to do it in a few lines only with zero HTML-parsing:
import wikipediaapi
wiki_wiki = wikipediaapi.Wikipedia('en')
page = wiki_wiki.page('Mathematics')
print(page.summary)
Prints:
Mathematics (from Greek μάθημα máthēma, "knowledge, study, learning")
includes the study of such topics as quantity, structure, space, and
change...(omitted intentionally)
And, in general, try to avoid screen-scraping if there's a direct API available.
add a comment |
There is a much, much more easy way to get information from wikipedia - Wikipedia API.
There is this Python wrapper, which allows you to do it in a few lines only with zero HTML-parsing:
import wikipediaapi
wiki_wiki = wikipediaapi.Wikipedia('en')
page = wiki_wiki.page('Mathematics')
print(page.summary)
Prints:
Mathematics (from Greek μάθημα máthēma, "knowledge, study, learning")
includes the study of such topics as quantity, structure, space, and
change...(omitted intentionally)
And, in general, try to avoid screen-scraping if there's a direct API available.
There is a much, much more easy way to get information from wikipedia - Wikipedia API.
There is this Python wrapper, which allows you to do it in a few lines only with zero HTML-parsing:
import wikipediaapi
wiki_wiki = wikipediaapi.Wikipedia('en')
page = wiki_wiki.page('Mathematics')
print(page.summary)
Prints:
Mathematics (from Greek μάθημα máthēma, "knowledge, study, learning")
includes the study of such topics as quantity, structure, space, and
change...(omitted intentionally)
And, in general, try to avoid screen-scraping if there's a direct API available.
edited Dec 17 at 0:10
answered Dec 16 at 17:47
alecxe
321k64619845
321k64619845
add a comment |
add a comment |
Use the library wikipedia
import wikipedia
#print(wikipedia.summary("Mathematics"))
#wikipedia.search("Mathematics")
print(wikipedia.page("Mathematics").content)
3
I'd usewikipediaapi
instead,wikipedia
module seems to be not maintained. Though, both would do the job in a similar manner.
– alecxe
Dec 16 at 17:49
add a comment |
Use the library wikipedia
import wikipedia
#print(wikipedia.summary("Mathematics"))
#wikipedia.search("Mathematics")
print(wikipedia.page("Mathematics").content)
3
I'd usewikipediaapi
instead,wikipedia
module seems to be not maintained. Though, both would do the job in a similar manner.
– alecxe
Dec 16 at 17:49
add a comment |
Use the library wikipedia
import wikipedia
#print(wikipedia.summary("Mathematics"))
#wikipedia.search("Mathematics")
print(wikipedia.page("Mathematics").content)
Use the library wikipedia
import wikipedia
#print(wikipedia.summary("Mathematics"))
#wikipedia.search("Mathematics")
print(wikipedia.page("Mathematics").content)
edited Dec 16 at 17:34
answered Dec 16 at 17:26
QHarr
29.9k81841
29.9k81841
3
I'd usewikipediaapi
instead,wikipedia
module seems to be not maintained. Though, both would do the job in a similar manner.
– alecxe
Dec 16 at 17:49
add a comment |
3
I'd usewikipediaapi
instead,wikipedia
module seems to be not maintained. Though, both would do the job in a similar manner.
– alecxe
Dec 16 at 17:49
3
3
I'd use
wikipediaapi
instead, wikipedia
module seems to be not maintained. Though, both would do the job in a similar manner.– alecxe
Dec 16 at 17:49
I'd use
wikipediaapi
instead, wikipedia
module seems to be not maintained. Though, both would do the job in a similar manner.– alecxe
Dec 16 at 17:49
add a comment |
You can get the desired output using lxml
library like following.
import requests
from lxml.html import fromstring
url = "https://en.wikipedia.org/wiki/Mathematics"
res = requests.get(url)
source = fromstring(res.content)
paragraph = 'n'.join([item.text_content() for item in source.xpath('//p[following::h2[2][span="History"]]')])
print(paragraph)
Using BeautifulSoup
:
from bs4 import BeautifulSoup
import requests
res = requests.get("https://en.wikipedia.org/wiki/Mathematics")
soup = BeautifulSoup(res.text, 'html.parser')
for item in soup.find_all("p"):
if item.text.startswith("The history"):break
print(item.text)
add a comment |
You can get the desired output using lxml
library like following.
import requests
from lxml.html import fromstring
url = "https://en.wikipedia.org/wiki/Mathematics"
res = requests.get(url)
source = fromstring(res.content)
paragraph = 'n'.join([item.text_content() for item in source.xpath('//p[following::h2[2][span="History"]]')])
print(paragraph)
Using BeautifulSoup
:
from bs4 import BeautifulSoup
import requests
res = requests.get("https://en.wikipedia.org/wiki/Mathematics")
soup = BeautifulSoup(res.text, 'html.parser')
for item in soup.find_all("p"):
if item.text.startswith("The history"):break
print(item.text)
add a comment |
You can get the desired output using lxml
library like following.
import requests
from lxml.html import fromstring
url = "https://en.wikipedia.org/wiki/Mathematics"
res = requests.get(url)
source = fromstring(res.content)
paragraph = 'n'.join([item.text_content() for item in source.xpath('//p[following::h2[2][span="History"]]')])
print(paragraph)
Using BeautifulSoup
:
from bs4 import BeautifulSoup
import requests
res = requests.get("https://en.wikipedia.org/wiki/Mathematics")
soup = BeautifulSoup(res.text, 'html.parser')
for item in soup.find_all("p"):
if item.text.startswith("The history"):break
print(item.text)
You can get the desired output using lxml
library like following.
import requests
from lxml.html import fromstring
url = "https://en.wikipedia.org/wiki/Mathematics"
res = requests.get(url)
source = fromstring(res.content)
paragraph = 'n'.join([item.text_content() for item in source.xpath('//p[following::h2[2][span="History"]]')])
print(paragraph)
Using BeautifulSoup
:
from bs4 import BeautifulSoup
import requests
res = requests.get("https://en.wikipedia.org/wiki/Mathematics")
soup = BeautifulSoup(res.text, 'html.parser')
for item in soup.find_all("p"):
if item.text.startswith("The history"):break
print(item.text)
edited Dec 16 at 19:54
answered Dec 16 at 19:14
SIM
9,8453740
9,8453740
add a comment |
add a comment |
What you seem to want is the (HTML) page content without the surrounding navigation elements. As I described in this earlier answer from 2013, there are (at least) two ways to get it:
Probably the easiest way in your case is to include the parameter
action=render
in the URL, as in https://en.wikipedia.org/wiki/Mathematics?action=render. This will give you just the content HTML, and nothing else.Alternatively, you can also obtain the page content via the MediaWiki API, as in https://en.wikipedia.org/w/api.php?format=xml&action=parse&page=Mathematics.
The advantage of using the API is that it can also give you a lot of other information about the page that you may find useful. For example, if you'd like to have a list of the interlanguage links normally shown in the page's sidebar, or the categories normally shown below the content area, you can get those from the API like this:
https://en.wikipedia.org/w/api.php?format=xml&action=parse&page=Mathematics&prop=langlinks|categories
(To also get the page content with the same request, use prop=langlinks|categories|text
.)
There are several Python libraries for using the MediaWiki API that can automate some of nitty gritty details of using it, although the feature set they support can vary. That said, just using the API directly from your code without a library in between is perfectly possible, too.
add a comment |
What you seem to want is the (HTML) page content without the surrounding navigation elements. As I described in this earlier answer from 2013, there are (at least) two ways to get it:
Probably the easiest way in your case is to include the parameter
action=render
in the URL, as in https://en.wikipedia.org/wiki/Mathematics?action=render. This will give you just the content HTML, and nothing else.Alternatively, you can also obtain the page content via the MediaWiki API, as in https://en.wikipedia.org/w/api.php?format=xml&action=parse&page=Mathematics.
The advantage of using the API is that it can also give you a lot of other information about the page that you may find useful. For example, if you'd like to have a list of the interlanguage links normally shown in the page's sidebar, or the categories normally shown below the content area, you can get those from the API like this:
https://en.wikipedia.org/w/api.php?format=xml&action=parse&page=Mathematics&prop=langlinks|categories
(To also get the page content with the same request, use prop=langlinks|categories|text
.)
There are several Python libraries for using the MediaWiki API that can automate some of nitty gritty details of using it, although the feature set they support can vary. That said, just using the API directly from your code without a library in between is perfectly possible, too.
add a comment |
What you seem to want is the (HTML) page content without the surrounding navigation elements. As I described in this earlier answer from 2013, there are (at least) two ways to get it:
Probably the easiest way in your case is to include the parameter
action=render
in the URL, as in https://en.wikipedia.org/wiki/Mathematics?action=render. This will give you just the content HTML, and nothing else.Alternatively, you can also obtain the page content via the MediaWiki API, as in https://en.wikipedia.org/w/api.php?format=xml&action=parse&page=Mathematics.
The advantage of using the API is that it can also give you a lot of other information about the page that you may find useful. For example, if you'd like to have a list of the interlanguage links normally shown in the page's sidebar, or the categories normally shown below the content area, you can get those from the API like this:
https://en.wikipedia.org/w/api.php?format=xml&action=parse&page=Mathematics&prop=langlinks|categories
(To also get the page content with the same request, use prop=langlinks|categories|text
.)
There are several Python libraries for using the MediaWiki API that can automate some of nitty gritty details of using it, although the feature set they support can vary. That said, just using the API directly from your code without a library in between is perfectly possible, too.
What you seem to want is the (HTML) page content without the surrounding navigation elements. As I described in this earlier answer from 2013, there are (at least) two ways to get it:
Probably the easiest way in your case is to include the parameter
action=render
in the URL, as in https://en.wikipedia.org/wiki/Mathematics?action=render. This will give you just the content HTML, and nothing else.Alternatively, you can also obtain the page content via the MediaWiki API, as in https://en.wikipedia.org/w/api.php?format=xml&action=parse&page=Mathematics.
The advantage of using the API is that it can also give you a lot of other information about the page that you may find useful. For example, if you'd like to have a list of the interlanguage links normally shown in the page's sidebar, or the categories normally shown below the content area, you can get those from the API like this:
https://en.wikipedia.org/w/api.php?format=xml&action=parse&page=Mathematics&prop=langlinks|categories
(To also get the page content with the same request, use prop=langlinks|categories|text
.)
There are several Python libraries for using the MediaWiki API that can automate some of nitty gritty details of using it, although the feature set they support can vary. That said, just using the API directly from your code without a library in between is perfectly possible, too.
answered Dec 17 at 2:17
Ilmari Karonen
37.1k566125
37.1k566125
add a comment |
add a comment |
To get a proper way using function, you can just get JSON API offered by Wikipedia :
from urllib.request import urlopen
from urllib.parse import urlencode
from json import loads
def getJSON(page):
params = urlencode({
'format': 'json',
'action': 'parse',
'prop': 'text',
'redirects' : 'true',
'page': page})
API = "https://en.wikipedia.org/w/api.php"
response = urlopen(API + "?" + params)
return response.read().decode('utf-8')
def getRawPage(page):
parsed = loads(getJSON(page))
try:
title = parsed['parse']['title']
content = parsed['parse']['text']['*']
return title, content
except KeyError:
# The page doesn't exist
return None, None
title, content = getRawPage("Mathematics")
You can then parse it with any library you want to extract what you need :)
add a comment |
To get a proper way using function, you can just get JSON API offered by Wikipedia :
from urllib.request import urlopen
from urllib.parse import urlencode
from json import loads
def getJSON(page):
params = urlencode({
'format': 'json',
'action': 'parse',
'prop': 'text',
'redirects' : 'true',
'page': page})
API = "https://en.wikipedia.org/w/api.php"
response = urlopen(API + "?" + params)
return response.read().decode('utf-8')
def getRawPage(page):
parsed = loads(getJSON(page))
try:
title = parsed['parse']['title']
content = parsed['parse']['text']['*']
return title, content
except KeyError:
# The page doesn't exist
return None, None
title, content = getRawPage("Mathematics")
You can then parse it with any library you want to extract what you need :)
add a comment |
To get a proper way using function, you can just get JSON API offered by Wikipedia :
from urllib.request import urlopen
from urllib.parse import urlencode
from json import loads
def getJSON(page):
params = urlencode({
'format': 'json',
'action': 'parse',
'prop': 'text',
'redirects' : 'true',
'page': page})
API = "https://en.wikipedia.org/w/api.php"
response = urlopen(API + "?" + params)
return response.read().decode('utf-8')
def getRawPage(page):
parsed = loads(getJSON(page))
try:
title = parsed['parse']['title']
content = parsed['parse']['text']['*']
return title, content
except KeyError:
# The page doesn't exist
return None, None
title, content = getRawPage("Mathematics")
You can then parse it with any library you want to extract what you need :)
To get a proper way using function, you can just get JSON API offered by Wikipedia :
from urllib.request import urlopen
from urllib.parse import urlencode
from json import loads
def getJSON(page):
params = urlencode({
'format': 'json',
'action': 'parse',
'prop': 'text',
'redirects' : 'true',
'page': page})
API = "https://en.wikipedia.org/w/api.php"
response = urlopen(API + "?" + params)
return response.read().decode('utf-8')
def getRawPage(page):
parsed = loads(getJSON(page))
try:
title = parsed['parse']['title']
content = parsed['parse']['text']['*']
return title, content
except KeyError:
# The page doesn't exist
return None, None
title, content = getRawPage("Mathematics")
You can then parse it with any library you want to extract what you need :)
answered Dec 17 at 17:55
LaSul
446215
446215
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53804643%2fhow-can-i-get-a-wikipedia-articles-text-using-python-3-with-beautiful-soup%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown