How to extract data between two different xml tags
up vote
2
down vote
favorite
I have looked but haven't been able to find anyone else with the same sort of problem I have.
I have an xml file like this:
<ID>1</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed><ID>2</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed><ID>3</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed><ID>4</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed>
Basically a whole bunch of data all on one line, no line breaks.
I need to extract the info (preferably just as-is with tags intact) between a specific < ID> tag (eg < ID>2 )and the very next < /dateAccessed> tag. I have about 50 files to check for a particular ID and the following related data. I get that this is not standard, there is no nesting.
I originally tried to do this using grep and sed, but I just get the whole file returned, which seems odd to me. Can't I just treat this like a text file?
EDIT:
I didn't realise the formatter removed text that was in enclosing < and > , so after re-reading my question this morning, I realised it's asking something completely different.
TL;DR
I need what is between a specific value between ID tags and the next closing DateAccessed tag. Not between the same opening and closing tags, ie between ID and /ID
So I can get something like this result:
<ID>2</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed>
text-processing xml
add a comment |
up vote
2
down vote
favorite
I have looked but haven't been able to find anyone else with the same sort of problem I have.
I have an xml file like this:
<ID>1</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed><ID>2</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed><ID>3</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed><ID>4</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed>
Basically a whole bunch of data all on one line, no line breaks.
I need to extract the info (preferably just as-is with tags intact) between a specific < ID> tag (eg < ID>2 )and the very next < /dateAccessed> tag. I have about 50 files to check for a particular ID and the following related data. I get that this is not standard, there is no nesting.
I originally tried to do this using grep and sed, but I just get the whole file returned, which seems odd to me. Can't I just treat this like a text file?
EDIT:
I didn't realise the formatter removed text that was in enclosing < and > , so after re-reading my question this morning, I realised it's asking something completely different.
TL;DR
I need what is between a specific value between ID tags and the next closing DateAccessed tag. Not between the same opening and closing tags, ie between ID and /ID
So I can get something like this result:
<ID>2</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed>
text-processing xml
3
I can't help but feel this is the "wrong question". If you're working with XML files you should really be using an XML parser (such asxmlstarlet
). I appreciate this won't give you an unbalanced segment, and so is not a suitable answer to your question as asked. But trying to treat XML as text will almost certainly lead to unintended consequences down the road. It's not a good place to be. Really.
– roaima
Oct 18 '17 at 7:14
add a comment |
up vote
2
down vote
favorite
up vote
2
down vote
favorite
I have looked but haven't been able to find anyone else with the same sort of problem I have.
I have an xml file like this:
<ID>1</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed><ID>2</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed><ID>3</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed><ID>4</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed>
Basically a whole bunch of data all on one line, no line breaks.
I need to extract the info (preferably just as-is with tags intact) between a specific < ID> tag (eg < ID>2 )and the very next < /dateAccessed> tag. I have about 50 files to check for a particular ID and the following related data. I get that this is not standard, there is no nesting.
I originally tried to do this using grep and sed, but I just get the whole file returned, which seems odd to me. Can't I just treat this like a text file?
EDIT:
I didn't realise the formatter removed text that was in enclosing < and > , so after re-reading my question this morning, I realised it's asking something completely different.
TL;DR
I need what is between a specific value between ID tags and the next closing DateAccessed tag. Not between the same opening and closing tags, ie between ID and /ID
So I can get something like this result:
<ID>2</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed>
text-processing xml
I have looked but haven't been able to find anyone else with the same sort of problem I have.
I have an xml file like this:
<ID>1</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed><ID>2</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed><ID>3</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed><ID>4</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed>
Basically a whole bunch of data all on one line, no line breaks.
I need to extract the info (preferably just as-is with tags intact) between a specific < ID> tag (eg < ID>2 )and the very next < /dateAccessed> tag. I have about 50 files to check for a particular ID and the following related data. I get that this is not standard, there is no nesting.
I originally tried to do this using grep and sed, but I just get the whole file returned, which seems odd to me. Can't I just treat this like a text file?
EDIT:
I didn't realise the formatter removed text that was in enclosing < and > , so after re-reading my question this morning, I realised it's asking something completely different.
TL;DR
I need what is between a specific value between ID tags and the next closing DateAccessed tag. Not between the same opening and closing tags, ie between ID and /ID
So I can get something like this result:
<ID>2</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed>
text-processing xml
text-processing xml
edited Feb 27 '17 at 23:36
asked Feb 27 '17 at 1:37
averagescripter
2124
2124
3
I can't help but feel this is the "wrong question". If you're working with XML files you should really be using an XML parser (such asxmlstarlet
). I appreciate this won't give you an unbalanced segment, and so is not a suitable answer to your question as asked. But trying to treat XML as text will almost certainly lead to unintended consequences down the road. It's not a good place to be. Really.
– roaima
Oct 18 '17 at 7:14
add a comment |
3
I can't help but feel this is the "wrong question". If you're working with XML files you should really be using an XML parser (such asxmlstarlet
). I appreciate this won't give you an unbalanced segment, and so is not a suitable answer to your question as asked. But trying to treat XML as text will almost certainly lead to unintended consequences down the road. It's not a good place to be. Really.
– roaima
Oct 18 '17 at 7:14
3
3
I can't help but feel this is the "wrong question". If you're working with XML files you should really be using an XML parser (such as
xmlstarlet
). I appreciate this won't give you an unbalanced segment, and so is not a suitable answer to your question as asked. But trying to treat XML as text will almost certainly lead to unintended consequences down the road. It's not a good place to be. Really.– roaima
Oct 18 '17 at 7:14
I can't help but feel this is the "wrong question". If you're working with XML files you should really be using an XML parser (such as
xmlstarlet
). I appreciate this won't give you an unbalanced segment, and so is not a suitable answer to your question as asked. But trying to treat XML as text will almost certainly lead to unintended consequences down the road. It's not a good place to be. Really.– roaima
Oct 18 '17 at 7:14
add a comment |
5 Answers
5
active
oldest
votes
up vote
1
down vote
As noted in the comments, your data isn't well-formed XML and it isn't completely clear what the structure of your document is, e.g. judging by your example data, it looks like you have no nested elements - is that really the case?
With that caveat in mind, here's a Python script that uses the BeautifulSoup4 parsing library to do what you want (i.e. it produces the desired output data for the given example input data):
#!/usr/bin/env python
# coding: ascii
"""extract.py
Extract everything between two XML tags
in a (possibly poorly formed) XML document."""
from bs4 import BeautifulSoup
import sys
# Set the opening tag name and value
opening_name = "ID"
opening_text = "2"
# Set the closing tag name
closing_name = "dateAccessed"
# Get the XML data from a file and instantiate a BeautifulSoup parser
# We add a root node because the input data is missing a root
with open(sys.argv[1], 'r') as xmlfile:
xmldoc = "<root>" + xmlfile.read() + "</root>"
soup = BeautifulSoup(xmldoc, 'xml')
# Iterate through the elements of the XML data and collect
# all of the elements inbetween the opening and closing tags
elements =
match = False
for e in soup.find_all():
if match is True:
elements.append(str(e))
if e.name==closing_name:
break
else:
try:
if e.name==opening_name and e.text==opening_text:
match = True
elements.append(str(e))
except AttributeError:
pass
# Output the results on a single line
print("".join(elements))
You would run it something like this:
python extract.py data.xml
For your given example data:
<ID>1</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed><ID>2</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed><ID>3</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed><ID>4</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed>
It produces the following output:
<ID>2</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed>
add a comment |
up vote
1
down vote
Assuming that the XML document actually has a root tag (your XML does not and is therefore not well formed), then you may use XMLstarlet like this:
xmlstarlet sel -t -m '//ID[. = 2]'
-c . -c './following-sibling::*[position()<5]' -nl file.xml
For the given data (modified to insert <root>
at the start and </root>
at the end), this would return
<ID>2</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed>
The XMLstarlet query selects any ID
node whose contents is 2
(-m '//ID[. = 2]'
). For each of these nodes (only one in the given data), it returns a copy of the node itself (-c .
) along with a copy of the following five sibling nodes (-c './following-sibling::*[position()<5]'
), ending the output by inserting a newline (-nl
).
The <root>
start and end tags could be inserted into the document itself, or be handed to XMLstarlet like so:
{ echo '<root>'; cat file.xml; echo '</root>'; } |
xmlstarlet sel -t -m '//ID[. = 2]'
-c . -c './following-sibling::*[position()<5]' -nl
add a comment |
up vote
-1
down vote
Grep
grep -oE '<data>[^<]*</data>' yourxmlfile
Bash
tag='data'
tL="<$tag>" tR="</$tag>"
xml=$(< yourxmlfile)
while case $xml in *"$tL"* ) :;; * ) break;; esac; do
t1=${xml#*"$tL"} t2=${t1%%"$tR"*} xml=${t1#*"$tR"}
echo "${tL}${t2}${tR}"
done
Perl
perl -lne "print for/<$tag>.*?</$tag>/g" yourxmlfile
Sed
sed -e "
s|<$tag>|n&|
s/.*n//
s|</$tag>|&n|
/n/P;D
" yourxmlfile
Output
<data>asdf</data>
<data>asdf</data>
<data>asdf</data>
<data>asdf</data>
add a comment |
up vote
-2
down vote
if you want to extract the ID value, and i assume ID always comes as first tag, then you can use this
awk -F"[<>]" '{print $3}' input.txt
if you want to search for specific tag, then try this awk command. you need to change the value of input=ID
awk -F"[<>]" '{for(i=1;i<=NF;i++)if($i~input){print $(i+1);next}}' input=ID input.txt
add a comment |
up vote
-4
down vote
provided XML has no line breaks.
why don't you try inserting n between >< which will make the XML in standard format
Example:-
i have created a file called stack with the given xml.
below is the sed operation to introduce line breaks.
cat stack|sed -e 's/></>n</g'
<ID>2</ID>
<data>asdf</data>
<data2>asdf</data2>
<dataX>asdf</dataX>
<dateAccessed>somedate</dateAccessed>
now you can access the tags you want
add a comment |
5 Answers
5
active
oldest
votes
5 Answers
5
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
As noted in the comments, your data isn't well-formed XML and it isn't completely clear what the structure of your document is, e.g. judging by your example data, it looks like you have no nested elements - is that really the case?
With that caveat in mind, here's a Python script that uses the BeautifulSoup4 parsing library to do what you want (i.e. it produces the desired output data for the given example input data):
#!/usr/bin/env python
# coding: ascii
"""extract.py
Extract everything between two XML tags
in a (possibly poorly formed) XML document."""
from bs4 import BeautifulSoup
import sys
# Set the opening tag name and value
opening_name = "ID"
opening_text = "2"
# Set the closing tag name
closing_name = "dateAccessed"
# Get the XML data from a file and instantiate a BeautifulSoup parser
# We add a root node because the input data is missing a root
with open(sys.argv[1], 'r') as xmlfile:
xmldoc = "<root>" + xmlfile.read() + "</root>"
soup = BeautifulSoup(xmldoc, 'xml')
# Iterate through the elements of the XML data and collect
# all of the elements inbetween the opening and closing tags
elements =
match = False
for e in soup.find_all():
if match is True:
elements.append(str(e))
if e.name==closing_name:
break
else:
try:
if e.name==opening_name and e.text==opening_text:
match = True
elements.append(str(e))
except AttributeError:
pass
# Output the results on a single line
print("".join(elements))
You would run it something like this:
python extract.py data.xml
For your given example data:
<ID>1</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed><ID>2</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed><ID>3</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed><ID>4</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed>
It produces the following output:
<ID>2</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed>
add a comment |
up vote
1
down vote
As noted in the comments, your data isn't well-formed XML and it isn't completely clear what the structure of your document is, e.g. judging by your example data, it looks like you have no nested elements - is that really the case?
With that caveat in mind, here's a Python script that uses the BeautifulSoup4 parsing library to do what you want (i.e. it produces the desired output data for the given example input data):
#!/usr/bin/env python
# coding: ascii
"""extract.py
Extract everything between two XML tags
in a (possibly poorly formed) XML document."""
from bs4 import BeautifulSoup
import sys
# Set the opening tag name and value
opening_name = "ID"
opening_text = "2"
# Set the closing tag name
closing_name = "dateAccessed"
# Get the XML data from a file and instantiate a BeautifulSoup parser
# We add a root node because the input data is missing a root
with open(sys.argv[1], 'r') as xmlfile:
xmldoc = "<root>" + xmlfile.read() + "</root>"
soup = BeautifulSoup(xmldoc, 'xml')
# Iterate through the elements of the XML data and collect
# all of the elements inbetween the opening and closing tags
elements =
match = False
for e in soup.find_all():
if match is True:
elements.append(str(e))
if e.name==closing_name:
break
else:
try:
if e.name==opening_name and e.text==opening_text:
match = True
elements.append(str(e))
except AttributeError:
pass
# Output the results on a single line
print("".join(elements))
You would run it something like this:
python extract.py data.xml
For your given example data:
<ID>1</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed><ID>2</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed><ID>3</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed><ID>4</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed>
It produces the following output:
<ID>2</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed>
add a comment |
up vote
1
down vote
up vote
1
down vote
As noted in the comments, your data isn't well-formed XML and it isn't completely clear what the structure of your document is, e.g. judging by your example data, it looks like you have no nested elements - is that really the case?
With that caveat in mind, here's a Python script that uses the BeautifulSoup4 parsing library to do what you want (i.e. it produces the desired output data for the given example input data):
#!/usr/bin/env python
# coding: ascii
"""extract.py
Extract everything between two XML tags
in a (possibly poorly formed) XML document."""
from bs4 import BeautifulSoup
import sys
# Set the opening tag name and value
opening_name = "ID"
opening_text = "2"
# Set the closing tag name
closing_name = "dateAccessed"
# Get the XML data from a file and instantiate a BeautifulSoup parser
# We add a root node because the input data is missing a root
with open(sys.argv[1], 'r') as xmlfile:
xmldoc = "<root>" + xmlfile.read() + "</root>"
soup = BeautifulSoup(xmldoc, 'xml')
# Iterate through the elements of the XML data and collect
# all of the elements inbetween the opening and closing tags
elements =
match = False
for e in soup.find_all():
if match is True:
elements.append(str(e))
if e.name==closing_name:
break
else:
try:
if e.name==opening_name and e.text==opening_text:
match = True
elements.append(str(e))
except AttributeError:
pass
# Output the results on a single line
print("".join(elements))
You would run it something like this:
python extract.py data.xml
For your given example data:
<ID>1</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed><ID>2</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed><ID>3</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed><ID>4</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed>
It produces the following output:
<ID>2</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed>
As noted in the comments, your data isn't well-formed XML and it isn't completely clear what the structure of your document is, e.g. judging by your example data, it looks like you have no nested elements - is that really the case?
With that caveat in mind, here's a Python script that uses the BeautifulSoup4 parsing library to do what you want (i.e. it produces the desired output data for the given example input data):
#!/usr/bin/env python
# coding: ascii
"""extract.py
Extract everything between two XML tags
in a (possibly poorly formed) XML document."""
from bs4 import BeautifulSoup
import sys
# Set the opening tag name and value
opening_name = "ID"
opening_text = "2"
# Set the closing tag name
closing_name = "dateAccessed"
# Get the XML data from a file and instantiate a BeautifulSoup parser
# We add a root node because the input data is missing a root
with open(sys.argv[1], 'r') as xmlfile:
xmldoc = "<root>" + xmlfile.read() + "</root>"
soup = BeautifulSoup(xmldoc, 'xml')
# Iterate through the elements of the XML data and collect
# all of the elements inbetween the opening and closing tags
elements =
match = False
for e in soup.find_all():
if match is True:
elements.append(str(e))
if e.name==closing_name:
break
else:
try:
if e.name==opening_name and e.text==opening_text:
match = True
elements.append(str(e))
except AttributeError:
pass
# Output the results on a single line
print("".join(elements))
You would run it something like this:
python extract.py data.xml
For your given example data:
<ID>1</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed><ID>2</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed><ID>3</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed><ID>4</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed>
It produces the following output:
<ID>2</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed>
answered Jul 11 at 22:23
igal
5,0871031
5,0871031
add a comment |
add a comment |
up vote
1
down vote
Assuming that the XML document actually has a root tag (your XML does not and is therefore not well formed), then you may use XMLstarlet like this:
xmlstarlet sel -t -m '//ID[. = 2]'
-c . -c './following-sibling::*[position()<5]' -nl file.xml
For the given data (modified to insert <root>
at the start and </root>
at the end), this would return
<ID>2</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed>
The XMLstarlet query selects any ID
node whose contents is 2
(-m '//ID[. = 2]'
). For each of these nodes (only one in the given data), it returns a copy of the node itself (-c .
) along with a copy of the following five sibling nodes (-c './following-sibling::*[position()<5]'
), ending the output by inserting a newline (-nl
).
The <root>
start and end tags could be inserted into the document itself, or be handed to XMLstarlet like so:
{ echo '<root>'; cat file.xml; echo '</root>'; } |
xmlstarlet sel -t -m '//ID[. = 2]'
-c . -c './following-sibling::*[position()<5]' -nl
add a comment |
up vote
1
down vote
Assuming that the XML document actually has a root tag (your XML does not and is therefore not well formed), then you may use XMLstarlet like this:
xmlstarlet sel -t -m '//ID[. = 2]'
-c . -c './following-sibling::*[position()<5]' -nl file.xml
For the given data (modified to insert <root>
at the start and </root>
at the end), this would return
<ID>2</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed>
The XMLstarlet query selects any ID
node whose contents is 2
(-m '//ID[. = 2]'
). For each of these nodes (only one in the given data), it returns a copy of the node itself (-c .
) along with a copy of the following five sibling nodes (-c './following-sibling::*[position()<5]'
), ending the output by inserting a newline (-nl
).
The <root>
start and end tags could be inserted into the document itself, or be handed to XMLstarlet like so:
{ echo '<root>'; cat file.xml; echo '</root>'; } |
xmlstarlet sel -t -m '//ID[. = 2]'
-c . -c './following-sibling::*[position()<5]' -nl
add a comment |
up vote
1
down vote
up vote
1
down vote
Assuming that the XML document actually has a root tag (your XML does not and is therefore not well formed), then you may use XMLstarlet like this:
xmlstarlet sel -t -m '//ID[. = 2]'
-c . -c './following-sibling::*[position()<5]' -nl file.xml
For the given data (modified to insert <root>
at the start and </root>
at the end), this would return
<ID>2</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed>
The XMLstarlet query selects any ID
node whose contents is 2
(-m '//ID[. = 2]'
). For each of these nodes (only one in the given data), it returns a copy of the node itself (-c .
) along with a copy of the following five sibling nodes (-c './following-sibling::*[position()<5]'
), ending the output by inserting a newline (-nl
).
The <root>
start and end tags could be inserted into the document itself, or be handed to XMLstarlet like so:
{ echo '<root>'; cat file.xml; echo '</root>'; } |
xmlstarlet sel -t -m '//ID[. = 2]'
-c . -c './following-sibling::*[position()<5]' -nl
Assuming that the XML document actually has a root tag (your XML does not and is therefore not well formed), then you may use XMLstarlet like this:
xmlstarlet sel -t -m '//ID[. = 2]'
-c . -c './following-sibling::*[position()<5]' -nl file.xml
For the given data (modified to insert <root>
at the start and </root>
at the end), this would return
<ID>2</ID><data>asdf</data><data2>asdf</data2><dataX>asdf</dataX><dateAccessed>somedate</dateAccessed>
The XMLstarlet query selects any ID
node whose contents is 2
(-m '//ID[. = 2]'
). For each of these nodes (only one in the given data), it returns a copy of the node itself (-c .
) along with a copy of the following five sibling nodes (-c './following-sibling::*[position()<5]'
), ending the output by inserting a newline (-nl
).
The <root>
start and end tags could be inserted into the document itself, or be handed to XMLstarlet like so:
{ echo '<root>'; cat file.xml; echo '</root>'; } |
xmlstarlet sel -t -m '//ID[. = 2]'
-c . -c './following-sibling::*[position()<5]' -nl
edited Nov 23 at 23:29
answered Nov 23 at 23:22
Kusalananda
118k16222360
118k16222360
add a comment |
add a comment |
up vote
-1
down vote
Grep
grep -oE '<data>[^<]*</data>' yourxmlfile
Bash
tag='data'
tL="<$tag>" tR="</$tag>"
xml=$(< yourxmlfile)
while case $xml in *"$tL"* ) :;; * ) break;; esac; do
t1=${xml#*"$tL"} t2=${t1%%"$tR"*} xml=${t1#*"$tR"}
echo "${tL}${t2}${tR}"
done
Perl
perl -lne "print for/<$tag>.*?</$tag>/g" yourxmlfile
Sed
sed -e "
s|<$tag>|n&|
s/.*n//
s|</$tag>|&n|
/n/P;D
" yourxmlfile
Output
<data>asdf</data>
<data>asdf</data>
<data>asdf</data>
<data>asdf</data>
add a comment |
up vote
-1
down vote
Grep
grep -oE '<data>[^<]*</data>' yourxmlfile
Bash
tag='data'
tL="<$tag>" tR="</$tag>"
xml=$(< yourxmlfile)
while case $xml in *"$tL"* ) :;; * ) break;; esac; do
t1=${xml#*"$tL"} t2=${t1%%"$tR"*} xml=${t1#*"$tR"}
echo "${tL}${t2}${tR}"
done
Perl
perl -lne "print for/<$tag>.*?</$tag>/g" yourxmlfile
Sed
sed -e "
s|<$tag>|n&|
s/.*n//
s|</$tag>|&n|
/n/P;D
" yourxmlfile
Output
<data>asdf</data>
<data>asdf</data>
<data>asdf</data>
<data>asdf</data>
add a comment |
up vote
-1
down vote
up vote
-1
down vote
Grep
grep -oE '<data>[^<]*</data>' yourxmlfile
Bash
tag='data'
tL="<$tag>" tR="</$tag>"
xml=$(< yourxmlfile)
while case $xml in *"$tL"* ) :;; * ) break;; esac; do
t1=${xml#*"$tL"} t2=${t1%%"$tR"*} xml=${t1#*"$tR"}
echo "${tL}${t2}${tR}"
done
Perl
perl -lne "print for/<$tag>.*?</$tag>/g" yourxmlfile
Sed
sed -e "
s|<$tag>|n&|
s/.*n//
s|</$tag>|&n|
/n/P;D
" yourxmlfile
Output
<data>asdf</data>
<data>asdf</data>
<data>asdf</data>
<data>asdf</data>
Grep
grep -oE '<data>[^<]*</data>' yourxmlfile
Bash
tag='data'
tL="<$tag>" tR="</$tag>"
xml=$(< yourxmlfile)
while case $xml in *"$tL"* ) :;; * ) break;; esac; do
t1=${xml#*"$tL"} t2=${t1%%"$tR"*} xml=${t1#*"$tR"}
echo "${tL}${t2}${tR}"
done
Perl
perl -lne "print for/<$tag>.*?</$tag>/g" yourxmlfile
Sed
sed -e "
s|<$tag>|n&|
s/.*n//
s|</$tag>|&n|
/n/P;D
" yourxmlfile
Output
<data>asdf</data>
<data>asdf</data>
<data>asdf</data>
<data>asdf</data>
edited Feb 27 '17 at 4:33
answered Feb 27 '17 at 3:48
Rakesh Sharma
62013
62013
add a comment |
add a comment |
up vote
-2
down vote
if you want to extract the ID value, and i assume ID always comes as first tag, then you can use this
awk -F"[<>]" '{print $3}' input.txt
if you want to search for specific tag, then try this awk command. you need to change the value of input=ID
awk -F"[<>]" '{for(i=1;i<=NF;i++)if($i~input){print $(i+1);next}}' input=ID input.txt
add a comment |
up vote
-2
down vote
if you want to extract the ID value, and i assume ID always comes as first tag, then you can use this
awk -F"[<>]" '{print $3}' input.txt
if you want to search for specific tag, then try this awk command. you need to change the value of input=ID
awk -F"[<>]" '{for(i=1;i<=NF;i++)if($i~input){print $(i+1);next}}' input=ID input.txt
add a comment |
up vote
-2
down vote
up vote
-2
down vote
if you want to extract the ID value, and i assume ID always comes as first tag, then you can use this
awk -F"[<>]" '{print $3}' input.txt
if you want to search for specific tag, then try this awk command. you need to change the value of input=ID
awk -F"[<>]" '{for(i=1;i<=NF;i++)if($i~input){print $(i+1);next}}' input=ID input.txt
if you want to extract the ID value, and i assume ID always comes as first tag, then you can use this
awk -F"[<>]" '{print $3}' input.txt
if you want to search for specific tag, then try this awk command. you need to change the value of input=ID
awk -F"[<>]" '{for(i=1;i<=NF;i++)if($i~input){print $(i+1);next}}' input=ID input.txt
answered Feb 27 '17 at 3:32
Kamaraj
2,9161513
2,9161513
add a comment |
add a comment |
up vote
-4
down vote
provided XML has no line breaks.
why don't you try inserting n between >< which will make the XML in standard format
Example:-
i have created a file called stack with the given xml.
below is the sed operation to introduce line breaks.
cat stack|sed -e 's/></>n</g'
<ID>2</ID>
<data>asdf</data>
<data2>asdf</data2>
<dataX>asdf</dataX>
<dateAccessed>somedate</dateAccessed>
now you can access the tags you want
add a comment |
up vote
-4
down vote
provided XML has no line breaks.
why don't you try inserting n between >< which will make the XML in standard format
Example:-
i have created a file called stack with the given xml.
below is the sed operation to introduce line breaks.
cat stack|sed -e 's/></>n</g'
<ID>2</ID>
<data>asdf</data>
<data2>asdf</data2>
<dataX>asdf</dataX>
<dateAccessed>somedate</dateAccessed>
now you can access the tags you want
add a comment |
up vote
-4
down vote
up vote
-4
down vote
provided XML has no line breaks.
why don't you try inserting n between >< which will make the XML in standard format
Example:-
i have created a file called stack with the given xml.
below is the sed operation to introduce line breaks.
cat stack|sed -e 's/></>n</g'
<ID>2</ID>
<data>asdf</data>
<data2>asdf</data2>
<dataX>asdf</dataX>
<dateAccessed>somedate</dateAccessed>
now you can access the tags you want
provided XML has no line breaks.
why don't you try inserting n between >< which will make the XML in standard format
Example:-
i have created a file called stack with the given xml.
below is the sed operation to introduce line breaks.
cat stack|sed -e 's/></>n</g'
<ID>2</ID>
<data>asdf</data>
<data2>asdf</data2>
<dataX>asdf</dataX>
<dateAccessed>somedate</dateAccessed>
now you can access the tags you want
answered Oct 18 '17 at 7:08
user256118
1
1
add a comment |
add a comment |
Thanks for contributing an answer to Unix & Linux Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f347776%2fhow-to-extract-data-between-two-different-xml-tags%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
3
I can't help but feel this is the "wrong question". If you're working with XML files you should really be using an XML parser (such as
xmlstarlet
). I appreciate this won't give you an unbalanced segment, and so is not a suitable answer to your question as asked. But trying to treat XML as text will almost certainly lead to unintended consequences down the road. It's not a good place to be. Really.– roaima
Oct 18 '17 at 7:14