Extract data between two patterns from a huge (forced) text file
up vote
-3
down vote
favorite
I have filename.json
. If I parse it in terminal with
file filename.json
output is:
filename.json: UTF-8 Unicode text, with very long lines
wc -l filename.json
1 filename.json
If I parse it as a json
using jq
then I'll have to mention what section of the data I want it to print like an id,summary,author,etc,. I have thousands of json which are similar in structure but section where I want the data is stored as either "summary","description","review",etc,. Since, there are thousands of JSON files I don't want to check in each one of them. But I know that the data I want resides between two patterns
"title": and "url":
$ cat filename.json
gives:
{"source":"PhoneArena","author":"","title":"Apple's US Black Friday shopping event has gift cards galore for select iPhones, iPads, and more","description":"As confirmed earlier this week, a four-day Black Friday and Cyber Monday shopping event is underway, offering Apple Store gift cards with purchases of select iPhone models, three iPad variants, an assortment of Macs, the entire Apple Watch Series 3 family, as well as the HomePod, both Apple TV versions, and select Beats headphones.That ...","url":"https://www.phonearena.com/news/Apples-US-Black-Friday-shopping-event-has-gift-cards-galore-for-select-iPhones-iPads-and-more_id111287","urlToImage":"https://i-cdn.phonearena.com/images/article/111287-two_lead/Apples-US-Black-Friday-shopping-event-has-gift-cards-galore-for-select-iPhones-iPads-and-more.jpg","publishedAt":"2018-11-23T09:05:00Z","dataRefreshedTime":"2018-11-23T09:43:09Z","category":"phone_news_reviews","resource":"PhoneArena"},{"source":"PhoneArena","author":"","title":"Verizon's top Black Friday bargain is a free Moto G6, no trade-in required","description":"That made it virtually impossible for retailers like Best Buy and B&H Photo Video to outdo themselves come the actual Black Friday frenzy, but luckily, that’s what carriers are (sometimes) good for.Enter Verizon, which revealed a wide range of killer deals on popular high-end ...","url":"https://www.phonearena.com/news/Verizons-top-Black-Friday-bargain-is-a-free-Moto-G6-no-trade-in-required_id111285","urlToImage":"https://i-cdn.phonearena.com/images/article/111285-two_lead/Verizons-top-Black-Friday-bargain-is-a-free-Moto-G6-no-trade-in-required.jpg","publishedAt":"2018-11-23T07:54:00Z","dataRefreshedTime":"2018-11-23T09:43:09Z","category":"phone_news_reviews","resource":"PhoneArena"},
So, I want to print everything between the patterns, but in the terminal the files is of 1 line and pattern appears multiple times. Only way I could think of is to print between the two patterns till the end of the file.
I tried using sed:
sed -n '^/title/,/^url/p' filename.json
but it prints blank.
I want the data to further input to do language analysis using machine learning techniques.
Any suggestion on other ways to print between the patterns, also patterns repeat multiple times. So, I want the data to print between each of the repeats.
Expected result is to print as CSV or tsv:
1 "As confirmed earlier this week, a four-day Black Friday and Cyber Monday shopping event is underway, offering Apple Store gift cards with purchases of select iPhone models, three iPad variants, an assortment of Macs, the entire Apple Watch Series 3 family, as well as the HomePod, both Apple TV versions, and select Beats headphones.That ..."
2 "That made it virtually impossible for retailers like Best Buy and B&H Photo Video to outdo themselves come the actual Black Friday frenzy, but luckily, that’s what carriers are (sometimes) good for.Enter Verizon, which revealed a wide range of killer deals on popular high-end ..."
etc,.
till the end of the file.
awk sed grep python pattern-matching
|
show 8 more comments
up vote
-3
down vote
favorite
I have filename.json
. If I parse it in terminal with
file filename.json
output is:
filename.json: UTF-8 Unicode text, with very long lines
wc -l filename.json
1 filename.json
If I parse it as a json
using jq
then I'll have to mention what section of the data I want it to print like an id,summary,author,etc,. I have thousands of json which are similar in structure but section where I want the data is stored as either "summary","description","review",etc,. Since, there are thousands of JSON files I don't want to check in each one of them. But I know that the data I want resides between two patterns
"title": and "url":
$ cat filename.json
gives:
{"source":"PhoneArena","author":"","title":"Apple's US Black Friday shopping event has gift cards galore for select iPhones, iPads, and more","description":"As confirmed earlier this week, a four-day Black Friday and Cyber Monday shopping event is underway, offering Apple Store gift cards with purchases of select iPhone models, three iPad variants, an assortment of Macs, the entire Apple Watch Series 3 family, as well as the HomePod, both Apple TV versions, and select Beats headphones.That ...","url":"https://www.phonearena.com/news/Apples-US-Black-Friday-shopping-event-has-gift-cards-galore-for-select-iPhones-iPads-and-more_id111287","urlToImage":"https://i-cdn.phonearena.com/images/article/111287-two_lead/Apples-US-Black-Friday-shopping-event-has-gift-cards-galore-for-select-iPhones-iPads-and-more.jpg","publishedAt":"2018-11-23T09:05:00Z","dataRefreshedTime":"2018-11-23T09:43:09Z","category":"phone_news_reviews","resource":"PhoneArena"},{"source":"PhoneArena","author":"","title":"Verizon's top Black Friday bargain is a free Moto G6, no trade-in required","description":"That made it virtually impossible for retailers like Best Buy and B&H Photo Video to outdo themselves come the actual Black Friday frenzy, but luckily, that’s what carriers are (sometimes) good for.Enter Verizon, which revealed a wide range of killer deals on popular high-end ...","url":"https://www.phonearena.com/news/Verizons-top-Black-Friday-bargain-is-a-free-Moto-G6-no-trade-in-required_id111285","urlToImage":"https://i-cdn.phonearena.com/images/article/111285-two_lead/Verizons-top-Black-Friday-bargain-is-a-free-Moto-G6-no-trade-in-required.jpg","publishedAt":"2018-11-23T07:54:00Z","dataRefreshedTime":"2018-11-23T09:43:09Z","category":"phone_news_reviews","resource":"PhoneArena"},
So, I want to print everything between the patterns, but in the terminal the files is of 1 line and pattern appears multiple times. Only way I could think of is to print between the two patterns till the end of the file.
I tried using sed:
sed -n '^/title/,/^url/p' filename.json
but it prints blank.
I want the data to further input to do language analysis using machine learning techniques.
Any suggestion on other ways to print between the patterns, also patterns repeat multiple times. So, I want the data to print between each of the repeats.
Expected result is to print as CSV or tsv:
1 "As confirmed earlier this week, a four-day Black Friday and Cyber Monday shopping event is underway, offering Apple Store gift cards with purchases of select iPhone models, three iPad variants, an assortment of Macs, the entire Apple Watch Series 3 family, as well as the HomePod, both Apple TV versions, and select Beats headphones.That ..."
2 "That made it virtually impossible for retailers like Best Buy and B&H Photo Video to outdo themselves come the actual Black Friday frenzy, but luckily, that’s what carriers are (sometimes) good for.Enter Verizon, which revealed a wide range of killer deals on popular high-end ..."
etc,.
till the end of the file.
awk sed grep python pattern-matching
4
This does not look like JSON to me. Did you for some reason omit all{}
,and
:,
?
– Kusalananda
Nov 23 at 9:13
Yes, I did omit and just copied it from the browser
– CCC
Nov 23 at 9:21
That makes it very difficult to help you as we can't test on real data.
– Kusalananda
Nov 23 at 9:29
3
Why not supply the JSON in the question? That would allow for a more robust solution than having to do haphazard parsing of nearly free form text.
– Kusalananda
Nov 23 at 9:36
2
Possible duplicate of How to extract data from a JSON file
– RoVo
Nov 23 at 10:02
|
show 8 more comments
up vote
-3
down vote
favorite
up vote
-3
down vote
favorite
I have filename.json
. If I parse it in terminal with
file filename.json
output is:
filename.json: UTF-8 Unicode text, with very long lines
wc -l filename.json
1 filename.json
If I parse it as a json
using jq
then I'll have to mention what section of the data I want it to print like an id,summary,author,etc,. I have thousands of json which are similar in structure but section where I want the data is stored as either "summary","description","review",etc,. Since, there are thousands of JSON files I don't want to check in each one of them. But I know that the data I want resides between two patterns
"title": and "url":
$ cat filename.json
gives:
{"source":"PhoneArena","author":"","title":"Apple's US Black Friday shopping event has gift cards galore for select iPhones, iPads, and more","description":"As confirmed earlier this week, a four-day Black Friday and Cyber Monday shopping event is underway, offering Apple Store gift cards with purchases of select iPhone models, three iPad variants, an assortment of Macs, the entire Apple Watch Series 3 family, as well as the HomePod, both Apple TV versions, and select Beats headphones.That ...","url":"https://www.phonearena.com/news/Apples-US-Black-Friday-shopping-event-has-gift-cards-galore-for-select-iPhones-iPads-and-more_id111287","urlToImage":"https://i-cdn.phonearena.com/images/article/111287-two_lead/Apples-US-Black-Friday-shopping-event-has-gift-cards-galore-for-select-iPhones-iPads-and-more.jpg","publishedAt":"2018-11-23T09:05:00Z","dataRefreshedTime":"2018-11-23T09:43:09Z","category":"phone_news_reviews","resource":"PhoneArena"},{"source":"PhoneArena","author":"","title":"Verizon's top Black Friday bargain is a free Moto G6, no trade-in required","description":"That made it virtually impossible for retailers like Best Buy and B&H Photo Video to outdo themselves come the actual Black Friday frenzy, but luckily, that’s what carriers are (sometimes) good for.Enter Verizon, which revealed a wide range of killer deals on popular high-end ...","url":"https://www.phonearena.com/news/Verizons-top-Black-Friday-bargain-is-a-free-Moto-G6-no-trade-in-required_id111285","urlToImage":"https://i-cdn.phonearena.com/images/article/111285-two_lead/Verizons-top-Black-Friday-bargain-is-a-free-Moto-G6-no-trade-in-required.jpg","publishedAt":"2018-11-23T07:54:00Z","dataRefreshedTime":"2018-11-23T09:43:09Z","category":"phone_news_reviews","resource":"PhoneArena"},
So, I want to print everything between the patterns, but in the terminal the files is of 1 line and pattern appears multiple times. Only way I could think of is to print between the two patterns till the end of the file.
I tried using sed:
sed -n '^/title/,/^url/p' filename.json
but it prints blank.
I want the data to further input to do language analysis using machine learning techniques.
Any suggestion on other ways to print between the patterns, also patterns repeat multiple times. So, I want the data to print between each of the repeats.
Expected result is to print as CSV or tsv:
1 "As confirmed earlier this week, a four-day Black Friday and Cyber Monday shopping event is underway, offering Apple Store gift cards with purchases of select iPhone models, three iPad variants, an assortment of Macs, the entire Apple Watch Series 3 family, as well as the HomePod, both Apple TV versions, and select Beats headphones.That ..."
2 "That made it virtually impossible for retailers like Best Buy and B&H Photo Video to outdo themselves come the actual Black Friday frenzy, but luckily, that’s what carriers are (sometimes) good for.Enter Verizon, which revealed a wide range of killer deals on popular high-end ..."
etc,.
till the end of the file.
awk sed grep python pattern-matching
I have filename.json
. If I parse it in terminal with
file filename.json
output is:
filename.json: UTF-8 Unicode text, with very long lines
wc -l filename.json
1 filename.json
If I parse it as a json
using jq
then I'll have to mention what section of the data I want it to print like an id,summary,author,etc,. I have thousands of json which are similar in structure but section where I want the data is stored as either "summary","description","review",etc,. Since, there are thousands of JSON files I don't want to check in each one of them. But I know that the data I want resides between two patterns
"title": and "url":
$ cat filename.json
gives:
{"source":"PhoneArena","author":"","title":"Apple's US Black Friday shopping event has gift cards galore for select iPhones, iPads, and more","description":"As confirmed earlier this week, a four-day Black Friday and Cyber Monday shopping event is underway, offering Apple Store gift cards with purchases of select iPhone models, three iPad variants, an assortment of Macs, the entire Apple Watch Series 3 family, as well as the HomePod, both Apple TV versions, and select Beats headphones.That ...","url":"https://www.phonearena.com/news/Apples-US-Black-Friday-shopping-event-has-gift-cards-galore-for-select-iPhones-iPads-and-more_id111287","urlToImage":"https://i-cdn.phonearena.com/images/article/111287-two_lead/Apples-US-Black-Friday-shopping-event-has-gift-cards-galore-for-select-iPhones-iPads-and-more.jpg","publishedAt":"2018-11-23T09:05:00Z","dataRefreshedTime":"2018-11-23T09:43:09Z","category":"phone_news_reviews","resource":"PhoneArena"},{"source":"PhoneArena","author":"","title":"Verizon's top Black Friday bargain is a free Moto G6, no trade-in required","description":"That made it virtually impossible for retailers like Best Buy and B&H Photo Video to outdo themselves come the actual Black Friday frenzy, but luckily, that’s what carriers are (sometimes) good for.Enter Verizon, which revealed a wide range of killer deals on popular high-end ...","url":"https://www.phonearena.com/news/Verizons-top-Black-Friday-bargain-is-a-free-Moto-G6-no-trade-in-required_id111285","urlToImage":"https://i-cdn.phonearena.com/images/article/111285-two_lead/Verizons-top-Black-Friday-bargain-is-a-free-Moto-G6-no-trade-in-required.jpg","publishedAt":"2018-11-23T07:54:00Z","dataRefreshedTime":"2018-11-23T09:43:09Z","category":"phone_news_reviews","resource":"PhoneArena"},
So, I want to print everything between the patterns, but in the terminal the files is of 1 line and pattern appears multiple times. Only way I could think of is to print between the two patterns till the end of the file.
I tried using sed:
sed -n '^/title/,/^url/p' filename.json
but it prints blank.
I want the data to further input to do language analysis using machine learning techniques.
Any suggestion on other ways to print between the patterns, also patterns repeat multiple times. So, I want the data to print between each of the repeats.
Expected result is to print as CSV or tsv:
1 "As confirmed earlier this week, a four-day Black Friday and Cyber Monday shopping event is underway, offering Apple Store gift cards with purchases of select iPhone models, three iPad variants, an assortment of Macs, the entire Apple Watch Series 3 family, as well as the HomePod, both Apple TV versions, and select Beats headphones.That ..."
2 "That made it virtually impossible for retailers like Best Buy and B&H Photo Video to outdo themselves come the actual Black Friday frenzy, but luckily, that’s what carriers are (sometimes) good for.Enter Verizon, which revealed a wide range of killer deals on popular high-end ..."
etc,.
till the end of the file.
awk sed grep python pattern-matching
awk sed grep python pattern-matching
edited Nov 25 at 23:21
Isaac
9,84111445
9,84111445
asked Nov 23 at 9:09
CCC
447
447
4
This does not look like JSON to me. Did you for some reason omit all{}
,and
:,
?
– Kusalananda
Nov 23 at 9:13
Yes, I did omit and just copied it from the browser
– CCC
Nov 23 at 9:21
That makes it very difficult to help you as we can't test on real data.
– Kusalananda
Nov 23 at 9:29
3
Why not supply the JSON in the question? That would allow for a more robust solution than having to do haphazard parsing of nearly free form text.
– Kusalananda
Nov 23 at 9:36
2
Possible duplicate of How to extract data from a JSON file
– RoVo
Nov 23 at 10:02
|
show 8 more comments
4
This does not look like JSON to me. Did you for some reason omit all{}
,and
:,
?
– Kusalananda
Nov 23 at 9:13
Yes, I did omit and just copied it from the browser
– CCC
Nov 23 at 9:21
That makes it very difficult to help you as we can't test on real data.
– Kusalananda
Nov 23 at 9:29
3
Why not supply the JSON in the question? That would allow for a more robust solution than having to do haphazard parsing of nearly free form text.
– Kusalananda
Nov 23 at 9:36
2
Possible duplicate of How to extract data from a JSON file
– RoVo
Nov 23 at 10:02
4
4
This does not look like JSON to me. Did you for some reason omit all
{}
,
and :,
?– Kusalananda
Nov 23 at 9:13
This does not look like JSON to me. Did you for some reason omit all
{}
,
and :,
?– Kusalananda
Nov 23 at 9:13
Yes, I did omit and just copied it from the browser
– CCC
Nov 23 at 9:21
Yes, I did omit and just copied it from the browser
– CCC
Nov 23 at 9:21
That makes it very difficult to help you as we can't test on real data.
– Kusalananda
Nov 23 at 9:29
That makes it very difficult to help you as we can't test on real data.
– Kusalananda
Nov 23 at 9:29
3
3
Why not supply the JSON in the question? That would allow for a more robust solution than having to do haphazard parsing of nearly free form text.
– Kusalananda
Nov 23 at 9:36
Why not supply the JSON in the question? That would allow for a more robust solution than having to do haphazard parsing of nearly free form text.
– Kusalananda
Nov 23 at 9:36
2
2
Possible duplicate of How to extract data from a JSON file
– RoVo
Nov 23 at 10:02
Possible duplicate of How to extract data from a JSON file
– RoVo
Nov 23 at 10:02
|
show 8 more comments
1 Answer
1
active
oldest
votes
up vote
1
down vote
accepted
TL;DR
In ksh,bash,zsh:
sed -e $'s,"title":,1,g' -e $'s,"url":,2,g' -e $'s,^[^1]*,,' -e $'
s,1\([^2]*\)2[^1]*,\1\n,g' infile
sed
One character delimiters.
The canonical solution for one character delimiters
lets assume @
and #
as an example, is:
sed 's,^[^@]*,,;s,@([^#]*)#[^@]*,1 ,g' infile
That will
- remove every character from the start that is not a @
- extract characters that are between the first @
to the next first #
that follows.
For each line of the input file infile
.
General delimiters.
Any other delimiter could be converted to the answer above by simply converting each delimiter string to one character.
sed -e 's,"title":,@,g' -e 's,"url":,#,g' -e 's/^[^@]*//;s/@([^#]*)#[^@]*/1 /g' infile
Instead of space (1
), in your case, you can use newlines, which written for GNU sed are simply (1n
):
sed -e 's,"title":,@,g' -e 's,"url":,#,g' -e 's/^[^@]*//;s/@([^#]*)#[^@]*/1n/g' infile
For other (older) seds Add an explicit newline:
sed -e 's,"title":,@,g' -e 's,"url":,#,g' -e 's/^[^@]*//;s/@([^#]*)#[^@]*/1
/g' infile
If there is the risk that the delimters used above could be inside the file, choose another ones that are warateed to not exist inside the file. If that seems to be a problem, the start and end delimiters could be a control character
like Ctrl-A ( or encoded: ^A
, as hex: Ox01
or as octal 01
). You can type that in a shell console by typing Ctrl-V Ctrl-A. You will see a ^A in the command line:
sed -e 's,"title":,^A,g' -e 's,"url":,^B,g' -e 's,^[^^A]*,,;s,^A([^^B]*)^B[^^A]*,1n,g' infile
Or, if it is too cumbersome to type, either use (ksh,bash,zsh):
sed -e $'s,"title":,1,g' -e $'s,"url":,2,g' -e $'s,^[^1]*,,' -e $'s,1\([^2]*\)2[^1]*,\1\n,g' infile
Or, if your sed supports it:
sed -e 's,"title":,o001,g' -e 's,"url":,o002,g' -e 's,^[^o001]*,,' -e 's,o001([^o002]*)o002[^o001]*,1o012,g' infile
if delimiter is "description":
If the starting tag is actually "description":
(from your example of output), just use it instead of "title":
The output from above (from the file you linked before in your question):
"Black Friday deal: Palm companion phone is $150 off at Verizon, but there's a catch","description":"",
"LG trademarks potential names for its foldable phone, one fits a crazy concept found in patents","description":"",
"Blackview's Black Friday promo discounts the BV9500 Pro and other rugged phones on Amazon","description":"Advertorial by Blackview: the opinions expressed in this story may not reflect the positions of PhoneArena! disclaimer amzn_assoc_tracking_id = 'phone0e0d-20';amzn_assoc_ad_mode = 'manual';amzn_assoc_ad_type ...",
If you need to number the lines, sed it again with sed -n '=;p;g;p'
:
| sed -n '=;p;g;p'
1
"Black Friday deal: Palm companion phone is $150 off at Verizon, but there's a catch","description":"",
2
"LG trademarks potential names for its foldable phone, one fits a crazy concept found in patents","description":"",
3
"Blackview's Black Friday promo discounts the BV9500 Pro and other rugged phones on Amazon","description":"Advertorial by Blackview: the opinions expressed in this story may not reflect the positions of PhoneArena! disclaimer amzn_assoc_tracking_id = 'phone0e0d-20';amzn_assoc_ad_mode = 'manual';amzn_assoc_ad_type ...",
AWK
Similar logic implemented in awk:
awk -vone=$'1' -vtwo=$'2' '{
gsub(/"title":/,one);
gsub(/"url":/,two);
sub("^[^"one"]*"one,"")
gsub(two"[^"one"]*"one,ORS)
sub(two"[^"two"]*$","")
} 1' infile
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
accepted
TL;DR
In ksh,bash,zsh:
sed -e $'s,"title":,1,g' -e $'s,"url":,2,g' -e $'s,^[^1]*,,' -e $'
s,1\([^2]*\)2[^1]*,\1\n,g' infile
sed
One character delimiters.
The canonical solution for one character delimiters
lets assume @
and #
as an example, is:
sed 's,^[^@]*,,;s,@([^#]*)#[^@]*,1 ,g' infile
That will
- remove every character from the start that is not a @
- extract characters that are between the first @
to the next first #
that follows.
For each line of the input file infile
.
General delimiters.
Any other delimiter could be converted to the answer above by simply converting each delimiter string to one character.
sed -e 's,"title":,@,g' -e 's,"url":,#,g' -e 's/^[^@]*//;s/@([^#]*)#[^@]*/1 /g' infile
Instead of space (1
), in your case, you can use newlines, which written for GNU sed are simply (1n
):
sed -e 's,"title":,@,g' -e 's,"url":,#,g' -e 's/^[^@]*//;s/@([^#]*)#[^@]*/1n/g' infile
For other (older) seds Add an explicit newline:
sed -e 's,"title":,@,g' -e 's,"url":,#,g' -e 's/^[^@]*//;s/@([^#]*)#[^@]*/1
/g' infile
If there is the risk that the delimters used above could be inside the file, choose another ones that are warateed to not exist inside the file. If that seems to be a problem, the start and end delimiters could be a control character
like Ctrl-A ( or encoded: ^A
, as hex: Ox01
or as octal 01
). You can type that in a shell console by typing Ctrl-V Ctrl-A. You will see a ^A in the command line:
sed -e 's,"title":,^A,g' -e 's,"url":,^B,g' -e 's,^[^^A]*,,;s,^A([^^B]*)^B[^^A]*,1n,g' infile
Or, if it is too cumbersome to type, either use (ksh,bash,zsh):
sed -e $'s,"title":,1,g' -e $'s,"url":,2,g' -e $'s,^[^1]*,,' -e $'s,1\([^2]*\)2[^1]*,\1\n,g' infile
Or, if your sed supports it:
sed -e 's,"title":,o001,g' -e 's,"url":,o002,g' -e 's,^[^o001]*,,' -e 's,o001([^o002]*)o002[^o001]*,1o012,g' infile
if delimiter is "description":
If the starting tag is actually "description":
(from your example of output), just use it instead of "title":
The output from above (from the file you linked before in your question):
"Black Friday deal: Palm companion phone is $150 off at Verizon, but there's a catch","description":"",
"LG trademarks potential names for its foldable phone, one fits a crazy concept found in patents","description":"",
"Blackview's Black Friday promo discounts the BV9500 Pro and other rugged phones on Amazon","description":"Advertorial by Blackview: the opinions expressed in this story may not reflect the positions of PhoneArena! disclaimer amzn_assoc_tracking_id = 'phone0e0d-20';amzn_assoc_ad_mode = 'manual';amzn_assoc_ad_type ...",
If you need to number the lines, sed it again with sed -n '=;p;g;p'
:
| sed -n '=;p;g;p'
1
"Black Friday deal: Palm companion phone is $150 off at Verizon, but there's a catch","description":"",
2
"LG trademarks potential names for its foldable phone, one fits a crazy concept found in patents","description":"",
3
"Blackview's Black Friday promo discounts the BV9500 Pro and other rugged phones on Amazon","description":"Advertorial by Blackview: the opinions expressed in this story may not reflect the positions of PhoneArena! disclaimer amzn_assoc_tracking_id = 'phone0e0d-20';amzn_assoc_ad_mode = 'manual';amzn_assoc_ad_type ...",
AWK
Similar logic implemented in awk:
awk -vone=$'1' -vtwo=$'2' '{
gsub(/"title":/,one);
gsub(/"url":/,two);
sub("^[^"one"]*"one,"")
gsub(two"[^"one"]*"one,ORS)
sub(two"[^"two"]*$","")
} 1' infile
add a comment |
up vote
1
down vote
accepted
TL;DR
In ksh,bash,zsh:
sed -e $'s,"title":,1,g' -e $'s,"url":,2,g' -e $'s,^[^1]*,,' -e $'
s,1\([^2]*\)2[^1]*,\1\n,g' infile
sed
One character delimiters.
The canonical solution for one character delimiters
lets assume @
and #
as an example, is:
sed 's,^[^@]*,,;s,@([^#]*)#[^@]*,1 ,g' infile
That will
- remove every character from the start that is not a @
- extract characters that are between the first @
to the next first #
that follows.
For each line of the input file infile
.
General delimiters.
Any other delimiter could be converted to the answer above by simply converting each delimiter string to one character.
sed -e 's,"title":,@,g' -e 's,"url":,#,g' -e 's/^[^@]*//;s/@([^#]*)#[^@]*/1 /g' infile
Instead of space (1
), in your case, you can use newlines, which written for GNU sed are simply (1n
):
sed -e 's,"title":,@,g' -e 's,"url":,#,g' -e 's/^[^@]*//;s/@([^#]*)#[^@]*/1n/g' infile
For other (older) seds Add an explicit newline:
sed -e 's,"title":,@,g' -e 's,"url":,#,g' -e 's/^[^@]*//;s/@([^#]*)#[^@]*/1
/g' infile
If there is the risk that the delimters used above could be inside the file, choose another ones that are warateed to not exist inside the file. If that seems to be a problem, the start and end delimiters could be a control character
like Ctrl-A ( or encoded: ^A
, as hex: Ox01
or as octal 01
). You can type that in a shell console by typing Ctrl-V Ctrl-A. You will see a ^A in the command line:
sed -e 's,"title":,^A,g' -e 's,"url":,^B,g' -e 's,^[^^A]*,,;s,^A([^^B]*)^B[^^A]*,1n,g' infile
Or, if it is too cumbersome to type, either use (ksh,bash,zsh):
sed -e $'s,"title":,1,g' -e $'s,"url":,2,g' -e $'s,^[^1]*,,' -e $'s,1\([^2]*\)2[^1]*,\1\n,g' infile
Or, if your sed supports it:
sed -e 's,"title":,o001,g' -e 's,"url":,o002,g' -e 's,^[^o001]*,,' -e 's,o001([^o002]*)o002[^o001]*,1o012,g' infile
if delimiter is "description":
If the starting tag is actually "description":
(from your example of output), just use it instead of "title":
The output from above (from the file you linked before in your question):
"Black Friday deal: Palm companion phone is $150 off at Verizon, but there's a catch","description":"",
"LG trademarks potential names for its foldable phone, one fits a crazy concept found in patents","description":"",
"Blackview's Black Friday promo discounts the BV9500 Pro and other rugged phones on Amazon","description":"Advertorial by Blackview: the opinions expressed in this story may not reflect the positions of PhoneArena! disclaimer amzn_assoc_tracking_id = 'phone0e0d-20';amzn_assoc_ad_mode = 'manual';amzn_assoc_ad_type ...",
If you need to number the lines, sed it again with sed -n '=;p;g;p'
:
| sed -n '=;p;g;p'
1
"Black Friday deal: Palm companion phone is $150 off at Verizon, but there's a catch","description":"",
2
"LG trademarks potential names for its foldable phone, one fits a crazy concept found in patents","description":"",
3
"Blackview's Black Friday promo discounts the BV9500 Pro and other rugged phones on Amazon","description":"Advertorial by Blackview: the opinions expressed in this story may not reflect the positions of PhoneArena! disclaimer amzn_assoc_tracking_id = 'phone0e0d-20';amzn_assoc_ad_mode = 'manual';amzn_assoc_ad_type ...",
AWK
Similar logic implemented in awk:
awk -vone=$'1' -vtwo=$'2' '{
gsub(/"title":/,one);
gsub(/"url":/,two);
sub("^[^"one"]*"one,"")
gsub(two"[^"one"]*"one,ORS)
sub(two"[^"two"]*$","")
} 1' infile
add a comment |
up vote
1
down vote
accepted
up vote
1
down vote
accepted
TL;DR
In ksh,bash,zsh:
sed -e $'s,"title":,1,g' -e $'s,"url":,2,g' -e $'s,^[^1]*,,' -e $'
s,1\([^2]*\)2[^1]*,\1\n,g' infile
sed
One character delimiters.
The canonical solution for one character delimiters
lets assume @
and #
as an example, is:
sed 's,^[^@]*,,;s,@([^#]*)#[^@]*,1 ,g' infile
That will
- remove every character from the start that is not a @
- extract characters that are between the first @
to the next first #
that follows.
For each line of the input file infile
.
General delimiters.
Any other delimiter could be converted to the answer above by simply converting each delimiter string to one character.
sed -e 's,"title":,@,g' -e 's,"url":,#,g' -e 's/^[^@]*//;s/@([^#]*)#[^@]*/1 /g' infile
Instead of space (1
), in your case, you can use newlines, which written for GNU sed are simply (1n
):
sed -e 's,"title":,@,g' -e 's,"url":,#,g' -e 's/^[^@]*//;s/@([^#]*)#[^@]*/1n/g' infile
For other (older) seds Add an explicit newline:
sed -e 's,"title":,@,g' -e 's,"url":,#,g' -e 's/^[^@]*//;s/@([^#]*)#[^@]*/1
/g' infile
If there is the risk that the delimters used above could be inside the file, choose another ones that are warateed to not exist inside the file. If that seems to be a problem, the start and end delimiters could be a control character
like Ctrl-A ( or encoded: ^A
, as hex: Ox01
or as octal 01
). You can type that in a shell console by typing Ctrl-V Ctrl-A. You will see a ^A in the command line:
sed -e 's,"title":,^A,g' -e 's,"url":,^B,g' -e 's,^[^^A]*,,;s,^A([^^B]*)^B[^^A]*,1n,g' infile
Or, if it is too cumbersome to type, either use (ksh,bash,zsh):
sed -e $'s,"title":,1,g' -e $'s,"url":,2,g' -e $'s,^[^1]*,,' -e $'s,1\([^2]*\)2[^1]*,\1\n,g' infile
Or, if your sed supports it:
sed -e 's,"title":,o001,g' -e 's,"url":,o002,g' -e 's,^[^o001]*,,' -e 's,o001([^o002]*)o002[^o001]*,1o012,g' infile
if delimiter is "description":
If the starting tag is actually "description":
(from your example of output), just use it instead of "title":
The output from above (from the file you linked before in your question):
"Black Friday deal: Palm companion phone is $150 off at Verizon, but there's a catch","description":"",
"LG trademarks potential names for its foldable phone, one fits a crazy concept found in patents","description":"",
"Blackview's Black Friday promo discounts the BV9500 Pro and other rugged phones on Amazon","description":"Advertorial by Blackview: the opinions expressed in this story may not reflect the positions of PhoneArena! disclaimer amzn_assoc_tracking_id = 'phone0e0d-20';amzn_assoc_ad_mode = 'manual';amzn_assoc_ad_type ...",
If you need to number the lines, sed it again with sed -n '=;p;g;p'
:
| sed -n '=;p;g;p'
1
"Black Friday deal: Palm companion phone is $150 off at Verizon, but there's a catch","description":"",
2
"LG trademarks potential names for its foldable phone, one fits a crazy concept found in patents","description":"",
3
"Blackview's Black Friday promo discounts the BV9500 Pro and other rugged phones on Amazon","description":"Advertorial by Blackview: the opinions expressed in this story may not reflect the positions of PhoneArena! disclaimer amzn_assoc_tracking_id = 'phone0e0d-20';amzn_assoc_ad_mode = 'manual';amzn_assoc_ad_type ...",
AWK
Similar logic implemented in awk:
awk -vone=$'1' -vtwo=$'2' '{
gsub(/"title":/,one);
gsub(/"url":/,two);
sub("^[^"one"]*"one,"")
gsub(two"[^"one"]*"one,ORS)
sub(two"[^"two"]*$","")
} 1' infile
TL;DR
In ksh,bash,zsh:
sed -e $'s,"title":,1,g' -e $'s,"url":,2,g' -e $'s,^[^1]*,,' -e $'
s,1\([^2]*\)2[^1]*,\1\n,g' infile
sed
One character delimiters.
The canonical solution for one character delimiters
lets assume @
and #
as an example, is:
sed 's,^[^@]*,,;s,@([^#]*)#[^@]*,1 ,g' infile
That will
- remove every character from the start that is not a @
- extract characters that are between the first @
to the next first #
that follows.
For each line of the input file infile
.
General delimiters.
Any other delimiter could be converted to the answer above by simply converting each delimiter string to one character.
sed -e 's,"title":,@,g' -e 's,"url":,#,g' -e 's/^[^@]*//;s/@([^#]*)#[^@]*/1 /g' infile
Instead of space (1
), in your case, you can use newlines, which written for GNU sed are simply (1n
):
sed -e 's,"title":,@,g' -e 's,"url":,#,g' -e 's/^[^@]*//;s/@([^#]*)#[^@]*/1n/g' infile
For other (older) seds Add an explicit newline:
sed -e 's,"title":,@,g' -e 's,"url":,#,g' -e 's/^[^@]*//;s/@([^#]*)#[^@]*/1
/g' infile
If there is the risk that the delimters used above could be inside the file, choose another ones that are warateed to not exist inside the file. If that seems to be a problem, the start and end delimiters could be a control character
like Ctrl-A ( or encoded: ^A
, as hex: Ox01
or as octal 01
). You can type that in a shell console by typing Ctrl-V Ctrl-A. You will see a ^A in the command line:
sed -e 's,"title":,^A,g' -e 's,"url":,^B,g' -e 's,^[^^A]*,,;s,^A([^^B]*)^B[^^A]*,1n,g' infile
Or, if it is too cumbersome to type, either use (ksh,bash,zsh):
sed -e $'s,"title":,1,g' -e $'s,"url":,2,g' -e $'s,^[^1]*,,' -e $'s,1\([^2]*\)2[^1]*,\1\n,g' infile
Or, if your sed supports it:
sed -e 's,"title":,o001,g' -e 's,"url":,o002,g' -e 's,^[^o001]*,,' -e 's,o001([^o002]*)o002[^o001]*,1o012,g' infile
if delimiter is "description":
If the starting tag is actually "description":
(from your example of output), just use it instead of "title":
The output from above (from the file you linked before in your question):
"Black Friday deal: Palm companion phone is $150 off at Verizon, but there's a catch","description":"",
"LG trademarks potential names for its foldable phone, one fits a crazy concept found in patents","description":"",
"Blackview's Black Friday promo discounts the BV9500 Pro and other rugged phones on Amazon","description":"Advertorial by Blackview: the opinions expressed in this story may not reflect the positions of PhoneArena! disclaimer amzn_assoc_tracking_id = 'phone0e0d-20';amzn_assoc_ad_mode = 'manual';amzn_assoc_ad_type ...",
If you need to number the lines, sed it again with sed -n '=;p;g;p'
:
| sed -n '=;p;g;p'
1
"Black Friday deal: Palm companion phone is $150 off at Verizon, but there's a catch","description":"",
2
"LG trademarks potential names for its foldable phone, one fits a crazy concept found in patents","description":"",
3
"Blackview's Black Friday promo discounts the BV9500 Pro and other rugged phones on Amazon","description":"Advertorial by Blackview: the opinions expressed in this story may not reflect the positions of PhoneArena! disclaimer amzn_assoc_tracking_id = 'phone0e0d-20';amzn_assoc_ad_mode = 'manual';amzn_assoc_ad_type ...",
AWK
Similar logic implemented in awk:
awk -vone=$'1' -vtwo=$'2' '{
gsub(/"title":/,one);
gsub(/"url":/,two);
sub("^[^"one"]*"one,"")
gsub(two"[^"one"]*"one,ORS)
sub(two"[^"two"]*$","")
} 1' infile
edited Nov 25 at 23:52
answered Nov 25 at 23:14
Isaac
9,84111445
9,84111445
add a comment |
add a comment |
Thanks for contributing an answer to Unix & Linux Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f483627%2fextract-data-between-two-patterns-from-a-huge-forced-text-file%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
4
This does not look like JSON to me. Did you for some reason omit all
{}
,and
:,
?– Kusalananda
Nov 23 at 9:13
Yes, I did omit and just copied it from the browser
– CCC
Nov 23 at 9:21
That makes it very difficult to help you as we can't test on real data.
– Kusalananda
Nov 23 at 9:29
3
Why not supply the JSON in the question? That would allow for a more robust solution than having to do haphazard parsing of nearly free form text.
– Kusalananda
Nov 23 at 9:36
2
Possible duplicate of How to extract data from a JSON file
– RoVo
Nov 23 at 10:02