Extract data between two patterns from a huge (forced) text file

up vote
-3
down vote

favorite

I have filename.json. If I parse it in terminal with

file filename.json

output is:

filename.json: UTF-8 Unicode text, with very long lines  



wc -l filename.json    

1 filename.json

If I parse it as a json using jq then I'll have to mention what section of the data I want it to print like an id,summary,author,etc,. I have thousands of json which are similar in structure but section where I want the data is stored as either "summary","description","review",etc,. Since, there are thousands of JSON files I don't want to check in each one of them. But I know that the data I want resides between two patterns

"title": and "url":

$ cat filename.json

gives:

{"source":"PhoneArena","author":"","title":"Apple's US Black Friday shopping event has gift cards galore for select iPhones, iPads, and more","description":"As confirmed earlier this week, a four-day Black Friday and Cyber Monday shopping event is underway, offering Apple Store gift cards with purchases of select iPhone models, three iPad variants, an assortment of Macs, the entire Apple Watch Series 3 family, as well as the HomePod, both Apple TV versions, and select Beats headphones.That ...","url":"https://www.phonearena.com/news/Apples-US-Black-Friday-shopping-event-has-gift-cards-galore-for-select-iPhones-iPads-and-more_id111287","urlToImage":"https://i-cdn.phonearena.com/images/article/111287-two_lead/Apples-US-Black-Friday-shopping-event-has-gift-cards-galore-for-select-iPhones-iPads-and-more.jpg","publishedAt":"2018-11-23T09:05:00Z","dataRefreshedTime":"2018-11-23T09:43:09Z","category":"phone_news_reviews","resource":"PhoneArena"},{"source":"PhoneArena","author":"","title":"Verizon's top Black Friday bargain is a free Moto G6, no trade-in required","description":"That made it virtually impossible for retailers like Best Buy and B&H Photo Video to outdo themselves come the actual Black Friday frenzy, but luckily, that’s what carriers are (sometimes) good for.Enter Verizon, which revealed a wide range of killer deals on popular high-end ...","url":"https://www.phonearena.com/news/Verizons-top-Black-Friday-bargain-is-a-free-Moto-G6-no-trade-in-required_id111285","urlToImage":"https://i-cdn.phonearena.com/images/article/111285-two_lead/Verizons-top-Black-Friday-bargain-is-a-free-Moto-G6-no-trade-in-required.jpg","publishedAt":"2018-11-23T07:54:00Z","dataRefreshedTime":"2018-11-23T09:43:09Z","category":"phone_news_reviews","resource":"PhoneArena"},

So, I want to print everything between the patterns, but in the terminal the files is of 1 line and pattern appears multiple times. Only way I could think of is to print between the two patterns till the end of the file.

I tried using sed:

sed -n '^/title/,/^url/p' filename.json

but it prints blank.

I want the data to further input to do language analysis using machine learning techniques.

Any suggestion on other ways to print between the patterns, also patterns repeat multiple times. So, I want the data to print between each of the repeats.

Expected result is to print as CSV or tsv:

1 "As confirmed earlier this week, a four-day Black Friday and Cyber Monday shopping event is underway, offering Apple Store gift cards with purchases of select iPhone models, three iPad variants, an assortment of Macs, the entire Apple Watch Series 3 family, as well as the HomePod, both Apple TV versions, and select Beats headphones.That ..."



2 "That made it virtually impossible for retailers like Best Buy and B&H Photo Video to outdo themselves come the actual Black Friday frenzy, but luckily, that’s what carriers are (sometimes) good for.Enter Verizon, which revealed a wide range of killer deals on popular high-end ..."



etc,.

till the end of the file.

edited Nov 25 at 23:21

Isaac

9,84111445

asked Nov 23 at 9:09

CCC

447

4

This does not look like JSON to me. Did you for some reason omit all {}, and :,?
– Kusalananda
Nov 23 at 9:13

Yes, I did omit and just copied it from the browser
– CCC
Nov 23 at 9:21

That makes it very difficult to help you as we can't test on real data.
– Kusalananda
Nov 23 at 9:29

3

Why not supply the JSON in the question? That would allow for a more robust solution than having to do haphazard parsing of nearly free form text.
– Kusalananda
Nov 23 at 9:36

2

Possible duplicate of How to extract data from a JSON file
– RoVo
Nov 23 at 10:02

|
show 8 more comments

up vote
-3
down vote

favorite

I have filename.json. If I parse it in terminal with

file filename.json

output is:

filename.json: UTF-8 Unicode text, with very long lines  



wc -l filename.json    

1 filename.json

"title": and "url":

$ cat filename.json

gives:

{"source":"PhoneArena","author":"","title":"Apple's US Black Friday shopping event has gift cards galore for select iPhones, iPads, and more","description":"As confirmed earlier this week, a four-day Black Friday and Cyber Monday shopping event is underway, offering Apple Store gift cards with purchases of select iPhone models, three iPad variants, an assortment of Macs, the entire Apple Watch Series 3 family, as well as the HomePod, both Apple TV versions, and select Beats headphones.That ...","url":"https://www.phonearena.com/news/Apples-US-Black-Friday-shopping-event-has-gift-cards-galore-for-select-iPhones-iPads-and-more_id111287","urlToImage":"https://i-cdn.phonearena.com/images/article/111287-two_lead/Apples-US-Black-Friday-shopping-event-has-gift-cards-galore-for-select-iPhones-iPads-and-more.jpg","publishedAt":"2018-11-23T09:05:00Z","dataRefreshedTime":"2018-11-23T09:43:09Z","category":"phone_news_reviews","resource":"PhoneArena"},{"source":"PhoneArena","author":"","title":"Verizon's top Black Friday bargain is a free Moto G6, no trade-in required","description":"That made it virtually impossible for retailers like Best Buy and B&H Photo Video to outdo themselves come the actual Black Friday frenzy, but luckily, that’s what carriers are (sometimes) good for.Enter Verizon, which revealed a wide range of killer deals on popular high-end ...","url":"https://www.phonearena.com/news/Verizons-top-Black-Friday-bargain-is-a-free-Moto-G6-no-trade-in-required_id111285","urlToImage":"https://i-cdn.phonearena.com/images/article/111285-two_lead/Verizons-top-Black-Friday-bargain-is-a-free-Moto-G6-no-trade-in-required.jpg","publishedAt":"2018-11-23T07:54:00Z","dataRefreshedTime":"2018-11-23T09:43:09Z","category":"phone_news_reviews","resource":"PhoneArena"},

I tried using sed:

sed -n '^/title/,/^url/p' filename.json

but it prints blank.

I want the data to further input to do language analysis using machine learning techniques.

Any suggestion on other ways to print between the patterns, also patterns repeat multiple times. So, I want the data to print between each of the repeats.

Expected result is to print as CSV or tsv:

1 "As confirmed earlier this week, a four-day Black Friday and Cyber Monday shopping event is underway, offering Apple Store gift cards with purchases of select iPhone models, three iPad variants, an assortment of Macs, the entire Apple Watch Series 3 family, as well as the HomePod, both Apple TV versions, and select Beats headphones.That ..."



2 "That made it virtually impossible for retailers like Best Buy and B&H Photo Video to outdo themselves come the actual Black Friday frenzy, but luckily, that’s what carriers are (sometimes) good for.Enter Verizon, which revealed a wide range of killer deals on popular high-end ..."



etc,.

till the end of the file.

edited Nov 25 at 23:21

Isaac

9,84111445

asked Nov 23 at 9:09

CCC

447

4

This does not look like JSON to me. Did you for some reason omit all {}, and :,?
– Kusalananda
Nov 23 at 9:13

Yes, I did omit and just copied it from the browser
– CCC
Nov 23 at 9:21

That makes it very difficult to help you as we can't test on real data.
– Kusalananda
Nov 23 at 9:29

3

Why not supply the JSON in the question? That would allow for a more robust solution than having to do haphazard parsing of nearly free form text.
– Kusalananda
Nov 23 at 9:36

2

Possible duplicate of How to extract data from a JSON file
– RoVo
Nov 23 at 10:02

|
show 8 more comments

up vote
-3
down vote

favorite

I have filename.json. If I parse it in terminal with

file filename.json

output is:

filename.json: UTF-8 Unicode text, with very long lines  



wc -l filename.json    

1 filename.json

"title": and "url":

$ cat filename.json

gives:

{"source":"PhoneArena","author":"","title":"Apple's US Black Friday shopping event has gift cards galore for select iPhones, iPads, and more","description":"As confirmed earlier this week, a four-day Black Friday and Cyber Monday shopping event is underway, offering Apple Store gift cards with purchases of select iPhone models, three iPad variants, an assortment of Macs, the entire Apple Watch Series 3 family, as well as the HomePod, both Apple TV versions, and select Beats headphones.That ...","url":"https://www.phonearena.com/news/Apples-US-Black-Friday-shopping-event-has-gift-cards-galore-for-select-iPhones-iPads-and-more_id111287","urlToImage":"https://i-cdn.phonearena.com/images/article/111287-two_lead/Apples-US-Black-Friday-shopping-event-has-gift-cards-galore-for-select-iPhones-iPads-and-more.jpg","publishedAt":"2018-11-23T09:05:00Z","dataRefreshedTime":"2018-11-23T09:43:09Z","category":"phone_news_reviews","resource":"PhoneArena"},{"source":"PhoneArena","author":"","title":"Verizon's top Black Friday bargain is a free Moto G6, no trade-in required","description":"That made it virtually impossible for retailers like Best Buy and B&H Photo Video to outdo themselves come the actual Black Friday frenzy, but luckily, that’s what carriers are (sometimes) good for.Enter Verizon, which revealed a wide range of killer deals on popular high-end ...","url":"https://www.phonearena.com/news/Verizons-top-Black-Friday-bargain-is-a-free-Moto-G6-no-trade-in-required_id111285","urlToImage":"https://i-cdn.phonearena.com/images/article/111285-two_lead/Verizons-top-Black-Friday-bargain-is-a-free-Moto-G6-no-trade-in-required.jpg","publishedAt":"2018-11-23T07:54:00Z","dataRefreshedTime":"2018-11-23T09:43:09Z","category":"phone_news_reviews","resource":"PhoneArena"},

I tried using sed:

sed -n '^/title/,/^url/p' filename.json

but it prints blank.

I want the data to further input to do language analysis using machine learning techniques.

Any suggestion on other ways to print between the patterns, also patterns repeat multiple times. So, I want the data to print between each of the repeats.

Expected result is to print as CSV or tsv:

1 "As confirmed earlier this week, a four-day Black Friday and Cyber Monday shopping event is underway, offering Apple Store gift cards with purchases of select iPhone models, three iPad variants, an assortment of Macs, the entire Apple Watch Series 3 family, as well as the HomePod, both Apple TV versions, and select Beats headphones.That ..."



2 "That made it virtually impossible for retailers like Best Buy and B&H Photo Video to outdo themselves come the actual Black Friday frenzy, but luckily, that’s what carriers are (sometimes) good for.Enter Verizon, which revealed a wide range of killer deals on popular high-end ..."



etc,.

till the end of the file.

edited Nov 25 at 23:21

Isaac

9,84111445

asked Nov 23 at 9:09

CCC

447

I have filename.json. If I parse it in terminal with

file filename.json

output is:

filename.json: UTF-8 Unicode text, with very long lines  



wc -l filename.json    

1 filename.json

"title": and "url":

$ cat filename.json

gives:

{"source":"PhoneArena","author":"","title":"Apple's US Black Friday shopping event has gift cards galore for select iPhones, iPads, and more","description":"As confirmed earlier this week, a four-day Black Friday and Cyber Monday shopping event is underway, offering Apple Store gift cards with purchases of select iPhone models, three iPad variants, an assortment of Macs, the entire Apple Watch Series 3 family, as well as the HomePod, both Apple TV versions, and select Beats headphones.That ...","url":"https://www.phonearena.com/news/Apples-US-Black-Friday-shopping-event-has-gift-cards-galore-for-select-iPhones-iPads-and-more_id111287","urlToImage":"https://i-cdn.phonearena.com/images/article/111287-two_lead/Apples-US-Black-Friday-shopping-event-has-gift-cards-galore-for-select-iPhones-iPads-and-more.jpg","publishedAt":"2018-11-23T09:05:00Z","dataRefreshedTime":"2018-11-23T09:43:09Z","category":"phone_news_reviews","resource":"PhoneArena"},{"source":"PhoneArena","author":"","title":"Verizon's top Black Friday bargain is a free Moto G6, no trade-in required","description":"That made it virtually impossible for retailers like Best Buy and B&H Photo Video to outdo themselves come the actual Black Friday frenzy, but luckily, that’s what carriers are (sometimes) good for.Enter Verizon, which revealed a wide range of killer deals on popular high-end ...","url":"https://www.phonearena.com/news/Verizons-top-Black-Friday-bargain-is-a-free-Moto-G6-no-trade-in-required_id111285","urlToImage":"https://i-cdn.phonearena.com/images/article/111285-two_lead/Verizons-top-Black-Friday-bargain-is-a-free-Moto-G6-no-trade-in-required.jpg","publishedAt":"2018-11-23T07:54:00Z","dataRefreshedTime":"2018-11-23T09:43:09Z","category":"phone_news_reviews","resource":"PhoneArena"},

I tried using sed:

sed -n '^/title/,/^url/p' filename.json

but it prints blank.

I want the data to further input to do language analysis using machine learning techniques.

Any suggestion on other ways to print between the patterns, also patterns repeat multiple times. So, I want the data to print between each of the repeats.

Expected result is to print as CSV or tsv:

1 "As confirmed earlier this week, a four-day Black Friday and Cyber Monday shopping event is underway, offering Apple Store gift cards with purchases of select iPhone models, three iPad variants, an assortment of Macs, the entire Apple Watch Series 3 family, as well as the HomePod, both Apple TV versions, and select Beats headphones.That ..."



2 "That made it virtually impossible for retailers like Best Buy and B&H Photo Video to outdo themselves come the actual Black Friday frenzy, but luckily, that’s what carriers are (sometimes) good for.Enter Verizon, which revealed a wide range of killer deals on popular high-end ..."



etc,.

till the end of the file.

awk sed grep python pattern-matching

edited Nov 25 at 23:21

Isaac

9,84111445

asked Nov 23 at 9:09

CCC

447

edited Nov 25 at 23:21

Isaac

9,84111445

asked Nov 23 at 9:09

CCC

447

edited Nov 25 at 23:21

Isaac

9,84111445

edited Nov 25 at 23:21

Isaac

9,84111445

edited Nov 25 at 23:21

Isaac

9,84111445

asked Nov 23 at 9:09

CCC

447

asked Nov 23 at 9:09

CCC

447

asked Nov 23 at 9:09

CCC

447

4

This does not look like JSON to me. Did you for some reason omit all {}, and :,?
– Kusalananda
Nov 23 at 9:13

Yes, I did omit and just copied it from the browser
– CCC
Nov 23 at 9:21

That makes it very difficult to help you as we can't test on real data.
– Kusalananda
Nov 23 at 9:29

3

Why not supply the JSON in the question? That would allow for a more robust solution than having to do haphazard parsing of nearly free form text.
– Kusalananda
Nov 23 at 9:36

2

Possible duplicate of How to extract data from a JSON file
– RoVo
Nov 23 at 10:02

|
show 8 more comments

4

This does not look like JSON to me. Did you for some reason omit all {}, and :,?
– Kusalananda
Nov 23 at 9:13

Yes, I did omit and just copied it from the browser
– CCC
Nov 23 at 9:21

That makes it very difficult to help you as we can't test on real data.
– Kusalananda
Nov 23 at 9:29

3

Why not supply the JSON in the question? That would allow for a more robust solution than having to do haphazard parsing of nearly free form text.
– Kusalananda
Nov 23 at 9:36

2

Possible duplicate of How to extract data from a JSON file
– RoVo
Nov 23 at 10:02

This does not look like JSON to me. Did you for some reason omit all {}, and :,?
– Kusalananda
Nov 23 at 9:13

Yes, I did omit and just copied it from the browser
– CCC
Nov 23 at 9:21

That makes it very difficult to help you as we can't test on real data.
– Kusalananda
Nov 23 at 9:29

Why not supply the JSON in the question? That would allow for a more robust solution than having to do haphazard parsing of nearly free form text.
– Kusalananda
Nov 23 at 9:36

Possible duplicate of How to extract data from a JSON file
– RoVo
Nov 23 at 10:02

|
show 8 more comments

1 Answer
1

active

oldest

votes

up vote
1
down vote

accepted

TL;DR

In ksh,bash,zsh:

sed -e $'s,"title":,1,g' -e $'s,"url":,2,g' -e $'s,^[^1]*,,' -e $'

         s,1\([^2]*\)2[^1]*,\1\n,g' infile

sed

One character delimiters.

The canonical solution for one character delimiters
lets assume @ and # as an example, is:

sed 's,^[^@]*,,;s,@([^#]*)#[^@]*,1 ,g' infile

That will
- remove every character from the start that is not a @
- extract characters that are between the first @
to the next first # that follows.

For each line of the input file infile.

General delimiters.

Any other delimiter could be converted to the answer above by simply converting each delimiter string to one character.

sed -e 's,"title":,@,g' -e 's,"url":,#,g' -e 's/^[^@]*//;s/@([^#]*)#[^@]*/1 /g' infile

Instead of space (1), in your case, you can use newlines, which written for GNU sed are simply (1n):

sed -e 's,"title":,@,g' -e 's,"url":,#,g' -e 's/^[^@]*//;s/@([^#]*)#[^@]*/1n/g' infile

For other (older) seds Add an explicit newline:

sed -e 's,"title":,@,g' -e 's,"url":,#,g' -e 's/^[^@]*//;s/@([^#]*)#[^@]*/1

/g' infile

If there is the risk that the delimters used above could be inside the file, choose another ones that are warateed to not exist inside the file. If that seems to be a problem, the start and end delimiters could be a control character
like Ctrl-A ( or encoded: ^A, as hex: Ox01 or as octal 01 ). You can type that in a shell console by typing Ctrl-V Ctrl-A. You will see a ^A in the command line:

sed -e 's,"title":,^A,g' -e 's,"url":,^B,g' -e 's,^[^^A]*,,;s,^A([^^B]*)^B[^^A]*,1n,g' infile

Or, if it is too cumbersome to type, either use (ksh,bash,zsh):

sed -e $'s,"title":,1,g' -e $'s,"url":,2,g' -e $'s,^[^1]*,,' -e $'s,1\([^2]*\)2[^1]*,\1\n,g' infile

Or, if your sed supports it:

sed -e 's,"title":,o001,g' -e 's,"url":,o002,g' -e 's,^[^o001]*,,' -e 's,o001([^o002]*)o002[^o001]*,1o012,g' infile

if delimiter is "description":

If the starting tag is actually "description": (from your example of output), just use it instead of "title":

The output from above (from the file you linked before in your question):

"Black Friday deal: Palm companion phone is $150 off at Verizon, but there's a catch","description":"",

"LG trademarks potential names for its foldable phone, one fits a crazy concept found in patents","description":"",

"Blackview's Black Friday promo discounts the BV9500 Pro and other rugged phones on Amazon","description":"Advertorial by Blackview: the opinions expressed in this story may not reflect the positions of PhoneArena! disclaimer   amzn_assoc_tracking_id = 'phone0e0d-20';amzn_assoc_ad_mode = 'manual';amzn_assoc_ad_type ...",

If you need to number the lines, sed it again with sed -n '=;p;g;p':

| sed -n '=;p;g;p'

1

"Black Friday deal: Palm companion phone is $150 off at Verizon, but there's a catch","description":"",



2

"LG trademarks potential names for its foldable phone, one fits a crazy concept found in patents","description":"",



3

"Blackview's Black Friday promo discounts the BV9500 Pro and other rugged phones on Amazon","description":"Advertorial by Blackview: the opinions expressed in this story may not reflect the positions of PhoneArena! disclaimer   amzn_assoc_tracking_id = 'phone0e0d-20';amzn_assoc_ad_mode = 'manual';amzn_assoc_ad_type ...",

AWK

Similar logic implemented in awk:

awk -vone=$'1' -vtwo=$'2' '{

            gsub(/"title":/,one);

            gsub(/"url":/,two);

            sub("^[^"one"]*"one,"")

            gsub(two"[^"one"]*"one,ORS)

            sub(two"[^"two"]*$","")

           } 1' infile

edited Nov 25 at 23:52

answered Nov 25 at 23:14

Isaac

9,84111445

add a comment |

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f483627%2fextract-data-between-two-patterns-from-a-huge-forced-text-file%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
1
down vote

accepted

TL;DR

In ksh,bash,zsh:

sed -e $'s,"title":,1,g' -e $'s,"url":,2,g' -e $'s,^[^1]*,,' -e $'

         s,1\([^2]*\)2[^1]*,\1\n,g' infile

sed

One character delimiters.

The canonical solution for one character delimiters
lets assume @ and # as an example, is:

sed 's,^[^@]*,,;s,@([^#]*)#[^@]*,1 ,g' infile

That will
- remove every character from the start that is not a @
- extract characters that are between the first @
to the next first # that follows.

For each line of the input file infile.

General delimiters.

Any other delimiter could be converted to the answer above by simply converting each delimiter string to one character.

sed -e 's,"title":,@,g' -e 's,"url":,#,g' -e 's/^[^@]*//;s/@([^#]*)#[^@]*/1 /g' infile

Instead of space (1), in your case, you can use newlines, which written for GNU sed are simply (1n):

sed -e 's,"title":,@,g' -e 's,"url":,#,g' -e 's/^[^@]*//;s/@([^#]*)#[^@]*/1n/g' infile

For other (older) seds Add an explicit newline:

sed -e 's,"title":,@,g' -e 's,"url":,#,g' -e 's/^[^@]*//;s/@([^#]*)#[^@]*/1

/g' infile

sed -e 's,"title":,^A,g' -e 's,"url":,^B,g' -e 's,^[^^A]*,,;s,^A([^^B]*)^B[^^A]*,1n,g' infile

Or, if it is too cumbersome to type, either use (ksh,bash,zsh):

sed -e $'s,"title":,1,g' -e $'s,"url":,2,g' -e $'s,^[^1]*,,' -e $'s,1\([^2]*\)2[^1]*,\1\n,g' infile

Or, if your sed supports it:

sed -e 's,"title":,o001,g' -e 's,"url":,o002,g' -e 's,^[^o001]*,,' -e 's,o001([^o002]*)o002[^o001]*,1o012,g' infile

if delimiter is "description":

If the starting tag is actually "description": (from your example of output), just use it instead of "title":

The output from above (from the file you linked before in your question):

"Black Friday deal: Palm companion phone is $150 off at Verizon, but there's a catch","description":"",

"LG trademarks potential names for its foldable phone, one fits a crazy concept found in patents","description":"",

"Blackview's Black Friday promo discounts the BV9500 Pro and other rugged phones on Amazon","description":"Advertorial by Blackview: the opinions expressed in this story may not reflect the positions of PhoneArena! disclaimer   amzn_assoc_tracking_id = 'phone0e0d-20';amzn_assoc_ad_mode = 'manual';amzn_assoc_ad_type ...",

If you need to number the lines, sed it again with sed -n '=;p;g;p':

| sed -n '=;p;g;p'

1

"Black Friday deal: Palm companion phone is $150 off at Verizon, but there's a catch","description":"",



2

"LG trademarks potential names for its foldable phone, one fits a crazy concept found in patents","description":"",



3

"Blackview's Black Friday promo discounts the BV9500 Pro and other rugged phones on Amazon","description":"Advertorial by Blackview: the opinions expressed in this story may not reflect the positions of PhoneArena! disclaimer   amzn_assoc_tracking_id = 'phone0e0d-20';amzn_assoc_ad_mode = 'manual';amzn_assoc_ad_type ...",

AWK

Similar logic implemented in awk:

awk -vone=$'1' -vtwo=$'2' '{

            gsub(/"title":/,one);

            gsub(/"url":/,two);

            sub("^[^"one"]*"one,"")

            gsub(two"[^"one"]*"one,ORS)

            sub(two"[^"two"]*$","")

           } 1' infile

edited Nov 25 at 23:52

answered Nov 25 at 23:14

Isaac

9,84111445

add a comment |

up vote
1
down vote

accepted

TL;DR

In ksh,bash,zsh:

sed -e $'s,"title":,1,g' -e $'s,"url":,2,g' -e $'s,^[^1]*,,' -e $'

         s,1\([^2]*\)2[^1]*,\1\n,g' infile

sed

One character delimiters.

The canonical solution for one character delimiters
lets assume @ and # as an example, is:

sed 's,^[^@]*,,;s,@([^#]*)#[^@]*,1 ,g' infile

That will
- remove every character from the start that is not a @
- extract characters that are between the first @
to the next first # that follows.

For each line of the input file infile.

General delimiters.

Any other delimiter could be converted to the answer above by simply converting each delimiter string to one character.

sed -e 's,"title":,@,g' -e 's,"url":,#,g' -e 's/^[^@]*//;s/@([^#]*)#[^@]*/1 /g' infile

Instead of space (1), in your case, you can use newlines, which written for GNU sed are simply (1n):

sed -e 's,"title":,@,g' -e 's,"url":,#,g' -e 's/^[^@]*//;s/@([^#]*)#[^@]*/1n/g' infile

For other (older) seds Add an explicit newline:

sed -e 's,"title":,@,g' -e 's,"url":,#,g' -e 's/^[^@]*//;s/@([^#]*)#[^@]*/1

/g' infile

sed -e 's,"title":,^A,g' -e 's,"url":,^B,g' -e 's,^[^^A]*,,;s,^A([^^B]*)^B[^^A]*,1n,g' infile

Or, if it is too cumbersome to type, either use (ksh,bash,zsh):

sed -e $'s,"title":,1,g' -e $'s,"url":,2,g' -e $'s,^[^1]*,,' -e $'s,1\([^2]*\)2[^1]*,\1\n,g' infile

Or, if your sed supports it:

sed -e 's,"title":,o001,g' -e 's,"url":,o002,g' -e 's,^[^o001]*,,' -e 's,o001([^o002]*)o002[^o001]*,1o012,g' infile

if delimiter is "description":

If the starting tag is actually "description": (from your example of output), just use it instead of "title":

The output from above (from the file you linked before in your question):

"Black Friday deal: Palm companion phone is $150 off at Verizon, but there's a catch","description":"",

"LG trademarks potential names for its foldable phone, one fits a crazy concept found in patents","description":"",

"Blackview's Black Friday promo discounts the BV9500 Pro and other rugged phones on Amazon","description":"Advertorial by Blackview: the opinions expressed in this story may not reflect the positions of PhoneArena! disclaimer   amzn_assoc_tracking_id = 'phone0e0d-20';amzn_assoc_ad_mode = 'manual';amzn_assoc_ad_type ...",

If you need to number the lines, sed it again with sed -n '=;p;g;p':

| sed -n '=;p;g;p'

1

"Black Friday deal: Palm companion phone is $150 off at Verizon, but there's a catch","description":"",



2

"LG trademarks potential names for its foldable phone, one fits a crazy concept found in patents","description":"",



3

"Blackview's Black Friday promo discounts the BV9500 Pro and other rugged phones on Amazon","description":"Advertorial by Blackview: the opinions expressed in this story may not reflect the positions of PhoneArena! disclaimer   amzn_assoc_tracking_id = 'phone0e0d-20';amzn_assoc_ad_mode = 'manual';amzn_assoc_ad_type ...",

AWK

Similar logic implemented in awk:

awk -vone=$'1' -vtwo=$'2' '{

            gsub(/"title":/,one);

            gsub(/"url":/,two);

            sub("^[^"one"]*"one,"")

            gsub(two"[^"one"]*"one,ORS)

            sub(two"[^"two"]*$","")

           } 1' infile

edited Nov 25 at 23:52

answered Nov 25 at 23:14

Isaac

9,84111445

add a comment |

up vote
1
down vote

accepted

TL;DR

In ksh,bash,zsh:

sed -e $'s,"title":,1,g' -e $'s,"url":,2,g' -e $'s,^[^1]*,,' -e $'

         s,1\([^2]*\)2[^1]*,\1\n,g' infile

sed

One character delimiters.

The canonical solution for one character delimiters
lets assume @ and # as an example, is:

sed 's,^[^@]*,,;s,@([^#]*)#[^@]*,1 ,g' infile

That will
- remove every character from the start that is not a @
- extract characters that are between the first @
to the next first # that follows.

For each line of the input file infile.

General delimiters.

Any other delimiter could be converted to the answer above by simply converting each delimiter string to one character.

sed -e 's,"title":,@,g' -e 's,"url":,#,g' -e 's/^[^@]*//;s/@([^#]*)#[^@]*/1 /g' infile

Instead of space (1), in your case, you can use newlines, which written for GNU sed are simply (1n):

sed -e 's,"title":,@,g' -e 's,"url":,#,g' -e 's/^[^@]*//;s/@([^#]*)#[^@]*/1n/g' infile

For other (older) seds Add an explicit newline:

sed -e 's,"title":,@,g' -e 's,"url":,#,g' -e 's/^[^@]*//;s/@([^#]*)#[^@]*/1

/g' infile

sed -e 's,"title":,^A,g' -e 's,"url":,^B,g' -e 's,^[^^A]*,,;s,^A([^^B]*)^B[^^A]*,1n,g' infile

Or, if it is too cumbersome to type, either use (ksh,bash,zsh):

sed -e $'s,"title":,1,g' -e $'s,"url":,2,g' -e $'s,^[^1]*,,' -e $'s,1\([^2]*\)2[^1]*,\1\n,g' infile

Or, if your sed supports it:

sed -e 's,"title":,o001,g' -e 's,"url":,o002,g' -e 's,^[^o001]*,,' -e 's,o001([^o002]*)o002[^o001]*,1o012,g' infile

if delimiter is "description":

If the starting tag is actually "description": (from your example of output), just use it instead of "title":

The output from above (from the file you linked before in your question):

"Black Friday deal: Palm companion phone is $150 off at Verizon, but there's a catch","description":"",

"LG trademarks potential names for its foldable phone, one fits a crazy concept found in patents","description":"",

"Blackview's Black Friday promo discounts the BV9500 Pro and other rugged phones on Amazon","description":"Advertorial by Blackview: the opinions expressed in this story may not reflect the positions of PhoneArena! disclaimer   amzn_assoc_tracking_id = 'phone0e0d-20';amzn_assoc_ad_mode = 'manual';amzn_assoc_ad_type ...",

If you need to number the lines, sed it again with sed -n '=;p;g;p':

| sed -n '=;p;g;p'

1

"Black Friday deal: Palm companion phone is $150 off at Verizon, but there's a catch","description":"",



2

"LG trademarks potential names for its foldable phone, one fits a crazy concept found in patents","description":"",



3

"Blackview's Black Friday promo discounts the BV9500 Pro and other rugged phones on Amazon","description":"Advertorial by Blackview: the opinions expressed in this story may not reflect the positions of PhoneArena! disclaimer   amzn_assoc_tracking_id = 'phone0e0d-20';amzn_assoc_ad_mode = 'manual';amzn_assoc_ad_type ...",

AWK

Similar logic implemented in awk:

awk -vone=$'1' -vtwo=$'2' '{

            gsub(/"title":/,one);

            gsub(/"url":/,two);

            sub("^[^"one"]*"one,"")

            gsub(two"[^"one"]*"one,ORS)

            sub(two"[^"two"]*$","")

           } 1' infile

edited Nov 25 at 23:52

answered Nov 25 at 23:14

Isaac

9,84111445

TL;DR

In ksh,bash,zsh:

sed -e $'s,"title":,1,g' -e $'s,"url":,2,g' -e $'s,^[^1]*,,' -e $'

         s,1\([^2]*\)2[^1]*,\1\n,g' infile

sed

One character delimiters.

The canonical solution for one character delimiters
lets assume @ and # as an example, is:

sed 's,^[^@]*,,;s,@([^#]*)#[^@]*,1 ,g' infile

That will
- remove every character from the start that is not a @
- extract characters that are between the first @
to the next first # that follows.

For each line of the input file infile.

General delimiters.

Any other delimiter could be converted to the answer above by simply converting each delimiter string to one character.

sed -e 's,"title":,@,g' -e 's,"url":,#,g' -e 's/^[^@]*//;s/@([^#]*)#[^@]*/1 /g' infile

Instead of space (1), in your case, you can use newlines, which written for GNU sed are simply (1n):

sed -e 's,"title":,@,g' -e 's,"url":,#,g' -e 's/^[^@]*//;s/@([^#]*)#[^@]*/1n/g' infile

For other (older) seds Add an explicit newline:

sed -e 's,"title":,@,g' -e 's,"url":,#,g' -e 's/^[^@]*//;s/@([^#]*)#[^@]*/1

/g' infile

sed -e 's,"title":,^A,g' -e 's,"url":,^B,g' -e 's,^[^^A]*,,;s,^A([^^B]*)^B[^^A]*,1n,g' infile

Or, if it is too cumbersome to type, either use (ksh,bash,zsh):

sed -e $'s,"title":,1,g' -e $'s,"url":,2,g' -e $'s,^[^1]*,,' -e $'s,1\([^2]*\)2[^1]*,\1\n,g' infile

Or, if your sed supports it:

sed -e 's,"title":,o001,g' -e 's,"url":,o002,g' -e 's,^[^o001]*,,' -e 's,o001([^o002]*)o002[^o001]*,1o012,g' infile

if delimiter is "description":

If the starting tag is actually "description": (from your example of output), just use it instead of "title":

The output from above (from the file you linked before in your question):

"Black Friday deal: Palm companion phone is $150 off at Verizon, but there's a catch","description":"",

"LG trademarks potential names for its foldable phone, one fits a crazy concept found in patents","description":"",

"Blackview's Black Friday promo discounts the BV9500 Pro and other rugged phones on Amazon","description":"Advertorial by Blackview: the opinions expressed in this story may not reflect the positions of PhoneArena! disclaimer   amzn_assoc_tracking_id = 'phone0e0d-20';amzn_assoc_ad_mode = 'manual';amzn_assoc_ad_type ...",

If you need to number the lines, sed it again with sed -n '=;p;g;p':

| sed -n '=;p;g;p'

1

"Black Friday deal: Palm companion phone is $150 off at Verizon, but there's a catch","description":"",



2

"LG trademarks potential names for its foldable phone, one fits a crazy concept found in patents","description":"",



3

"Blackview's Black Friday promo discounts the BV9500 Pro and other rugged phones on Amazon","description":"Advertorial by Blackview: the opinions expressed in this story may not reflect the positions of PhoneArena! disclaimer   amzn_assoc_tracking_id = 'phone0e0d-20';amzn_assoc_ad_mode = 'manual';amzn_assoc_ad_type ...",

AWK

Similar logic implemented in awk:

awk -vone=$'1' -vtwo=$'2' '{

            gsub(/"title":/,one);

            gsub(/"url":/,two);

            sub("^[^"one"]*"one,"")

            gsub(two"[^"one"]*"one,ORS)

            sub(two"[^"two"]*$","")

           } 1' infile

edited Nov 25 at 23:52

answered Nov 25 at 23:14

Isaac

9,84111445

edited Nov 25 at 23:52

answered Nov 25 at 23:14

Isaac

9,84111445

answered Nov 25 at 23:14

Isaac

9,84111445

answered Nov 25 at 23:14

Isaac

9,84111445

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Unix & Linux Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Extract data between two patterns from a huge (forced) text file

1 Answer 1

TL;DR

sed

One character delimiters.

General delimiters.

if delimiter is "description":

AWK

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

TL;DR

sed

One character delimiters.

General delimiters.

if delimiter is "description":

AWK

TL;DR

sed

One character delimiters.

General delimiters.

if delimiter is "description":

AWK

TL;DR

sed

One character delimiters.

General delimiters.

if delimiter is "description":

AWK

TL;DR

sed

One character delimiters.

General delimiters.

if delimiter is "description":

AWK

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Województwo

Scott Moir

What dialect is “You wants I should do it for ya?”

1 Answer
1

1 Answer
1

1 Answer
1