Extract data between two patterns from a huge (forced) text file











up vote
-3
down vote

favorite












I have filename.json. If I parse it in terminal with



file filename.json


output is:



filename.json: UTF-8 Unicode text, with very long lines  

wc -l filename.json
1 filename.json


If I parse it as a json using jq then I'll have to mention what section of the data I want it to print like an id,summary,author,etc,. I have thousands of json which are similar in structure but section where I want the data is stored as either "summary","description","review",etc,. Since, there are thousands of JSON files I don't want to check in each one of them. But I know that the data I want resides between two patterns



"title": and "url":



$ cat filename.json


gives:



{"source":"PhoneArena","author":"","title":"Apple's US Black Friday shopping event has gift cards galore for select iPhones, iPads, and more","description":"As confirmed earlier this week, a four-day Black Friday and Cyber Monday shopping event is underway, offering Apple Store gift cards with purchases of select iPhone models, three iPad variants, an assortment of Macs, the entire Apple Watch Series 3 family, as well as the HomePod, both Apple TV versions, and select Beats headphones.That ...","url":"https://www.phonearena.com/news/Apples-US-Black-Friday-shopping-event-has-gift-cards-galore-for-select-iPhones-iPads-and-more_id111287","urlToImage":"https://i-cdn.phonearena.com/images/article/111287-two_lead/Apples-US-Black-Friday-shopping-event-has-gift-cards-galore-for-select-iPhones-iPads-and-more.jpg","publishedAt":"2018-11-23T09:05:00Z","dataRefreshedTime":"2018-11-23T09:43:09Z","category":"phone_news_reviews","resource":"PhoneArena"},{"source":"PhoneArena","author":"","title":"Verizon's top Black Friday bargain is a free Moto G6, no trade-in required","description":"That made it virtually impossible for retailers like Best Buy and B&H Photo Video to outdo themselves come the actual Black Friday frenzy, but luckily, that’s what carriers are (sometimes) good for.Enter Verizon, which revealed a wide range of killer deals on popular high-end ...","url":"https://www.phonearena.com/news/Verizons-top-Black-Friday-bargain-is-a-free-Moto-G6-no-trade-in-required_id111285","urlToImage":"https://i-cdn.phonearena.com/images/article/111285-two_lead/Verizons-top-Black-Friday-bargain-is-a-free-Moto-G6-no-trade-in-required.jpg","publishedAt":"2018-11-23T07:54:00Z","dataRefreshedTime":"2018-11-23T09:43:09Z","category":"phone_news_reviews","resource":"PhoneArena"},


So, I want to print everything between the patterns, but in the terminal the files is of 1 line and pattern appears multiple times. Only way I could think of is to print between the two patterns till the end of the file.



I tried using sed:



sed -n '^/title/,/^url/p' filename.json


but it prints blank.



I want the data to further input to do language analysis using machine learning techniques.



Any suggestion on other ways to print between the patterns, also patterns repeat multiple times. So, I want the data to print between each of the repeats.



Expected result is to print as CSV or tsv:



1 "As confirmed earlier this week, a four-day Black Friday and Cyber Monday shopping event is underway, offering Apple Store gift cards with purchases of select iPhone models, three iPad variants, an assortment of Macs, the entire Apple Watch Series 3 family, as well as the HomePod, both Apple TV versions, and select Beats headphones.That ..."

2 "That made it virtually impossible for retailers like Best Buy and B&H Photo Video to outdo themselves come the actual Black Friday frenzy, but luckily, that’s what carriers are (sometimes) good for.Enter Verizon, which revealed a wide range of killer deals on popular high-end ..."

etc,.


till the end of the file.










share|improve this question




















  • 4




    This does not look like JSON to me. Did you for some reason omit all {}, and :,?
    – Kusalananda
    Nov 23 at 9:13












  • Yes, I did omit and just copied it from the browser
    – CCC
    Nov 23 at 9:21












  • That makes it very difficult to help you as we can't test on real data.
    – Kusalananda
    Nov 23 at 9:29






  • 3




    Why not supply the JSON in the question? That would allow for a more robust solution than having to do haphazard parsing of nearly free form text.
    – Kusalananda
    Nov 23 at 9:36






  • 2




    Possible duplicate of How to extract data from a JSON file
    – RoVo
    Nov 23 at 10:02















up vote
-3
down vote

favorite












I have filename.json. If I parse it in terminal with



file filename.json


output is:



filename.json: UTF-8 Unicode text, with very long lines  

wc -l filename.json
1 filename.json


If I parse it as a json using jq then I'll have to mention what section of the data I want it to print like an id,summary,author,etc,. I have thousands of json which are similar in structure but section where I want the data is stored as either "summary","description","review",etc,. Since, there are thousands of JSON files I don't want to check in each one of them. But I know that the data I want resides between two patterns



"title": and "url":



$ cat filename.json


gives:



{"source":"PhoneArena","author":"","title":"Apple's US Black Friday shopping event has gift cards galore for select iPhones, iPads, and more","description":"As confirmed earlier this week, a four-day Black Friday and Cyber Monday shopping event is underway, offering Apple Store gift cards with purchases of select iPhone models, three iPad variants, an assortment of Macs, the entire Apple Watch Series 3 family, as well as the HomePod, both Apple TV versions, and select Beats headphones.That ...","url":"https://www.phonearena.com/news/Apples-US-Black-Friday-shopping-event-has-gift-cards-galore-for-select-iPhones-iPads-and-more_id111287","urlToImage":"https://i-cdn.phonearena.com/images/article/111287-two_lead/Apples-US-Black-Friday-shopping-event-has-gift-cards-galore-for-select-iPhones-iPads-and-more.jpg","publishedAt":"2018-11-23T09:05:00Z","dataRefreshedTime":"2018-11-23T09:43:09Z","category":"phone_news_reviews","resource":"PhoneArena"},{"source":"PhoneArena","author":"","title":"Verizon's top Black Friday bargain is a free Moto G6, no trade-in required","description":"That made it virtually impossible for retailers like Best Buy and B&H Photo Video to outdo themselves come the actual Black Friday frenzy, but luckily, that’s what carriers are (sometimes) good for.Enter Verizon, which revealed a wide range of killer deals on popular high-end ...","url":"https://www.phonearena.com/news/Verizons-top-Black-Friday-bargain-is-a-free-Moto-G6-no-trade-in-required_id111285","urlToImage":"https://i-cdn.phonearena.com/images/article/111285-two_lead/Verizons-top-Black-Friday-bargain-is-a-free-Moto-G6-no-trade-in-required.jpg","publishedAt":"2018-11-23T07:54:00Z","dataRefreshedTime":"2018-11-23T09:43:09Z","category":"phone_news_reviews","resource":"PhoneArena"},


So, I want to print everything between the patterns, but in the terminal the files is of 1 line and pattern appears multiple times. Only way I could think of is to print between the two patterns till the end of the file.



I tried using sed:



sed -n '^/title/,/^url/p' filename.json


but it prints blank.



I want the data to further input to do language analysis using machine learning techniques.



Any suggestion on other ways to print between the patterns, also patterns repeat multiple times. So, I want the data to print between each of the repeats.



Expected result is to print as CSV or tsv:



1 "As confirmed earlier this week, a four-day Black Friday and Cyber Monday shopping event is underway, offering Apple Store gift cards with purchases of select iPhone models, three iPad variants, an assortment of Macs, the entire Apple Watch Series 3 family, as well as the HomePod, both Apple TV versions, and select Beats headphones.That ..."

2 "That made it virtually impossible for retailers like Best Buy and B&H Photo Video to outdo themselves come the actual Black Friday frenzy, but luckily, that’s what carriers are (sometimes) good for.Enter Verizon, which revealed a wide range of killer deals on popular high-end ..."

etc,.


till the end of the file.










share|improve this question




















  • 4




    This does not look like JSON to me. Did you for some reason omit all {}, and :,?
    – Kusalananda
    Nov 23 at 9:13












  • Yes, I did omit and just copied it from the browser
    – CCC
    Nov 23 at 9:21












  • That makes it very difficult to help you as we can't test on real data.
    – Kusalananda
    Nov 23 at 9:29






  • 3




    Why not supply the JSON in the question? That would allow for a more robust solution than having to do haphazard parsing of nearly free form text.
    – Kusalananda
    Nov 23 at 9:36






  • 2




    Possible duplicate of How to extract data from a JSON file
    – RoVo
    Nov 23 at 10:02













up vote
-3
down vote

favorite









up vote
-3
down vote

favorite











I have filename.json. If I parse it in terminal with



file filename.json


output is:



filename.json: UTF-8 Unicode text, with very long lines  

wc -l filename.json
1 filename.json


If I parse it as a json using jq then I'll have to mention what section of the data I want it to print like an id,summary,author,etc,. I have thousands of json which are similar in structure but section where I want the data is stored as either "summary","description","review",etc,. Since, there are thousands of JSON files I don't want to check in each one of them. But I know that the data I want resides between two patterns



"title": and "url":



$ cat filename.json


gives:



{"source":"PhoneArena","author":"","title":"Apple's US Black Friday shopping event has gift cards galore for select iPhones, iPads, and more","description":"As confirmed earlier this week, a four-day Black Friday and Cyber Monday shopping event is underway, offering Apple Store gift cards with purchases of select iPhone models, three iPad variants, an assortment of Macs, the entire Apple Watch Series 3 family, as well as the HomePod, both Apple TV versions, and select Beats headphones.That ...","url":"https://www.phonearena.com/news/Apples-US-Black-Friday-shopping-event-has-gift-cards-galore-for-select-iPhones-iPads-and-more_id111287","urlToImage":"https://i-cdn.phonearena.com/images/article/111287-two_lead/Apples-US-Black-Friday-shopping-event-has-gift-cards-galore-for-select-iPhones-iPads-and-more.jpg","publishedAt":"2018-11-23T09:05:00Z","dataRefreshedTime":"2018-11-23T09:43:09Z","category":"phone_news_reviews","resource":"PhoneArena"},{"source":"PhoneArena","author":"","title":"Verizon's top Black Friday bargain is a free Moto G6, no trade-in required","description":"That made it virtually impossible for retailers like Best Buy and B&H Photo Video to outdo themselves come the actual Black Friday frenzy, but luckily, that’s what carriers are (sometimes) good for.Enter Verizon, which revealed a wide range of killer deals on popular high-end ...","url":"https://www.phonearena.com/news/Verizons-top-Black-Friday-bargain-is-a-free-Moto-G6-no-trade-in-required_id111285","urlToImage":"https://i-cdn.phonearena.com/images/article/111285-two_lead/Verizons-top-Black-Friday-bargain-is-a-free-Moto-G6-no-trade-in-required.jpg","publishedAt":"2018-11-23T07:54:00Z","dataRefreshedTime":"2018-11-23T09:43:09Z","category":"phone_news_reviews","resource":"PhoneArena"},


So, I want to print everything between the patterns, but in the terminal the files is of 1 line and pattern appears multiple times. Only way I could think of is to print between the two patterns till the end of the file.



I tried using sed:



sed -n '^/title/,/^url/p' filename.json


but it prints blank.



I want the data to further input to do language analysis using machine learning techniques.



Any suggestion on other ways to print between the patterns, also patterns repeat multiple times. So, I want the data to print between each of the repeats.



Expected result is to print as CSV or tsv:



1 "As confirmed earlier this week, a four-day Black Friday and Cyber Monday shopping event is underway, offering Apple Store gift cards with purchases of select iPhone models, three iPad variants, an assortment of Macs, the entire Apple Watch Series 3 family, as well as the HomePod, both Apple TV versions, and select Beats headphones.That ..."

2 "That made it virtually impossible for retailers like Best Buy and B&H Photo Video to outdo themselves come the actual Black Friday frenzy, but luckily, that’s what carriers are (sometimes) good for.Enter Verizon, which revealed a wide range of killer deals on popular high-end ..."

etc,.


till the end of the file.










share|improve this question















I have filename.json. If I parse it in terminal with



file filename.json


output is:



filename.json: UTF-8 Unicode text, with very long lines  

wc -l filename.json
1 filename.json


If I parse it as a json using jq then I'll have to mention what section of the data I want it to print like an id,summary,author,etc,. I have thousands of json which are similar in structure but section where I want the data is stored as either "summary","description","review",etc,. Since, there are thousands of JSON files I don't want to check in each one of them. But I know that the data I want resides between two patterns



"title": and "url":



$ cat filename.json


gives:



{"source":"PhoneArena","author":"","title":"Apple's US Black Friday shopping event has gift cards galore for select iPhones, iPads, and more","description":"As confirmed earlier this week, a four-day Black Friday and Cyber Monday shopping event is underway, offering Apple Store gift cards with purchases of select iPhone models, three iPad variants, an assortment of Macs, the entire Apple Watch Series 3 family, as well as the HomePod, both Apple TV versions, and select Beats headphones.That ...","url":"https://www.phonearena.com/news/Apples-US-Black-Friday-shopping-event-has-gift-cards-galore-for-select-iPhones-iPads-and-more_id111287","urlToImage":"https://i-cdn.phonearena.com/images/article/111287-two_lead/Apples-US-Black-Friday-shopping-event-has-gift-cards-galore-for-select-iPhones-iPads-and-more.jpg","publishedAt":"2018-11-23T09:05:00Z","dataRefreshedTime":"2018-11-23T09:43:09Z","category":"phone_news_reviews","resource":"PhoneArena"},{"source":"PhoneArena","author":"","title":"Verizon's top Black Friday bargain is a free Moto G6, no trade-in required","description":"That made it virtually impossible for retailers like Best Buy and B&H Photo Video to outdo themselves come the actual Black Friday frenzy, but luckily, that’s what carriers are (sometimes) good for.Enter Verizon, which revealed a wide range of killer deals on popular high-end ...","url":"https://www.phonearena.com/news/Verizons-top-Black-Friday-bargain-is-a-free-Moto-G6-no-trade-in-required_id111285","urlToImage":"https://i-cdn.phonearena.com/images/article/111285-two_lead/Verizons-top-Black-Friday-bargain-is-a-free-Moto-G6-no-trade-in-required.jpg","publishedAt":"2018-11-23T07:54:00Z","dataRefreshedTime":"2018-11-23T09:43:09Z","category":"phone_news_reviews","resource":"PhoneArena"},


So, I want to print everything between the patterns, but in the terminal the files is of 1 line and pattern appears multiple times. Only way I could think of is to print between the two patterns till the end of the file.



I tried using sed:



sed -n '^/title/,/^url/p' filename.json


but it prints blank.



I want the data to further input to do language analysis using machine learning techniques.



Any suggestion on other ways to print between the patterns, also patterns repeat multiple times. So, I want the data to print between each of the repeats.



Expected result is to print as CSV or tsv:



1 "As confirmed earlier this week, a four-day Black Friday and Cyber Monday shopping event is underway, offering Apple Store gift cards with purchases of select iPhone models, three iPad variants, an assortment of Macs, the entire Apple Watch Series 3 family, as well as the HomePod, both Apple TV versions, and select Beats headphones.That ..."

2 "That made it virtually impossible for retailers like Best Buy and B&H Photo Video to outdo themselves come the actual Black Friday frenzy, but luckily, that’s what carriers are (sometimes) good for.Enter Verizon, which revealed a wide range of killer deals on popular high-end ..."

etc,.


till the end of the file.







awk sed grep python pattern-matching






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 25 at 23:21









Isaac

9,84111445




9,84111445










asked Nov 23 at 9:09









CCC

447




447








  • 4




    This does not look like JSON to me. Did you for some reason omit all {}, and :,?
    – Kusalananda
    Nov 23 at 9:13












  • Yes, I did omit and just copied it from the browser
    – CCC
    Nov 23 at 9:21












  • That makes it very difficult to help you as we can't test on real data.
    – Kusalananda
    Nov 23 at 9:29






  • 3




    Why not supply the JSON in the question? That would allow for a more robust solution than having to do haphazard parsing of nearly free form text.
    – Kusalananda
    Nov 23 at 9:36






  • 2




    Possible duplicate of How to extract data from a JSON file
    – RoVo
    Nov 23 at 10:02














  • 4




    This does not look like JSON to me. Did you for some reason omit all {}, and :,?
    – Kusalananda
    Nov 23 at 9:13












  • Yes, I did omit and just copied it from the browser
    – CCC
    Nov 23 at 9:21












  • That makes it very difficult to help you as we can't test on real data.
    – Kusalananda
    Nov 23 at 9:29






  • 3




    Why not supply the JSON in the question? That would allow for a more robust solution than having to do haphazard parsing of nearly free form text.
    – Kusalananda
    Nov 23 at 9:36






  • 2




    Possible duplicate of How to extract data from a JSON file
    – RoVo
    Nov 23 at 10:02








4




4




This does not look like JSON to me. Did you for some reason omit all {}, and :,?
– Kusalananda
Nov 23 at 9:13






This does not look like JSON to me. Did you for some reason omit all {}, and :,?
– Kusalananda
Nov 23 at 9:13














Yes, I did omit and just copied it from the browser
– CCC
Nov 23 at 9:21






Yes, I did omit and just copied it from the browser
– CCC
Nov 23 at 9:21














That makes it very difficult to help you as we can't test on real data.
– Kusalananda
Nov 23 at 9:29




That makes it very difficult to help you as we can't test on real data.
– Kusalananda
Nov 23 at 9:29




3




3




Why not supply the JSON in the question? That would allow for a more robust solution than having to do haphazard parsing of nearly free form text.
– Kusalananda
Nov 23 at 9:36




Why not supply the JSON in the question? That would allow for a more robust solution than having to do haphazard parsing of nearly free form text.
– Kusalananda
Nov 23 at 9:36




2




2




Possible duplicate of How to extract data from a JSON file
– RoVo
Nov 23 at 10:02




Possible duplicate of How to extract data from a JSON file
– RoVo
Nov 23 at 10:02










1 Answer
1






active

oldest

votes

















up vote
1
down vote



accepted










TL;DR



In ksh,bash,zsh:



sed -e $'s,"title":,1,g' -e $'s,"url":,2,g' -e $'s,^[^1]*,,' -e $'
s,1\([^2]*\)2[^1]*,\1\n,g' infile




sed



One character delimiters.



The canonical solution for one character delimiters
lets assume @ and # as an example, is:



sed 's,^[^@]*,,;s,@([^#]*)#[^@]*,1 ,g' infile


That will
- remove every character from the start that is not a @
- extract characters that are between the first @
to the next first # that follows.



For each line of the input file infile.



General delimiters.



Any other delimiter could be converted to the answer above by simply converting each delimiter string to one character.



sed -e 's,"title":,@,g' -e 's,"url":,#,g' -e 's/^[^@]*//;s/@([^#]*)#[^@]*/1 /g' infile


Instead of space (1), in your case, you can use newlines, which written for GNU sed are simply (1n):



sed -e 's,"title":,@,g' -e 's,"url":,#,g' -e 's/^[^@]*//;s/@([^#]*)#[^@]*/1n/g' infile


For other (older) seds Add an explicit newline:



sed -e 's,"title":,@,g' -e 's,"url":,#,g' -e 's/^[^@]*//;s/@([^#]*)#[^@]*/1
/g' infile


If there is the risk that the delimters used above could be inside the file, choose another ones that are warateed to not exist inside the file. If that seems to be a problem, the start and end delimiters could be a control character
like Ctrl-A ( or encoded: ^A, as hex: Ox01 or as octal 01 ). You can type that in a shell console by typing Ctrl-V Ctrl-A. You will see a ^A in the command line:



sed -e 's,"title":,^A,g' -e 's,"url":,^B,g' -e 's,^[^^A]*,,;s,^A([^^B]*)^B[^^A]*,1n,g' infile


Or, if it is too cumbersome to type, either use (ksh,bash,zsh):



sed -e $'s,"title":,1,g' -e $'s,"url":,2,g' -e $'s,^[^1]*,,' -e $'s,1\([^2]*\)2[^1]*,\1\n,g' infile


Or, if your sed supports it:



sed -e 's,"title":,o001,g' -e 's,"url":,o002,g' -e 's,^[^o001]*,,' -e 's,o001([^o002]*)o002[^o001]*,1o012,g' infile


if delimiter is "description":



If the starting tag is actually "description": (from your example of output), just use it instead of "title":



The output from above (from the file you linked before in your question):



"Black Friday deal: Palm companion phone is $150 off at Verizon, but there's a catch","description":"",
"LG trademarks potential names for its foldable phone, one fits a crazy concept found in patents","description":"",
"Blackview's Black Friday promo discounts the BV9500 Pro and other rugged phones on Amazon","description":"Advertorial by Blackview: the opinions expressed in this story may not reflect the positions of PhoneArena! disclaimer amzn_assoc_tracking_id = 'phone0e0d-20';amzn_assoc_ad_mode = 'manual';amzn_assoc_ad_type ...",


If you need to number the lines, sed it again with sed -n '=;p;g;p':



| sed -n '=;p;g;p'
1
"Black Friday deal: Palm companion phone is $150 off at Verizon, but there's a catch","description":"",

2
"LG trademarks potential names for its foldable phone, one fits a crazy concept found in patents","description":"",

3
"Blackview's Black Friday promo discounts the BV9500 Pro and other rugged phones on Amazon","description":"Advertorial by Blackview: the opinions expressed in this story may not reflect the positions of PhoneArena! disclaimer amzn_assoc_tracking_id = 'phone0e0d-20';amzn_assoc_ad_mode = 'manual';amzn_assoc_ad_type ...",


AWK



Similar logic implemented in awk:



awk -vone=$'1' -vtwo=$'2' '{
gsub(/"title":/,one);
gsub(/"url":/,two);
sub("^[^"one"]*"one,"")
gsub(two"[^"one"]*"one,ORS)
sub(two"[^"two"]*$","")
} 1' infile





share|improve this answer























    Your Answer








    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "106"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f483627%2fextract-data-between-two-patterns-from-a-huge-forced-text-file%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    1
    down vote



    accepted










    TL;DR



    In ksh,bash,zsh:



    sed -e $'s,"title":,1,g' -e $'s,"url":,2,g' -e $'s,^[^1]*,,' -e $'
    s,1\([^2]*\)2[^1]*,\1\n,g' infile




    sed



    One character delimiters.



    The canonical solution for one character delimiters
    lets assume @ and # as an example, is:



    sed 's,^[^@]*,,;s,@([^#]*)#[^@]*,1 ,g' infile


    That will
    - remove every character from the start that is not a @
    - extract characters that are between the first @
    to the next first # that follows.



    For each line of the input file infile.



    General delimiters.



    Any other delimiter could be converted to the answer above by simply converting each delimiter string to one character.



    sed -e 's,"title":,@,g' -e 's,"url":,#,g' -e 's/^[^@]*//;s/@([^#]*)#[^@]*/1 /g' infile


    Instead of space (1), in your case, you can use newlines, which written for GNU sed are simply (1n):



    sed -e 's,"title":,@,g' -e 's,"url":,#,g' -e 's/^[^@]*//;s/@([^#]*)#[^@]*/1n/g' infile


    For other (older) seds Add an explicit newline:



    sed -e 's,"title":,@,g' -e 's,"url":,#,g' -e 's/^[^@]*//;s/@([^#]*)#[^@]*/1
    /g' infile


    If there is the risk that the delimters used above could be inside the file, choose another ones that are warateed to not exist inside the file. If that seems to be a problem, the start and end delimiters could be a control character
    like Ctrl-A ( or encoded: ^A, as hex: Ox01 or as octal 01 ). You can type that in a shell console by typing Ctrl-V Ctrl-A. You will see a ^A in the command line:



    sed -e 's,"title":,^A,g' -e 's,"url":,^B,g' -e 's,^[^^A]*,,;s,^A([^^B]*)^B[^^A]*,1n,g' infile


    Or, if it is too cumbersome to type, either use (ksh,bash,zsh):



    sed -e $'s,"title":,1,g' -e $'s,"url":,2,g' -e $'s,^[^1]*,,' -e $'s,1\([^2]*\)2[^1]*,\1\n,g' infile


    Or, if your sed supports it:



    sed -e 's,"title":,o001,g' -e 's,"url":,o002,g' -e 's,^[^o001]*,,' -e 's,o001([^o002]*)o002[^o001]*,1o012,g' infile


    if delimiter is "description":



    If the starting tag is actually "description": (from your example of output), just use it instead of "title":



    The output from above (from the file you linked before in your question):



    "Black Friday deal: Palm companion phone is $150 off at Verizon, but there's a catch","description":"",
    "LG trademarks potential names for its foldable phone, one fits a crazy concept found in patents","description":"",
    "Blackview's Black Friday promo discounts the BV9500 Pro and other rugged phones on Amazon","description":"Advertorial by Blackview: the opinions expressed in this story may not reflect the positions of PhoneArena! disclaimer amzn_assoc_tracking_id = 'phone0e0d-20';amzn_assoc_ad_mode = 'manual';amzn_assoc_ad_type ...",


    If you need to number the lines, sed it again with sed -n '=;p;g;p':



    | sed -n '=;p;g;p'
    1
    "Black Friday deal: Palm companion phone is $150 off at Verizon, but there's a catch","description":"",

    2
    "LG trademarks potential names for its foldable phone, one fits a crazy concept found in patents","description":"",

    3
    "Blackview's Black Friday promo discounts the BV9500 Pro and other rugged phones on Amazon","description":"Advertorial by Blackview: the opinions expressed in this story may not reflect the positions of PhoneArena! disclaimer amzn_assoc_tracking_id = 'phone0e0d-20';amzn_assoc_ad_mode = 'manual';amzn_assoc_ad_type ...",


    AWK



    Similar logic implemented in awk:



    awk -vone=$'1' -vtwo=$'2' '{
    gsub(/"title":/,one);
    gsub(/"url":/,two);
    sub("^[^"one"]*"one,"")
    gsub(two"[^"one"]*"one,ORS)
    sub(two"[^"two"]*$","")
    } 1' infile





    share|improve this answer



























      up vote
      1
      down vote



      accepted










      TL;DR



      In ksh,bash,zsh:



      sed -e $'s,"title":,1,g' -e $'s,"url":,2,g' -e $'s,^[^1]*,,' -e $'
      s,1\([^2]*\)2[^1]*,\1\n,g' infile




      sed



      One character delimiters.



      The canonical solution for one character delimiters
      lets assume @ and # as an example, is:



      sed 's,^[^@]*,,;s,@([^#]*)#[^@]*,1 ,g' infile


      That will
      - remove every character from the start that is not a @
      - extract characters that are between the first @
      to the next first # that follows.



      For each line of the input file infile.



      General delimiters.



      Any other delimiter could be converted to the answer above by simply converting each delimiter string to one character.



      sed -e 's,"title":,@,g' -e 's,"url":,#,g' -e 's/^[^@]*//;s/@([^#]*)#[^@]*/1 /g' infile


      Instead of space (1), in your case, you can use newlines, which written for GNU sed are simply (1n):



      sed -e 's,"title":,@,g' -e 's,"url":,#,g' -e 's/^[^@]*//;s/@([^#]*)#[^@]*/1n/g' infile


      For other (older) seds Add an explicit newline:



      sed -e 's,"title":,@,g' -e 's,"url":,#,g' -e 's/^[^@]*//;s/@([^#]*)#[^@]*/1
      /g' infile


      If there is the risk that the delimters used above could be inside the file, choose another ones that are warateed to not exist inside the file. If that seems to be a problem, the start and end delimiters could be a control character
      like Ctrl-A ( or encoded: ^A, as hex: Ox01 or as octal 01 ). You can type that in a shell console by typing Ctrl-V Ctrl-A. You will see a ^A in the command line:



      sed -e 's,"title":,^A,g' -e 's,"url":,^B,g' -e 's,^[^^A]*,,;s,^A([^^B]*)^B[^^A]*,1n,g' infile


      Or, if it is too cumbersome to type, either use (ksh,bash,zsh):



      sed -e $'s,"title":,1,g' -e $'s,"url":,2,g' -e $'s,^[^1]*,,' -e $'s,1\([^2]*\)2[^1]*,\1\n,g' infile


      Or, if your sed supports it:



      sed -e 's,"title":,o001,g' -e 's,"url":,o002,g' -e 's,^[^o001]*,,' -e 's,o001([^o002]*)o002[^o001]*,1o012,g' infile


      if delimiter is "description":



      If the starting tag is actually "description": (from your example of output), just use it instead of "title":



      The output from above (from the file you linked before in your question):



      "Black Friday deal: Palm companion phone is $150 off at Verizon, but there's a catch","description":"",
      "LG trademarks potential names for its foldable phone, one fits a crazy concept found in patents","description":"",
      "Blackview's Black Friday promo discounts the BV9500 Pro and other rugged phones on Amazon","description":"Advertorial by Blackview: the opinions expressed in this story may not reflect the positions of PhoneArena! disclaimer amzn_assoc_tracking_id = 'phone0e0d-20';amzn_assoc_ad_mode = 'manual';amzn_assoc_ad_type ...",


      If you need to number the lines, sed it again with sed -n '=;p;g;p':



      | sed -n '=;p;g;p'
      1
      "Black Friday deal: Palm companion phone is $150 off at Verizon, but there's a catch","description":"",

      2
      "LG trademarks potential names for its foldable phone, one fits a crazy concept found in patents","description":"",

      3
      "Blackview's Black Friday promo discounts the BV9500 Pro and other rugged phones on Amazon","description":"Advertorial by Blackview: the opinions expressed in this story may not reflect the positions of PhoneArena! disclaimer amzn_assoc_tracking_id = 'phone0e0d-20';amzn_assoc_ad_mode = 'manual';amzn_assoc_ad_type ...",


      AWK



      Similar logic implemented in awk:



      awk -vone=$'1' -vtwo=$'2' '{
      gsub(/"title":/,one);
      gsub(/"url":/,two);
      sub("^[^"one"]*"one,"")
      gsub(two"[^"one"]*"one,ORS)
      sub(two"[^"two"]*$","")
      } 1' infile





      share|improve this answer

























        up vote
        1
        down vote



        accepted







        up vote
        1
        down vote



        accepted






        TL;DR



        In ksh,bash,zsh:



        sed -e $'s,"title":,1,g' -e $'s,"url":,2,g' -e $'s,^[^1]*,,' -e $'
        s,1\([^2]*\)2[^1]*,\1\n,g' infile




        sed



        One character delimiters.



        The canonical solution for one character delimiters
        lets assume @ and # as an example, is:



        sed 's,^[^@]*,,;s,@([^#]*)#[^@]*,1 ,g' infile


        That will
        - remove every character from the start that is not a @
        - extract characters that are between the first @
        to the next first # that follows.



        For each line of the input file infile.



        General delimiters.



        Any other delimiter could be converted to the answer above by simply converting each delimiter string to one character.



        sed -e 's,"title":,@,g' -e 's,"url":,#,g' -e 's/^[^@]*//;s/@([^#]*)#[^@]*/1 /g' infile


        Instead of space (1), in your case, you can use newlines, which written for GNU sed are simply (1n):



        sed -e 's,"title":,@,g' -e 's,"url":,#,g' -e 's/^[^@]*//;s/@([^#]*)#[^@]*/1n/g' infile


        For other (older) seds Add an explicit newline:



        sed -e 's,"title":,@,g' -e 's,"url":,#,g' -e 's/^[^@]*//;s/@([^#]*)#[^@]*/1
        /g' infile


        If there is the risk that the delimters used above could be inside the file, choose another ones that are warateed to not exist inside the file. If that seems to be a problem, the start and end delimiters could be a control character
        like Ctrl-A ( or encoded: ^A, as hex: Ox01 or as octal 01 ). You can type that in a shell console by typing Ctrl-V Ctrl-A. You will see a ^A in the command line:



        sed -e 's,"title":,^A,g' -e 's,"url":,^B,g' -e 's,^[^^A]*,,;s,^A([^^B]*)^B[^^A]*,1n,g' infile


        Or, if it is too cumbersome to type, either use (ksh,bash,zsh):



        sed -e $'s,"title":,1,g' -e $'s,"url":,2,g' -e $'s,^[^1]*,,' -e $'s,1\([^2]*\)2[^1]*,\1\n,g' infile


        Or, if your sed supports it:



        sed -e 's,"title":,o001,g' -e 's,"url":,o002,g' -e 's,^[^o001]*,,' -e 's,o001([^o002]*)o002[^o001]*,1o012,g' infile


        if delimiter is "description":



        If the starting tag is actually "description": (from your example of output), just use it instead of "title":



        The output from above (from the file you linked before in your question):



        "Black Friday deal: Palm companion phone is $150 off at Verizon, but there's a catch","description":"",
        "LG trademarks potential names for its foldable phone, one fits a crazy concept found in patents","description":"",
        "Blackview's Black Friday promo discounts the BV9500 Pro and other rugged phones on Amazon","description":"Advertorial by Blackview: the opinions expressed in this story may not reflect the positions of PhoneArena! disclaimer amzn_assoc_tracking_id = 'phone0e0d-20';amzn_assoc_ad_mode = 'manual';amzn_assoc_ad_type ...",


        If you need to number the lines, sed it again with sed -n '=;p;g;p':



        | sed -n '=;p;g;p'
        1
        "Black Friday deal: Palm companion phone is $150 off at Verizon, but there's a catch","description":"",

        2
        "LG trademarks potential names for its foldable phone, one fits a crazy concept found in patents","description":"",

        3
        "Blackview's Black Friday promo discounts the BV9500 Pro and other rugged phones on Amazon","description":"Advertorial by Blackview: the opinions expressed in this story may not reflect the positions of PhoneArena! disclaimer amzn_assoc_tracking_id = 'phone0e0d-20';amzn_assoc_ad_mode = 'manual';amzn_assoc_ad_type ...",


        AWK



        Similar logic implemented in awk:



        awk -vone=$'1' -vtwo=$'2' '{
        gsub(/"title":/,one);
        gsub(/"url":/,two);
        sub("^[^"one"]*"one,"")
        gsub(two"[^"one"]*"one,ORS)
        sub(two"[^"two"]*$","")
        } 1' infile





        share|improve this answer














        TL;DR



        In ksh,bash,zsh:



        sed -e $'s,"title":,1,g' -e $'s,"url":,2,g' -e $'s,^[^1]*,,' -e $'
        s,1\([^2]*\)2[^1]*,\1\n,g' infile




        sed



        One character delimiters.



        The canonical solution for one character delimiters
        lets assume @ and # as an example, is:



        sed 's,^[^@]*,,;s,@([^#]*)#[^@]*,1 ,g' infile


        That will
        - remove every character from the start that is not a @
        - extract characters that are between the first @
        to the next first # that follows.



        For each line of the input file infile.



        General delimiters.



        Any other delimiter could be converted to the answer above by simply converting each delimiter string to one character.



        sed -e 's,"title":,@,g' -e 's,"url":,#,g' -e 's/^[^@]*//;s/@([^#]*)#[^@]*/1 /g' infile


        Instead of space (1), in your case, you can use newlines, which written for GNU sed are simply (1n):



        sed -e 's,"title":,@,g' -e 's,"url":,#,g' -e 's/^[^@]*//;s/@([^#]*)#[^@]*/1n/g' infile


        For other (older) seds Add an explicit newline:



        sed -e 's,"title":,@,g' -e 's,"url":,#,g' -e 's/^[^@]*//;s/@([^#]*)#[^@]*/1
        /g' infile


        If there is the risk that the delimters used above could be inside the file, choose another ones that are warateed to not exist inside the file. If that seems to be a problem, the start and end delimiters could be a control character
        like Ctrl-A ( or encoded: ^A, as hex: Ox01 or as octal 01 ). You can type that in a shell console by typing Ctrl-V Ctrl-A. You will see a ^A in the command line:



        sed -e 's,"title":,^A,g' -e 's,"url":,^B,g' -e 's,^[^^A]*,,;s,^A([^^B]*)^B[^^A]*,1n,g' infile


        Or, if it is too cumbersome to type, either use (ksh,bash,zsh):



        sed -e $'s,"title":,1,g' -e $'s,"url":,2,g' -e $'s,^[^1]*,,' -e $'s,1\([^2]*\)2[^1]*,\1\n,g' infile


        Or, if your sed supports it:



        sed -e 's,"title":,o001,g' -e 's,"url":,o002,g' -e 's,^[^o001]*,,' -e 's,o001([^o002]*)o002[^o001]*,1o012,g' infile


        if delimiter is "description":



        If the starting tag is actually "description": (from your example of output), just use it instead of "title":



        The output from above (from the file you linked before in your question):



        "Black Friday deal: Palm companion phone is $150 off at Verizon, but there's a catch","description":"",
        "LG trademarks potential names for its foldable phone, one fits a crazy concept found in patents","description":"",
        "Blackview's Black Friday promo discounts the BV9500 Pro and other rugged phones on Amazon","description":"Advertorial by Blackview: the opinions expressed in this story may not reflect the positions of PhoneArena! disclaimer amzn_assoc_tracking_id = 'phone0e0d-20';amzn_assoc_ad_mode = 'manual';amzn_assoc_ad_type ...",


        If you need to number the lines, sed it again with sed -n '=;p;g;p':



        | sed -n '=;p;g;p'
        1
        "Black Friday deal: Palm companion phone is $150 off at Verizon, but there's a catch","description":"",

        2
        "LG trademarks potential names for its foldable phone, one fits a crazy concept found in patents","description":"",

        3
        "Blackview's Black Friday promo discounts the BV9500 Pro and other rugged phones on Amazon","description":"Advertorial by Blackview: the opinions expressed in this story may not reflect the positions of PhoneArena! disclaimer amzn_assoc_tracking_id = 'phone0e0d-20';amzn_assoc_ad_mode = 'manual';amzn_assoc_ad_type ...",


        AWK



        Similar logic implemented in awk:



        awk -vone=$'1' -vtwo=$'2' '{
        gsub(/"title":/,one);
        gsub(/"url":/,two);
        sub("^[^"one"]*"one,"")
        gsub(two"[^"one"]*"one,ORS)
        sub(two"[^"two"]*$","")
        } 1' infile






        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Nov 25 at 23:52

























        answered Nov 25 at 23:14









        Isaac

        9,84111445




        9,84111445






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Unix & Linux Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.





            Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


            Please pay close attention to the following guidance:


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f483627%2fextract-data-between-two-patterns-from-a-huge-forced-text-file%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Morgemoulin

            Scott Moir

            Souastre