Demonstrate file encoding












0














I am developing some training and would like to demonstrate file encoding. What I am trying to accomplish is create a text file with one type of encoding that show up as nonsense when read by Linux.



Then convert the file to UTF8 Encoding and be able to read the file in Linux.



Is this possible?










share|improve this question
























  • You should firstly understand what character-encoding is. Characters encoded in one encoding being read as another encoding leads to mojibaka, CONVERTING it to another encoding will not solve the mojibaka.
    – 炸鱼薯条德里克
    Dec 17 at 3:00










  • Also this is not unix&linux specific. Readability is affected by file permissions stored as metadata in the filesystem, which is unrelated with text encoding.
    – 炸鱼薯条德里克
    Dec 17 at 3:01






  • 4




    This is why I wish we could downvote comments.   (1) The notion of character encoding is not specific to Unix&Linux, but the tools for working with it are OS-specific.  (2) This question has nothing to do with file permissions.  Why are you even mentioning it?
    – G-Man
    Dec 17 at 3:16










  • Mojibaka ... Chewbacca's sibling? Something from Star Bucks? Oh... Mojibake! ;)
    – Christopher
    Dec 18 at 1:13


















0














I am developing some training and would like to demonstrate file encoding. What I am trying to accomplish is create a text file with one type of encoding that show up as nonsense when read by Linux.



Then convert the file to UTF8 Encoding and be able to read the file in Linux.



Is this possible?










share|improve this question
























  • You should firstly understand what character-encoding is. Characters encoded in one encoding being read as another encoding leads to mojibaka, CONVERTING it to another encoding will not solve the mojibaka.
    – 炸鱼薯条德里克
    Dec 17 at 3:00










  • Also this is not unix&linux specific. Readability is affected by file permissions stored as metadata in the filesystem, which is unrelated with text encoding.
    – 炸鱼薯条德里克
    Dec 17 at 3:01






  • 4




    This is why I wish we could downvote comments.   (1) The notion of character encoding is not specific to Unix&Linux, but the tools for working with it are OS-specific.  (2) This question has nothing to do with file permissions.  Why are you even mentioning it?
    – G-Man
    Dec 17 at 3:16










  • Mojibaka ... Chewbacca's sibling? Something from Star Bucks? Oh... Mojibake! ;)
    – Christopher
    Dec 18 at 1:13
















0












0








0







I am developing some training and would like to demonstrate file encoding. What I am trying to accomplish is create a text file with one type of encoding that show up as nonsense when read by Linux.



Then convert the file to UTF8 Encoding and be able to read the file in Linux.



Is this possible?










share|improve this question















I am developing some training and would like to demonstrate file encoding. What I am trying to accomplish is create a text file with one type of encoding that show up as nonsense when read by Linux.



Then convert the file to UTF8 Encoding and be able to read the file in Linux.



Is this possible?







ubuntu files filesystems character-encoding






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Dec 17 at 9:21









SouravGhosh

455311




455311










asked Dec 17 at 2:29









Jeff Christman

1063




1063












  • You should firstly understand what character-encoding is. Characters encoded in one encoding being read as another encoding leads to mojibaka, CONVERTING it to another encoding will not solve the mojibaka.
    – 炸鱼薯条德里克
    Dec 17 at 3:00










  • Also this is not unix&linux specific. Readability is affected by file permissions stored as metadata in the filesystem, which is unrelated with text encoding.
    – 炸鱼薯条德里克
    Dec 17 at 3:01






  • 4




    This is why I wish we could downvote comments.   (1) The notion of character encoding is not specific to Unix&Linux, but the tools for working with it are OS-specific.  (2) This question has nothing to do with file permissions.  Why are you even mentioning it?
    – G-Man
    Dec 17 at 3:16










  • Mojibaka ... Chewbacca's sibling? Something from Star Bucks? Oh... Mojibake! ;)
    – Christopher
    Dec 18 at 1:13




















  • You should firstly understand what character-encoding is. Characters encoded in one encoding being read as another encoding leads to mojibaka, CONVERTING it to another encoding will not solve the mojibaka.
    – 炸鱼薯条德里克
    Dec 17 at 3:00










  • Also this is not unix&linux specific. Readability is affected by file permissions stored as metadata in the filesystem, which is unrelated with text encoding.
    – 炸鱼薯条德里克
    Dec 17 at 3:01






  • 4




    This is why I wish we could downvote comments.   (1) The notion of character encoding is not specific to Unix&Linux, but the tools for working with it are OS-specific.  (2) This question has nothing to do with file permissions.  Why are you even mentioning it?
    – G-Man
    Dec 17 at 3:16










  • Mojibaka ... Chewbacca's sibling? Something from Star Bucks? Oh... Mojibake! ;)
    – Christopher
    Dec 18 at 1:13


















You should firstly understand what character-encoding is. Characters encoded in one encoding being read as another encoding leads to mojibaka, CONVERTING it to another encoding will not solve the mojibaka.
– 炸鱼薯条德里克
Dec 17 at 3:00




You should firstly understand what character-encoding is. Characters encoded in one encoding being read as another encoding leads to mojibaka, CONVERTING it to another encoding will not solve the mojibaka.
– 炸鱼薯条德里克
Dec 17 at 3:00












Also this is not unix&linux specific. Readability is affected by file permissions stored as metadata in the filesystem, which is unrelated with text encoding.
– 炸鱼薯条德里克
Dec 17 at 3:01




Also this is not unix&linux specific. Readability is affected by file permissions stored as metadata in the filesystem, which is unrelated with text encoding.
– 炸鱼薯条德里克
Dec 17 at 3:01




4




4




This is why I wish we could downvote comments.   (1) The notion of character encoding is not specific to Unix&Linux, but the tools for working with it are OS-specific.  (2) This question has nothing to do with file permissions.  Why are you even mentioning it?
– G-Man
Dec 17 at 3:16




This is why I wish we could downvote comments.   (1) The notion of character encoding is not specific to Unix&Linux, but the tools for working with it are OS-specific.  (2) This question has nothing to do with file permissions.  Why are you even mentioning it?
– G-Man
Dec 17 at 3:16












Mojibaka ... Chewbacca's sibling? Something from Star Bucks? Oh... Mojibake! ;)
– Christopher
Dec 18 at 1:13






Mojibaka ... Chewbacca's sibling? Something from Star Bucks? Oh... Mojibake! ;)
– Christopher
Dec 18 at 1:13












3 Answers
3






active

oldest

votes


















3














You can use GNU recode to convert between encodings. It reads from stdin and is called like this:



recode from-encoding..to-encoding


For example:



$ recode ascii..ebcdic < file.txt


Or perhaps more relevant, to convert from the Windows-1252 encoding:



$ recode windows-1252..utf8 < extended-latin.txt


So an example demonstration:



$ cat > universal-declaration-french.txt
Tous les êtres humains naissent libres et égaux en dignité et en droits.
Ils sont doués de raison et de conscience et doivent agir les uns envers
les autres dans un esprit de fraternité.
^D
$ recode utf8..windows-1252 < universal-declaration-french.txt > declaration-1252.txt
$ cat declaration-1252.txt
Tous les �tres humains naissent libres et �gaux en dignit� et en droits.
Ils sont dou�s de raison et de conscience et doivent agir les uns envers
les autres dans un esprit de fraternit�.

$ recode windows-1252..utf8 < declaration-1252.txt
Tous les êtres humains naissent libres et égaux en dignité et en droits.
Ils sont doués de raison et de conscience et doivent agir les uns envers
les autres dans un esprit de fraternité.


You can see a list of encodings it supports with 'recode -l'.






share|improve this answer





















  • Perfect ! - Thanks so much
    – Jeff Christman
    Dec 17 at 14:21



















1














If by "Linux" you mean some program in Linux, assuming your terminal emulator's encoding is set to UTF-8, and your locale is some UTF-8 locale:



$ cat > utf8.txt <<<"This is 日本語。"
$ iconv -f UTF-8 -t UTF-16 utf8.txt > utf16.txt
$ head utf*.txt
==> utf16.txt <==
��This is �e,g��0

==> utf8.txt <==
This is 日本語。
$ iconv -f UTF-16 -t UTF-8 utf16.txt
This is 日本語。





share|improve this answer





























    1














    SHIFT-JIS can encode some pretty indecipherable stuff,



    % cat phrase
    ?k?????? ???????????? ???????? ?????? ?????????????? ????????????????????.
    % hexdump -C phrase
    00000000 82 6b 82 89 82 8b 82 85 20 82 91 82 95 82 89 82 |.k...... .......|
    00000010 8c 82 8c 82 93 20 82 95 82 90 82 8f 82 8e 20 82 |..... ........ .|
    00000020 94 82 88 82 85 20 82 86 82 92 82 85 82 94 82 86 |..... ..........|
    00000030 82 95 82 8c 20 82 90 82 8f 82 92 82 90 82 85 82 |.... ...........|
    00000040 8e 82 94 82 89 82 8e 82 85 2e 0a |...........|
    0000004b
    % iconv -f SHIFT-JIS -t UTF-8 < phrase
    Like quills upon the fretful porpentine.


    Images are also necessary for encoding problems, as some display software will "tofu" the text (white rectangles) while others display it just fine, or there may be various other disagreements about how things are being rendered only an image will help clear up (well, image, and a hexdump...)



    tofu or not



    These are from the Unicode fullwidth range that starts around U+FF01. Even more fun might be had with The Confusables.



    Method to this Madness



    First you'll need some means to generate text in non-standard unicode ranges, either with automation or by manually pasting a phrase together. Here's a converter that takes the a-zA-Z range and translates them up into the full width range:



    #!/usr/bin/env perl
    use 5.24.0;
    use warnings;

    die "Usage: not-ascii ...n" unless @ARGV;

    my $s = '';
    for my $c ( split //, "@ARGV" ) {
    if ( $c =~ m/[a-z]/ ) { # FF41
    $s .= chr( 0xFF41 + ord($c) - 97 );
    } elsif ( $c =~ m/[A-Z]/ ) { # FF21
    $s .= chr( 0xFF21 + ord($c) - 65 );
    } else {
    $s .= $c;
    }
    }
    binmode *STDOUT, ':encoding(UTF-8)';
    say $s;


    we can then take our Shakespeare to full width and encode it in SHIFT-JIS:



    % not-ascii 'Like quills upon the fretful porpentine.' 
    | iconv -f UTF-8 -t SHIFT-JIS > phrase


    SHIFT-JIS was found to be usable for this prurpose by doing a brute-force search converting the UTF-8 input into all the encodings listed by iconf -l. Most other encodings aren't very interesting, or fail to convert the UTF-8:



    #!/bin/sh
    IFS=' '
    iconv -l | while read e unused; do
    printf "$e "
    printf "test phrasen" | iconv -f UTF-8 -t "$e"
    done


    though you really need a hex viewer to inspect the results:



    % ./brutus-iconv > x
    iconv: (stdin):1:0: cannot convert
    iconv: (stdin):1:0: cannot convert
    iconv: (stdin):1:0: cannot convert

    iconv: (stdin):1:4: cannot convert
    iconv: (stdin):1:0: cannot convert
    % hexdump -C x
    00000000 41 4e 53 49 5f 58 33 2e 34 2d 31 39 36 38 20 74 |ANSI_X3.4-1968 t|
    00000010 65 73 74 20 70 68 72 61 73 65 0a 55 54 46 2d 38 |est phrase.UTF-8|
    00000020 20 74 65 73 74 20 70 68 72 61 73 65 0a 55 54 46 | test phrase.UTF|
    00000030 2d 38 2d 4d 41 43 20 74 65 73 74 20 70 68 72 61 |-8-MAC test phra|
    ...





    share|improve this answer























      Your Answer








      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "106"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: false,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: null,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });














      draft saved

      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f489396%2fdemonstrate-file-encoding%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      3 Answers
      3






      active

      oldest

      votes








      3 Answers
      3






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      3














      You can use GNU recode to convert between encodings. It reads from stdin and is called like this:



      recode from-encoding..to-encoding


      For example:



      $ recode ascii..ebcdic < file.txt


      Or perhaps more relevant, to convert from the Windows-1252 encoding:



      $ recode windows-1252..utf8 < extended-latin.txt


      So an example demonstration:



      $ cat > universal-declaration-french.txt
      Tous les êtres humains naissent libres et égaux en dignité et en droits.
      Ils sont doués de raison et de conscience et doivent agir les uns envers
      les autres dans un esprit de fraternité.
      ^D
      $ recode utf8..windows-1252 < universal-declaration-french.txt > declaration-1252.txt
      $ cat declaration-1252.txt
      Tous les �tres humains naissent libres et �gaux en dignit� et en droits.
      Ils sont dou�s de raison et de conscience et doivent agir les uns envers
      les autres dans un esprit de fraternit�.

      $ recode windows-1252..utf8 < declaration-1252.txt
      Tous les êtres humains naissent libres et égaux en dignité et en droits.
      Ils sont doués de raison et de conscience et doivent agir les uns envers
      les autres dans un esprit de fraternité.


      You can see a list of encodings it supports with 'recode -l'.






      share|improve this answer





















      • Perfect ! - Thanks so much
        – Jeff Christman
        Dec 17 at 14:21
















      3














      You can use GNU recode to convert between encodings. It reads from stdin and is called like this:



      recode from-encoding..to-encoding


      For example:



      $ recode ascii..ebcdic < file.txt


      Or perhaps more relevant, to convert from the Windows-1252 encoding:



      $ recode windows-1252..utf8 < extended-latin.txt


      So an example demonstration:



      $ cat > universal-declaration-french.txt
      Tous les êtres humains naissent libres et égaux en dignité et en droits.
      Ils sont doués de raison et de conscience et doivent agir les uns envers
      les autres dans un esprit de fraternité.
      ^D
      $ recode utf8..windows-1252 < universal-declaration-french.txt > declaration-1252.txt
      $ cat declaration-1252.txt
      Tous les �tres humains naissent libres et �gaux en dignit� et en droits.
      Ils sont dou�s de raison et de conscience et doivent agir les uns envers
      les autres dans un esprit de fraternit�.

      $ recode windows-1252..utf8 < declaration-1252.txt
      Tous les êtres humains naissent libres et égaux en dignité et en droits.
      Ils sont doués de raison et de conscience et doivent agir les uns envers
      les autres dans un esprit de fraternité.


      You can see a list of encodings it supports with 'recode -l'.






      share|improve this answer





















      • Perfect ! - Thanks so much
        – Jeff Christman
        Dec 17 at 14:21














      3












      3








      3






      You can use GNU recode to convert between encodings. It reads from stdin and is called like this:



      recode from-encoding..to-encoding


      For example:



      $ recode ascii..ebcdic < file.txt


      Or perhaps more relevant, to convert from the Windows-1252 encoding:



      $ recode windows-1252..utf8 < extended-latin.txt


      So an example demonstration:



      $ cat > universal-declaration-french.txt
      Tous les êtres humains naissent libres et égaux en dignité et en droits.
      Ils sont doués de raison et de conscience et doivent agir les uns envers
      les autres dans un esprit de fraternité.
      ^D
      $ recode utf8..windows-1252 < universal-declaration-french.txt > declaration-1252.txt
      $ cat declaration-1252.txt
      Tous les �tres humains naissent libres et �gaux en dignit� et en droits.
      Ils sont dou�s de raison et de conscience et doivent agir les uns envers
      les autres dans un esprit de fraternit�.

      $ recode windows-1252..utf8 < declaration-1252.txt
      Tous les êtres humains naissent libres et égaux en dignité et en droits.
      Ils sont doués de raison et de conscience et doivent agir les uns envers
      les autres dans un esprit de fraternité.


      You can see a list of encodings it supports with 'recode -l'.






      share|improve this answer












      You can use GNU recode to convert between encodings. It reads from stdin and is called like this:



      recode from-encoding..to-encoding


      For example:



      $ recode ascii..ebcdic < file.txt


      Or perhaps more relevant, to convert from the Windows-1252 encoding:



      $ recode windows-1252..utf8 < extended-latin.txt


      So an example demonstration:



      $ cat > universal-declaration-french.txt
      Tous les êtres humains naissent libres et égaux en dignité et en droits.
      Ils sont doués de raison et de conscience et doivent agir les uns envers
      les autres dans un esprit de fraternité.
      ^D
      $ recode utf8..windows-1252 < universal-declaration-french.txt > declaration-1252.txt
      $ cat declaration-1252.txt
      Tous les �tres humains naissent libres et �gaux en dignit� et en droits.
      Ils sont dou�s de raison et de conscience et doivent agir les uns envers
      les autres dans un esprit de fraternit�.

      $ recode windows-1252..utf8 < declaration-1252.txt
      Tous les êtres humains naissent libres et égaux en dignité et en droits.
      Ils sont doués de raison et de conscience et doivent agir les uns envers
      les autres dans un esprit de fraternité.


      You can see a list of encodings it supports with 'recode -l'.







      share|improve this answer












      share|improve this answer



      share|improve this answer










      answered Dec 17 at 3:05









      Silas Coker

      1963




      1963












      • Perfect ! - Thanks so much
        – Jeff Christman
        Dec 17 at 14:21


















      • Perfect ! - Thanks so much
        – Jeff Christman
        Dec 17 at 14:21
















      Perfect ! - Thanks so much
      – Jeff Christman
      Dec 17 at 14:21




      Perfect ! - Thanks so much
      – Jeff Christman
      Dec 17 at 14:21













      1














      If by "Linux" you mean some program in Linux, assuming your terminal emulator's encoding is set to UTF-8, and your locale is some UTF-8 locale:



      $ cat > utf8.txt <<<"This is 日本語。"
      $ iconv -f UTF-8 -t UTF-16 utf8.txt > utf16.txt
      $ head utf*.txt
      ==> utf16.txt <==
      ��This is �e,g��0

      ==> utf8.txt <==
      This is 日本語。
      $ iconv -f UTF-16 -t UTF-8 utf16.txt
      This is 日本語。





      share|improve this answer


























        1














        If by "Linux" you mean some program in Linux, assuming your terminal emulator's encoding is set to UTF-8, and your locale is some UTF-8 locale:



        $ cat > utf8.txt <<<"This is 日本語。"
        $ iconv -f UTF-8 -t UTF-16 utf8.txt > utf16.txt
        $ head utf*.txt
        ==> utf16.txt <==
        ��This is �e,g��0

        ==> utf8.txt <==
        This is 日本語。
        $ iconv -f UTF-16 -t UTF-8 utf16.txt
        This is 日本語。





        share|improve this answer
























          1












          1








          1






          If by "Linux" you mean some program in Linux, assuming your terminal emulator's encoding is set to UTF-8, and your locale is some UTF-8 locale:



          $ cat > utf8.txt <<<"This is 日本語。"
          $ iconv -f UTF-8 -t UTF-16 utf8.txt > utf16.txt
          $ head utf*.txt
          ==> utf16.txt <==
          ��This is �e,g��0

          ==> utf8.txt <==
          This is 日本語。
          $ iconv -f UTF-16 -t UTF-8 utf16.txt
          This is 日本語。





          share|improve this answer












          If by "Linux" you mean some program in Linux, assuming your terminal emulator's encoding is set to UTF-8, and your locale is some UTF-8 locale:



          $ cat > utf8.txt <<<"This is 日本語。"
          $ iconv -f UTF-8 -t UTF-16 utf8.txt > utf16.txt
          $ head utf*.txt
          ==> utf16.txt <==
          ��This is �e,g��0

          ==> utf8.txt <==
          This is 日本語。
          $ iconv -f UTF-16 -t UTF-8 utf16.txt
          This is 日本語。






          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Dec 17 at 3:07









          muru

          1




          1























              1














              SHIFT-JIS can encode some pretty indecipherable stuff,



              % cat phrase
              ?k?????? ???????????? ???????? ?????? ?????????????? ????????????????????.
              % hexdump -C phrase
              00000000 82 6b 82 89 82 8b 82 85 20 82 91 82 95 82 89 82 |.k...... .......|
              00000010 8c 82 8c 82 93 20 82 95 82 90 82 8f 82 8e 20 82 |..... ........ .|
              00000020 94 82 88 82 85 20 82 86 82 92 82 85 82 94 82 86 |..... ..........|
              00000030 82 95 82 8c 20 82 90 82 8f 82 92 82 90 82 85 82 |.... ...........|
              00000040 8e 82 94 82 89 82 8e 82 85 2e 0a |...........|
              0000004b
              % iconv -f SHIFT-JIS -t UTF-8 < phrase
              Like quills upon the fretful porpentine.


              Images are also necessary for encoding problems, as some display software will "tofu" the text (white rectangles) while others display it just fine, or there may be various other disagreements about how things are being rendered only an image will help clear up (well, image, and a hexdump...)



              tofu or not



              These are from the Unicode fullwidth range that starts around U+FF01. Even more fun might be had with The Confusables.



              Method to this Madness



              First you'll need some means to generate text in non-standard unicode ranges, either with automation or by manually pasting a phrase together. Here's a converter that takes the a-zA-Z range and translates them up into the full width range:



              #!/usr/bin/env perl
              use 5.24.0;
              use warnings;

              die "Usage: not-ascii ...n" unless @ARGV;

              my $s = '';
              for my $c ( split //, "@ARGV" ) {
              if ( $c =~ m/[a-z]/ ) { # FF41
              $s .= chr( 0xFF41 + ord($c) - 97 );
              } elsif ( $c =~ m/[A-Z]/ ) { # FF21
              $s .= chr( 0xFF21 + ord($c) - 65 );
              } else {
              $s .= $c;
              }
              }
              binmode *STDOUT, ':encoding(UTF-8)';
              say $s;


              we can then take our Shakespeare to full width and encode it in SHIFT-JIS:



              % not-ascii 'Like quills upon the fretful porpentine.' 
              | iconv -f UTF-8 -t SHIFT-JIS > phrase


              SHIFT-JIS was found to be usable for this prurpose by doing a brute-force search converting the UTF-8 input into all the encodings listed by iconf -l. Most other encodings aren't very interesting, or fail to convert the UTF-8:



              #!/bin/sh
              IFS=' '
              iconv -l | while read e unused; do
              printf "$e "
              printf "test phrasen" | iconv -f UTF-8 -t "$e"
              done


              though you really need a hex viewer to inspect the results:



              % ./brutus-iconv > x
              iconv: (stdin):1:0: cannot convert
              iconv: (stdin):1:0: cannot convert
              iconv: (stdin):1:0: cannot convert

              iconv: (stdin):1:4: cannot convert
              iconv: (stdin):1:0: cannot convert
              % hexdump -C x
              00000000 41 4e 53 49 5f 58 33 2e 34 2d 31 39 36 38 20 74 |ANSI_X3.4-1968 t|
              00000010 65 73 74 20 70 68 72 61 73 65 0a 55 54 46 2d 38 |est phrase.UTF-8|
              00000020 20 74 65 73 74 20 70 68 72 61 73 65 0a 55 54 46 | test phrase.UTF|
              00000030 2d 38 2d 4d 41 43 20 74 65 73 74 20 70 68 72 61 |-8-MAC test phra|
              ...





              share|improve this answer




























                1














                SHIFT-JIS can encode some pretty indecipherable stuff,



                % cat phrase
                ?k?????? ???????????? ???????? ?????? ?????????????? ????????????????????.
                % hexdump -C phrase
                00000000 82 6b 82 89 82 8b 82 85 20 82 91 82 95 82 89 82 |.k...... .......|
                00000010 8c 82 8c 82 93 20 82 95 82 90 82 8f 82 8e 20 82 |..... ........ .|
                00000020 94 82 88 82 85 20 82 86 82 92 82 85 82 94 82 86 |..... ..........|
                00000030 82 95 82 8c 20 82 90 82 8f 82 92 82 90 82 85 82 |.... ...........|
                00000040 8e 82 94 82 89 82 8e 82 85 2e 0a |...........|
                0000004b
                % iconv -f SHIFT-JIS -t UTF-8 < phrase
                Like quills upon the fretful porpentine.


                Images are also necessary for encoding problems, as some display software will "tofu" the text (white rectangles) while others display it just fine, or there may be various other disagreements about how things are being rendered only an image will help clear up (well, image, and a hexdump...)



                tofu or not



                These are from the Unicode fullwidth range that starts around U+FF01. Even more fun might be had with The Confusables.



                Method to this Madness



                First you'll need some means to generate text in non-standard unicode ranges, either with automation or by manually pasting a phrase together. Here's a converter that takes the a-zA-Z range and translates them up into the full width range:



                #!/usr/bin/env perl
                use 5.24.0;
                use warnings;

                die "Usage: not-ascii ...n" unless @ARGV;

                my $s = '';
                for my $c ( split //, "@ARGV" ) {
                if ( $c =~ m/[a-z]/ ) { # FF41
                $s .= chr( 0xFF41 + ord($c) - 97 );
                } elsif ( $c =~ m/[A-Z]/ ) { # FF21
                $s .= chr( 0xFF21 + ord($c) - 65 );
                } else {
                $s .= $c;
                }
                }
                binmode *STDOUT, ':encoding(UTF-8)';
                say $s;


                we can then take our Shakespeare to full width and encode it in SHIFT-JIS:



                % not-ascii 'Like quills upon the fretful porpentine.' 
                | iconv -f UTF-8 -t SHIFT-JIS > phrase


                SHIFT-JIS was found to be usable for this prurpose by doing a brute-force search converting the UTF-8 input into all the encodings listed by iconf -l. Most other encodings aren't very interesting, or fail to convert the UTF-8:



                #!/bin/sh
                IFS=' '
                iconv -l | while read e unused; do
                printf "$e "
                printf "test phrasen" | iconv -f UTF-8 -t "$e"
                done


                though you really need a hex viewer to inspect the results:



                % ./brutus-iconv > x
                iconv: (stdin):1:0: cannot convert
                iconv: (stdin):1:0: cannot convert
                iconv: (stdin):1:0: cannot convert

                iconv: (stdin):1:4: cannot convert
                iconv: (stdin):1:0: cannot convert
                % hexdump -C x
                00000000 41 4e 53 49 5f 58 33 2e 34 2d 31 39 36 38 20 74 |ANSI_X3.4-1968 t|
                00000010 65 73 74 20 70 68 72 61 73 65 0a 55 54 46 2d 38 |est phrase.UTF-8|
                00000020 20 74 65 73 74 20 70 68 72 61 73 65 0a 55 54 46 | test phrase.UTF|
                00000030 2d 38 2d 4d 41 43 20 74 65 73 74 20 70 68 72 61 |-8-MAC test phra|
                ...





                share|improve this answer


























                  1












                  1








                  1






                  SHIFT-JIS can encode some pretty indecipherable stuff,



                  % cat phrase
                  ?k?????? ???????????? ???????? ?????? ?????????????? ????????????????????.
                  % hexdump -C phrase
                  00000000 82 6b 82 89 82 8b 82 85 20 82 91 82 95 82 89 82 |.k...... .......|
                  00000010 8c 82 8c 82 93 20 82 95 82 90 82 8f 82 8e 20 82 |..... ........ .|
                  00000020 94 82 88 82 85 20 82 86 82 92 82 85 82 94 82 86 |..... ..........|
                  00000030 82 95 82 8c 20 82 90 82 8f 82 92 82 90 82 85 82 |.... ...........|
                  00000040 8e 82 94 82 89 82 8e 82 85 2e 0a |...........|
                  0000004b
                  % iconv -f SHIFT-JIS -t UTF-8 < phrase
                  Like quills upon the fretful porpentine.


                  Images are also necessary for encoding problems, as some display software will "tofu" the text (white rectangles) while others display it just fine, or there may be various other disagreements about how things are being rendered only an image will help clear up (well, image, and a hexdump...)



                  tofu or not



                  These are from the Unicode fullwidth range that starts around U+FF01. Even more fun might be had with The Confusables.



                  Method to this Madness



                  First you'll need some means to generate text in non-standard unicode ranges, either with automation or by manually pasting a phrase together. Here's a converter that takes the a-zA-Z range and translates them up into the full width range:



                  #!/usr/bin/env perl
                  use 5.24.0;
                  use warnings;

                  die "Usage: not-ascii ...n" unless @ARGV;

                  my $s = '';
                  for my $c ( split //, "@ARGV" ) {
                  if ( $c =~ m/[a-z]/ ) { # FF41
                  $s .= chr( 0xFF41 + ord($c) - 97 );
                  } elsif ( $c =~ m/[A-Z]/ ) { # FF21
                  $s .= chr( 0xFF21 + ord($c) - 65 );
                  } else {
                  $s .= $c;
                  }
                  }
                  binmode *STDOUT, ':encoding(UTF-8)';
                  say $s;


                  we can then take our Shakespeare to full width and encode it in SHIFT-JIS:



                  % not-ascii 'Like quills upon the fretful porpentine.' 
                  | iconv -f UTF-8 -t SHIFT-JIS > phrase


                  SHIFT-JIS was found to be usable for this prurpose by doing a brute-force search converting the UTF-8 input into all the encodings listed by iconf -l. Most other encodings aren't very interesting, or fail to convert the UTF-8:



                  #!/bin/sh
                  IFS=' '
                  iconv -l | while read e unused; do
                  printf "$e "
                  printf "test phrasen" | iconv -f UTF-8 -t "$e"
                  done


                  though you really need a hex viewer to inspect the results:



                  % ./brutus-iconv > x
                  iconv: (stdin):1:0: cannot convert
                  iconv: (stdin):1:0: cannot convert
                  iconv: (stdin):1:0: cannot convert

                  iconv: (stdin):1:4: cannot convert
                  iconv: (stdin):1:0: cannot convert
                  % hexdump -C x
                  00000000 41 4e 53 49 5f 58 33 2e 34 2d 31 39 36 38 20 74 |ANSI_X3.4-1968 t|
                  00000010 65 73 74 20 70 68 72 61 73 65 0a 55 54 46 2d 38 |est phrase.UTF-8|
                  00000020 20 74 65 73 74 20 70 68 72 61 73 65 0a 55 54 46 | test phrase.UTF|
                  00000030 2d 38 2d 4d 41 43 20 74 65 73 74 20 70 68 72 61 |-8-MAC test phra|
                  ...





                  share|improve this answer














                  SHIFT-JIS can encode some pretty indecipherable stuff,



                  % cat phrase
                  ?k?????? ???????????? ???????? ?????? ?????????????? ????????????????????.
                  % hexdump -C phrase
                  00000000 82 6b 82 89 82 8b 82 85 20 82 91 82 95 82 89 82 |.k...... .......|
                  00000010 8c 82 8c 82 93 20 82 95 82 90 82 8f 82 8e 20 82 |..... ........ .|
                  00000020 94 82 88 82 85 20 82 86 82 92 82 85 82 94 82 86 |..... ..........|
                  00000030 82 95 82 8c 20 82 90 82 8f 82 92 82 90 82 85 82 |.... ...........|
                  00000040 8e 82 94 82 89 82 8e 82 85 2e 0a |...........|
                  0000004b
                  % iconv -f SHIFT-JIS -t UTF-8 < phrase
                  Like quills upon the fretful porpentine.


                  Images are also necessary for encoding problems, as some display software will "tofu" the text (white rectangles) while others display it just fine, or there may be various other disagreements about how things are being rendered only an image will help clear up (well, image, and a hexdump...)



                  tofu or not



                  These are from the Unicode fullwidth range that starts around U+FF01. Even more fun might be had with The Confusables.



                  Method to this Madness



                  First you'll need some means to generate text in non-standard unicode ranges, either with automation or by manually pasting a phrase together. Here's a converter that takes the a-zA-Z range and translates them up into the full width range:



                  #!/usr/bin/env perl
                  use 5.24.0;
                  use warnings;

                  die "Usage: not-ascii ...n" unless @ARGV;

                  my $s = '';
                  for my $c ( split //, "@ARGV" ) {
                  if ( $c =~ m/[a-z]/ ) { # FF41
                  $s .= chr( 0xFF41 + ord($c) - 97 );
                  } elsif ( $c =~ m/[A-Z]/ ) { # FF21
                  $s .= chr( 0xFF21 + ord($c) - 65 );
                  } else {
                  $s .= $c;
                  }
                  }
                  binmode *STDOUT, ':encoding(UTF-8)';
                  say $s;


                  we can then take our Shakespeare to full width and encode it in SHIFT-JIS:



                  % not-ascii 'Like quills upon the fretful porpentine.' 
                  | iconv -f UTF-8 -t SHIFT-JIS > phrase


                  SHIFT-JIS was found to be usable for this prurpose by doing a brute-force search converting the UTF-8 input into all the encodings listed by iconf -l. Most other encodings aren't very interesting, or fail to convert the UTF-8:



                  #!/bin/sh
                  IFS=' '
                  iconv -l | while read e unused; do
                  printf "$e "
                  printf "test phrasen" | iconv -f UTF-8 -t "$e"
                  done


                  though you really need a hex viewer to inspect the results:



                  % ./brutus-iconv > x
                  iconv: (stdin):1:0: cannot convert
                  iconv: (stdin):1:0: cannot convert
                  iconv: (stdin):1:0: cannot convert

                  iconv: (stdin):1:4: cannot convert
                  iconv: (stdin):1:0: cannot convert
                  % hexdump -C x
                  00000000 41 4e 53 49 5f 58 33 2e 34 2d 31 39 36 38 20 74 |ANSI_X3.4-1968 t|
                  00000010 65 73 74 20 70 68 72 61 73 65 0a 55 54 46 2d 38 |est phrase.UTF-8|
                  00000020 20 74 65 73 74 20 70 68 72 61 73 65 0a 55 54 46 | test phrase.UTF|
                  00000030 2d 38 2d 4d 41 43 20 74 65 73 74 20 70 68 72 61 |-8-MAC test phra|
                  ...






                  share|improve this answer














                  share|improve this answer



                  share|improve this answer








                  edited Dec 18 at 1:03

























                  answered Dec 17 at 3:26









                  thrig

                  24.2k23056




                  24.2k23056






























                      draft saved

                      draft discarded




















































                      Thanks for contributing an answer to Unix & Linux Stack Exchange!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      To learn more, see our tips on writing great answers.





                      Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                      Please pay close attention to the following guidance:


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f489396%2fdemonstrate-file-encoding%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      Morgemoulin

                      Scott Moir

                      Souastre