Demonstrate file encoding

I am developing some training and would like to demonstrate file encoding. What I am trying to accomplish is create a text file with one type of encoding that show up as nonsense when read by Linux.

Then convert the file to UTF8 Encoding and be able to read the file in Linux.

Is this possible?

edited Dec 17 at 9:21

SouravGhosh

455311

asked Dec 17 at 2:29

Jeff Christman

1063

You should firstly understand what character-encoding is. Characters encoded in one encoding being read as another encoding leads to mojibaka, CONVERTING it to another encoding will not solve the mojibaka.
– 炸鱼薯条德里克
Dec 17 at 3:00

Also this is not unix&linux specific. Readability is affected by file permissions stored as metadata in the filesystem, which is unrelated with text encoding.
– 炸鱼薯条德里克
Dec 17 at 3:01

4

This is why I wish we could downvote comments.   (1) The notion of character encoding is not specific to Unix&Linux, but the tools for working with it are OS-specific. (2) This question has nothing to do with file permissions. Why are you even mentioning it?
– G-Man
Dec 17 at 3:16

Mojibaka ... Chewbacca's sibling? Something from Star Bucks? Oh... Mojibake! ;)
– Christopher
Dec 18 at 1:13

add a comment |

I am developing some training and would like to demonstrate file encoding. What I am trying to accomplish is create a text file with one type of encoding that show up as nonsense when read by Linux.

Then convert the file to UTF8 Encoding and be able to read the file in Linux.

Is this possible?

edited Dec 17 at 9:21

SouravGhosh

455311

asked Dec 17 at 2:29

Jeff Christman

1063

You should firstly understand what character-encoding is. Characters encoded in one encoding being read as another encoding leads to mojibaka, CONVERTING it to another encoding will not solve the mojibaka.
– 炸鱼薯条德里克
Dec 17 at 3:00

Also this is not unix&linux specific. Readability is affected by file permissions stored as metadata in the filesystem, which is unrelated with text encoding.
– 炸鱼薯条德里克
Dec 17 at 3:01

4

This is why I wish we could downvote comments.   (1) The notion of character encoding is not specific to Unix&Linux, but the tools for working with it are OS-specific. (2) This question has nothing to do with file permissions. Why are you even mentioning it?
– G-Man
Dec 17 at 3:16

Mojibaka ... Chewbacca's sibling? Something from Star Bucks? Oh... Mojibake! ;)
– Christopher
Dec 18 at 1:13

add a comment |

I am developing some training and would like to demonstrate file encoding. What I am trying to accomplish is create a text file with one type of encoding that show up as nonsense when read by Linux.

Then convert the file to UTF8 Encoding and be able to read the file in Linux.

Is this possible?

edited Dec 17 at 9:21

SouravGhosh

455311

asked Dec 17 at 2:29

Jeff Christman

1063

I am developing some training and would like to demonstrate file encoding. What I am trying to accomplish is create a text file with one type of encoding that show up as nonsense when read by Linux.

Then convert the file to UTF8 Encoding and be able to read the file in Linux.

Is this possible?

ubuntu files filesystems character-encoding

edited Dec 17 at 9:21

SouravGhosh

455311

asked Dec 17 at 2:29

Jeff Christman

1063

edited Dec 17 at 9:21

SouravGhosh

455311

asked Dec 17 at 2:29

Jeff Christman

1063

edited Dec 17 at 9:21

SouravGhosh

455311

edited Dec 17 at 9:21

SouravGhosh

455311

edited Dec 17 at 9:21

SouravGhosh

455311

asked Dec 17 at 2:29

Jeff Christman

1063

asked Dec 17 at 2:29

Jeff Christman

1063

asked Dec 17 at 2:29

Jeff Christman

1063

You should firstly understand what character-encoding is. Characters encoded in one encoding being read as another encoding leads to mojibaka, CONVERTING it to another encoding will not solve the mojibaka.
– 炸鱼薯条德里克
Dec 17 at 3:00

Also this is not unix&linux specific. Readability is affected by file permissions stored as metadata in the filesystem, which is unrelated with text encoding.
– 炸鱼薯条德里克
Dec 17 at 3:01

4

This is why I wish we could downvote comments.   (1) The notion of character encoding is not specific to Unix&Linux, but the tools for working with it are OS-specific. (2) This question has nothing to do with file permissions. Why are you even mentioning it?
– G-Man
Dec 17 at 3:16

Mojibaka ... Chewbacca's sibling? Something from Star Bucks? Oh... Mojibake! ;)
– Christopher
Dec 18 at 1:13

add a comment |

You should firstly understand what character-encoding is. Characters encoded in one encoding being read as another encoding leads to mojibaka, CONVERTING it to another encoding will not solve the mojibaka.
– 炸鱼薯条德里克
Dec 17 at 3:00

Also this is not unix&linux specific. Readability is affected by file permissions stored as metadata in the filesystem, which is unrelated with text encoding.
– 炸鱼薯条德里克
Dec 17 at 3:01

4

This is why I wish we could downvote comments.   (1) The notion of character encoding is not specific to Unix&Linux, but the tools for working with it are OS-specific. (2) This question has nothing to do with file permissions. Why are you even mentioning it?
– G-Man
Dec 17 at 3:16

Mojibaka ... Chewbacca's sibling? Something from Star Bucks? Oh... Mojibake! ;)
– Christopher
Dec 18 at 1:13

You should firstly understand what character-encoding is. Characters encoded in one encoding being read as another encoding leads to mojibaka, CONVERTING it to another encoding will not solve the mojibaka.
– 炸鱼薯条德里克
Dec 17 at 3:00

Also this is not unix&linux specific. Readability is affected by file permissions stored as metadata in the filesystem, which is unrelated with text encoding.
– 炸鱼薯条德里克
Dec 17 at 3:01

This is why I wish we could downvote comments.   (1) The notion of character encoding is not specific to Unix&Linux, but the tools for working with it are OS-specific. (2) This question has nothing to do with file permissions. Why are you even mentioning it?
– G-Man
Dec 17 at 3:16

Mojibaka ... Chewbacca's sibling? Something from Star Bucks? Oh... Mojibake! ;)
– Christopher
Dec 18 at 1:13

add a comment |

3 Answers
3

active

oldest

votes

You can use GNU recode to convert between encodings. It reads from stdin and is called like this:

recode from-encoding..to-encoding

For example:

$ recode ascii..ebcdic < file.txt

Or perhaps more relevant, to convert from the Windows-1252 encoding:

$ recode windows-1252..utf8 < extended-latin.txt

So an example demonstration:

$ cat > universal-declaration-french.txt

Tous les êtres humains naissent libres et égaux en dignité et en droits.

Ils sont doués de raison et de conscience et doivent agir les uns envers

les autres dans un esprit de fraternité.

^D

$ recode utf8..windows-1252 < universal-declaration-french.txt > declaration-1252.txt

$ cat declaration-1252.txt

Tous les �tres humains naissent libres et �gaux en dignit� et en droits.

Ils sont dou�s de raison et de conscience et doivent agir les uns envers

les autres dans un esprit de fraternit�.



$ recode windows-1252..utf8 < declaration-1252.txt

Tous les êtres humains naissent libres et égaux en dignité et en droits.

Ils sont doués de raison et de conscience et doivent agir les uns envers

les autres dans un esprit de fraternité.

You can see a list of encodings it supports with 'recode -l'.

answered Dec 17 at 3:05

Silas Coker

1963

Perfect ! - Thanks so much
– Jeff Christman
Dec 17 at 14:21

add a comment |

If by "Linux" you mean some program in Linux, assuming your terminal emulator's encoding is set to UTF-8, and your locale is some UTF-8 locale:

$ cat > utf8.txt <<<"This is 日本語。"

$ iconv -f UTF-8 -t UTF-16 utf8.txt > utf16.txt

$ head utf*.txt  

==> utf16.txt <==

��This is �e,g��0



==> utf8.txt <==

This is 日本語。

$ iconv -f UTF-16 -t UTF-8 utf16.txt

This is 日本語。

answered Dec 17 at 3:07

muru

add a comment |

SHIFT-JIS can encode some pretty indecipherable stuff,

% cat phrase

?k?????? ???????????? ???????? ?????? ?????????????? ????????????????????.

% hexdump -C phrase

00000000  82 6b 82 89 82 8b 82 85  20 82 91 82 95 82 89 82  |.k...... .......|

00000010  8c 82 8c 82 93 20 82 95  82 90 82 8f 82 8e 20 82  |..... ........ .|

00000020  94 82 88 82 85 20 82 86  82 92 82 85 82 94 82 86  |..... ..........|

00000030  82 95 82 8c 20 82 90 82  8f 82 92 82 90 82 85 82  |.... ...........|

00000040  8e 82 94 82 89 82 8e 82  85 2e 0a                 |...........|

0000004b

% iconv -f SHIFT-JIS -t UTF-8 < phrase 

Ｌｉｋｅ ｑｕｉｌｌｓ ｕｐｏｎ ｔｈｅ ｆｒｅｔｆｕｌ ｐｏｒｐｅｎｔｉｎｅ.

Images are also necessary for encoding problems, as some display software will "tofu" the text (white rectangles) while others display it just fine, or there may be various other disagreements about how things are being rendered only an image will help clear up (well, image, and a hexdump...)

tofu or not

These are from the Unicode fullwidth range that starts around U+FF01. Even more fun might be had with The Confusables.

Method to this Madness

First you'll need some means to generate text in non-standard unicode ranges, either with automation or by manually pasting a phrase together. Here's a converter that takes the a-zA-Z range and translates them up into the full width range:

#!/usr/bin/env perl

use 5.24.0;

use warnings;



die "Usage: not-ascii ...n" unless @ARGV;



my $s = '';

for my $c ( split //, "@ARGV" ) {

    if ( $c =~ m/[a-z]/ ) {    # FF41

        $s .= chr( 0xFF41 + ord($c) - 97 );

    } elsif ( $c =~ m/[A-Z]/ ) {    # FF21

        $s .= chr( 0xFF21 + ord($c) - 65 );

    } else {

        $s .= $c;

    }

}

binmode *STDOUT, ':encoding(UTF-8)';

say $s;

we can then take our Shakespeare to full width and encode it in SHIFT-JIS:

% not-ascii 'Like quills upon the fretful porpentine.' 

| iconv -f UTF-8 -t SHIFT-JIS > phrase

SHIFT-JIS was found to be usable for this prurpose by doing a brute-force search converting the UTF-8 input into all the encodings listed by iconf -l. Most other encodings aren't very interesting, or fail to convert the UTF-8:

#!/bin/sh

IFS=' '

iconv -l | while read e unused; do

   printf "$e "

   printf "test phrasen" | iconv -f UTF-8 -t "$e"

done

though you really need a hex viewer to inspect the results:

% ./brutus-iconv > x

iconv: (stdin):1:0: cannot convert

iconv: (stdin):1:0: cannot convert

iconv: (stdin):1:0: cannot convert



iconv: (stdin):1:4: cannot convert

iconv: (stdin):1:0: cannot convert

% hexdump -C x

00000000  41 4e 53 49 5f 58 33 2e  34 2d 31 39 36 38 20 74  |ANSI_X3.4-1968 t|

00000010  65 73 74 20 70 68 72 61  73 65 0a 55 54 46 2d 38  |est phrase.UTF-8|

00000020  20 74 65 73 74 20 70 68  72 61 73 65 0a 55 54 46  | test phrase.UTF|

00000030  2d 38 2d 4d 41 43 20 74  65 73 74 20 70 68 72 61  |-8-MAC test phra|

...

edited Dec 18 at 1:03

answered Dec 17 at 3:26

thrig

24.2k23056

add a comment |

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f489396%2fdemonstrate-file-encoding%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

You can use GNU recode to convert between encodings. It reads from stdin and is called like this:

recode from-encoding..to-encoding

For example:

$ recode ascii..ebcdic < file.txt

Or perhaps more relevant, to convert from the Windows-1252 encoding:

$ recode windows-1252..utf8 < extended-latin.txt

So an example demonstration:

$ cat > universal-declaration-french.txt

Tous les êtres humains naissent libres et égaux en dignité et en droits.

Ils sont doués de raison et de conscience et doivent agir les uns envers

les autres dans un esprit de fraternité.

^D

$ recode utf8..windows-1252 < universal-declaration-french.txt > declaration-1252.txt

$ cat declaration-1252.txt

Tous les �tres humains naissent libres et �gaux en dignit� et en droits.

Ils sont dou�s de raison et de conscience et doivent agir les uns envers

les autres dans un esprit de fraternit�.



$ recode windows-1252..utf8 < declaration-1252.txt

Tous les êtres humains naissent libres et égaux en dignité et en droits.

Ils sont doués de raison et de conscience et doivent agir les uns envers

les autres dans un esprit de fraternité.

You can see a list of encodings it supports with 'recode -l'.

answered Dec 17 at 3:05

Silas Coker

1963

Perfect ! - Thanks so much
– Jeff Christman
Dec 17 at 14:21

add a comment |

You can use GNU recode to convert between encodings. It reads from stdin and is called like this:

recode from-encoding..to-encoding

For example:

$ recode ascii..ebcdic < file.txt

Or perhaps more relevant, to convert from the Windows-1252 encoding:

$ recode windows-1252..utf8 < extended-latin.txt

So an example demonstration:

$ cat > universal-declaration-french.txt

Tous les êtres humains naissent libres et égaux en dignité et en droits.

Ils sont doués de raison et de conscience et doivent agir les uns envers

les autres dans un esprit de fraternité.

^D

$ recode utf8..windows-1252 < universal-declaration-french.txt > declaration-1252.txt

$ cat declaration-1252.txt

Tous les �tres humains naissent libres et �gaux en dignit� et en droits.

Ils sont dou�s de raison et de conscience et doivent agir les uns envers

les autres dans un esprit de fraternit�.



$ recode windows-1252..utf8 < declaration-1252.txt

Tous les êtres humains naissent libres et égaux en dignité et en droits.

Ils sont doués de raison et de conscience et doivent agir les uns envers

les autres dans un esprit de fraternité.

You can see a list of encodings it supports with 'recode -l'.

answered Dec 17 at 3:05

Silas Coker

1963

Perfect ! - Thanks so much
– Jeff Christman
Dec 17 at 14:21

add a comment |

You can use GNU recode to convert between encodings. It reads from stdin and is called like this:

recode from-encoding..to-encoding

For example:

$ recode ascii..ebcdic < file.txt

Or perhaps more relevant, to convert from the Windows-1252 encoding:

$ recode windows-1252..utf8 < extended-latin.txt

So an example demonstration:

$ cat > universal-declaration-french.txt

Tous les êtres humains naissent libres et égaux en dignité et en droits.

Ils sont doués de raison et de conscience et doivent agir les uns envers

les autres dans un esprit de fraternité.

^D

$ recode utf8..windows-1252 < universal-declaration-french.txt > declaration-1252.txt

$ cat declaration-1252.txt

Tous les �tres humains naissent libres et �gaux en dignit� et en droits.

Ils sont dou�s de raison et de conscience et doivent agir les uns envers

les autres dans un esprit de fraternit�.



$ recode windows-1252..utf8 < declaration-1252.txt

Tous les êtres humains naissent libres et égaux en dignité et en droits.

Ils sont doués de raison et de conscience et doivent agir les uns envers

les autres dans un esprit de fraternité.

You can see a list of encodings it supports with 'recode -l'.

answered Dec 17 at 3:05

Silas Coker

1963

You can use GNU recode to convert between encodings. It reads from stdin and is called like this:

recode from-encoding..to-encoding

For example:

$ recode ascii..ebcdic < file.txt

Or perhaps more relevant, to convert from the Windows-1252 encoding:

$ recode windows-1252..utf8 < extended-latin.txt

So an example demonstration:

$ cat > universal-declaration-french.txt

Tous les êtres humains naissent libres et égaux en dignité et en droits.

Ils sont doués de raison et de conscience et doivent agir les uns envers

les autres dans un esprit de fraternité.

^D

$ recode utf8..windows-1252 < universal-declaration-french.txt > declaration-1252.txt

$ cat declaration-1252.txt

Tous les �tres humains naissent libres et �gaux en dignit� et en droits.

Ils sont dou�s de raison et de conscience et doivent agir les uns envers

les autres dans un esprit de fraternit�.



$ recode windows-1252..utf8 < declaration-1252.txt

Tous les êtres humains naissent libres et égaux en dignité et en droits.

Ils sont doués de raison et de conscience et doivent agir les uns envers

les autres dans un esprit de fraternité.

You can see a list of encodings it supports with 'recode -l'.

answered Dec 17 at 3:05

Silas Coker

1963

answered Dec 17 at 3:05

Silas Coker

1963

answered Dec 17 at 3:05

Silas Coker

1963

answered Dec 17 at 3:05

Silas Coker

1963

Perfect ! - Thanks so much
– Jeff Christman
Dec 17 at 14:21

add a comment |

Perfect ! - Thanks so much
– Jeff Christman
Dec 17 at 14:21

Perfect ! - Thanks so much
– Jeff Christman
Dec 17 at 14:21

add a comment |

If by "Linux" you mean some program in Linux, assuming your terminal emulator's encoding is set to UTF-8, and your locale is some UTF-8 locale:

$ cat > utf8.txt <<<"This is 日本語。"

$ iconv -f UTF-8 -t UTF-16 utf8.txt > utf16.txt

$ head utf*.txt  

==> utf16.txt <==

��This is �e,g��0



==> utf8.txt <==

This is 日本語。

$ iconv -f UTF-16 -t UTF-8 utf16.txt

This is 日本語。

answered Dec 17 at 3:07

muru

add a comment |

If by "Linux" you mean some program in Linux, assuming your terminal emulator's encoding is set to UTF-8, and your locale is some UTF-8 locale:

$ cat > utf8.txt <<<"This is 日本語。"

$ iconv -f UTF-8 -t UTF-16 utf8.txt > utf16.txt

$ head utf*.txt  

==> utf16.txt <==

��This is �e,g��0



==> utf8.txt <==

This is 日本語。

$ iconv -f UTF-16 -t UTF-8 utf16.txt

This is 日本語。

answered Dec 17 at 3:07

muru

add a comment |

If by "Linux" you mean some program in Linux, assuming your terminal emulator's encoding is set to UTF-8, and your locale is some UTF-8 locale:

$ cat > utf8.txt <<<"This is 日本語。"

$ iconv -f UTF-8 -t UTF-16 utf8.txt > utf16.txt

$ head utf*.txt  

==> utf16.txt <==

��This is �e,g��0



==> utf8.txt <==

This is 日本語。

$ iconv -f UTF-16 -t UTF-8 utf16.txt

This is 日本語。

answered Dec 17 at 3:07

muru

If by "Linux" you mean some program in Linux, assuming your terminal emulator's encoding is set to UTF-8, and your locale is some UTF-8 locale:

$ cat > utf8.txt <<<"This is 日本語。"

$ iconv -f UTF-8 -t UTF-16 utf8.txt > utf16.txt

$ head utf*.txt  

==> utf16.txt <==

��This is �e,g��0



==> utf8.txt <==

This is 日本語。

$ iconv -f UTF-16 -t UTF-8 utf16.txt

This is 日本語。

answered Dec 17 at 3:07

muru

answered Dec 17 at 3:07

muru

answered Dec 17 at 3:07

muru

answered Dec 17 at 3:07

muru

add a comment |

SHIFT-JIS can encode some pretty indecipherable stuff,

% cat phrase

?k?????? ???????????? ???????? ?????? ?????????????? ????????????????????.

% hexdump -C phrase

00000000  82 6b 82 89 82 8b 82 85  20 82 91 82 95 82 89 82  |.k...... .......|

00000010  8c 82 8c 82 93 20 82 95  82 90 82 8f 82 8e 20 82  |..... ........ .|

00000020  94 82 88 82 85 20 82 86  82 92 82 85 82 94 82 86  |..... ..........|

00000030  82 95 82 8c 20 82 90 82  8f 82 92 82 90 82 85 82  |.... ...........|

00000040  8e 82 94 82 89 82 8e 82  85 2e 0a                 |...........|

0000004b

% iconv -f SHIFT-JIS -t UTF-8 < phrase 

Ｌｉｋｅ ｑｕｉｌｌｓ ｕｐｏｎ ｔｈｅ ｆｒｅｔｆｕｌ ｐｏｒｐｅｎｔｉｎｅ.

tofu or not

These are from the Unicode fullwidth range that starts around U+FF01. Even more fun might be had with The Confusables.

Method to this Madness

#!/usr/bin/env perl

use 5.24.0;

use warnings;



die "Usage: not-ascii ...n" unless @ARGV;



my $s = '';

for my $c ( split //, "@ARGV" ) {

    if ( $c =~ m/[a-z]/ ) {    # FF41

        $s .= chr( 0xFF41 + ord($c) - 97 );

    } elsif ( $c =~ m/[A-Z]/ ) {    # FF21

        $s .= chr( 0xFF21 + ord($c) - 65 );

    } else {

        $s .= $c;

    }

}

binmode *STDOUT, ':encoding(UTF-8)';

say $s;

we can then take our Shakespeare to full width and encode it in SHIFT-JIS:

% not-ascii 'Like quills upon the fretful porpentine.' 

| iconv -f UTF-8 -t SHIFT-JIS > phrase

#!/bin/sh

IFS=' '

iconv -l | while read e unused; do

   printf "$e "

   printf "test phrasen" | iconv -f UTF-8 -t "$e"

done

though you really need a hex viewer to inspect the results:

% ./brutus-iconv > x

iconv: (stdin):1:0: cannot convert

iconv: (stdin):1:0: cannot convert

iconv: (stdin):1:0: cannot convert



iconv: (stdin):1:4: cannot convert

iconv: (stdin):1:0: cannot convert

% hexdump -C x

00000000  41 4e 53 49 5f 58 33 2e  34 2d 31 39 36 38 20 74  |ANSI_X3.4-1968 t|

00000010  65 73 74 20 70 68 72 61  73 65 0a 55 54 46 2d 38  |est phrase.UTF-8|

00000020  20 74 65 73 74 20 70 68  72 61 73 65 0a 55 54 46  | test phrase.UTF|

00000030  2d 38 2d 4d 41 43 20 74  65 73 74 20 70 68 72 61  |-8-MAC test phra|

...

edited Dec 18 at 1:03

answered Dec 17 at 3:26

thrig

24.2k23056

add a comment |

SHIFT-JIS can encode some pretty indecipherable stuff,

% cat phrase

?k?????? ???????????? ???????? ?????? ?????????????? ????????????????????.

% hexdump -C phrase

00000000  82 6b 82 89 82 8b 82 85  20 82 91 82 95 82 89 82  |.k...... .......|

00000010  8c 82 8c 82 93 20 82 95  82 90 82 8f 82 8e 20 82  |..... ........ .|

00000020  94 82 88 82 85 20 82 86  82 92 82 85 82 94 82 86  |..... ..........|

00000030  82 95 82 8c 20 82 90 82  8f 82 92 82 90 82 85 82  |.... ...........|

00000040  8e 82 94 82 89 82 8e 82  85 2e 0a                 |...........|

0000004b

% iconv -f SHIFT-JIS -t UTF-8 < phrase 

Ｌｉｋｅ ｑｕｉｌｌｓ ｕｐｏｎ ｔｈｅ ｆｒｅｔｆｕｌ ｐｏｒｐｅｎｔｉｎｅ.

tofu or not

These are from the Unicode fullwidth range that starts around U+FF01. Even more fun might be had with The Confusables.

Method to this Madness

#!/usr/bin/env perl

use 5.24.0;

use warnings;



die "Usage: not-ascii ...n" unless @ARGV;



my $s = '';

for my $c ( split //, "@ARGV" ) {

    if ( $c =~ m/[a-z]/ ) {    # FF41

        $s .= chr( 0xFF41 + ord($c) - 97 );

    } elsif ( $c =~ m/[A-Z]/ ) {    # FF21

        $s .= chr( 0xFF21 + ord($c) - 65 );

    } else {

        $s .= $c;

    }

}

binmode *STDOUT, ':encoding(UTF-8)';

say $s;

we can then take our Shakespeare to full width and encode it in SHIFT-JIS:

% not-ascii 'Like quills upon the fretful porpentine.' 

| iconv -f UTF-8 -t SHIFT-JIS > phrase

#!/bin/sh

IFS=' '

iconv -l | while read e unused; do

   printf "$e "

   printf "test phrasen" | iconv -f UTF-8 -t "$e"

done

though you really need a hex viewer to inspect the results:

% ./brutus-iconv > x

iconv: (stdin):1:0: cannot convert

iconv: (stdin):1:0: cannot convert

iconv: (stdin):1:0: cannot convert



iconv: (stdin):1:4: cannot convert

iconv: (stdin):1:0: cannot convert

% hexdump -C x

00000000  41 4e 53 49 5f 58 33 2e  34 2d 31 39 36 38 20 74  |ANSI_X3.4-1968 t|

00000010  65 73 74 20 70 68 72 61  73 65 0a 55 54 46 2d 38  |est phrase.UTF-8|

00000020  20 74 65 73 74 20 70 68  72 61 73 65 0a 55 54 46  | test phrase.UTF|

00000030  2d 38 2d 4d 41 43 20 74  65 73 74 20 70 68 72 61  |-8-MAC test phra|

...

edited Dec 18 at 1:03

answered Dec 17 at 3:26

thrig

24.2k23056

add a comment |

SHIFT-JIS can encode some pretty indecipherable stuff,

% cat phrase

?k?????? ???????????? ???????? ?????? ?????????????? ????????????????????.

% hexdump -C phrase

00000000  82 6b 82 89 82 8b 82 85  20 82 91 82 95 82 89 82  |.k...... .......|

00000010  8c 82 8c 82 93 20 82 95  82 90 82 8f 82 8e 20 82  |..... ........ .|

00000020  94 82 88 82 85 20 82 86  82 92 82 85 82 94 82 86  |..... ..........|

00000030  82 95 82 8c 20 82 90 82  8f 82 92 82 90 82 85 82  |.... ...........|

00000040  8e 82 94 82 89 82 8e 82  85 2e 0a                 |...........|

0000004b

% iconv -f SHIFT-JIS -t UTF-8 < phrase 

Ｌｉｋｅ ｑｕｉｌｌｓ ｕｐｏｎ ｔｈｅ ｆｒｅｔｆｕｌ ｐｏｒｐｅｎｔｉｎｅ.

tofu or not

These are from the Unicode fullwidth range that starts around U+FF01. Even more fun might be had with The Confusables.

Method to this Madness

#!/usr/bin/env perl

use 5.24.0;

use warnings;



die "Usage: not-ascii ...n" unless @ARGV;



my $s = '';

for my $c ( split //, "@ARGV" ) {

    if ( $c =~ m/[a-z]/ ) {    # FF41

        $s .= chr( 0xFF41 + ord($c) - 97 );

    } elsif ( $c =~ m/[A-Z]/ ) {    # FF21

        $s .= chr( 0xFF21 + ord($c) - 65 );

    } else {

        $s .= $c;

    }

}

binmode *STDOUT, ':encoding(UTF-8)';

say $s;

we can then take our Shakespeare to full width and encode it in SHIFT-JIS:

% not-ascii 'Like quills upon the fretful porpentine.' 

| iconv -f UTF-8 -t SHIFT-JIS > phrase

#!/bin/sh

IFS=' '

iconv -l | while read e unused; do

   printf "$e "

   printf "test phrasen" | iconv -f UTF-8 -t "$e"

done

though you really need a hex viewer to inspect the results:

% ./brutus-iconv > x

iconv: (stdin):1:0: cannot convert

iconv: (stdin):1:0: cannot convert

iconv: (stdin):1:0: cannot convert



iconv: (stdin):1:4: cannot convert

iconv: (stdin):1:0: cannot convert

% hexdump -C x

00000000  41 4e 53 49 5f 58 33 2e  34 2d 31 39 36 38 20 74  |ANSI_X3.4-1968 t|

00000010  65 73 74 20 70 68 72 61  73 65 0a 55 54 46 2d 38  |est phrase.UTF-8|

00000020  20 74 65 73 74 20 70 68  72 61 73 65 0a 55 54 46  | test phrase.UTF|

00000030  2d 38 2d 4d 41 43 20 74  65 73 74 20 70 68 72 61  |-8-MAC test phra|

...

edited Dec 18 at 1:03

answered Dec 17 at 3:26

thrig

24.2k23056

SHIFT-JIS can encode some pretty indecipherable stuff,

% cat phrase

?k?????? ???????????? ???????? ?????? ?????????????? ????????????????????.

% hexdump -C phrase

00000000  82 6b 82 89 82 8b 82 85  20 82 91 82 95 82 89 82  |.k...... .......|

00000010  8c 82 8c 82 93 20 82 95  82 90 82 8f 82 8e 20 82  |..... ........ .|

00000020  94 82 88 82 85 20 82 86  82 92 82 85 82 94 82 86  |..... ..........|

00000030  82 95 82 8c 20 82 90 82  8f 82 92 82 90 82 85 82  |.... ...........|

00000040  8e 82 94 82 89 82 8e 82  85 2e 0a                 |...........|

0000004b

% iconv -f SHIFT-JIS -t UTF-8 < phrase 

Ｌｉｋｅ ｑｕｉｌｌｓ ｕｐｏｎ ｔｈｅ ｆｒｅｔｆｕｌ ｐｏｒｐｅｎｔｉｎｅ.

tofu or not

These are from the Unicode fullwidth range that starts around U+FF01. Even more fun might be had with The Confusables.

Method to this Madness

#!/usr/bin/env perl

use 5.24.0;

use warnings;



die "Usage: not-ascii ...n" unless @ARGV;



my $s = '';

for my $c ( split //, "@ARGV" ) {

    if ( $c =~ m/[a-z]/ ) {    # FF41

        $s .= chr( 0xFF41 + ord($c) - 97 );

    } elsif ( $c =~ m/[A-Z]/ ) {    # FF21

        $s .= chr( 0xFF21 + ord($c) - 65 );

    } else {

        $s .= $c;

    }

}

binmode *STDOUT, ':encoding(UTF-8)';

say $s;

we can then take our Shakespeare to full width and encode it in SHIFT-JIS:

% not-ascii 'Like quills upon the fretful porpentine.' 

| iconv -f UTF-8 -t SHIFT-JIS > phrase

#!/bin/sh

IFS=' '

iconv -l | while read e unused; do

   printf "$e "

   printf "test phrasen" | iconv -f UTF-8 -t "$e"

done

though you really need a hex viewer to inspect the results:

% ./brutus-iconv > x

iconv: (stdin):1:0: cannot convert

iconv: (stdin):1:0: cannot convert

iconv: (stdin):1:0: cannot convert



iconv: (stdin):1:4: cannot convert

iconv: (stdin):1:0: cannot convert

% hexdump -C x

00000000  41 4e 53 49 5f 58 33 2e  34 2d 31 39 36 38 20 74  |ANSI_X3.4-1968 t|

00000010  65 73 74 20 70 68 72 61  73 65 0a 55 54 46 2d 38  |est phrase.UTF-8|

00000020  20 74 65 73 74 20 70 68  72 61 73 65 0a 55 54 46  | test phrase.UTF|

00000030  2d 38 2d 4d 41 43 20 74  65 73 74 20 70 68 72 61  |-8-MAC test phra|

...

edited Dec 18 at 1:03

answered Dec 17 at 3:26

thrig

24.2k23056

edited Dec 18 at 1:03

answered Dec 17 at 3:26

thrig

24.2k23056

answered Dec 17 at 3:26

thrig

24.2k23056

answered Dec 17 at 3:26

thrig

24.2k23056

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Unix & Linux Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Cfrtjryk