Demonstrate file encoding
I am developing some training and would like to demonstrate file encoding. What I am trying to accomplish is create a text file with one type of encoding that show up as nonsense when read by Linux.
Then convert the file to UTF8 Encoding and be able to read the file in Linux.
Is this possible?
ubuntu files filesystems character-encoding
add a comment |
I am developing some training and would like to demonstrate file encoding. What I am trying to accomplish is create a text file with one type of encoding that show up as nonsense when read by Linux.
Then convert the file to UTF8 Encoding and be able to read the file in Linux.
Is this possible?
ubuntu files filesystems character-encoding
You should firstly understand what character-encoding is. Characters encoded in one encoding being read as another encoding leads to mojibaka, CONVERTING it to another encoding will not solve the mojibaka.
– 炸鱼薯条德里克
Dec 17 at 3:00
Also this is not unix&linux specific. Readability is affected by file permissions stored as metadata in the filesystem, which is unrelated with text encoding.
– 炸鱼薯条德里克
Dec 17 at 3:01
4
This is why I wish we could downvote comments. (1) The notion of character encoding is not specific to Unix&Linux, but the tools for working with it are OS-specific. (2) This question has nothing to do with file permissions. Why are you even mentioning it?
– G-Man
Dec 17 at 3:16
Mojibaka ... Chewbacca's sibling? Something from Star Bucks? Oh... Mojibake! ;)
– Christopher
Dec 18 at 1:13
add a comment |
I am developing some training and would like to demonstrate file encoding. What I am trying to accomplish is create a text file with one type of encoding that show up as nonsense when read by Linux.
Then convert the file to UTF8 Encoding and be able to read the file in Linux.
Is this possible?
ubuntu files filesystems character-encoding
I am developing some training and would like to demonstrate file encoding. What I am trying to accomplish is create a text file with one type of encoding that show up as nonsense when read by Linux.
Then convert the file to UTF8 Encoding and be able to read the file in Linux.
Is this possible?
ubuntu files filesystems character-encoding
ubuntu files filesystems character-encoding
edited Dec 17 at 9:21
SouravGhosh
455311
455311
asked Dec 17 at 2:29
Jeff Christman
1063
1063
You should firstly understand what character-encoding is. Characters encoded in one encoding being read as another encoding leads to mojibaka, CONVERTING it to another encoding will not solve the mojibaka.
– 炸鱼薯条德里克
Dec 17 at 3:00
Also this is not unix&linux specific. Readability is affected by file permissions stored as metadata in the filesystem, which is unrelated with text encoding.
– 炸鱼薯条德里克
Dec 17 at 3:01
4
This is why I wish we could downvote comments. (1) The notion of character encoding is not specific to Unix&Linux, but the tools for working with it are OS-specific. (2) This question has nothing to do with file permissions. Why are you even mentioning it?
– G-Man
Dec 17 at 3:16
Mojibaka ... Chewbacca's sibling? Something from Star Bucks? Oh... Mojibake! ;)
– Christopher
Dec 18 at 1:13
add a comment |
You should firstly understand what character-encoding is. Characters encoded in one encoding being read as another encoding leads to mojibaka, CONVERTING it to another encoding will not solve the mojibaka.
– 炸鱼薯条德里克
Dec 17 at 3:00
Also this is not unix&linux specific. Readability is affected by file permissions stored as metadata in the filesystem, which is unrelated with text encoding.
– 炸鱼薯条德里克
Dec 17 at 3:01
4
This is why I wish we could downvote comments. (1) The notion of character encoding is not specific to Unix&Linux, but the tools for working with it are OS-specific. (2) This question has nothing to do with file permissions. Why are you even mentioning it?
– G-Man
Dec 17 at 3:16
Mojibaka ... Chewbacca's sibling? Something from Star Bucks? Oh... Mojibake! ;)
– Christopher
Dec 18 at 1:13
You should firstly understand what character-encoding is. Characters encoded in one encoding being read as another encoding leads to mojibaka, CONVERTING it to another encoding will not solve the mojibaka.
– 炸鱼薯条德里克
Dec 17 at 3:00
You should firstly understand what character-encoding is. Characters encoded in one encoding being read as another encoding leads to mojibaka, CONVERTING it to another encoding will not solve the mojibaka.
– 炸鱼薯条德里克
Dec 17 at 3:00
Also this is not unix&linux specific. Readability is affected by file permissions stored as metadata in the filesystem, which is unrelated with text encoding.
– 炸鱼薯条德里克
Dec 17 at 3:01
Also this is not unix&linux specific. Readability is affected by file permissions stored as metadata in the filesystem, which is unrelated with text encoding.
– 炸鱼薯条德里克
Dec 17 at 3:01
4
4
This is why I wish we could downvote comments. (1) The notion of character encoding is not specific to Unix&Linux, but the tools for working with it are OS-specific. (2) This question has nothing to do with file permissions. Why are you even mentioning it?
– G-Man
Dec 17 at 3:16
This is why I wish we could downvote comments. (1) The notion of character encoding is not specific to Unix&Linux, but the tools for working with it are OS-specific. (2) This question has nothing to do with file permissions. Why are you even mentioning it?
– G-Man
Dec 17 at 3:16
Mojibaka ... Chewbacca's sibling? Something from Star Bucks? Oh... Mojibake! ;)
– Christopher
Dec 18 at 1:13
Mojibaka ... Chewbacca's sibling? Something from Star Bucks? Oh... Mojibake! ;)
– Christopher
Dec 18 at 1:13
add a comment |
3 Answers
3
active
oldest
votes
You can use GNU recode to convert between encodings. It reads from stdin and is called like this:
recode from-encoding..to-encoding
For example:
$ recode ascii..ebcdic < file.txt
Or perhaps more relevant, to convert from the Windows-1252 encoding:
$ recode windows-1252..utf8 < extended-latin.txt
So an example demonstration:
$ cat > universal-declaration-french.txt
Tous les êtres humains naissent libres et égaux en dignité et en droits.
Ils sont doués de raison et de conscience et doivent agir les uns envers
les autres dans un esprit de fraternité.
^D
$ recode utf8..windows-1252 < universal-declaration-french.txt > declaration-1252.txt
$ cat declaration-1252.txt
Tous les �tres humains naissent libres et �gaux en dignit� et en droits.
Ils sont dou�s de raison et de conscience et doivent agir les uns envers
les autres dans un esprit de fraternit�.
$ recode windows-1252..utf8 < declaration-1252.txt
Tous les êtres humains naissent libres et égaux en dignité et en droits.
Ils sont doués de raison et de conscience et doivent agir les uns envers
les autres dans un esprit de fraternité.
You can see a list of encodings it supports with 'recode -l'.
Perfect ! - Thanks so much
– Jeff Christman
Dec 17 at 14:21
add a comment |
If by "Linux" you mean some program in Linux, assuming your terminal emulator's encoding is set to UTF-8, and your locale is some UTF-8 locale:
$ cat > utf8.txt <<<"This is 日本語。"
$ iconv -f UTF-8 -t UTF-16 utf8.txt > utf16.txt
$ head utf*.txt
==> utf16.txt <==
��This is �e,g��0
==> utf8.txt <==
This is 日本語。
$ iconv -f UTF-16 -t UTF-8 utf16.txt
This is 日本語。
add a comment |
SHIFT-JIS can encode some pretty indecipherable stuff,
% cat phrase
?k?????? ???????????? ???????? ?????? ?????????????? ????????????????????.
% hexdump -C phrase
00000000 82 6b 82 89 82 8b 82 85 20 82 91 82 95 82 89 82 |.k...... .......|
00000010 8c 82 8c 82 93 20 82 95 82 90 82 8f 82 8e 20 82 |..... ........ .|
00000020 94 82 88 82 85 20 82 86 82 92 82 85 82 94 82 86 |..... ..........|
00000030 82 95 82 8c 20 82 90 82 8f 82 92 82 90 82 85 82 |.... ...........|
00000040 8e 82 94 82 89 82 8e 82 85 2e 0a |...........|
0000004b
% iconv -f SHIFT-JIS -t UTF-8 < phrase
Like quills upon the fretful porpentine.
Images are also necessary for encoding problems, as some display software will "tofu" the text (white rectangles) while others display it just fine, or there may be various other disagreements about how things are being rendered only an image will help clear up (well, image, and a hexdump...)
These are from the Unicode fullwidth range that starts around U+FF01. Even more fun might be had with The Confusables.
Method to this Madness
First you'll need some means to generate text in non-standard unicode ranges, either with automation or by manually pasting a phrase together. Here's a converter that takes the a-zA-Z
range and translates them up into the full width range:
#!/usr/bin/env perl
use 5.24.0;
use warnings;
die "Usage: not-ascii ...n" unless @ARGV;
my $s = '';
for my $c ( split //, "@ARGV" ) {
if ( $c =~ m/[a-z]/ ) { # FF41
$s .= chr( 0xFF41 + ord($c) - 97 );
} elsif ( $c =~ m/[A-Z]/ ) { # FF21
$s .= chr( 0xFF21 + ord($c) - 65 );
} else {
$s .= $c;
}
}
binmode *STDOUT, ':encoding(UTF-8)';
say $s;
we can then take our Shakespeare to full width and encode it in SHIFT-JIS:
% not-ascii 'Like quills upon the fretful porpentine.'
| iconv -f UTF-8 -t SHIFT-JIS > phrase
SHIFT-JIS was found to be usable for this prurpose by doing a brute-force search converting the UTF-8 input into all the encodings listed by iconf -l
. Most other encodings aren't very interesting, or fail to convert the UTF-8:
#!/bin/sh
IFS=' '
iconv -l | while read e unused; do
printf "$e "
printf "test phrasen" | iconv -f UTF-8 -t "$e"
done
though you really need a hex viewer to inspect the results:
% ./brutus-iconv > x
iconv: (stdin):1:0: cannot convert
iconv: (stdin):1:0: cannot convert
iconv: (stdin):1:0: cannot convert
iconv: (stdin):1:4: cannot convert
iconv: (stdin):1:0: cannot convert
% hexdump -C x
00000000 41 4e 53 49 5f 58 33 2e 34 2d 31 39 36 38 20 74 |ANSI_X3.4-1968 t|
00000010 65 73 74 20 70 68 72 61 73 65 0a 55 54 46 2d 38 |est phrase.UTF-8|
00000020 20 74 65 73 74 20 70 68 72 61 73 65 0a 55 54 46 | test phrase.UTF|
00000030 2d 38 2d 4d 41 43 20 74 65 73 74 20 70 68 72 61 |-8-MAC test phra|
...
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f489396%2fdemonstrate-file-encoding%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
You can use GNU recode to convert between encodings. It reads from stdin and is called like this:
recode from-encoding..to-encoding
For example:
$ recode ascii..ebcdic < file.txt
Or perhaps more relevant, to convert from the Windows-1252 encoding:
$ recode windows-1252..utf8 < extended-latin.txt
So an example demonstration:
$ cat > universal-declaration-french.txt
Tous les êtres humains naissent libres et égaux en dignité et en droits.
Ils sont doués de raison et de conscience et doivent agir les uns envers
les autres dans un esprit de fraternité.
^D
$ recode utf8..windows-1252 < universal-declaration-french.txt > declaration-1252.txt
$ cat declaration-1252.txt
Tous les �tres humains naissent libres et �gaux en dignit� et en droits.
Ils sont dou�s de raison et de conscience et doivent agir les uns envers
les autres dans un esprit de fraternit�.
$ recode windows-1252..utf8 < declaration-1252.txt
Tous les êtres humains naissent libres et égaux en dignité et en droits.
Ils sont doués de raison et de conscience et doivent agir les uns envers
les autres dans un esprit de fraternité.
You can see a list of encodings it supports with 'recode -l'.
Perfect ! - Thanks so much
– Jeff Christman
Dec 17 at 14:21
add a comment |
You can use GNU recode to convert between encodings. It reads from stdin and is called like this:
recode from-encoding..to-encoding
For example:
$ recode ascii..ebcdic < file.txt
Or perhaps more relevant, to convert from the Windows-1252 encoding:
$ recode windows-1252..utf8 < extended-latin.txt
So an example demonstration:
$ cat > universal-declaration-french.txt
Tous les êtres humains naissent libres et égaux en dignité et en droits.
Ils sont doués de raison et de conscience et doivent agir les uns envers
les autres dans un esprit de fraternité.
^D
$ recode utf8..windows-1252 < universal-declaration-french.txt > declaration-1252.txt
$ cat declaration-1252.txt
Tous les �tres humains naissent libres et �gaux en dignit� et en droits.
Ils sont dou�s de raison et de conscience et doivent agir les uns envers
les autres dans un esprit de fraternit�.
$ recode windows-1252..utf8 < declaration-1252.txt
Tous les êtres humains naissent libres et égaux en dignité et en droits.
Ils sont doués de raison et de conscience et doivent agir les uns envers
les autres dans un esprit de fraternité.
You can see a list of encodings it supports with 'recode -l'.
Perfect ! - Thanks so much
– Jeff Christman
Dec 17 at 14:21
add a comment |
You can use GNU recode to convert between encodings. It reads from stdin and is called like this:
recode from-encoding..to-encoding
For example:
$ recode ascii..ebcdic < file.txt
Or perhaps more relevant, to convert from the Windows-1252 encoding:
$ recode windows-1252..utf8 < extended-latin.txt
So an example demonstration:
$ cat > universal-declaration-french.txt
Tous les êtres humains naissent libres et égaux en dignité et en droits.
Ils sont doués de raison et de conscience et doivent agir les uns envers
les autres dans un esprit de fraternité.
^D
$ recode utf8..windows-1252 < universal-declaration-french.txt > declaration-1252.txt
$ cat declaration-1252.txt
Tous les �tres humains naissent libres et �gaux en dignit� et en droits.
Ils sont dou�s de raison et de conscience et doivent agir les uns envers
les autres dans un esprit de fraternit�.
$ recode windows-1252..utf8 < declaration-1252.txt
Tous les êtres humains naissent libres et égaux en dignité et en droits.
Ils sont doués de raison et de conscience et doivent agir les uns envers
les autres dans un esprit de fraternité.
You can see a list of encodings it supports with 'recode -l'.
You can use GNU recode to convert between encodings. It reads from stdin and is called like this:
recode from-encoding..to-encoding
For example:
$ recode ascii..ebcdic < file.txt
Or perhaps more relevant, to convert from the Windows-1252 encoding:
$ recode windows-1252..utf8 < extended-latin.txt
So an example demonstration:
$ cat > universal-declaration-french.txt
Tous les êtres humains naissent libres et égaux en dignité et en droits.
Ils sont doués de raison et de conscience et doivent agir les uns envers
les autres dans un esprit de fraternité.
^D
$ recode utf8..windows-1252 < universal-declaration-french.txt > declaration-1252.txt
$ cat declaration-1252.txt
Tous les �tres humains naissent libres et �gaux en dignit� et en droits.
Ils sont dou�s de raison et de conscience et doivent agir les uns envers
les autres dans un esprit de fraternit�.
$ recode windows-1252..utf8 < declaration-1252.txt
Tous les êtres humains naissent libres et égaux en dignité et en droits.
Ils sont doués de raison et de conscience et doivent agir les uns envers
les autres dans un esprit de fraternité.
You can see a list of encodings it supports with 'recode -l'.
answered Dec 17 at 3:05
Silas Coker
1963
1963
Perfect ! - Thanks so much
– Jeff Christman
Dec 17 at 14:21
add a comment |
Perfect ! - Thanks so much
– Jeff Christman
Dec 17 at 14:21
Perfect ! - Thanks so much
– Jeff Christman
Dec 17 at 14:21
Perfect ! - Thanks so much
– Jeff Christman
Dec 17 at 14:21
add a comment |
If by "Linux" you mean some program in Linux, assuming your terminal emulator's encoding is set to UTF-8, and your locale is some UTF-8 locale:
$ cat > utf8.txt <<<"This is 日本語。"
$ iconv -f UTF-8 -t UTF-16 utf8.txt > utf16.txt
$ head utf*.txt
==> utf16.txt <==
��This is �e,g��0
==> utf8.txt <==
This is 日本語。
$ iconv -f UTF-16 -t UTF-8 utf16.txt
This is 日本語。
add a comment |
If by "Linux" you mean some program in Linux, assuming your terminal emulator's encoding is set to UTF-8, and your locale is some UTF-8 locale:
$ cat > utf8.txt <<<"This is 日本語。"
$ iconv -f UTF-8 -t UTF-16 utf8.txt > utf16.txt
$ head utf*.txt
==> utf16.txt <==
��This is �e,g��0
==> utf8.txt <==
This is 日本語。
$ iconv -f UTF-16 -t UTF-8 utf16.txt
This is 日本語。
add a comment |
If by "Linux" you mean some program in Linux, assuming your terminal emulator's encoding is set to UTF-8, and your locale is some UTF-8 locale:
$ cat > utf8.txt <<<"This is 日本語。"
$ iconv -f UTF-8 -t UTF-16 utf8.txt > utf16.txt
$ head utf*.txt
==> utf16.txt <==
��This is �e,g��0
==> utf8.txt <==
This is 日本語。
$ iconv -f UTF-16 -t UTF-8 utf16.txt
This is 日本語。
If by "Linux" you mean some program in Linux, assuming your terminal emulator's encoding is set to UTF-8, and your locale is some UTF-8 locale:
$ cat > utf8.txt <<<"This is 日本語。"
$ iconv -f UTF-8 -t UTF-16 utf8.txt > utf16.txt
$ head utf*.txt
==> utf16.txt <==
��This is �e,g��0
==> utf8.txt <==
This is 日本語。
$ iconv -f UTF-16 -t UTF-8 utf16.txt
This is 日本語。
answered Dec 17 at 3:07
muru
1
1
add a comment |
add a comment |
SHIFT-JIS can encode some pretty indecipherable stuff,
% cat phrase
?k?????? ???????????? ???????? ?????? ?????????????? ????????????????????.
% hexdump -C phrase
00000000 82 6b 82 89 82 8b 82 85 20 82 91 82 95 82 89 82 |.k...... .......|
00000010 8c 82 8c 82 93 20 82 95 82 90 82 8f 82 8e 20 82 |..... ........ .|
00000020 94 82 88 82 85 20 82 86 82 92 82 85 82 94 82 86 |..... ..........|
00000030 82 95 82 8c 20 82 90 82 8f 82 92 82 90 82 85 82 |.... ...........|
00000040 8e 82 94 82 89 82 8e 82 85 2e 0a |...........|
0000004b
% iconv -f SHIFT-JIS -t UTF-8 < phrase
Like quills upon the fretful porpentine.
Images are also necessary for encoding problems, as some display software will "tofu" the text (white rectangles) while others display it just fine, or there may be various other disagreements about how things are being rendered only an image will help clear up (well, image, and a hexdump...)
These are from the Unicode fullwidth range that starts around U+FF01. Even more fun might be had with The Confusables.
Method to this Madness
First you'll need some means to generate text in non-standard unicode ranges, either with automation or by manually pasting a phrase together. Here's a converter that takes the a-zA-Z
range and translates them up into the full width range:
#!/usr/bin/env perl
use 5.24.0;
use warnings;
die "Usage: not-ascii ...n" unless @ARGV;
my $s = '';
for my $c ( split //, "@ARGV" ) {
if ( $c =~ m/[a-z]/ ) { # FF41
$s .= chr( 0xFF41 + ord($c) - 97 );
} elsif ( $c =~ m/[A-Z]/ ) { # FF21
$s .= chr( 0xFF21 + ord($c) - 65 );
} else {
$s .= $c;
}
}
binmode *STDOUT, ':encoding(UTF-8)';
say $s;
we can then take our Shakespeare to full width and encode it in SHIFT-JIS:
% not-ascii 'Like quills upon the fretful porpentine.'
| iconv -f UTF-8 -t SHIFT-JIS > phrase
SHIFT-JIS was found to be usable for this prurpose by doing a brute-force search converting the UTF-8 input into all the encodings listed by iconf -l
. Most other encodings aren't very interesting, or fail to convert the UTF-8:
#!/bin/sh
IFS=' '
iconv -l | while read e unused; do
printf "$e "
printf "test phrasen" | iconv -f UTF-8 -t "$e"
done
though you really need a hex viewer to inspect the results:
% ./brutus-iconv > x
iconv: (stdin):1:0: cannot convert
iconv: (stdin):1:0: cannot convert
iconv: (stdin):1:0: cannot convert
iconv: (stdin):1:4: cannot convert
iconv: (stdin):1:0: cannot convert
% hexdump -C x
00000000 41 4e 53 49 5f 58 33 2e 34 2d 31 39 36 38 20 74 |ANSI_X3.4-1968 t|
00000010 65 73 74 20 70 68 72 61 73 65 0a 55 54 46 2d 38 |est phrase.UTF-8|
00000020 20 74 65 73 74 20 70 68 72 61 73 65 0a 55 54 46 | test phrase.UTF|
00000030 2d 38 2d 4d 41 43 20 74 65 73 74 20 70 68 72 61 |-8-MAC test phra|
...
add a comment |
SHIFT-JIS can encode some pretty indecipherable stuff,
% cat phrase
?k?????? ???????????? ???????? ?????? ?????????????? ????????????????????.
% hexdump -C phrase
00000000 82 6b 82 89 82 8b 82 85 20 82 91 82 95 82 89 82 |.k...... .......|
00000010 8c 82 8c 82 93 20 82 95 82 90 82 8f 82 8e 20 82 |..... ........ .|
00000020 94 82 88 82 85 20 82 86 82 92 82 85 82 94 82 86 |..... ..........|
00000030 82 95 82 8c 20 82 90 82 8f 82 92 82 90 82 85 82 |.... ...........|
00000040 8e 82 94 82 89 82 8e 82 85 2e 0a |...........|
0000004b
% iconv -f SHIFT-JIS -t UTF-8 < phrase
Like quills upon the fretful porpentine.
Images are also necessary for encoding problems, as some display software will "tofu" the text (white rectangles) while others display it just fine, or there may be various other disagreements about how things are being rendered only an image will help clear up (well, image, and a hexdump...)
These are from the Unicode fullwidth range that starts around U+FF01. Even more fun might be had with The Confusables.
Method to this Madness
First you'll need some means to generate text in non-standard unicode ranges, either with automation or by manually pasting a phrase together. Here's a converter that takes the a-zA-Z
range and translates them up into the full width range:
#!/usr/bin/env perl
use 5.24.0;
use warnings;
die "Usage: not-ascii ...n" unless @ARGV;
my $s = '';
for my $c ( split //, "@ARGV" ) {
if ( $c =~ m/[a-z]/ ) { # FF41
$s .= chr( 0xFF41 + ord($c) - 97 );
} elsif ( $c =~ m/[A-Z]/ ) { # FF21
$s .= chr( 0xFF21 + ord($c) - 65 );
} else {
$s .= $c;
}
}
binmode *STDOUT, ':encoding(UTF-8)';
say $s;
we can then take our Shakespeare to full width and encode it in SHIFT-JIS:
% not-ascii 'Like quills upon the fretful porpentine.'
| iconv -f UTF-8 -t SHIFT-JIS > phrase
SHIFT-JIS was found to be usable for this prurpose by doing a brute-force search converting the UTF-8 input into all the encodings listed by iconf -l
. Most other encodings aren't very interesting, or fail to convert the UTF-8:
#!/bin/sh
IFS=' '
iconv -l | while read e unused; do
printf "$e "
printf "test phrasen" | iconv -f UTF-8 -t "$e"
done
though you really need a hex viewer to inspect the results:
% ./brutus-iconv > x
iconv: (stdin):1:0: cannot convert
iconv: (stdin):1:0: cannot convert
iconv: (stdin):1:0: cannot convert
iconv: (stdin):1:4: cannot convert
iconv: (stdin):1:0: cannot convert
% hexdump -C x
00000000 41 4e 53 49 5f 58 33 2e 34 2d 31 39 36 38 20 74 |ANSI_X3.4-1968 t|
00000010 65 73 74 20 70 68 72 61 73 65 0a 55 54 46 2d 38 |est phrase.UTF-8|
00000020 20 74 65 73 74 20 70 68 72 61 73 65 0a 55 54 46 | test phrase.UTF|
00000030 2d 38 2d 4d 41 43 20 74 65 73 74 20 70 68 72 61 |-8-MAC test phra|
...
add a comment |
SHIFT-JIS can encode some pretty indecipherable stuff,
% cat phrase
?k?????? ???????????? ???????? ?????? ?????????????? ????????????????????.
% hexdump -C phrase
00000000 82 6b 82 89 82 8b 82 85 20 82 91 82 95 82 89 82 |.k...... .......|
00000010 8c 82 8c 82 93 20 82 95 82 90 82 8f 82 8e 20 82 |..... ........ .|
00000020 94 82 88 82 85 20 82 86 82 92 82 85 82 94 82 86 |..... ..........|
00000030 82 95 82 8c 20 82 90 82 8f 82 92 82 90 82 85 82 |.... ...........|
00000040 8e 82 94 82 89 82 8e 82 85 2e 0a |...........|
0000004b
% iconv -f SHIFT-JIS -t UTF-8 < phrase
Like quills upon the fretful porpentine.
Images are also necessary for encoding problems, as some display software will "tofu" the text (white rectangles) while others display it just fine, or there may be various other disagreements about how things are being rendered only an image will help clear up (well, image, and a hexdump...)
These are from the Unicode fullwidth range that starts around U+FF01. Even more fun might be had with The Confusables.
Method to this Madness
First you'll need some means to generate text in non-standard unicode ranges, either with automation or by manually pasting a phrase together. Here's a converter that takes the a-zA-Z
range and translates them up into the full width range:
#!/usr/bin/env perl
use 5.24.0;
use warnings;
die "Usage: not-ascii ...n" unless @ARGV;
my $s = '';
for my $c ( split //, "@ARGV" ) {
if ( $c =~ m/[a-z]/ ) { # FF41
$s .= chr( 0xFF41 + ord($c) - 97 );
} elsif ( $c =~ m/[A-Z]/ ) { # FF21
$s .= chr( 0xFF21 + ord($c) - 65 );
} else {
$s .= $c;
}
}
binmode *STDOUT, ':encoding(UTF-8)';
say $s;
we can then take our Shakespeare to full width and encode it in SHIFT-JIS:
% not-ascii 'Like quills upon the fretful porpentine.'
| iconv -f UTF-8 -t SHIFT-JIS > phrase
SHIFT-JIS was found to be usable for this prurpose by doing a brute-force search converting the UTF-8 input into all the encodings listed by iconf -l
. Most other encodings aren't very interesting, or fail to convert the UTF-8:
#!/bin/sh
IFS=' '
iconv -l | while read e unused; do
printf "$e "
printf "test phrasen" | iconv -f UTF-8 -t "$e"
done
though you really need a hex viewer to inspect the results:
% ./brutus-iconv > x
iconv: (stdin):1:0: cannot convert
iconv: (stdin):1:0: cannot convert
iconv: (stdin):1:0: cannot convert
iconv: (stdin):1:4: cannot convert
iconv: (stdin):1:0: cannot convert
% hexdump -C x
00000000 41 4e 53 49 5f 58 33 2e 34 2d 31 39 36 38 20 74 |ANSI_X3.4-1968 t|
00000010 65 73 74 20 70 68 72 61 73 65 0a 55 54 46 2d 38 |est phrase.UTF-8|
00000020 20 74 65 73 74 20 70 68 72 61 73 65 0a 55 54 46 | test phrase.UTF|
00000030 2d 38 2d 4d 41 43 20 74 65 73 74 20 70 68 72 61 |-8-MAC test phra|
...
SHIFT-JIS can encode some pretty indecipherable stuff,
% cat phrase
?k?????? ???????????? ???????? ?????? ?????????????? ????????????????????.
% hexdump -C phrase
00000000 82 6b 82 89 82 8b 82 85 20 82 91 82 95 82 89 82 |.k...... .......|
00000010 8c 82 8c 82 93 20 82 95 82 90 82 8f 82 8e 20 82 |..... ........ .|
00000020 94 82 88 82 85 20 82 86 82 92 82 85 82 94 82 86 |..... ..........|
00000030 82 95 82 8c 20 82 90 82 8f 82 92 82 90 82 85 82 |.... ...........|
00000040 8e 82 94 82 89 82 8e 82 85 2e 0a |...........|
0000004b
% iconv -f SHIFT-JIS -t UTF-8 < phrase
Like quills upon the fretful porpentine.
Images are also necessary for encoding problems, as some display software will "tofu" the text (white rectangles) while others display it just fine, or there may be various other disagreements about how things are being rendered only an image will help clear up (well, image, and a hexdump...)
These are from the Unicode fullwidth range that starts around U+FF01. Even more fun might be had with The Confusables.
Method to this Madness
First you'll need some means to generate text in non-standard unicode ranges, either with automation or by manually pasting a phrase together. Here's a converter that takes the a-zA-Z
range and translates them up into the full width range:
#!/usr/bin/env perl
use 5.24.0;
use warnings;
die "Usage: not-ascii ...n" unless @ARGV;
my $s = '';
for my $c ( split //, "@ARGV" ) {
if ( $c =~ m/[a-z]/ ) { # FF41
$s .= chr( 0xFF41 + ord($c) - 97 );
} elsif ( $c =~ m/[A-Z]/ ) { # FF21
$s .= chr( 0xFF21 + ord($c) - 65 );
} else {
$s .= $c;
}
}
binmode *STDOUT, ':encoding(UTF-8)';
say $s;
we can then take our Shakespeare to full width and encode it in SHIFT-JIS:
% not-ascii 'Like quills upon the fretful porpentine.'
| iconv -f UTF-8 -t SHIFT-JIS > phrase
SHIFT-JIS was found to be usable for this prurpose by doing a brute-force search converting the UTF-8 input into all the encodings listed by iconf -l
. Most other encodings aren't very interesting, or fail to convert the UTF-8:
#!/bin/sh
IFS=' '
iconv -l | while read e unused; do
printf "$e "
printf "test phrasen" | iconv -f UTF-8 -t "$e"
done
though you really need a hex viewer to inspect the results:
% ./brutus-iconv > x
iconv: (stdin):1:0: cannot convert
iconv: (stdin):1:0: cannot convert
iconv: (stdin):1:0: cannot convert
iconv: (stdin):1:4: cannot convert
iconv: (stdin):1:0: cannot convert
% hexdump -C x
00000000 41 4e 53 49 5f 58 33 2e 34 2d 31 39 36 38 20 74 |ANSI_X3.4-1968 t|
00000010 65 73 74 20 70 68 72 61 73 65 0a 55 54 46 2d 38 |est phrase.UTF-8|
00000020 20 74 65 73 74 20 70 68 72 61 73 65 0a 55 54 46 | test phrase.UTF|
00000030 2d 38 2d 4d 41 43 20 74 65 73 74 20 70 68 72 61 |-8-MAC test phra|
...
edited Dec 18 at 1:03
answered Dec 17 at 3:26
thrig
24.2k23056
24.2k23056
add a comment |
add a comment |
Thanks for contributing an answer to Unix & Linux Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f489396%2fdemonstrate-file-encoding%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
You should firstly understand what character-encoding is. Characters encoded in one encoding being read as another encoding leads to mojibaka, CONVERTING it to another encoding will not solve the mojibaka.
– 炸鱼薯条德里克
Dec 17 at 3:00
Also this is not unix&linux specific. Readability is affected by file permissions stored as metadata in the filesystem, which is unrelated with text encoding.
– 炸鱼薯条德里克
Dec 17 at 3:01
4
This is why I wish we could downvote comments. (1) The notion of character encoding is not specific to Unix&Linux, but the tools for working with it are OS-specific. (2) This question has nothing to do with file permissions. Why are you even mentioning it?
– G-Man
Dec 17 at 3:16
Mojibaka ... Chewbacca's sibling? Something from Star Bucks? Oh... Mojibake! ;)
– Christopher
Dec 18 at 1:13