How are wide chars handled atomically? [closed]
Some languages have wide chars. A wide char can have multiple bytes. When you type a wide char in console or X, you are actually sending several bytes. A single-byte char is by itself atomic, in the sense that it's either transmitted or not transmitted, received or not received. But this is not true for a wide char. For example, delivering only the first byte of a 3-byte char would generate garbage. How does the underlying system guarantee an application always receives a wide char atomically? A good answer should explain what happened in sequence when a user types a wide char in console, X and ssh, respectively. To start the story: When a user types a wide char, an interrupt is generated...
How does the underlying system guarantee an application always receives a wide char atomically?
Since there are multiple layers in the stack, the terminology in this question probably has confused some people. What I actually mean is: Think about you are writing a GUI app with an edit box. When you type a wide char, it is either shown in full or not shown, never partially shown. So the underlying system includes everything below the app (in this case, the application framework, GUI library, etc.).
linux keyboard-event
closed as too broad by Rui F Ribeiro, Stephen Harris, Isaac, Thomas Dickey, Stephen Kitt Jan 3 at 9:34
Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.
|
show 6 more comments
Some languages have wide chars. A wide char can have multiple bytes. When you type a wide char in console or X, you are actually sending several bytes. A single-byte char is by itself atomic, in the sense that it's either transmitted or not transmitted, received or not received. But this is not true for a wide char. For example, delivering only the first byte of a 3-byte char would generate garbage. How does the underlying system guarantee an application always receives a wide char atomically? A good answer should explain what happened in sequence when a user types a wide char in console, X and ssh, respectively. To start the story: When a user types a wide char, an interrupt is generated...
How does the underlying system guarantee an application always receives a wide char atomically?
Since there are multiple layers in the stack, the terminology in this question probably has confused some people. What I actually mean is: Think about you are writing a GUI app with an edit box. When you type a wide char, it is either shown in full or not shown, never partially shown. So the underlying system includes everything below the app (in this case, the application framework, GUI library, etc.).
linux keyboard-event
closed as too broad by Rui F Ribeiro, Stephen Harris, Isaac, Thomas Dickey, Stephen Kitt Jan 3 at 9:34
Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.
Your premise "a single-byte char is atomic" is incorrect, depending on the transmission medium. e.g. on a serial port it's sent as a series of bits (potentially 9 bits, on an 8n1 transmission) and any of those bits could get corrupted. So the character received may not be the character sent. This is why transmission protocols may have error correction built in. This can also apply to network transmissions (e.g. IP checksums, retransmissions, etc).
– Stephen Harris
Jan 3 at 2:55
you are assuming that the software does not wait until several characters are received
– jsotola
Jan 3 at 2:59
The topic of this group is not providing canned interview answers. This one also does not seems a particularly good one.
– Rui F Ribeiro
Jan 3 at 3:21
It's not atomic, depending on how you transport data, single byte data might or might not be atomic. So does multi-bytes data. And what interrupt are you talking about?
– 炸鱼薯条德里克
Jan 3 at 3:31
@炸鱼薯条德里克 en.wikipedia.org/wiki/Keyboard_interrupt
– Cyker
Jan 3 at 4:07
|
show 6 more comments
Some languages have wide chars. A wide char can have multiple bytes. When you type a wide char in console or X, you are actually sending several bytes. A single-byte char is by itself atomic, in the sense that it's either transmitted or not transmitted, received or not received. But this is not true for a wide char. For example, delivering only the first byte of a 3-byte char would generate garbage. How does the underlying system guarantee an application always receives a wide char atomically? A good answer should explain what happened in sequence when a user types a wide char in console, X and ssh, respectively. To start the story: When a user types a wide char, an interrupt is generated...
How does the underlying system guarantee an application always receives a wide char atomically?
Since there are multiple layers in the stack, the terminology in this question probably has confused some people. What I actually mean is: Think about you are writing a GUI app with an edit box. When you type a wide char, it is either shown in full or not shown, never partially shown. So the underlying system includes everything below the app (in this case, the application framework, GUI library, etc.).
linux keyboard-event
Some languages have wide chars. A wide char can have multiple bytes. When you type a wide char in console or X, you are actually sending several bytes. A single-byte char is by itself atomic, in the sense that it's either transmitted or not transmitted, received or not received. But this is not true for a wide char. For example, delivering only the first byte of a 3-byte char would generate garbage. How does the underlying system guarantee an application always receives a wide char atomically? A good answer should explain what happened in sequence when a user types a wide char in console, X and ssh, respectively. To start the story: When a user types a wide char, an interrupt is generated...
How does the underlying system guarantee an application always receives a wide char atomically?
Since there are multiple layers in the stack, the terminology in this question probably has confused some people. What I actually mean is: Think about you are writing a GUI app with an edit box. When you type a wide char, it is either shown in full or not shown, never partially shown. So the underlying system includes everything below the app (in this case, the application framework, GUI library, etc.).
linux keyboard-event
linux keyboard-event
edited Jan 3 at 19:19
Cyker
asked Jan 3 at 2:47
CykerCyker
1,42721531
1,42721531
closed as too broad by Rui F Ribeiro, Stephen Harris, Isaac, Thomas Dickey, Stephen Kitt Jan 3 at 9:34
Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.
closed as too broad by Rui F Ribeiro, Stephen Harris, Isaac, Thomas Dickey, Stephen Kitt Jan 3 at 9:34
Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.
Your premise "a single-byte char is atomic" is incorrect, depending on the transmission medium. e.g. on a serial port it's sent as a series of bits (potentially 9 bits, on an 8n1 transmission) and any of those bits could get corrupted. So the character received may not be the character sent. This is why transmission protocols may have error correction built in. This can also apply to network transmissions (e.g. IP checksums, retransmissions, etc).
– Stephen Harris
Jan 3 at 2:55
you are assuming that the software does not wait until several characters are received
– jsotola
Jan 3 at 2:59
The topic of this group is not providing canned interview answers. This one also does not seems a particularly good one.
– Rui F Ribeiro
Jan 3 at 3:21
It's not atomic, depending on how you transport data, single byte data might or might not be atomic. So does multi-bytes data. And what interrupt are you talking about?
– 炸鱼薯条德里克
Jan 3 at 3:31
@炸鱼薯条德里克 en.wikipedia.org/wiki/Keyboard_interrupt
– Cyker
Jan 3 at 4:07
|
show 6 more comments
Your premise "a single-byte char is atomic" is incorrect, depending on the transmission medium. e.g. on a serial port it's sent as a series of bits (potentially 9 bits, on an 8n1 transmission) and any of those bits could get corrupted. So the character received may not be the character sent. This is why transmission protocols may have error correction built in. This can also apply to network transmissions (e.g. IP checksums, retransmissions, etc).
– Stephen Harris
Jan 3 at 2:55
you are assuming that the software does not wait until several characters are received
– jsotola
Jan 3 at 2:59
The topic of this group is not providing canned interview answers. This one also does not seems a particularly good one.
– Rui F Ribeiro
Jan 3 at 3:21
It's not atomic, depending on how you transport data, single byte data might or might not be atomic. So does multi-bytes data. And what interrupt are you talking about?
– 炸鱼薯条德里克
Jan 3 at 3:31
@炸鱼薯条德里克 en.wikipedia.org/wiki/Keyboard_interrupt
– Cyker
Jan 3 at 4:07
Your premise "a single-byte char is atomic" is incorrect, depending on the transmission medium. e.g. on a serial port it's sent as a series of bits (potentially 9 bits, on an 8n1 transmission) and any of those bits could get corrupted. So the character received may not be the character sent. This is why transmission protocols may have error correction built in. This can also apply to network transmissions (e.g. IP checksums, retransmissions, etc).
– Stephen Harris
Jan 3 at 2:55
Your premise "a single-byte char is atomic" is incorrect, depending on the transmission medium. e.g. on a serial port it's sent as a series of bits (potentially 9 bits, on an 8n1 transmission) and any of those bits could get corrupted. So the character received may not be the character sent. This is why transmission protocols may have error correction built in. This can also apply to network transmissions (e.g. IP checksums, retransmissions, etc).
– Stephen Harris
Jan 3 at 2:55
you are assuming that the software does not wait until several characters are received
– jsotola
Jan 3 at 2:59
you are assuming that the software does not wait until several characters are received
– jsotola
Jan 3 at 2:59
The topic of this group is not providing canned interview answers. This one also does not seems a particularly good one.
– Rui F Ribeiro
Jan 3 at 3:21
The topic of this group is not providing canned interview answers. This one also does not seems a particularly good one.
– Rui F Ribeiro
Jan 3 at 3:21
It's not atomic, depending on how you transport data, single byte data might or might not be atomic. So does multi-bytes data. And what interrupt are you talking about?
– 炸鱼薯条德里克
Jan 3 at 3:31
It's not atomic, depending on how you transport data, single byte data might or might not be atomic. So does multi-bytes data. And what interrupt are you talking about?
– 炸鱼薯条德里克
Jan 3 at 3:31
@炸鱼薯条德里克 en.wikipedia.org/wiki/Keyboard_interrupt
– Cyker
Jan 3 at 4:07
@炸鱼薯条德里克 en.wikipedia.org/wiki/Keyboard_interrupt
– Cyker
Jan 3 at 4:07
|
show 6 more comments
1 Answer
1
active
oldest
votes
They aren't.
A (text) application receives a byte stream on standard input. The application is free to read as many or as few of them as it likes, and indeed a read call is free to return fewer than the requested number of bytes. It is free to interpret that byte stream however it wishes, to request reading more bytes, or to treat everything as opaque binary data. If it wants to interpret some of the bytes as part of one distinct, discrete chunk, it's able to do so, and it's able to keep looking until it decides it has that chunk. Nothing guarantees it gets it all at once unless it's a single byte.
That is the short answer and you can stop reading now if you like: the premise of the question is faulty, up to the application being able to do whatever it wants to get the result it chooses.
Below we'll carry on with some of the ways things may (or may not) work out, and when they may or may not be "atomic" from someone's point of view.
For the rest of this answer, we're going to ignore network and socket traffic. We're also going to adopt a fairly fuzzy definition of what a "character" is, since that seems to be the starting point, and perhaps refine it as we go. We'll consider things from the final application's perspective. At the very end, we'll look briefly at the (mostly off-topic) path starting from the keyboard hardware. It's still going to be too long.
The application wants to get an input character
The application asks to read some bytes from its input. It is then also free to interpret the bytes it received in any way that it likes, or to have a library do so for it. In the case where it is expecting text, it will interpret those bytes through some character encoding, that determines what a byte sequence means. Hopefully, it agrees with the data source or terminal on what encoding is in use (but it doesn't have to!). For whatever encoding the application decides to interpret things as:
- If that encoding is ASCII, all is well, since everything is single-byte anyway.
- If that encoding is UCS-2, for a 16-bit
wchar_t, it had best know how many bytes it's read already so that it can ask for another if need be (sincereadmay return an odd number of bytes at any time, giving you half a code unit). - If that encoding is UTF-9, it probably needs to do some bit mangling of its own to get the code unit out, and it's virtually impossible to deliver it atomically on modern systems.
- If that encoding is KEIS, atomic delivery at the level you're expecting isn't even possible, because the interpretation of a sequence depends on stateful mode shifts earlier in the stream. The application needs to remember what it saw already to know what this piece of input means.
If that encoding is UTF-8, which these days is overwhelmingly likely, the application can synchronise using any single byte because it is a self-synchronising encoding: the high bit is 0 for single-byte code units, and 1 for any part of a multi-byte sequence. The first byte of a 2-, 3-, or 4-byte code unit has 110, 1110, or 11110 respectively, and continuation bytes start 10. Reading one byte tells you how many more you need, or that you're in the middle of something already.
If necessary, the application may then wish to make a second
readfor the following byte(s). It needs to remember the partial element it's already read, and combine the two pieces together at the end. It might even need to make a third or fourth read.
It can then process them discretely as it wishes, perhaps emitting a single 32-bit value for the code point. It might be sensible to abstract this away into a function or library for reuse.
There are other encodings available with all of those properties, and more. Some of those are not permitted by POSIX as system encodings, but might be agreed upon between applications. More likely than not, if you have a multi-byte system encoding on a Unix-like system today, it's UTF-8.
But what is a character anyway?
My multi-byte character "é" might be two bytes of UTF-8 (C3 A9) or three (65 CC 81), because it might be a single code point (U+00E9 LATIN SMALL LETTER E WITH ACUTE) or two (U+0065 LATIN SMALL LETTER E + U+0301 COMBINING ACUTE ACCENT). "👨👨👧👦" is going to be even more complex. What does your application consider atomic? That's up to the application, as in fact everything up to this point has been.
Unicode calls these multi-codepoint characters "grapheme clusters". If what you want atomically is one of those, then you do all of the above to read a codepoint - and then you check it against your UAX29 Text Segmentation state table, and decide whether it is possible for another codepoint to follow in the same cluster. If so, you try to read all over again, and keep doing so until you reaches the start of a new cluster, a dead end in the state machine, or the end of input.
A library within the application may abstract away any or all of this, and deliver a stream of bytes, code units, grapheme clusters, or something else. If you program is in Swift, for example, you get grapheme clusters. If it's in C, you get bytes, where "you" includes all your libraries - you can choose any integrated behaviour you want as long as you work against the API of the library that provides it.
Other multi-byte sequences, like the console escapes for the arrow keys or backspace (press Ctrl-V Left and see what you get), are a similar thing again: are they one unit, or not? How do you tell? Sometimes you can't, and you have to guess. For example, zsh's zkbd system asks you to press certain keys so it can remember them for keybindings - but it just uses a timeout to tell when one sequence's input is finished. Later on, it knows if it's seeing a prefix of a sequence it recognises, and it looks to see if it gets the whole thing before it acts (what happens over SSH on a bad connection?). In context, that works well enough for what's being done at the time, but in other situations it might not. That's an application question like the others.
None of that is addressed by the (Unix-like) system, which only delivers bytes. A byte is atomic at the Unix API level, but essentially nothing else is. Everything else is up to the application to enforce for itself.
Text representation is hard. There is no clear, correct, universal answer. The application is able to construct one that works for itself, but the system doesn't attempt to provide anything more than the ability to construct it.
How did they get there in the first place?
As far as the keyboard goes, it sends a message to the system, which gives it to the kernel keyboard driver, which gives it to X, which gives it to the terminal emulator, which sends some encoded bytes to standard input. Each of those messages has its own encoding, and its own way of exchanging, and may even be split into multiple parts from the previous layer (or coalesced) - for example, XCompose allows converting sequences of one or more key presses into single or multiple virtual inputs - and you can consider how you might go about making all of those messages "atomic" as well if you want.
The first message is likely to be "keys 213 and 71 are down, caps lock is off, num lock is on"; the penultimate an X KeyPress event (try xev); the last a series of bytes as discussed. Essentially none of that is on-topic here, as one of systems programming, general programming, or hardware design, but may be suitable on some other network sites. The whole stack is too broad to answer all at once anywhere.
Thank you very much for presenting such a clear written answer. Your patience and expertise are greatly appreciated.
– Cyker
Jan 3 at 19:31
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
They aren't.
A (text) application receives a byte stream on standard input. The application is free to read as many or as few of them as it likes, and indeed a read call is free to return fewer than the requested number of bytes. It is free to interpret that byte stream however it wishes, to request reading more bytes, or to treat everything as opaque binary data. If it wants to interpret some of the bytes as part of one distinct, discrete chunk, it's able to do so, and it's able to keep looking until it decides it has that chunk. Nothing guarantees it gets it all at once unless it's a single byte.
That is the short answer and you can stop reading now if you like: the premise of the question is faulty, up to the application being able to do whatever it wants to get the result it chooses.
Below we'll carry on with some of the ways things may (or may not) work out, and when they may or may not be "atomic" from someone's point of view.
For the rest of this answer, we're going to ignore network and socket traffic. We're also going to adopt a fairly fuzzy definition of what a "character" is, since that seems to be the starting point, and perhaps refine it as we go. We'll consider things from the final application's perspective. At the very end, we'll look briefly at the (mostly off-topic) path starting from the keyboard hardware. It's still going to be too long.
The application wants to get an input character
The application asks to read some bytes from its input. It is then also free to interpret the bytes it received in any way that it likes, or to have a library do so for it. In the case where it is expecting text, it will interpret those bytes through some character encoding, that determines what a byte sequence means. Hopefully, it agrees with the data source or terminal on what encoding is in use (but it doesn't have to!). For whatever encoding the application decides to interpret things as:
- If that encoding is ASCII, all is well, since everything is single-byte anyway.
- If that encoding is UCS-2, for a 16-bit
wchar_t, it had best know how many bytes it's read already so that it can ask for another if need be (sincereadmay return an odd number of bytes at any time, giving you half a code unit). - If that encoding is UTF-9, it probably needs to do some bit mangling of its own to get the code unit out, and it's virtually impossible to deliver it atomically on modern systems.
- If that encoding is KEIS, atomic delivery at the level you're expecting isn't even possible, because the interpretation of a sequence depends on stateful mode shifts earlier in the stream. The application needs to remember what it saw already to know what this piece of input means.
If that encoding is UTF-8, which these days is overwhelmingly likely, the application can synchronise using any single byte because it is a self-synchronising encoding: the high bit is 0 for single-byte code units, and 1 for any part of a multi-byte sequence. The first byte of a 2-, 3-, or 4-byte code unit has 110, 1110, or 11110 respectively, and continuation bytes start 10. Reading one byte tells you how many more you need, or that you're in the middle of something already.
If necessary, the application may then wish to make a second
readfor the following byte(s). It needs to remember the partial element it's already read, and combine the two pieces together at the end. It might even need to make a third or fourth read.
It can then process them discretely as it wishes, perhaps emitting a single 32-bit value for the code point. It might be sensible to abstract this away into a function or library for reuse.
There are other encodings available with all of those properties, and more. Some of those are not permitted by POSIX as system encodings, but might be agreed upon between applications. More likely than not, if you have a multi-byte system encoding on a Unix-like system today, it's UTF-8.
But what is a character anyway?
My multi-byte character "é" might be two bytes of UTF-8 (C3 A9) or three (65 CC 81), because it might be a single code point (U+00E9 LATIN SMALL LETTER E WITH ACUTE) or two (U+0065 LATIN SMALL LETTER E + U+0301 COMBINING ACUTE ACCENT). "👨👨👧👦" is going to be even more complex. What does your application consider atomic? That's up to the application, as in fact everything up to this point has been.
Unicode calls these multi-codepoint characters "grapheme clusters". If what you want atomically is one of those, then you do all of the above to read a codepoint - and then you check it against your UAX29 Text Segmentation state table, and decide whether it is possible for another codepoint to follow in the same cluster. If so, you try to read all over again, and keep doing so until you reaches the start of a new cluster, a dead end in the state machine, or the end of input.
A library within the application may abstract away any or all of this, and deliver a stream of bytes, code units, grapheme clusters, or something else. If you program is in Swift, for example, you get grapheme clusters. If it's in C, you get bytes, where "you" includes all your libraries - you can choose any integrated behaviour you want as long as you work against the API of the library that provides it.
Other multi-byte sequences, like the console escapes for the arrow keys or backspace (press Ctrl-V Left and see what you get), are a similar thing again: are they one unit, or not? How do you tell? Sometimes you can't, and you have to guess. For example, zsh's zkbd system asks you to press certain keys so it can remember them for keybindings - but it just uses a timeout to tell when one sequence's input is finished. Later on, it knows if it's seeing a prefix of a sequence it recognises, and it looks to see if it gets the whole thing before it acts (what happens over SSH on a bad connection?). In context, that works well enough for what's being done at the time, but in other situations it might not. That's an application question like the others.
None of that is addressed by the (Unix-like) system, which only delivers bytes. A byte is atomic at the Unix API level, but essentially nothing else is. Everything else is up to the application to enforce for itself.
Text representation is hard. There is no clear, correct, universal answer. The application is able to construct one that works for itself, but the system doesn't attempt to provide anything more than the ability to construct it.
How did they get there in the first place?
As far as the keyboard goes, it sends a message to the system, which gives it to the kernel keyboard driver, which gives it to X, which gives it to the terminal emulator, which sends some encoded bytes to standard input. Each of those messages has its own encoding, and its own way of exchanging, and may even be split into multiple parts from the previous layer (or coalesced) - for example, XCompose allows converting sequences of one or more key presses into single or multiple virtual inputs - and you can consider how you might go about making all of those messages "atomic" as well if you want.
The first message is likely to be "keys 213 and 71 are down, caps lock is off, num lock is on"; the penultimate an X KeyPress event (try xev); the last a series of bytes as discussed. Essentially none of that is on-topic here, as one of systems programming, general programming, or hardware design, but may be suitable on some other network sites. The whole stack is too broad to answer all at once anywhere.
Thank you very much for presenting such a clear written answer. Your patience and expertise are greatly appreciated.
– Cyker
Jan 3 at 19:31
add a comment |
They aren't.
A (text) application receives a byte stream on standard input. The application is free to read as many or as few of them as it likes, and indeed a read call is free to return fewer than the requested number of bytes. It is free to interpret that byte stream however it wishes, to request reading more bytes, or to treat everything as opaque binary data. If it wants to interpret some of the bytes as part of one distinct, discrete chunk, it's able to do so, and it's able to keep looking until it decides it has that chunk. Nothing guarantees it gets it all at once unless it's a single byte.
That is the short answer and you can stop reading now if you like: the premise of the question is faulty, up to the application being able to do whatever it wants to get the result it chooses.
Below we'll carry on with some of the ways things may (or may not) work out, and when they may or may not be "atomic" from someone's point of view.
For the rest of this answer, we're going to ignore network and socket traffic. We're also going to adopt a fairly fuzzy definition of what a "character" is, since that seems to be the starting point, and perhaps refine it as we go. We'll consider things from the final application's perspective. At the very end, we'll look briefly at the (mostly off-topic) path starting from the keyboard hardware. It's still going to be too long.
The application wants to get an input character
The application asks to read some bytes from its input. It is then also free to interpret the bytes it received in any way that it likes, or to have a library do so for it. In the case where it is expecting text, it will interpret those bytes through some character encoding, that determines what a byte sequence means. Hopefully, it agrees with the data source or terminal on what encoding is in use (but it doesn't have to!). For whatever encoding the application decides to interpret things as:
- If that encoding is ASCII, all is well, since everything is single-byte anyway.
- If that encoding is UCS-2, for a 16-bit
wchar_t, it had best know how many bytes it's read already so that it can ask for another if need be (sincereadmay return an odd number of bytes at any time, giving you half a code unit). - If that encoding is UTF-9, it probably needs to do some bit mangling of its own to get the code unit out, and it's virtually impossible to deliver it atomically on modern systems.
- If that encoding is KEIS, atomic delivery at the level you're expecting isn't even possible, because the interpretation of a sequence depends on stateful mode shifts earlier in the stream. The application needs to remember what it saw already to know what this piece of input means.
If that encoding is UTF-8, which these days is overwhelmingly likely, the application can synchronise using any single byte because it is a self-synchronising encoding: the high bit is 0 for single-byte code units, and 1 for any part of a multi-byte sequence. The first byte of a 2-, 3-, or 4-byte code unit has 110, 1110, or 11110 respectively, and continuation bytes start 10. Reading one byte tells you how many more you need, or that you're in the middle of something already.
If necessary, the application may then wish to make a second
readfor the following byte(s). It needs to remember the partial element it's already read, and combine the two pieces together at the end. It might even need to make a third or fourth read.
It can then process them discretely as it wishes, perhaps emitting a single 32-bit value for the code point. It might be sensible to abstract this away into a function or library for reuse.
There are other encodings available with all of those properties, and more. Some of those are not permitted by POSIX as system encodings, but might be agreed upon between applications. More likely than not, if you have a multi-byte system encoding on a Unix-like system today, it's UTF-8.
But what is a character anyway?
My multi-byte character "é" might be two bytes of UTF-8 (C3 A9) or three (65 CC 81), because it might be a single code point (U+00E9 LATIN SMALL LETTER E WITH ACUTE) or two (U+0065 LATIN SMALL LETTER E + U+0301 COMBINING ACUTE ACCENT). "👨👨👧👦" is going to be even more complex. What does your application consider atomic? That's up to the application, as in fact everything up to this point has been.
Unicode calls these multi-codepoint characters "grapheme clusters". If what you want atomically is one of those, then you do all of the above to read a codepoint - and then you check it against your UAX29 Text Segmentation state table, and decide whether it is possible for another codepoint to follow in the same cluster. If so, you try to read all over again, and keep doing so until you reaches the start of a new cluster, a dead end in the state machine, or the end of input.
A library within the application may abstract away any or all of this, and deliver a stream of bytes, code units, grapheme clusters, or something else. If you program is in Swift, for example, you get grapheme clusters. If it's in C, you get bytes, where "you" includes all your libraries - you can choose any integrated behaviour you want as long as you work against the API of the library that provides it.
Other multi-byte sequences, like the console escapes for the arrow keys or backspace (press Ctrl-V Left and see what you get), are a similar thing again: are they one unit, or not? How do you tell? Sometimes you can't, and you have to guess. For example, zsh's zkbd system asks you to press certain keys so it can remember them for keybindings - but it just uses a timeout to tell when one sequence's input is finished. Later on, it knows if it's seeing a prefix of a sequence it recognises, and it looks to see if it gets the whole thing before it acts (what happens over SSH on a bad connection?). In context, that works well enough for what's being done at the time, but in other situations it might not. That's an application question like the others.
None of that is addressed by the (Unix-like) system, which only delivers bytes. A byte is atomic at the Unix API level, but essentially nothing else is. Everything else is up to the application to enforce for itself.
Text representation is hard. There is no clear, correct, universal answer. The application is able to construct one that works for itself, but the system doesn't attempt to provide anything more than the ability to construct it.
How did they get there in the first place?
As far as the keyboard goes, it sends a message to the system, which gives it to the kernel keyboard driver, which gives it to X, which gives it to the terminal emulator, which sends some encoded bytes to standard input. Each of those messages has its own encoding, and its own way of exchanging, and may even be split into multiple parts from the previous layer (or coalesced) - for example, XCompose allows converting sequences of one or more key presses into single or multiple virtual inputs - and you can consider how you might go about making all of those messages "atomic" as well if you want.
The first message is likely to be "keys 213 and 71 are down, caps lock is off, num lock is on"; the penultimate an X KeyPress event (try xev); the last a series of bytes as discussed. Essentially none of that is on-topic here, as one of systems programming, general programming, or hardware design, but may be suitable on some other network sites. The whole stack is too broad to answer all at once anywhere.
Thank you very much for presenting such a clear written answer. Your patience and expertise are greatly appreciated.
– Cyker
Jan 3 at 19:31
add a comment |
They aren't.
A (text) application receives a byte stream on standard input. The application is free to read as many or as few of them as it likes, and indeed a read call is free to return fewer than the requested number of bytes. It is free to interpret that byte stream however it wishes, to request reading more bytes, or to treat everything as opaque binary data. If it wants to interpret some of the bytes as part of one distinct, discrete chunk, it's able to do so, and it's able to keep looking until it decides it has that chunk. Nothing guarantees it gets it all at once unless it's a single byte.
That is the short answer and you can stop reading now if you like: the premise of the question is faulty, up to the application being able to do whatever it wants to get the result it chooses.
Below we'll carry on with some of the ways things may (or may not) work out, and when they may or may not be "atomic" from someone's point of view.
For the rest of this answer, we're going to ignore network and socket traffic. We're also going to adopt a fairly fuzzy definition of what a "character" is, since that seems to be the starting point, and perhaps refine it as we go. We'll consider things from the final application's perspective. At the very end, we'll look briefly at the (mostly off-topic) path starting from the keyboard hardware. It's still going to be too long.
The application wants to get an input character
The application asks to read some bytes from its input. It is then also free to interpret the bytes it received in any way that it likes, or to have a library do so for it. In the case where it is expecting text, it will interpret those bytes through some character encoding, that determines what a byte sequence means. Hopefully, it agrees with the data source or terminal on what encoding is in use (but it doesn't have to!). For whatever encoding the application decides to interpret things as:
- If that encoding is ASCII, all is well, since everything is single-byte anyway.
- If that encoding is UCS-2, for a 16-bit
wchar_t, it had best know how many bytes it's read already so that it can ask for another if need be (sincereadmay return an odd number of bytes at any time, giving you half a code unit). - If that encoding is UTF-9, it probably needs to do some bit mangling of its own to get the code unit out, and it's virtually impossible to deliver it atomically on modern systems.
- If that encoding is KEIS, atomic delivery at the level you're expecting isn't even possible, because the interpretation of a sequence depends on stateful mode shifts earlier in the stream. The application needs to remember what it saw already to know what this piece of input means.
If that encoding is UTF-8, which these days is overwhelmingly likely, the application can synchronise using any single byte because it is a self-synchronising encoding: the high bit is 0 for single-byte code units, and 1 for any part of a multi-byte sequence. The first byte of a 2-, 3-, or 4-byte code unit has 110, 1110, or 11110 respectively, and continuation bytes start 10. Reading one byte tells you how many more you need, or that you're in the middle of something already.
If necessary, the application may then wish to make a second
readfor the following byte(s). It needs to remember the partial element it's already read, and combine the two pieces together at the end. It might even need to make a third or fourth read.
It can then process them discretely as it wishes, perhaps emitting a single 32-bit value for the code point. It might be sensible to abstract this away into a function or library for reuse.
There are other encodings available with all of those properties, and more. Some of those are not permitted by POSIX as system encodings, but might be agreed upon between applications. More likely than not, if you have a multi-byte system encoding on a Unix-like system today, it's UTF-8.
But what is a character anyway?
My multi-byte character "é" might be two bytes of UTF-8 (C3 A9) or three (65 CC 81), because it might be a single code point (U+00E9 LATIN SMALL LETTER E WITH ACUTE) or two (U+0065 LATIN SMALL LETTER E + U+0301 COMBINING ACUTE ACCENT). "👨👨👧👦" is going to be even more complex. What does your application consider atomic? That's up to the application, as in fact everything up to this point has been.
Unicode calls these multi-codepoint characters "grapheme clusters". If what you want atomically is one of those, then you do all of the above to read a codepoint - and then you check it against your UAX29 Text Segmentation state table, and decide whether it is possible for another codepoint to follow in the same cluster. If so, you try to read all over again, and keep doing so until you reaches the start of a new cluster, a dead end in the state machine, or the end of input.
A library within the application may abstract away any or all of this, and deliver a stream of bytes, code units, grapheme clusters, or something else. If you program is in Swift, for example, you get grapheme clusters. If it's in C, you get bytes, where "you" includes all your libraries - you can choose any integrated behaviour you want as long as you work against the API of the library that provides it.
Other multi-byte sequences, like the console escapes for the arrow keys or backspace (press Ctrl-V Left and see what you get), are a similar thing again: are they one unit, or not? How do you tell? Sometimes you can't, and you have to guess. For example, zsh's zkbd system asks you to press certain keys so it can remember them for keybindings - but it just uses a timeout to tell when one sequence's input is finished. Later on, it knows if it's seeing a prefix of a sequence it recognises, and it looks to see if it gets the whole thing before it acts (what happens over SSH on a bad connection?). In context, that works well enough for what's being done at the time, but in other situations it might not. That's an application question like the others.
None of that is addressed by the (Unix-like) system, which only delivers bytes. A byte is atomic at the Unix API level, but essentially nothing else is. Everything else is up to the application to enforce for itself.
Text representation is hard. There is no clear, correct, universal answer. The application is able to construct one that works for itself, but the system doesn't attempt to provide anything more than the ability to construct it.
How did they get there in the first place?
As far as the keyboard goes, it sends a message to the system, which gives it to the kernel keyboard driver, which gives it to X, which gives it to the terminal emulator, which sends some encoded bytes to standard input. Each of those messages has its own encoding, and its own way of exchanging, and may even be split into multiple parts from the previous layer (or coalesced) - for example, XCompose allows converting sequences of one or more key presses into single or multiple virtual inputs - and you can consider how you might go about making all of those messages "atomic" as well if you want.
The first message is likely to be "keys 213 and 71 are down, caps lock is off, num lock is on"; the penultimate an X KeyPress event (try xev); the last a series of bytes as discussed. Essentially none of that is on-topic here, as one of systems programming, general programming, or hardware design, but may be suitable on some other network sites. The whole stack is too broad to answer all at once anywhere.
They aren't.
A (text) application receives a byte stream on standard input. The application is free to read as many or as few of them as it likes, and indeed a read call is free to return fewer than the requested number of bytes. It is free to interpret that byte stream however it wishes, to request reading more bytes, or to treat everything as opaque binary data. If it wants to interpret some of the bytes as part of one distinct, discrete chunk, it's able to do so, and it's able to keep looking until it decides it has that chunk. Nothing guarantees it gets it all at once unless it's a single byte.
That is the short answer and you can stop reading now if you like: the premise of the question is faulty, up to the application being able to do whatever it wants to get the result it chooses.
Below we'll carry on with some of the ways things may (or may not) work out, and when they may or may not be "atomic" from someone's point of view.
For the rest of this answer, we're going to ignore network and socket traffic. We're also going to adopt a fairly fuzzy definition of what a "character" is, since that seems to be the starting point, and perhaps refine it as we go. We'll consider things from the final application's perspective. At the very end, we'll look briefly at the (mostly off-topic) path starting from the keyboard hardware. It's still going to be too long.
The application wants to get an input character
The application asks to read some bytes from its input. It is then also free to interpret the bytes it received in any way that it likes, or to have a library do so for it. In the case where it is expecting text, it will interpret those bytes through some character encoding, that determines what a byte sequence means. Hopefully, it agrees with the data source or terminal on what encoding is in use (but it doesn't have to!). For whatever encoding the application decides to interpret things as:
- If that encoding is ASCII, all is well, since everything is single-byte anyway.
- If that encoding is UCS-2, for a 16-bit
wchar_t, it had best know how many bytes it's read already so that it can ask for another if need be (sincereadmay return an odd number of bytes at any time, giving you half a code unit). - If that encoding is UTF-9, it probably needs to do some bit mangling of its own to get the code unit out, and it's virtually impossible to deliver it atomically on modern systems.
- If that encoding is KEIS, atomic delivery at the level you're expecting isn't even possible, because the interpretation of a sequence depends on stateful mode shifts earlier in the stream. The application needs to remember what it saw already to know what this piece of input means.
If that encoding is UTF-8, which these days is overwhelmingly likely, the application can synchronise using any single byte because it is a self-synchronising encoding: the high bit is 0 for single-byte code units, and 1 for any part of a multi-byte sequence. The first byte of a 2-, 3-, or 4-byte code unit has 110, 1110, or 11110 respectively, and continuation bytes start 10. Reading one byte tells you how many more you need, or that you're in the middle of something already.
If necessary, the application may then wish to make a second
readfor the following byte(s). It needs to remember the partial element it's already read, and combine the two pieces together at the end. It might even need to make a third or fourth read.
It can then process them discretely as it wishes, perhaps emitting a single 32-bit value for the code point. It might be sensible to abstract this away into a function or library for reuse.
There are other encodings available with all of those properties, and more. Some of those are not permitted by POSIX as system encodings, but might be agreed upon between applications. More likely than not, if you have a multi-byte system encoding on a Unix-like system today, it's UTF-8.
But what is a character anyway?
My multi-byte character "é" might be two bytes of UTF-8 (C3 A9) or three (65 CC 81), because it might be a single code point (U+00E9 LATIN SMALL LETTER E WITH ACUTE) or two (U+0065 LATIN SMALL LETTER E + U+0301 COMBINING ACUTE ACCENT). "👨👨👧👦" is going to be even more complex. What does your application consider atomic? That's up to the application, as in fact everything up to this point has been.
Unicode calls these multi-codepoint characters "grapheme clusters". If what you want atomically is one of those, then you do all of the above to read a codepoint - and then you check it against your UAX29 Text Segmentation state table, and decide whether it is possible for another codepoint to follow in the same cluster. If so, you try to read all over again, and keep doing so until you reaches the start of a new cluster, a dead end in the state machine, or the end of input.
A library within the application may abstract away any or all of this, and deliver a stream of bytes, code units, grapheme clusters, or something else. If you program is in Swift, for example, you get grapheme clusters. If it's in C, you get bytes, where "you" includes all your libraries - you can choose any integrated behaviour you want as long as you work against the API of the library that provides it.
Other multi-byte sequences, like the console escapes for the arrow keys or backspace (press Ctrl-V Left and see what you get), are a similar thing again: are they one unit, or not? How do you tell? Sometimes you can't, and you have to guess. For example, zsh's zkbd system asks you to press certain keys so it can remember them for keybindings - but it just uses a timeout to tell when one sequence's input is finished. Later on, it knows if it's seeing a prefix of a sequence it recognises, and it looks to see if it gets the whole thing before it acts (what happens over SSH on a bad connection?). In context, that works well enough for what's being done at the time, but in other situations it might not. That's an application question like the others.
None of that is addressed by the (Unix-like) system, which only delivers bytes. A byte is atomic at the Unix API level, but essentially nothing else is. Everything else is up to the application to enforce for itself.
Text representation is hard. There is no clear, correct, universal answer. The application is able to construct one that works for itself, but the system doesn't attempt to provide anything more than the ability to construct it.
How did they get there in the first place?
As far as the keyboard goes, it sends a message to the system, which gives it to the kernel keyboard driver, which gives it to X, which gives it to the terminal emulator, which sends some encoded bytes to standard input. Each of those messages has its own encoding, and its own way of exchanging, and may even be split into multiple parts from the previous layer (or coalesced) - for example, XCompose allows converting sequences of one or more key presses into single or multiple virtual inputs - and you can consider how you might go about making all of those messages "atomic" as well if you want.
The first message is likely to be "keys 213 and 71 are down, caps lock is off, num lock is on"; the penultimate an X KeyPress event (try xev); the last a series of bytes as discussed. Essentially none of that is on-topic here, as one of systems programming, general programming, or hardware design, but may be suitable on some other network sites. The whole stack is too broad to answer all at once anywhere.
answered Jan 3 at 6:12
Michael HomerMichael Homer
46.5k8122161
46.5k8122161
Thank you very much for presenting such a clear written answer. Your patience and expertise are greatly appreciated.
– Cyker
Jan 3 at 19:31
add a comment |
Thank you very much for presenting such a clear written answer. Your patience and expertise are greatly appreciated.
– Cyker
Jan 3 at 19:31
Thank you very much for presenting such a clear written answer. Your patience and expertise are greatly appreciated.
– Cyker
Jan 3 at 19:31
Thank you very much for presenting such a clear written answer. Your patience and expertise are greatly appreciated.
– Cyker
Jan 3 at 19:31
add a comment |
Your premise "a single-byte char is atomic" is incorrect, depending on the transmission medium. e.g. on a serial port it's sent as a series of bits (potentially 9 bits, on an 8n1 transmission) and any of those bits could get corrupted. So the character received may not be the character sent. This is why transmission protocols may have error correction built in. This can also apply to network transmissions (e.g. IP checksums, retransmissions, etc).
– Stephen Harris
Jan 3 at 2:55
you are assuming that the software does not wait until several characters are received
– jsotola
Jan 3 at 2:59
The topic of this group is not providing canned interview answers. This one also does not seems a particularly good one.
– Rui F Ribeiro
Jan 3 at 3:21
It's not atomic, depending on how you transport data, single byte data might or might not be atomic. So does multi-bytes data. And what interrupt are you talking about?
– 炸鱼薯条德里克
Jan 3 at 3:31
@炸鱼薯条德里克 en.wikipedia.org/wiki/Keyboard_interrupt
– Cyker
Jan 3 at 4:07