Conversation
Notices
-
Santa Claes πΈπͺππ°π (clacke@libranet.de)'s status on Saturday, 07-May-2022 05:52:46 UTC Santa Claes πΈπͺππ°π Unicode is a big pile of tables that assign numbers to things, mostly characters. Capital A is 65, for example, and βΌ is 9084. βΌ? What's that? We'll get back to that.
The tables also contain things that aren't quite characters, like "new line", "combining grave accent" or "zero-width non-joiner". Not everything that is a "unicode character" has a single number. The question "what's a character?" is a whole story of its own.
dev.to/awwsmm/why-no-modern-prβ¦-
Santa Claes πΈπͺππ°π (clacke@libranet.de)'s status on Saturday, 07-May-2022 05:53:55 UTC Santa Claes πΈπͺππ°π Originally the people who created Unicode 1.0 took every table of characters (and other things) from every national standard and some international standards and gave (almost) every number in those tables some corresponding number in Unicode.
Many numbers from different national standards were assigned a single number in Unicode, which made a lot of people upset, but that's a whole story of its own too[1]. The point was that the numbers representing text in any pre-existing digital document could all be translated to these new numbers from one table. This allows characters that were previously from different standards to appear in one document, and it allows for software and its authors to keep track of fewer evolving standards of tables.
[1] en.wikipedia.org/wiki/CJK_Unif⦠-
Santa Claes πΈπͺππ°π (clacke@libranet.de)'s status on Saturday, 07-May-2022 08:20:39 UTC Santa Claes πΈπͺππ°π @veer66 It's just about the naming. "Character" is a bad name, because it has a lot of baggage and a "character" is not a very well-defined concept.
A code point is a well-defined concept. A grapheme is what most people would feel is a "character", or maybe a "grapheme cluster" is, but they are entities that are ever going to be in flux, with minor updates in every Unicode version. They're not the elementary building blocks that programmers for decades have led themselves to believe they are. -
veer66 (veer66@mstdn.io)'s status on Saturday, 07-May-2022 08:20:40 UTC veer66 @clacke After reading the article, I don't see the problem with character type. :blob_dizzy_face:
Moreover, if s.chars() returned Iterator<u32> instead of Chars, would it help fix anything?
-
veer66 (veer66@mstdn.io)'s status on Saturday, 07-May-2022 08:34:51 UTC veer66 @clacke So renaming Character to CodePoint or Rune should be okay. π€
-
Santa Claes πΈπͺππ°π (clacke@libranet.de)'s status on Saturday, 07-May-2022 08:34:51 UTC Santa Claes πΈπͺππ°π @veer66 Yes. Because now you have some languages like JavaScript where the thing named a Character is an abomination and then in Swift a Character is an Extended Grapheme Cluster.
So it would be nice if we called them code points and grapheme clusters so we knew what we were talking about.
Here's the article I wanted to find when I settled for the one posted above. =)
manishearth.github.io/blog/201β¦ -
iooioio (iooioio@fosstodon.org)'s status on Saturday, 07-May-2022 12:00:10 UTC iooioio @clacke This is a really great read. I learned a lot! Thanks for sharing.
Santa Claes πΈπͺππ°π likes this.
-