Conversation

Notices

Hallå Kitteh (clacke@social.heldscal.la)'s status on Friday, 04-Aug-2017 08:37:45 UTC Hallå Kitteh
- Cindy Xiao🔰
- Nate Cull
@natecull @cdxiao If there is a better solution, the presence of Unicode will undermine it for decades to come.

That said, I don't think there is one. Unicode is what it is given the problem space, shaped by backwards compatibility that made the standard possible at all, shaped by the forces involved in making it happen, and shaped by the messiness of human textual communication itself.

Anyone who says "why don't we just ..." is probably missing a whole lot of subtle details that were hammered out in the last two decades, or just disagreeing with some decision that just arbitrarily had to be made.

As with so many things, reform *is* the way forward. Unicode *is* the synthesis of previous attempts to encode human textual communication.

In conversation Friday, 04-Aug-2017 08:37:45 UTC from social.heldscal.la permalink
- Cindy Xiao🔰 (cdxiao@mastodon.social)'s status on Wednesday, 12-Jul-2017 05:13:14 UTC Cindy Xiao🔰
  
  Hey mastodon, is there anything analogous to regex for CJK character sets? How do people, say, filter a Korean text based on the initial sound in each Hangul character? Or find all characters in a Chinese text that contain a certain radical?
  Do people just use regular expressions plus some helper libraries that sort out the text encodings, and that contain some language-specific information??
  #unicode #regex
  
  In conversation Wednesday, 12-Jul-2017 05:13:14 UTC permalink
  
  Hallå Kitteh likes this.
  
  Hallå Kitteh repeated this.
- Nate Cull (natecull@mastodon.social)'s status on Wednesday, 12-Jul-2017 06:40:22 UTC Nate Cull
  in reply to
  - Cindy Xiao🔰
  @cdxiao I feel like this entire field is still rapidly evolving, but HanziJS ( http://www.hanzijs.com/ ) may give you some options.
  Near as I can figure, the CJK Decomposition Data (http://cjkdecomp.codeplex.com/) can help break a Chinese character into subcomponents , and CC-CEDICT ( https://cc-cedict.org/wiki/) can map a character to its (Mandarin) Pinyin sound.
  Neither of these databases are 'official'
  I *think* there's a Hangul decomposition algorithm: http://unicode.org/reports/tr15/
  In conversation Wednesday, 12-Jul-2017 06:40:22 UTC permalink
  Attachments
  1. Untitled attachment
  2. CJK Decomposition Data
    
    from CodePlex
    
    The CJK Decomposition Data File is a graphical analysis of the approx 75,000 Chinese/Japanese characters in Unicode.
  3. Untitled attachment
  4. Untitled attachment
  Hallå Kitteh likes this.
  
  Hallå Kitteh repeated this.
- Hallå Kitteh (clacke@social.heldscal.la)'s status on Wednesday, 12-Jul-2017 09:31:41 UTC Hallå Kitteh
  - Cindy Xiao🔰
  - Nate Cull
  @natecull @cdxiao Seems to me if you just NFD Hangul, they're regexable, as long as your regex works on codepoints, not "characters" (whatever that is -- there's a cool blog post somewhere on Unicode's Extended Morpheme Clusters, which seems to be as reasonable a definition as anything).
  
  In conversation Wednesday, 12-Jul-2017 09:31:41 UTC permalink
- Hallå Kitteh (clacke@social.heldscal.la)'s status on Thursday, 03-Aug-2017 05:51:05 UTC Hallå Kitteh
  in reply to
  - Cindy Xiao🔰
  - Nate Cull
  @cdxiao @natecull Late supplementary: Here's everything you ever wanted to know about morpheme clusters, normalizations, case folding, deconstructing normalization (each level of linking below adds to the puzzle) ...
  
  http://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to-unicode-code-points/
  
  /via https://datamost.com/clacke/note/a3j57b_9TI2UaUBmmQeFug
  
  /via https://social.heldscal.la/notice/1310773
  
  #unicode
  In conversation Thursday, 03-Aug-2017 05:51:05 UTC permalink
  Attachments
  1. Untitled attachment
  2. No result found on File_thumbnail lookup.
    
    Sat Mar 11 14:24:29 +0000 2017 (Qvitter)
    
    By Claes Wallin (韋嘉誠) 🇭🇰🇸🇪 (clacke@social.heldscal.la) from social.heldscal.la
    
    TIL about unicode normalizations NFKC and NFKD, which are like NFC and NFD except they also explode some typographical code points like the ffi ligature. That makes sense. But also modifications like superscript 5 becomes normal 5. That destroys information. I guess if you want to discover sneakiness and use it for comparison only, it's good.
    
    http://unicode.org/reports/tr15/#Compatibility_Composite_Figure
    
    Bonus link on the topic of what a "character" is: https://datamost.com/clacke/note/a3j57b_9TI2UaUBmmQeFug
- Hallå Kitteh (clacke@social.heldscal.la)'s status on Friday, 04-Aug-2017 09:26:59 UTC Hallå Kitteh
  in reply to
  - Cindy Xiao🔰
  - Nate Cull
  @cdxiao @natecull
  
  Almost related: Brought to you by the wonders of natural language:
  
  https://www.wolframalpha.com/input/?i=48+random+digits
  
  Input interpretation:
  48 random | digits
  
  Result:
  right big toe | left little finger | left second toe | ...
  
  :-D
  
  In conversation Friday, 04-Aug-2017 09:26:59 UTC permalink

Public

Notices

Feeds