@natecull @cdxiao If there is a better solution, the presence of Unicode will undermine it for decades to come.
That said, I don't think there is one. Unicode is what it is given the problem space, shaped by backwards compatibility that made the standard possible at all, shaped by the forces involved in making it happen, and shaped by the messiness of human textual communication itself.
Anyone who says "why don't we just ..." is probably missing a whole lot of subtle details that were hammered out in the last two decades, or just disagreeing with some decision that just arbitrarily had to be made.
As with so many things, reform *is* the way forward. Unicode *is* the synthesis of previous attempts to encode human textual communication.
Hey mastodon, is there anything analogous to regex for CJK character sets? How do people, say, filter a Korean text based on the initial sound in each Hangul character? Or find all characters in a Chinese text that contain a certain radical?
Do people just use regular expressions plus some helper libraries that sort out the text encodings, and that contain some language-specific information??
@cdxiao I feel like this entire field is still rapidly evolving, but HanziJS ( http://www.hanzijs.com/ ) may give you some options.
Near as I can figure, the CJK Decomposition Data (http://cjkdecomp.codeplex.com/) can help break a Chinese character into subcomponents , and CC-CEDICT ( https://cc-cedict.org/wiki/) can map a character to its (Mandarin) Pinyin sound.
@natecull @cdxiao Seems to me if you just NFD Hangul, they're regexable, as long as your regex works on codepoints, not "characters" (whatever that is -- there's a cool blog post somewhere on Unicode's Extended Morpheme Clusters, which seems to be as reasonable a definition as anything).
@cdxiao @natecull Late supplementary: Here's everything you ever wanted to know about morpheme clusters, normalizations, case folding, deconstructing normalization (each level of linking below adds to the puzzle) ...