Reminder. Dear #MySQL #mariadb Users: utf8 isnβt utf8. utf8mb4 is utf8. Since almost 12 years.
https://adamhooper.medium.com/in-mysql-never-use-utf8-use-utf8mb4-11761243e434
Reminder. Dear #MySQL #mariadb Users: utf8 isnβt utf8. utf8mb4 is utf8. Since almost 12 years.
https://adamhooper.medium.com/in-mysql-never-use-utf8-use-utf8mb4-11761243e434
@smuglispweenie @jwildeboer I think that excerpt is from a time (2003) when ISO 10646 imagined that Unicode would use codepoints higher than U+10FFFF. There have been ~5 major releases of ISO 10646 since then and while I havenβt found any online I would venture that they have since limited UTF-8 to four bytes, in line with the RFC.
There are other limits on UTF-8 too, all mentioned in the RFC: encoding surrogate codepoints is illegal; overlong sequences to hide (say) \0 are illegal.
@jwildeboer Why "mb4" and not "mb6"? From RFC 3629 (or 'man utf-8'): Β«Another security issue occurs when encoding to UTF-8: the ISO/IEC 10646 description of UTF-8 allows encoding character numbers up to U+7FFFFFFF, yielding sequences of up to 6 bytes. There is therefore a risk of buffer overflow if the range of character numbers is not explicitly limited to U+10FFFF or if buffer sizing doesn't take into account the possibility of 5- and 6-byte sequences.Β»
@jwildeboer Interestingly, just found out that RFC3629 supersedes RFC2279, and one of the changes is the reduction of character range. Weird.
@smuglispweenie the 6 byte encoding comes with a hefty price - 2 bytes wasted as in βalways zeroβ. 4 Bytes are more than sufficient to even integrate coding systems from other planets and galaxies ;)
Chirp! is a social network. It runs on GNU social, version 2.0.1-beta0, available under the GNU Affero General Public License.
All Chirp! content and data are available under the Creative Commons Attribution 3.0 license.