Public
- Public
- Network
- Groups
- Popular
- People

Conversation

Notices

Jan Wildeboer 😷:krulorange: (jwildeboer@social.wildeboer.net)'s status on Wednesday, 12-Jan-2022 17:05:24 UTC Jan Wildeboer 😷:krulorange:

Reminder. Dear #MySQL #mariadb Users: utf8 isn’t utf8. utf8mb4 is utf8. Since almost 12 years.
https://adamhooper.medium.com/in-mysql-never-use-utf8-use-utf8mb4-11761243e434
In conversation Wednesday, 12-Jan-2022 17:05:24 UTC from social.wildeboer.net permalink
Attachments
1. Untitled attachment
  https://cdn.masto.host/socialwildeboernet/media_attachments/files/107/610/535/317/957/285/original/7415a04144dbf9c0.jpg
2. In MySQL, never use “utf8”. Use “utf8mb4”.
  
  from https://adamhooper.medium.com
  
  Today’s bug: I tried to store a UTF-8 string in a MariaDB “utf8”-encoded database, and Rails raised a bizarre error:
- Santa Claes 🇸🇪🇭🇰🎅 likes this.
- feld (feld@bikeshed.party)'s status on Wednesday, 12-Jan-2022 21:34:27 UTC feld
  in reply to
  - Reto
  @reto NOPE! Not in MySQL
  
  In conversation Wednesday, 12-Jan-2022 21:34:27 UTC permalink
- feld (feld@bikeshed.party)'s status on Wednesday, 12-Jan-2022 21:34:27 UTC feld
  in reply to
  - feld
  - Reto
  @reto well I should clarify: it DOES but it uses your MySQL client's default encoding
  
  and the default client encoding is latin1 or whatever, so that's what it uses and that's what it puts into the dump. It literally converts the data from e.g., utf8mb4 to latin1 during the dumping process
  
  In conversation Wednesday, 12-Jan-2022 21:34:27 UTC permalink
  
  Santa Claes 🇸🇪🇭🇰🎅 likes this.
- feld (feld@bikeshed.party)'s status on Wednesday, 12-Jan-2022 21:34:28 UTC feld
  in reply to
  
  @jwildeboer don't forget you have to set the encoding during your database dump and restore too otherwise your backups are fucked!
  
  In conversation Wednesday, 12-Jan-2022 21:34:28 UTC permalink
- Reto (reto@pleroma.labrat.space)'s status on Wednesday, 12-Jan-2022 21:34:28 UTC Reto
  in reply to
  - feld
  @feld it doesn't store that in the table when you dump? oO
  
  In conversation Wednesday, 12-Jan-2022 21:34:28 UTC permalink
- Santa Claes 🇸🇪🇭🇰🎅 (clacke@libranet.de)'s status on Wednesday, 12-Jan-2022 21:34:49 UTC Santa Claes 🇸🇪🇭🇰🎅
  in reply to
  - feld
  - Reto
  @feld @reto Amazing.
  
  In conversation Wednesday, 12-Jan-2022 21:34:49 UTC permalink
- Deborah Pickett (futzle@aus.social)'s status on Thursday, 13-Jan-2022 09:45:17 UTC Deborah Pickett
  in reply to
  - smuglispweenie
  @smuglispweenie @jwildeboer I think that excerpt is from a time (2003) when ISO 10646 imagined that Unicode would use codepoints higher than U+10FFFF. There have been ~5 major releases of ISO 10646 since then and while I haven’t found any online I would venture that they have since limited UTF-8 to four bytes, in line with the RFC.
  There are other limits on UTF-8 too, all mentioned in the RFC: encoding surrogate codepoints is illegal; overlong sequences to hide (say) \0 are illegal.
  
  In conversation Thursday, 13-Jan-2022 09:45:17 UTC permalink
  
  Santa Claes 🇸🇪🇭🇰🎅 likes this.
- smuglispweenie@mastodon.social's status on Thursday, 13-Jan-2022 09:45:24 UTC smuglispweenie
  in reply to
  
  @jwildeboer Why "mb4" and not "mb6"? From RFC 3629 (or 'man utf-8'): «Another security issue occurs when encoding to UTF-8: the ISO/IEC 10646 description of UTF-8 allows encoding character numbers up to U+7FFFFFFF, yielding sequences of up to 6 bytes. There is therefore a risk of buffer overflow if the range of character numbers is not explicitly limited to U+10FFFF or if buffer sizing doesn't take into account the possibility of 5- and 6-byte sequences.»
  
  In conversation Thursday, 13-Jan-2022 09:45:24 UTC permalink
- Santa Claes 🇸🇪🇭🇰🎅 (clacke@libranet.de)'s status on Thursday, 13-Jan-2022 11:03:17 UTC Santa Claes 🇸🇪🇭🇰🎅
  in reply to
  - smuglispweenie
  @jwildeboer @smuglispweenie It's still UTF-8, so it's variable length encoding which would be *up to* 6 bytes, but as noted, yeah, limiting Unicode to 21 bits limits UTF-8 max length to 4 bytes.
  
  In conversation Thursday, 13-Jan-2022 11:03:17 UTC permalink
- smuglispweenie@mastodon.social's status on Thursday, 13-Jan-2022 11:03:18 UTC smuglispweenie
  in reply to
  
  @jwildeboer Interestingly, just found out that RFC3629 supersedes RFC2279, and one of the changes is the reduction of character range. Weird.
  
  In conversation Thursday, 13-Jan-2022 11:03:18 UTC permalink
- Jan Wildeboer 😷:krulorange: (jwildeboer@social.wildeboer.net)'s status on Thursday, 13-Jan-2022 11:03:18 UTC Jan Wildeboer 😷:krulorange:
  in reply to
  - smuglispweenie
  @smuglispweenie the 6 byte encoding comes with a hefty price - 2 bytes wasted as in „always zero“. 4 Bytes are more than sufficient to even integrate coding systems from other planets and galaxies ;)
  
  In conversation Thursday, 13-Jan-2022 11:03:18 UTC permalink

Public

Conversation

Notices

Feeds