We have issues with utf8-string comparisons in MySQL 5, regarding case and accents :
from what I gathered, what MySQL implements collations by considering that "groups of characters should be considered equal".
For example, in the utf8_unicode_ci
collation, all the letters "EÉÈÊeéèê" are in the same box (together with other variants of "e").
So if you have a table containing ["video", "vidéo", "vidÉo", "vidÊo", "vidêo", "vidÈo", "vidèo", "vidEo"] (in a varchar column declared with ut8_general_ci collation) :
- when asking MySQL to sort the rows according to this column, the sorting is random (MySQL does not enforce a sorting rule between "é" and "É" for example),
- when asking MySQL to add a Unique Key on this column, it raises an error because it considers all the values are equal.
What setting can we fiddle with to fix these two points ?
PS : on a related note, I do not see any case sensitive collation for the utf8 charset. Did I miss something ?
[edit] I think my initial question still holds some interest, and I will leave it as is (and maybe one day get a positive answer).
It turned out, however, that our problems with string comparisons regarding accents was not linked to the collation of our text columns. It was linked to a configuration problem with the character_set_client
parameter when talking with MySQL - which defaulted to latin1
.
Here is the article that explained it all to us, and allowed us to fix the problem :
Getting out of MySQL character set hell
It is lengthy, but trust me, you need this length to explain both the problem and the fix.
utf8_unicode_ci
.