Make docs more clear that string is list of utf-16 code points #1055

richyliu · 2019-10-23T04:22:36Z

This would close #1047

showell · 2019-10-24T19:56:24Z

I think the correct terminology is "UTF-16 code units", not "UTF-16 code points." Or you could say "Unicode code points encoded with UTF-16".

showell · 2019-10-31T14:52:30Z

src/String.elm

@@ -81,7 +81,7 @@ import Result exposing (Result)

 A `String` can represent any sequence of [unicode characters][u]. You can use
 the unicode escapes from `\u{0000}` to `\u{10FFFF}` to represent characters
-by their code point. You can also include the unicode characters directly.


I believe that in this place you actually want to say "code point," since this is specifically referring to a Unicode code point (as opposed to an encoding of the point).

showell · 2019-10-31T14:57:11Z

To get this merged, it probably makes sense to address my most recent comment and then squash your two commits together.

I think we want to be really precise in this documentation, since there are subtle differences between Unicode code points and the encoding of Unicode characters. It's also possible that we want to underplay the UTF-16 aspect, since that seems like mostly an internal implementation detail.

@richyliu You may find this useful:

https://www.quora.com/In-the-Unicode-standard-what-is-the-difference-between-a-code-unit-and-a-code-point

I'm not super clear on Unicode terminology myself, but I know enough to spot some things that are ambiguous.

It's possible that Evan or somebody else in the core team will just want to revisit the docs holistically here.

richyliu · 2019-11-01T23:55:09Z

Ok, I changed line 84 back to code point and squashed the commits

LiberalArtist · 2022-05-15T06:18:32Z

src/String.elm

-are enclosed in `"double quotes"`. Strings are *not* lists of characters.
+are enclosed in `"double quotes"`. Strings are *not* lists of characters, but
+lists of UTF-16 code units.


I think the existing wording ("Strings are not lists of characters.") was meant to clarify that a String is not List (unlike, e.g., in Haskell, where type String = [Char]).

I think the reference to UTF-16 is fine (but maybe could be explained in terms of Char), but I don't think we should use the word "list". For example, the Revised^6 Report on the Algorithmic Language Scheme—another language where "list" has a very specific meaning—defines the character and string types like this:

Characters
Scheme characters mostly correspond to textual characters. More precisely, they are isomorphic to the scalar values of the Unicode standard.
Strings
Strings are finite sequences of characters with fixed length and thus represent arbitrary Unicode texts.

make docs more clear that string is list of utf-16 code points

98b6860

showell reviewed Oct 31, 2019

View reviewed changes

Change code "points" to code "units"

4edf4d7

richyliu force-pushed the pr branch from 3c89fc2 to 4edf4d7 Compare November 1, 2019 23:51

LiberalArtist reviewed May 15, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make docs more clear that string is list of utf-16 code points #1055

Make docs more clear that string is list of utf-16 code points #1055

richyliu commented Oct 23, 2019

showell commented Oct 24, 2019

showell Oct 31, 2019

showell commented Oct 31, 2019 •

edited

Loading

richyliu commented Nov 1, 2019

LiberalArtist May 15, 2022

Make docs more clear that string is list of utf-16 code points #1055

Are you sure you want to change the base?

Make docs more clear that string is list of utf-16 code points #1055

Conversation

richyliu commented Oct 23, 2019

showell commented Oct 24, 2019

showell Oct 31, 2019

Choose a reason for hiding this comment

showell commented Oct 31, 2019 • edited Loading

richyliu commented Nov 1, 2019

LiberalArtist May 15, 2022

Choose a reason for hiding this comment

showell commented Oct 31, 2019 •

edited

Loading