-
Notifications
You must be signed in to change notification settings - Fork 360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make docs more clear that string is list of utf-16 code points #1055
base: master
Are you sure you want to change the base?
Conversation
I think the correct terminology is "UTF-16 code units", not "UTF-16 code points." Or you could say "Unicode code points encoded with UTF-16". |
src/String.elm
Outdated
@@ -81,7 +81,7 @@ import Result exposing (Result) | |||
|
|||
A `String` can represent any sequence of [unicode characters][u]. You can use | |||
the unicode escapes from `\u{0000}` to `\u{10FFFF}` to represent characters | |||
by their code point. You can also include the unicode characters directly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that in this place you actually want to say "code point," since this is specifically referring to a Unicode code point (as opposed to an encoding of the point).
To get this merged, it probably makes sense to address my most recent comment and then squash your two commits together. I think we want to be really precise in this documentation, since there are subtle differences between Unicode code points and the encoding of Unicode characters. It's also possible that we want to underplay the UTF-16 aspect, since that seems like mostly an internal implementation detail. @richyliu You may find this useful: I'm not super clear on Unicode terminology myself, but I know enough to spot some things that are ambiguous. It's possible that Evan or somebody else in the core team will just want to revisit the docs holistically here. |
Ok, I changed line 84 back to |
are enclosed in `"double quotes"`. Strings are *not* lists of characters. | ||
are enclosed in `"double quotes"`. Strings are *not* lists of characters, but | ||
lists of UTF-16 code units. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the existing wording ("Strings are not lists of characters.") was meant to clarify that a String
is not List
(unlike, e.g., in Haskell, where type String = [Char]
).
I think the reference to UTF-16 is fine (but maybe could be explained in terms of Char
), but I don't think we should use the word "list". For example, the Revised^6 Report on the Algorithmic Language Scheme—another language where "list" has a very specific meaning—defines the character and string types like this:
Characters
Scheme characters mostly correspond to textual characters. More precisely, they are isomorphic to the scalar values of the Unicode standard.
Strings
Strings are finite sequences of characters with fixed length and thus represent arbitrary Unicode texts.
This would close #1047