Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make docs more clear that string is list of utf-16 code points #1055

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

richyliu
Copy link

This would close #1047

@showell
Copy link

showell commented Oct 24, 2019

I think the correct terminology is "UTF-16 code units", not "UTF-16 code points." Or you could say "Unicode code points encoded with UTF-16".

src/String.elm Outdated
@@ -81,7 +81,7 @@ import Result exposing (Result)

A `String` can represent any sequence of [unicode characters][u]. You can use
the unicode escapes from `\u{0000}` to `\u{10FFFF}` to represent characters
by their code point. You can also include the unicode characters directly.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that in this place you actually want to say "code point," since this is specifically referring to a Unicode code point (as opposed to an encoding of the point).

@showell
Copy link

showell commented Oct 31, 2019

To get this merged, it probably makes sense to address my most recent comment and then squash your two commits together.

I think we want to be really precise in this documentation, since there are subtle differences between Unicode code points and the encoding of Unicode characters. It's also possible that we want to underplay the UTF-16 aspect, since that seems like mostly an internal implementation detail.

@richyliu You may find this useful:

https://www.quora.com/In-the-Unicode-standard-what-is-the-difference-between-a-code-unit-and-a-code-point

I'm not super clear on Unicode terminology myself, but I know enough to spot some things that are ambiguous.

It's possible that Evan or somebody else in the core team will just want to revisit the docs holistically here.

@richyliu
Copy link
Author

richyliu commented Nov 1, 2019

Ok, I changed line 84 back to code point and squashed the commits

Comment on lines -16 to +17
are enclosed in `"double quotes"`. Strings are *not* lists of characters.
are enclosed in `"double quotes"`. Strings are *not* lists of characters, but
lists of UTF-16 code units.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the existing wording ("Strings are not lists of characters.") was meant to clarify that a String is not List (unlike, e.g., in Haskell, where type String = [Char]).

I think the reference to UTF-16 is fine (but maybe could be explained in terms of Char), but I don't think we should use the word "list". For example, the Revised^6 Report on the Algorithmic Language Scheme—another language where "list" has a very specific meaning—defines the character and string types like this:

Characters
Scheme characters mostly correspond to textual characters. More precisely, they are isomorphic to the scalar values of the Unicode standard.
Strings
Strings are finite sequences of characters with fixed length and thus represent arbitrary Unicode texts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
3 participants