Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix performance issue with List.sortBy #631

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jvoigtlaender
Copy link
Contributor

The issue fixed here is as follows, testable at http://elm-lang.org/try.

The following program:

import Html exposing (button, text)
import Html.App exposing (program)
import Html.Events exposing (onClick)
import Task exposing (succeed, andThen, perform)
import Time exposing (Time, now)

f n =
    if n < 2 then
        1
    else
        f (n - 1) + f (n - 2)

list =
    [30..35]

sortBy f =
    List.sortBy f

main =
    program { init = ( Nothing, Cmd.none ), view = view, update = update, subscriptions = \_ -> Sub.none }

view model =
    button [ onClick Click ] [ text (toString model) ]

type Msg a
    = Click
    | Finished Time a

update msg model =
    case msg of
        Click ->
            ( model
            , perform identity identity
                <| now
                `andThen` \t ->
                            succeed (sortBy f (List.reverse list))
                                `andThen` \x ->
                                            now
                                                `andThen` \t' -> succeed <| Finished (t' - t) x
            )

        Finished dt x ->
            ( Just ( dt, x ), Cmd.none )

needs, in the browser on my machine, almost 4 seconds after each button click to show the new result. (The time taken, in milliseconds, will be printed as the first component of the pair in the button text.)

The problem is that f is called far too often on individual elements of list.

By just swapping in the following definition instead of the current List.sortBy:

sortBy f =
    List.map snd << List.sortBy fst << List.map (\a -> ( f a, a ))

the 4 seconds from above are reduced to well under 1 second.

This is not just about a constant factor time difference. By changing f and/or list, the "old" List.sortBy can be arbitrarily (even asymptotically) more expensive than the "new" version. The converse is not true. The new implementation is guaranteed to never be asymptotically worse than the current List.sortBy.

The new version is also how the equivalent functionality is implemented in the Haskell standard library.

This pull request makes it so that Elm's List.sortBy uses the more efficient implementation.

@robinheghan
Copy link
Contributor

I recently review this PR as part of a larger List API run-through, and came to the conclusion that while this improves the worst case performance, it reduces stuff like List.sortBy .head ls by 20-30%, which I would assume is the most common use case for List.sortBy.

In Haskell, the overhead of calling map twice is likely very small due to deforestation and other compiler opts. But Elm, being strict, has no choice but to pay that overhead in full.

For people who have complex sort fn's and who would benefit from this implementation, could easily define this in their programs. However, for simpler use cases, it would not be possible to make a new implementation which avoid the "double map" overhead this PR introduces, and so would not be able to avoid the performance degredation.

As such, would not recommend applying this PR.

@jvoigtlaender
Copy link
Contributor Author

Short answer:

If the current implementation of the function is kept, then I think for honesty (and the users' sake) the function should be renamed into something like List.sortByAccessor and/or its documentation should contain a warning that the function should rather not be used with a function argument that performs any nontrivial computation, because that function argument could end up being applied to each list element many times. Maybe it would even be a good idea then to show in the documentation a reference implementation of the "double map" version List.sortByComputation, so that the user knows what to do in such cases.

Long answer:

We probably do not need to talk much about Haskell. I am pretty sure that there is a constant factor overhead of the List.sortByComputation version over the List.sortByAccessor version in Haskell as well, if used with something trivial, "accessor-instead-of-real-computation" kind, like head. Neither laziness, nor the limited forms of deforestation in ghc, should be of much help here. Whether the overhead is also in the 20-30% region, I cannot say. But in any case, I don't think I mentioned Haskell (three years ago) as justification for efficiency. Rather, that seems more like a witness for the correctness of the proposed alternative definition for the Elm function, in case that correctness was not immediately apparent to readers of the PR. So if you agree that the "double map" version is computing the right thing, then we do not need to further consider Haskell.

So let's consider the situation in Elm.

Your assumption that the "accessor case" is the most common use case for List.sortBy is interesting. Is the assumption backed up by any data? I mean, it is easy to assume the opposite, that people do want to perform sorting based on computing some characteristics of list entries. And as soon as this computation becomes nontrivial, the consequences of still using List.sortByAccessor could be extremely bad. So if the API design were to consciously provide the List.sortByAccessor version (but not under a name that conveys the expectation) and not even warn the user of that point, then that decision should maybe not be based on an assumption not backed by data.

There is at least one very big Elm codebase. Maybe someone could check uses of List.sortBy therein? Lest I assume that most uses of List.sortBy in NoRedInk's Elm code are of the form that a list of student data has to be sorted and the criterion by which to sort is to be computed for each student from various aspects like their current performance, former grade, recent courses taken, ... and so on, and where one probably does not want to repeat that complicated weighing formula computation 10 times per student just because there are 1000 students overall. 😄

@robinheghan
Copy link
Contributor

Haskell is being mentioned only because the code from this PR is taken directly from Haskell's sortOn. As far as I know, Haskell doesn't differentiate on sortByComputation or sortByAccessor. There is sort, sortOn and sortBy, the last two corresponds to Elm's sortBy and sortWith, respectively.

When the code mentioned in the PR copies to Haskell implementation, it makes sense to mention that Haskell has certain optimizations that reduces the overhead of such an implementaiton. Optimizations that Elm doesn't have.

We also have to consider that by changing the implementation of sortBy, we change the performance profile for everyone who uses that function. If this PR would be accepted, everyone who uses sortBy with a simple accessor would have their performance drop by 20-30% which, for those affected, would be bad.

Another thing to keep in mind is that if this PR is applied, those who only uses a simple accessor has no other avenue to speed up their code. If this PR is not applied, people could create their own sortByComputation that does the exact same thing as this PR, without affecting those who only uses simple accessors.

I can give you datapoints from the 50KLOC Elm codebase I've got at work. Our uses of sortBy is List.sortBy .name and List.sortBy .price.

I'm convinced that Elm's current List.sortBy doesn't scale well when the provided function is expensive, but I'm less convinced that this PR is the best solution to the problem. We need to look at more options.

@jvoigtlaender
Copy link
Contributor Author

Let's talk about Elm, not Haskell.

Your data is indeed interesting. I guess you would have been happy with the function being called something that actually conveys how it should be used, then?

None of my comment from yesterday was trying to convince anybody (after three years) that this PR should be merged as is.

Yes, there are other options to address the problem (which you acknowledge to exist). What's a way forward? Documentation change? Name change for the function?

I am aware of all the points reiterated above (like the "Another thing to keep in mind ..." part). What I proposed in the "Short answer" part of my comment yesterday would address them, I think.

Also, if the aim were to solve the performance issue itself, rather than relying on the assumption about most uses of this function (but possibly making it a non-silent assumption), then doing it exactly as in Haskell is not the only option. As I pointed out in the original GitHub issue (#485) that later was replaced by this PR: "in details this can be even further improved". I believe I was thinking of solving this in native code. That is, sort and sortBy could be implemented differently in Native.List, where Native.List.sortBy would do the "pre-packaging" and making sure that no function values get recomputed, in JS code, possibly on the fly along with the first sorting pass over the input list, thus preventing at least one of the extra map passes...

@jvoigtlaender
Copy link
Contributor Author

Actually, no integration with the sorting algorithm itself (the "on the fly along with the first sorting pass" mentioned above) should be needed at all.

Looking at the relevant part of Native.List:

return _List_fromArray(_List_toArray(xs).sort(function(a, b) {

there are already essentially two map passes performed anyway, namely the _List_toArray and _List_fromArray calls. Special-casing them to functions that also do the pairing and projecting would mean that the two "additional map traversals" are not needed.

@robinheghan
Copy link
Contributor

To me List.sortBy assumes an accessor. List.sortBy .name makes perfect sense. The issue with expensive functions could perhaps be added to the documentation. Ideally another implementation would exist which made all of that unnecessary, and I'm trying to find one, but I need more time as I've yet to find something that beats the current impl in the simplest and, in my experience, most common case.

Your second comment about modifying native code might make sense. It might be something to add to the JsArray module. I'll explore the idea when I'm revising Array's in a week or two.

@jvoigtlaender
Copy link
Contributor Author

Cool.

Just one "rebuttal" on the naming question: I agree that List.sortBy .name students makes perfect sense. But List.sortBy performance students makes just as much sense, doesn't it? I do not see how List.sortBy assumes an accessor per se.

@robinheghan
Copy link
Contributor

Fair point

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
3 participants