Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Request: Threadsafety of PersistentHashMap #40

Closed
smillies opened this issue Jul 7, 2020 · 11 comments
Closed

Support Request: Threadsafety of PersistentHashMap #40

smillies opened this issue Jul 7, 2020 · 11 comments

Comments

@smillies
Copy link

smillies commented Jul 7, 2020

Hello there,
I have not been able to determine whether PersistentHashMap is thread-safe. The Javadoc and usage examples don't seem to say. More specifically, say that I have a PersistentHashMap<K, V> map and concurrently execute the following code on it with different key-value combinations, would that be OK or do I need to synchronize?

V oldValue = map.get(key);
V newValue = (oldValue == null) ? value : remappingFunction.apply(oldValue, value);
if (newValue == null) {
      return map.without(key);
} else {
      return map.assoc(key, newValue);
}

The reference to map never changes, and the map itself is never modified, but can I concurrently execute multiple calls to without and assoc and get consistent results?

@GlenKPeterson
Copy link
Owner

I just created a Google Group mailing list for support questions, mainly in the hopes that if more people are looking at what I say, more people can correct me. So that's an option for future support requests: https://groups.google.com/d/forum/paguro

Until a bug is found, I'm assuming that PersistentHashMap, on its own, is thread-safe. I think we may have to zoom out to see whether you are using this thread-safely or not, but I can say a few things.

The reference to map never changes, and the map itself is never modified

The map itself is never modified in place. In order to represent a modification, the reference to the map must change. We need to see that in your code sample in order to determine how thread-safe that change is.

Also, it's a good idea to state why you need multiple threads so that we can understand better the problem you are trying to solve. The JVM has some collections designed for specific types of concurrent modification depending on what you're doing. Paguro's collections are designed to be safe against accidental change which may or may not be what you want. If you post more code, maybe that's a good first use of the mailing list?

@smillies
Copy link
Author

smillies commented Jul 7, 2020

That link to the google group did not work for me (error message "no access" although I logged in to Google). Perhaps it needs a while to become fully available.

Here's the complete code:

/**
     * Exactly like the merge operation on {@code java.util.Map}, except it doesn't return the merged value, but the new map containing it
     * (or not).
     */
    private static <K, V> PersistentHashMap<K, V> merge(PersistentHashMap<K, V> map, K key, V value,
            BiFunction<? super V, ? super V, ? extends V> remappingFunction) {
        Objects.requireNonNull(remappingFunction);
        Objects.requireNonNull(value);
        V oldValue = map.get(key);
        V newValue = (oldValue == null) ? value : remappingFunction.apply(oldValue, value);
        if (newValue == null) {
            return map.without(key);
        } else {
            return map.assoc(key, newValue);
        }
    }

The map in question is a cardinality map representing the marking of a Petri net. I'm doing concurrent alternative traversals of that net, each one running in its own worker thread, starting from a common initial marking. So several threads will concurrently call this method to add or remove tokens. Each thread will update its own map reference to the returned value. All those modifications "go back" to the shared initial marking.

@GlenKPeterson
Copy link
Owner

You probably already know the adage, "Synchronize access to shared mutable state." You must synchronize both read and write access to the shared state using the same lock object (the enclosing class if you don't specify). Presumably you have a private map variable so that you can always synchronize before reading or writing it. Also, you never want to nest synchronized blocks - never synchronized { synchronized { } }.

In any case, I still don't see where the map reference is changed. Passing the map reference to a function is exactly the opposite of what you probably want, because it encourages you to nest synchronized blocks for the read and for the write (which is bad).

I'm running out right now for several hours, but will re-read all this later or tomorrow.

Also, "Java Concurrency in Practice" by Brian Goetz has a nice recipe for this. Joshua Bloch's Effective Java has some good tips too.

@GlenKPeterson
Copy link
Owner

P.S. I made the group visible to all. Thanks for the heads-up!

@smillies
Copy link
Author

smillies commented Jul 7, 2020

Thanks for your help! I know that I should synchronize shared mutable state. My question is, is PersistentHashMap thread-safe?
I don't know how to explain this better. Here's some more (pseudo-)code, calling the merge method that I posted above..

new Thread() {
    private PersistentHashMap<Place, Integer> currentMarking;

    run() {
        // repeatedly fire transactions
    }

    public void fire(Transition transition) {
        PersistentHashMap<Place, Integer> marking = currentMarking;
        // remove tokens
        for (Place p : transition.preset()) {
            marking = merge(marking, p, -1, this::tokenSum);
        }
        // add tokens
        for (Place p : transition.postset()) {
            marking = merge(marking, p, 1, this::tokenSum);
        }
       // save new state
      currentMarking = marking;
    }
}

Imagine there are multiple instances of some class running such code repeatedly and concurrently, with each thread holding its own reference to the map, but that each thread was initialized with the same instance of PersistentHashMap to start with. The program may continually spawn new groups of such threads based on a common initial state, when another bit of the state space of the Petri net needs to be explored. This basically creates a tree of markings (modifications of the initial map) that I'm backtracking over. Can each thread create new modifications concurrently with the others? Are without and assoc atomic?

Furthermore, can one thread read entries from or iterate over the state of the map that it holds a reference to, while other threads create new modifications to the same state of the map? The class is not documented to throw ConcurrentModificationException, so I'd expect to be alright here.

If PersistentHashMap is not thread-safe, I have two options:

  1. synchronize externally (probably the entire merge method)
  2. give each thread an independent copy of the initial state to start with. But that copying would somehow defy the idea of using persistent collections in the first place, so I would rather not do that.

I think that persistent collections ought to be thread-safe by nature. What would be the use of implementing a sophisticated structure-sharing copy-on-write mechanism in a non-thread-safe manner? As a matter of course, persistent collections would be used preferably in concurrent environments. I'm just not sure about PersistentHashMap, because nowhere is it stated that the class is thread-safe. (At least I haven't seen any such statement.)

PS: I've been admitted to the Google group, but I guess it is now too late to move the entire discussion there.

@GlenKPeterson
Copy link
Owner

You might be looking for something like this:

// Must be private
private ImMap<Place, Integer> map = map();

// Put atomic map read-and-write in one place.  Do not pass or return map.
private void merge(
        @Nullable Place key,
        @NotNull Integer value,
        @NotNull Fn2<Integer, Integer, Integer> remappingFunction
) {
    // Synchronize around map read and write - this entire section must be atomic.
    synchronized (this) {
        Integer mapValue = map.get(key);
        Integer newValue = (mapValue == null) ? value
                                              : remappingFunction.apply(mapValue, value);

        map = (newValue == null) ? map.without(key)
                                 : map.assoc(key, newValue);
    }
}

// Map is no longer read or modified directly in this function.
public void fire(Transition transition) {
    // remove tokens
    for (Place p : transition.preset()) {
        merge(p, -1, this::tokenSum);
    }
    // add tokens
    for (Place p : transition.postset()) {
        merge(p, 1, this::tokenSum);
    }
}
@GlenKPeterson
Copy link
Owner

Paguro has collections that are designed to be thread-safe. This means safe from accidental modification, even when that happens in multiple threads. Your example is almost entirely single-threaded code, since the part that does all the work has to be inside a synchronized block.

I think you instead want high-concurrency modification of a shared map using java.util.ConcurrentHashMap. Use .compute() instead of your merge function.

I'm pretty sure that's a much better answer to your question.

@smillies
Copy link
Author

smillies commented Jul 7, 2020

Thanks for that input. I may not have made myself clear, perhaps I'm missing something so very obvious that you take it for granted. I'm sure I do not want modification of a shared ConcurrentHashMap, because a concurrent hash map is modified in place, and I really need access to the whole tree of modified states for state space exploration. Here' a picture:

        /map11
    /map1
  /     \map12
map     /map21
   \map2
        \map22 
etc.

That's a tree of modified states of some initial map. The reference to a modified map gets passed into a recursive call, and when I unwind the call stack during backtracking, I'll get a reference to my previous map state, without having to explicitly undo changes. That's what persistent collections are for, aren't they? I don't want modifications to a map on any branch of the tree to affect any other branch. That works fine single-threadedly.

Now I want the same thing working concurrently, that is with different threads exploring parts of that tree. For example, thread 1 may create map1 from map by calling some combination of without and assoc on it (by executing my merge method), while at the same time thread 2 may create map2 from map by executing another series of modifications of it. My question is, is that OK with PersistentHashMap? The answer appears to be "No", because you yourself suggested putting synchronized blocks around the map-modifying code. But I can't square that with your remark that "Paguro has collections that are designed to be thread-safe".

And what about iteration? Can thread 1 iterate over the keyset() or values() of map, while thread 2 is modifying it to create map2?

@GlenKPeterson
Copy link
Owner

Aha! Now I think I understand what you are doing. Yes, this is probably a good use of PersistentHashMap!

I also understand that this is frustrating. I've been trying to skim what you say because I've been busy and still want to help.

To be clear, PersistentHashMap has been used and has been thread-safe in Clojure and Paguro for years now. It is always possible that there could be a bug, but the likelihood is extremely small in that code.

Each call to "without" or "assoc" does not modify the existing map, but instead returns a new map which you can think of as a modified copy of the original. It's thread-safe not because of synchronization or ConcurrentModificationExceptions, but because it's referentially transparent. It's also very fast, considering what it's doing.

Just because something is thread-safe on its own, doesn't prevent you from misusing it! Consider, "The problems with synchronized collections" in chapter 5 of "Java Concurrency in Practice" by Brian Goetz. He specifically mentions the put-if-absent scenario saying that even though the "get" method and "put" method are individually thread-safe, you may still need to synchronize a block of code to make sure that no other thread modifies the collection between the "get" and the "put." He's got some nice diagrams. I really recommend you read that if you haven't already.

This is not as good as the book, but:

individual operations are thread-safe, but sequences of operations where the control flow depends on the results of previous operations may be subject to data races. The first snippet in Listing 1 shows the common put-if-absent idiom -- if an entry does not already exist in the Map, add it. Unfortunately, as written, it is possible for another thread to insert a value with the same key between the time the containsKey() method returns and the time the put() method is called. If you want to ensure exactly-once insertion, you need to wrap the pair of statements with a synchronized block that synchronizes on the Map m.

Source: https://www.ibm.com/developerworks/library/j-jtp07233/index.html

With PersistentHashMap, the collection itself is totally safe. But the line where you say myMap = newMap is not. Especially suspect is the time between when you say, myMap.contains(x) and myMap = myMap.assoc(x) because another thread could have interrupted your code to do a myMap = myMap.assoc(x) between those two statements.

The code you included in this discussion never initializes currentMarking. How you do that is important for solving this problem.

How do you plan to get the leaf maps? I see how you the leaf maps would include the contents of the root, but they are all declared in a local scope that disappears once the function returns. You might store the finished ImMaps in a ConcurrentHashMap, or other java.util.Concurrent collection. Presumably whatever collection you use to ensure that you process each node exactly once could also be used to store the map for the processing up to that node.

I don't need to know the answers to those questions, but you do, and you need to think about concurrency when you answer them. The information that needs to be communicated between threads.

I hope that helps. I'm probably not going to respond again until Thursday.

@smillies
Copy link
Author

smillies commented Jul 8, 2020

Thank you very much! That helped a lot.

It may be that the idea of a shared reference has been the source of some confusion: I never intended the threads to share the same reference. I only intended to provide for the possibility that at some point in time these different references (thread-local ones) would refer to the same map, and that the code would still work the same as if it were single-threaded. Never at any point in time would one thread be able to update the reference held by another thread, although it might update the same (copy of the) map.

I do not need to communicate any information between the threads. Indeed, I want to isolate the threads so that they cannot exchange any information. I need to collect the information gathered by exploring the branches of the tree I painted above. When recursion unwinds, at each inner node I select the "best" copy of the map to return to the next higher level. Just think of fork/join. Fork-join-tasks also do not usually exchange information.

With that in mind I think that I'm totally OK with my code as it is, when I put everything you say together:

  1. without and assoc are thread-safe and can be called concurrently on the same map instance from different threads. Each thread will get back its own new copy of the map, reflecting only the change it did itself, and neither call will throw an exception.
  2. I'm assuming, although you didn't say, that this also goes for iteration: One thread can iterate over the various set views of a map while another one calls assoc and without on the map, and the iterating thread will neither see any of those concurrent changes nor get an exception, because the result of calling these modifying methods is a new copy of the map that is visible only to the thread doing the modification.
  3. I understand what you say about possible misuse, and re-read that section in "Concurrency in Practice". However, the objection that "another thread could have interrupted your code to do a myMap = myMap.assoc(x) between those two statements" (meaning contains followed by assoc) does not apply in this situation, because the myMap references (although not necessarily the copy of the map on which assoc is executed) are different. So in my case, unlike the put-if-absent example of Brian Goetz's, I do not need to synchronize the complex operation merge.
  4. That is so because as I said each thread holds is own reference to a map, and the copy of the map pointed to by that reference is never changed in place. Only at the beginning, when the threads (tasks) are spawned, several references are set to point to the same map copy (the one representing the root of the tree) by passing the same map reference to the constructors of the task instances. But afterwards, when one thread does a modification, the statement myMap = myMap.assoc(x) will update only the thread-local myMap reference to a new copy of the map. From what you say, that will not have any effects on any copies of the map that are referenced by other threads.

I hope we've got that straightened out now. I'm sorry to have taken up so much of your time, and thank you very much again!

@GlenKPeterson
Copy link
Owner

  1. Yes, exactly so.
  2. Yes. Precisely.

In terms of points 3, 4, and every time one of us uses the term "reference," and your statement, "I do not need to communicate any information between the threads"...

If you just write each thread's output to the command line, or over the network, or maybe to some kind of messaging system, then everything you say makes sense - you don't need to communicate anything between threads. You are correct. Yay!

I was assuming that each thread was going to produce some sort of result that you'd then want reported back to whatever part of the program started those threads. I assumed that all those threads might communicate their results simultaneously, which would then involve some Shared Mutable State which you'd have to think about very carefully in order for it to work correctly and still yield the maximum performance benefits from using multiple threads.

Anyway, it sounds like you've got what you needed. Thanks for your interest! I'll close this ticket in a few days if I don't hear from you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants