Memory optimizations for tupled keys/values #16

cal101 · 2017-03-29T05:30:37Z

Moin,

when migrating to immutable data structures i too often face the challenge to have composite keys or values that are to be combined just for the current map usage.
Paying the price of extra tuple objects hurt. For my own RB-Tree I generated extra variants.
I don't consider this a good solution for a library.

What do you think about an AbtractImmutableHash/TreeMap with customizable Entry-Classes and Comparator?
That weakens the immutable contract but with having custom values and comparators already IMHO
thats OK.

Cal

cal101 · 2017-03-29T06:43:35Z

Looking at the implementation the storage part may be just be some key/value width parameter to count how many slots in the array the key/value needs.
Custom tuples just for in/out.

GlenKPeterson · 2017-03-29T12:27:47Z

Arrays are inherently mutable. I'm using tuples instead because they are immutable.

I did actually lay awake nights thinking about a similar issue here:
https://github.com/GlenKPeterson/Paguro/blob/master/src/main/java/org/organicdesign/fp/StaticImports.java#L230
If we have an array as input, and the vector uses arrays internally, why not use the unmodified input array in the vector? That be faster and more memory efficient! I timed it, and it doesn't make a big difference. The alternative would allow this:

   String[] strArray = { "Hello", "World!" };
   ImList<String> strList = vec(strArray);
   strArray[0] = "Goodbye";

If I had used the unmodified array (because varargs are really just an array), the caller could modify the immutable vector! That must NEVER happen! I'm willing to pay a small performance price for the immutability guarantee.

Even with java.util.HashMap, more than half the time, I end up iterating through map.entries(). If we're going to create tuples anyway, why not create them once for input and let them flow through the system? A valid answer is that it's one more layer of indirection for where the underlying keys are stored in memory. How much of an effect does that have on performance? I don't know.

If you're really worried, compare the performance of assoc(K, V) vs. assoc(Tuple2(K, V)) to add things to the collection. Time the difference with 10, 100, 1000, 10,000, 100,000 items. Post results here, and we can talk. I'd like to see results for just building, and for building + iterating.

Maybe some day someone will rewrite PersistentHashMap to store the input tuples internally instead of splitting them up and storing keys and values in the same arrays?

I'm not planning to change the interface to take an array where the first item is a key and the second item is a value because arrays are inherently mutable. If your code is working and you find that's really what's slowing you down, you can substitute a Java.util.HashMap. I'm actually finding the opposite in my code - the area with the bugs and slowness is the area where I'm concentrating on the complexities of looping and managing mutability. For example:
https://github.com/GlenKPeterson/Paguro/wiki/Looping-vs.-Functional-Transformations#performance-characteristics-of-collections

Remember that the Big O characteristics of these collections are generally the same as their java.util.Collections counterparts.

Jumped around a little bit there. Had I more time I would have written less. I hope that helps.

cal101 · 2017-03-29T16:05:28Z

Bluntly: no. ;-)

I don't care for code performance here but for memory bloat. And i didn't intend to expose an array.
I try to present more code to make my point when i have real keyboard instead of my tablet.

GlenKPeterson · 2017-04-02T23:19:42Z

I see that I missed the point of your question. I think if you use ImMap.assoc(K, V), .contains(K), and .get(K) there are no tuples involved. It's only if you want to iterate it or call certain non-critical methods that it returns tuples. I found after a while with java.util.Map I was continually calling .entrySet() and working with entries, which is why there are a bunch of tuple methods. They're convenient for what I tend to do.

If you are talking about extending some base class to provide the internal implementation the way HashMap extends java.util.AbstractMap, then you might look to ImSortedMap and ImUnsortedMap as starting points. The default methods should allow you a fairly streamlined implementation.

I'm disappointed by a lot of what's in java.util.AbstractMap. .equals() ignores the comparator/ordering of a SortedMap. Default implementations of containsKey() and get() are both O(n) when the whole point of this data structure is to make those operations O(1) (unsorted) or O(log n) (for sorted). I believe in making it convenient for people to do the right thing. This makes it convenient for people to go astray.

What do you think about an AbtractImmutableHash/TreeMap with customizable Entry-Classes and Comparator?

If you're concerned about space, I would think you'd want to implement the critical parts, not inherit them from some generic implementation. Maybe you could tell me more about, "For my own RB-Tree I generated extra variants."

cal101 · 2017-04-03T06:14:47Z

#The problem#

How to model the data structure map((kp1,kp2) -> v), a map from a composite key to a value.
The key parts kp1 and kp2 are values typically already present in memory, often as properties in v.

#The obvious solution#

ImMap<Tuple2<KP1,KP2>,V>

has the problem of adding a real Tuple2 instance for earch entry.

#How to improve?#

Just as Map<K,V> is conceptually a List<Entry<K,V>> you don't store it as such in a Hash Array Trie Map.

You store it as

Array: K1,V1,K2,V2,K3,V3,....

Similiar in a RB-Tree based map you have your tree nodes with fields

Node {
K key;
V value;
}

So what I did with my RB-Trees is generating special versions for composite keys (and composite values) where the nodes have as many key-part-fields and value-part fields as necessary.

Node {
KP1 keyPart1;
KP2 keypart2;
V value;
}

and just use tuple wrappers on the interface where needed.

But I want to replace the RB-Trees with HAMT maps and don't suffer the extra Tuple-Object(s) on each stored entry.

The HAMT could store the example like so

Array: KP1-1,KP2-1,V1,KP1-2,KP2-2,V2, KP1-3,KP2-3,V3

That could be done with generation but I hope it's possible to capture that logic in some abstract base class.

GlenKPeterson · 2017-04-03T12:42:57Z

In each of our conversations, it would be helpful for me to have a better idea of the use case for what you're asking for. I tend to go astray quickly without that.

Yes, you could make a multi-key map without using Tuples. It would be challenging. I don't plan to do it. If you do it, I would consider merging it into Paguro, but I'm really hesitant to provide any public method with an array argument. There are only 3 (or maybe 5?) varargs methods in this project now, all in StaticImports. Arrays don't play nicely with Generics. Arrays are inherently mutable. Paguro uses arrays so end-users don't have to. I'd certainly be willing to link to it for people who want that.

If you're going to write your own immutable collection, I've been working on some copy-on-write array functions that would be helpful for implementing any immutable collection in Java using arrays. Exposing these array functions is OK because the end user of the function provides and receives the arrays. They do not expose mutable internals of immutable data structures.

Paguro may already provide enough of what you want with Equator and ComparisonContext or even the dreaded Tuple2 may already do what you're asking for.

I was working with composite keys last week! I started (in the database) with a many-to-many join table between Books and Authors (a book can have multiple authors, an author can write multiple books). Then I added a field to this join table "role" meaning what role did the author have in writing a specific book? Were they an editor? Did they write the introduction? Co-author? Now the BookAuthorXref table needs to be represented in Java with fields, book_id, author_id, and role.

So I'm taking a guess and switching from Map to Set here. Set is implemented as the corresponding Map where V = K, so it is not so far from what you are asking as it might seem. Anyway, here's an example using an Equator (for a sorted map/set, you would use a ComparisonContext instead):

static class Book {
    private static long maxId = 0;
    // auto-increment and assign new surrogate key
    Book() { id = ++maxId; }
    long id;
}
static class Author {
    private static long maxId = 0;
    Author() { id = ++maxId; }
    long id;
}
class BookAuthor {
    Book book; Author author; String role;
    BookAuthor(Book b, Author a, String r) {
        book = b; author = a; role = r;
    }
}

@Test public void someTest() {

    // Define equality for BookAuthors as equality of the composite key
    // In the real world, you probably want to define this as an Enum singleton
    // for serialization.  I skipped that here for brevity.
    Equator<BookAuthor> eq = new Equator<BookAuthor>() {
        @Override public int hash(BookAuthor ba) {
            return (int) (ba.book.id + ba.author.id);
        }

        // Assumes that book_author has a unique constraint on (book_id, author_id)
        @Override public boolean eq(BookAuthor ba1, BookAuthor ba2) {
            return Objects.equals(ba1.book, ba2.book) &&
                   Objects.equals(ba1.author, ba2.author);
        }
    };

    // Create a new PersistentHashSet that uses our equator.
    // This will ignore BookAutor.equals() and .hashCode().
    // Similarly, passing ComparisonContext to an ImSortedSet/Map will
    // ignore compareTo().  This "trick" works with the java.util sorted collections
    // as well.
    ImSet<BookAuthor> baSet = PersistentHashSet.empty(eq);

    Book book1 = new Book();
    Book book2 = new Book();
    Author author1 = new Author();
    Author author2 = new Author();

    BookAuthor ba1 = new BookAuthor(book1, author1, "editor");
    BookAuthor ba2 = new BookAuthor(book1, author2, "primaryAuthor");
    BookAuthor ba3 = new BookAuthor(book2, author2, "introAuthor");

    ImList<BookAuthor> bookAuthList = vec(ba1, ba2, ba3);

    baSet = bookAuthList.foldLeft(baSet, (s, ba) -> s.put(ba));

    baSet.contains(new BookAuthor(book1, author1, null));
}

In this second example I may be bastardizing Tuples a little bit for memory and speed. Remember they provide equals() and hashCode() for you, so I'm making BookAuthor extend Tuple2, then adding the third field manually (so it won't be part of the equals() and hashCode() comparisons). Not sure I'd recommend this, but it's an idea.

static class Book {
    private static long maxId = 0;
    // auto-increment and assign new surrogate key
    Book() { id = ++maxId; }
    long id;
}
static class Author {
    private static long maxId = 0;
    Author() { id = ++maxId; }
    long id;
}

// Extends tuple2 just for the composite key, so that equals() and hashcode()
// are based on only those fields.
class BookAuthor extends Tuple2<Book,Author> {
    // Role field is not part of the tuple, so ignored by equals() and hashcode().
    String role;
    BookAuthor(Book b, Author a, String r) {
        super(b, a);
        role = r;
    }
    public Book book() { return _1; }
    public Author author() { return _2; }
}

@Test public void someTest() {

    Book book1 = new Book();
    Book book2 = new Book();
    Author author1 = new Author();
    Author author2 = new Author();

    BookAuthor ba1 = new BookAuthor(book1, author1, "editor");
    BookAuthor ba2 = new BookAuthor(book1, author2, "primaryAuthor");
    BookAuthor ba3 = new BookAuthor(book2, author2, "introAuthor");
    
    ImSet<BookAuthor> baSet = set(ba1, ba2, ba3);
    
    //noinspection SuspiciousMethodCalls
    baSet.contains(tup(book1, author1));
}

The above examples should both compile. I did not test that they actually work.

In the real world, you'd probably make Role an enum to control its values. I would also strongly suggest you always implement any Equator and ComparisonContext as enums, NEVER AS LAMBDAS otherwise you're in for a nasty surprise when you serialize and deserialize your collection (lambdas aren't serializable and return from deserialization as null). The same caveat exists with a Comparator.

As a parting thought, using a tuple has a constant time and constant memory cost. A key aspect of Functional Programming is that making things simple lets you focus on Big O performance - which ignores constants! This often ends up producing faster code than what you write when distracted by premature optimization. I'm not saying that's what you're doing. I'm saying that it happens to me all the time. Here's a great example:
https://github.com/GlenKPeterson/Paguro/wiki/Looping-vs.-Functional-Transformations#performance-characteristics-of-collections

cal101 · 2017-04-04T06:50:52Z

I am giving up.

GlenKPeterson added the question label Mar 29, 2017

cal101 closed this as completed Apr 4, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory optimizations for tupled keys/values #16

Memory optimizations for tupled keys/values #16

cal101 commented Mar 29, 2017 •

edited

Loading

cal101 commented Mar 29, 2017

GlenKPeterson commented Mar 29, 2017

cal101 commented Mar 29, 2017

GlenKPeterson commented Apr 2, 2017

cal101 commented Apr 3, 2017

GlenKPeterson commented Apr 3, 2017 •

edited

Loading

cal101 commented Apr 4, 2017

Memory optimizations for tupled keys/values #16

Memory optimizations for tupled keys/values #16

Comments

cal101 commented Mar 29, 2017 • edited Loading

cal101 commented Mar 29, 2017

GlenKPeterson commented Mar 29, 2017

cal101 commented Mar 29, 2017

GlenKPeterson commented Apr 2, 2017

cal101 commented Apr 3, 2017

GlenKPeterson commented Apr 3, 2017 • edited Loading

cal101 commented Apr 4, 2017

cal101 commented Mar 29, 2017 •

edited

Loading

GlenKPeterson commented Apr 3, 2017 •

edited

Loading