java collections the force awakens - jax london...reducing scope for bugs ~280 bugs in 28 projects...

Post on 27-May-2020

8 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Java CollectionsThe Force Awakens

Darth @RaoulUKDarth @RichardWarburto

Collection Problems

Java Episode 8 & 9

Persistent & Immutable Collections

HashMaps

Collection bugs

1. Element access (Off-by-one error, ArrayOutOfBound)2. Concurrent modification 3. Check-then-Act

Scenario 1

List<String> jedis = new ArrayList<>(asList("Luke", "yoda"));

for (String jedi: jedis) {

if (Character.isLowerCase(jedi.charAt(0))) {

jedis.remove(jedi);

}

}

Scenario 2

Map<String, BigDecimal> movieViews = new HashMap<>();

BigDecimal views = movieViews.get(MOVIE);

if(views != null) {

movieViews.put(MOVIE, views.add(BigDecimal.ONE));

}

views != nullmoviesViews.get movieViews.putThen

Check Act

Reducing scope for bugs

● ~280 bugs in 28 projects including Cassandra, Lucene

● ~80% check-then-act bugs discovered are put-if-absent

● Library designers can help by updating APIs as new idioms emerge

● Different data structures can provide alternatives by restricting reads & updates to reduce scope for bugs

CHECK-THEN-ACT Misuse of Java Concurrent Collectionshttp://dig.cs.illinois.edu/papers/checkThenAct.pdf

Collection Problems

Java Episode 8 & 9

Persistent & Immutable Collections

HashMaps

Java 8 Lazy Collection Initialization

Many allocated HashMaps and ArrayLists never written to, eg Null object pattern

Java 8 adds Lazy Initialization for the default initialization case

Typically 1-2% reduction in memory consumption

http://www.javamagazine.mozaicreader.com/MarApr2016/Twitter#&pageSet=28&page=0

Java 9 API updates

Collection factory methods● Non-goal to provide persistent immutable collections● http://openjdk.java.net/jeps/269

java.util.Optional● ifPresentOrElse(), or(), stream(), getWhenPresent()● Optional.get() will be deprecated in future

java.util.Stream & java.util.stream.Collectors● takeWhile, dropWhile● filtering, flatMapping

java.util.concurrent.CompletableFuture● orTimeout, completeOnTimeout

Collection Problems

Java Episode 8 & 9

Persistent & Immutable Collections

HashMaps

Categorising Collections

Mutable

Immutable

Non-Persistent Persistent

Unsynchronized Concurrent

Unmodifiable View

Available in Core Library

Mutable

● Popular friends include ArrayList, HashMap, TreeSet

● Memory-efficient modification operations

● State can be accidentally modified

● Can be thread-safe, but requires careful design

Unmodifiable

List<String> jedis = new ArrayList<>();

jedis.add("Luke Skywalker");

List<String> cantChangeMe = Collections.unmodifiableList(jedis);

// java.lang.UnsupportedOperationException

//cantChangeMe.add("Darth Vader");

System.out.println(cantChangeMe); // [Luke Skywalker]

jedis.add("Darth Vader");

System.out.println(cantChangeMe); // [Luke Skywalker, Darth Vader]

Immutable & Non-persistent

● No updates

● Flexibility to convert source in a more efficient representation

● No locking in context of concurrency

● Satisfies co-variant subtyping requirements

● Can be copied with modifications to create a new version (can be

expensive)

Immutable vs. Mutable hierarchy

ImmutableList MutableList

+ ImmutableList<T> toImmutable()

java.util.List

+ MutableList<T> toList()

Eclipse Collections (formaly GSCollections) https://projects.eclipse.org/projects/technology.collections/

ListIterable

Immutable and Persistent

● Changing source produces a new (version) of the collection

● Resulting collections shares structure with source to avoid full copying on updates

Persistent List (aka Cons)

public final class Cons<T> implements ConsList<T> {

private final T head;

private final ConsList<T> tail;

public Cons(T head, ConsList<T> tail) {

this.head = head; this.tail = tail;

}

@Override

public ConsList<T> add(T e) {

return new Cons(e, this);

}

}

Updating Persistent List

A B C X Y Z

Before

Updating Persistent List

A B C X Y Z

Before

A B D

After

Blue nodes indicate new copiesPurple nodes indicates nodes we wish to update

Concatenating Two Persistent Lists

A B C

X Y Z

Before

Concatenating Two Persistent Lists

- Poor locality due to pointer chasing- Copying of nodes

A B C

X Y Z

Before

A B C

After

Persistent List

● Structural sharing: no need to copy full structure

● Poor locality due to pointer chasing

● Copying becomes more expensive with larger lists

● Poor Random Access and thus Data Decomposition

Updating Persistent Binary Tree

Before

Updating Persistent Binary Tree

After

Persistent Array

How do we get the immutability benefits with performance of mutable variants?

Trieroot

10 4520

3. Picking the right branch is done by using parts of the key as a lookup

1. Branch factor not limited to binary

2. Leaf nodes contain actual values

a

a e

bc

b c f

Persistent Array (Bitmapped Vector Trie)... ...

... ...

... ...

... ...

.

.

.

.

.

.

1 31

0 1 31

Level 1 (root)

Level 2

Leaf nodes

Trade-offs

● Large branching factor facilitates iteration but hinders updates

● Small branching factor facilitates updates but hinders traversal

Java Persistent Collections

- Not available as part of Java Core Library

- Existing projects includes- PCollections: https://github.com/hrldcpr/pcollections- Port of Clojure DS: https://github.com/krukow/clj-ds- Port of Scala DS: https://github.com/andrewoma/dexx- Coming soon to Javaslang

Memory usage survey

10,000,000 elements, heap < 32GB

int[] : 40MBInteger[]: 160MBArrayList<Integer>: 215MBPersistentVector<Integer>: 214MB (Clojure-DS)Vector<Integer>: 206MB (Dexx, port of Scala-DS)

Data collected using Java Object Layout: http://openjdk.java.net/projects/code-tools/jol/

Primitive specialised collections

● Collections often hold boxed representations of primitive values

● Java 8 introduced IntStream, LongStream, DoubleStream and

primitive specialised functional interfaces

● Other libraries, eg: Agrona, Koloboke and Eclipse-Collections provide

primitive specialised collections today.

● Valhalla investigates primitive specialised generics

Takeaways

● Immutable collections reduce the scope for bugs

● Always a compromise between programming safety and performance

● Performance of persistent data structure is improving

Collection Problems

Java Episode 8 & 9

Persistent & Immutable Collections

HashMaps

HashMaps Basics

...

Han Solohash = 72309

Chewbaccahash = 72309

Chaining Probing

HashMaps

a separate data structure for collision lookups

Store inline and have a probing sequence

Aliases: Palpatine vs Darth Sidious

Chaining Probing

HashMaps

aka Closed Addressing

aka Open Hashing

aka Open Addressing

aka Closed Hashing

Chaining Probing

HashMaps

Linked List Based Tree Based

java.util.HashMap

Chaining Based HashMap

Historically maintained a LinkedList in the case of a collision

Problem: with high collision rates that the HashMap approaches O(N) lookup

java.util.HashMap in Java 8

Starts by using a List to store colliding values.

Trees used when there are over 8 elements

Tree based nodes use about twice the memory

Make heavy collision lookup case O(log(N)) rather than O(N)

Relies on keys being Comparable

https://github.com/RichardWarburton/map-visualiser

So which HashMap is best?

Benchmarking is about building a mental model of the performance tradeoffs

Example Jar-Jar Benchmark

call get() on a single value for a map of size 1

No model of the different factors that affect things!

Benchmarking HashMaps

Load FactorNonlinear key accessSuccessful vs Failed get()Hash CollisionsComparable vs Incomparable keysDifferent Keys and ValuesCost of hashCode/Equals

Tree Optimization - 60% Collisions

Tree Optimization - 10% Collisions

Probing vs Chaining

Probing Maps usually have lower memory consumption

Small Maps: Probing never has long clusters, can be up to 91% faster.

In large maps with high collision rates, probing scales poorly and can be significantly slower.

Takeaways

There’s no clearcut “winner”.

JDK Implementations try to minimise worst case.

Linear Probing requires a good hashCode() distribution, Often hashmaps “precondition” their hashes.

IdentityHashMap has low memory consumption and is fast, use it!

3rd Party libraries offer probing HashMaps, eg Koloboke & Eclipse-Collections.

Conclusions

Interface Popularity

List 1576210

Set 980763

Map 803171

Queue 62024

Deque 3464

SortedSet 9121

NavigableSet 1735

SortedMap 8677

NavigableMap 1484

Implementation Popularity

ArrayList 225029

LinkedList 26850

ArrayDeque 1086

HashSet 68940

TreeSet 10108

EnumSet 10512

HashMap 137610

TreeMap 7734

WeakHashMap 3473

IdentityHashMap 2443

EnumMap 1904

Evolution can be interesting ...Java 1.2 Java 10?

Any Questions?

www.iteratrlearning.com

● Modern Development with Java 8● Reactive and Asynchronous Java● Java Software Development Bootcamp

Further reading

Fast Functional Lists, Hash-Lists, Deques and Variable Length Arrayshttps://infoscience.epfl.ch/record/64410/files/techlists.pdf

Smaller Footprint for Java Collectionshttp://www.lirmm.fr/~ducour/Doc-objets/ECOOP2012/ECOOP/ecoop/356.pdf

Optimizing Hash-Array Mapped Tries for Fast and Lean Immutable JVM Collectionshttp://michael.steindorfer.name/publications/oopsla15.pdf

RRB-Trees: Efficient Immutable Vectorshttps://infoscience.epfl.ch/record/169879/files/RMTrees.pdf

Further reading

Doug Lea’s Analysis of the HashMap implementation tradeoffshttp://www.mail-archive.com/core-libs-dev@openjdk.java.net/msg02147.html

Java Specialists HashMap article

http://www.javaspecialists.eu/archive/Issue235.html

Sample and Benchmark Codehttps://github.com/RichardWarburton/Java-Collections-The-Force-Awakens

Further reading

Debian code search used for popularityhttps://codesearch.debian.net/

Small HashMaps

Many HashMaps are small or empty

Lazy Initialization In Java 8+

Specialised Implementations● Collections.singleton*/Collections.empty*● Collectors.partitioningBy()● Specialised Eclipse Collections (eg Doubleton)

Probing Sequence

Linear- Cache Locality

Quadratic- Tree

Clever ideas

Implementing Persistent Collections

Fat node● Nodes store updated values in an internal list ● Different versions accessible using an order (e.g. timestamp)

Path copying● Copy path leading to updated node● Share rest with previous version

Benchmarking HashMaps

Test different Assumptions + Behaviours

Understand costs, don’t just measure them

Be Scientific

Use a framework

Peer Review - Wisedom of crowds

h = key.hashCode() ^ (h >>> 16);

Preconditioning

CopyOnWrite

public boolean add(E e) {

final ReentrantLock lock = this.lock;

lock.lock();

try {

Object[] elements = getArray();

int len = elements.length;

Object[] newElements = Arrays.copyOf(elements, len + 1);

newElements[len] = e;

setArray(newElements);

return true;

} finally {

lock.unlock();

}

}

Persistent Array (Bitmapped Vector Trie)

● Uses bit pattern (representing index number) for efficient arithmetic / lookup of elements

● Branching factor of 32 and depth of 5 can stores 33 millions elements and requires 5 lookups to find an element “practically constant”

top related