java的hashmap实现


总结算法导论中对hash table的讨论,结合android的源码android-8.0.0_r34,总结在对应的hashmap的实现细节.

hash的作用

hash table是数组的延伸,从两个方面对数组做了改进,第一,可以使用任意数据类型作为key;第二,value不必是连续的. 同时保持了数组的O(1)的存取时间.

hash table的两个关键点

  1. 冲突的处理(collsion) 因为hash过程把一个较大的value空间映射到了一个较小的存储空间,所以必然会出现冲突,负载系数=存储元素数/存储空间,描述了冲突的可能性大小.可以用下面两种方式解决冲突.

    • chainning 当发生冲突时通过chain的方式把相同hash值的元素存储为一个链表,查找时需要先比较hash值再比较key值
    • open addressing 当检测到冲突时,继续向后存储下一存储位置,直到找到一个空余位置,或者hash表已满.
  2. hash函数 一个好的hash函数应该是平均一致的hash,即把每个元素同等概率的映射到每一个slot的位置,完全独立于已经映射过的元素.常用的hash函数为

    • 除法 h(k) = k mod m map key into one of m slots,m选择一个不接近于2的幂的素数,
    • 乘法 h(k) = 「m (kA mod 1)」0<A<1 m的选择对结果没啥影响,一般选2的幂 m = 2^w A= s/2^w = \frac{\sqrt{5} - 1}{2} = 0.618...

hash值的计算-java的hashcode函数

在java的Object 对象中有hashCode() 方法.如果对应的对象做hashmap的key就会使用该方法.

    public int hashCode() {
        return identityHashCode(this);
    }
    private static native int identityHashCodeNative(Object obj);

可见在默认的实现中hashCode是通过native函数返回了一个值,文档如是说(This is typically implemented by converting the internal address of the object into an integer)

hashcode和equal函数有千丝万缕的联系,可以参考这个文章这里摘抄下结论.

  • Equal objects must produce the same hash code as long as they are equal, however unequal objects need not produce distinct hash codes.
  • The equals method provides "deep comparison" by checking if two objects are logically equal as opposed to the "shallow comparison" provided by the equality operator ==.
  • However, the equals method in java.lang.Object class only provides "shallow comparison", same as provided by the equality operator ==.
  • The equals method only takes Java objects as an argument, and not primitives; passing primitives will result in a compile time error.
  • Passing objects of different types to the equals method will never result in a compile time error or runtime error.
  • For standard Java wrapper classes and for java.lang.String, if the equals argument type (class) is different from the type of the object on which the equals method is invoked, it will return false.
  • The class java.lang.StringBuffer does not override the equals method, and hence it inherits the implementation from java.lang.Object class.
  • The equals method must not provide equality comparison with any built in Java class, as it would result in the violation of the symmetry requirement stated in the general contract of the equals method.
  • If null is passed as an argument to the equals method, it will return false.
  • Equal hash codes do not imply that the objects are equal.
  • return 1; is a legal implementation of the hashCode method, however it is a very bad implementation. It is legal because it ensures that equal objects will have equal hash codes, it also ensures that the hash code returned will be consistent for multiple invocations during the same execution. Thus, it does not violate the general contract of the hashCode method. It is a bad implementation because it returns same hash code for all the objects. This explanation applies to all implementations of the hashCode method which return same constant integer value for all the objects.
  • In standard JDK 1.4, the wrapper classes java.lang.Short, java.lang.Byte, java.lang.Character and java.lang.Integer simply return the value they represent as the hash code by typecasting it to an int.
  • Since JDK version 1.3, the class java.lang.String caches its hash code, i.e. it calculates the hash code only once and stores it in an instance variable and returns this value whenever the hashCode method is called. It is legal because java.lang.String represents an immutable string.
  • It is incorrect to involve a random number directly while computing the hash code of the class object, as it would not consistently return the same hash code for multiple invocations during the same execution.

如果自己实现一个类,应该如何重写hashcode方法呢,同样参考上面的文章

  1. Store an arbitrary non-zero constant integer value (say 7) in an int variable, called hash.
  2. Involve significant variables of your object in the calculation of the hash code, all the variables that are part of equals comparison should be considered for this. Compute an individual hash code int var_code for each variable var as follows -
    • If the variable(var) is byte, char, short or int, then var_code = (int)var;
    • If the variable(var) is long, then var_code = (int)(var ^ (var >>> 32));
    • If the variable(var) is float, then var_code = Float.floatToIntBits(var);
    • If the variable(var) is double, then - long bits = Double.doubleToLongBits(var); var_code = (int)(bits ^ (bits >>> 32));
    • If the variable(var) is boolean, then var_code = var ? 1 : 0;
    • If the variable(var) is an object reference, then check if it is null, if yes then var_code = 0; otherwise invoke the hashCode method recursively on this object reference to get the hash code. This can be simplified and given as - var_code = (null == var ? 0 : var.hashCode());
  3. Combine this individual variable hash code var_code in the original hash code hash as follows - hash = 31 * hash + var_code;
  4. Follow these steps for all the significant variables and in the end return the resulting integer hash.
  5. Lastly, review your hashCode method and check if it is returning equal hash codes for equal objects. Also, verify that the hash codes returned for the object are consistently the same for multiple invocations during the same execution.

这里举个例子

public class Test
{
    private int num;
    private String data;

    public boolean equals(Object obj)
    {
        if(this == obj)
            return true;
            if((obj == null) || (obj.getClass() != this.getClass()))
                return false;
            // object must be Test at this point
            Test test = (Test)obj;
            return num == test.num &&
            (data == test.data || (data != null && data.equals(test.data)));
    }

    public int hashCode()
    {
        int hash = 7;
        hash = 31 * hash + num;
        hash = 31 * hash + (null == data ? 0 : data.hashCode());
        return hash;
    }
        // other methods
}

android 的String的hashcode的例子

 public int hashCode() {
        int h = hash;
        final int len = length();
        if (h == 0 && len > 0) {
            for (int i = 0; i < len; i++) {
                h = 31 * h + charAt(i);
            }
            hash = h;
        }
        return h;
    }

可以看到string length长度为0时,hash值也为0,String对每个对象的hash值进行了缓存.

Map

Map.java 定义了map接口的基本操作

public interface Map<K, V> {
    interface Entry<K, V> {
        K getKey();
        V getValue();
        V setValue(V value);
    }
    
    V remove(Object key);
    V put(K key, V value);
    V get(Object key);
    //遍历使用的方法
    Set<K> keySet();
    Collection<V> values();
    Set<Map.Entry<K, V>> entrySet();
    
    int size();
    boolean isEmpty();
    void clear();
}

HashMap

下面看下HashMap对这个接口的实现

public class HashMap<K,V> extends AbstractMap<K,V>
    implements Map<K,V>, Cloneable, Serializable {
    static class Node<K,V> implements Map.Entry<K,V> {
        final int hash;
        final K key;
        V value;
        Node<K,V> next;
    }
    transient Node<K,V>[] table;
    static final int DEFAULT_INITIAL_CAPACITY = 1 << 4; // aka 16
    static final int MAXIMUM_CAPACITY = 1 << 30;
    static final float DEFAULT_LOAD_FACTOR = 0.75f;
    int threshold;//capacity * load factor
}

可见hashmap内部使用了一个数组存储key-value值,每个节点是一个单链表,可见是通过chain的方式处理冲突的.

这里要先讨论几个关键量,capacity在这里capacity即为数组table的大小,表示当前hash空间的大小,最开始起始值是DEFAULT_INITIAL_CAPACITY=16. 随着map中的元素越来越多,当元素数量超过threshold=capacity * load factorcapacity会增加,也就是table数组会增大,具体扩大策略后面再说.

    static final int hash(Object key) {
        int h;
        return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
    }
    public V put(K key, V value) {
        return putVal(hash(key), key, value, false, true);
    }
    final V putVal(int hash, K key, V value, boolean onlyIfAbsent,boolean evict) {
        Node<K,V>[] tab; Node<K,V> p; int n, i;
        if ((tab = table) == null || (n = tab.length) == 0)//第一次使用创建table
            n = (tab = resize()).length;
        if ((p = tab[i = (n - 1) & hash]) == null)
            tab[i] = newNode(hash, key, value, null);//没有旧元素,需要创建一个并插入
        else {//已经有相同的hash元素存在了
            Node<K,V> e; K k;
            if (p.hash == hash &&
                ((k = p.key) == key || (key != null && key.equals(k))))//table中存在了相同的元素,需要替换
                e = p;
            else {
                for (int binCount = 0; ; ++binCount) {
                    if ((e = p.next) == null) {//collision 添加在链表后面
                        p.next = newNode(hash, key, value, null);
                        break;
                    }
                    if (e.hash == hash &&
                        ((k = e.key) == key || (key != null && key.equals(k))))//链表里存在相同元素,需要替换
                        break;
                    p = e;
                }
            }
            if (e != null) { // 这里统一替换
                V oldValue = e.value;
                if (!onlyIfAbsent || oldValue == null)
                    e.value = value;
                afterNodeAccess(e);
                return oldValue;
            }
        }
        ++modCount;
        if (++size > threshold)
            resize();
        afterNodeInsertion(evict);
        return null;
    }

这里涉及到了hash最核心的操作,首先作为key的Object对象的hashcode方法返回了一个hash值,该值通过(h = key.hashCode()) ^ (h >>> 16) 将高16位和低16位异或,使高位的值也能在hash时起作用,然后是这个32位的hash值如何映射到table数组呢,关键就是p = tab[i = (n - 1) & hash] 可见是把计算出的hash值直接对数组的大小取余,也就是使用除法hash方法.下面看下get方法

   public V get(Object key) {
        Node<K,V> e;
        return (e = getNode(hash(key), key)) == null ? null : e.value;
    }
    final Node<K,V> getNode(int hash, Object key) {
        Node<K,V>[] tab; Node<K,V> first, e; int n; K k;
        if ((tab = table) != null && (n = tab.length) > 0 &&
            (first = tab[(n - 1) & hash]) != null) {
            if (first.hash == hash && // always check first node
                ((k = first.key) == key || (key != null && key.equals(k))))
                return first;
            if ((e = first.next) != null) {
                do {
                    if (e.hash == hash &&
                        ((k = e.key) == key || (key != null && key.equals(k))))
                        return e;
                } while ((e = e.next) != null);
            }
        }
        return null;
    }

get方法没有什么特别通过hash获得table里对应的位置,遍历table里的单向链表.

   public Collection<V> values() {
        Collection<V> vs = values;
        if (vs == null) {
            vs = new Values();
            values = vs;
        }
        return vs;
    }
    final class Values extends AbstractCollection<V> {
        public final int size()                 { return size; }
        public final void clear()               { HashMap.this.clear(); }
        public final Iterator<V> iterator()     { return new ValueIterator(); }
        public final boolean contains(Object o) { return containsValue(o); }
        public final Spliterator<V> spliterator() {
            return new ValueSpliterator<>(HashMap.this, 0, -1, 0, 0);
        }
        public final void forEach(Consumer<? super V> action) {
            Node<K,V>[] tab;
            if (action == null)
                throw new NullPointerException();
            if (size > 0 && (tab = table) != null) {
                int mc = modCount;
                // Android-changed: Detect changes to modCount early.
                for (int i = 0; (i < tab.length && modCount == mc); ++i) {
                    for (Node<K,V> e = tab[i]; e != null; e = e.next)
                        action.accept(e.value);
                }
                if (modCount != mc)
                    throw new ConcurrentModificationException();
            }
        }
    }

values()方法是直接操作table数组,所以通过values的返回值修改会反馈到map. keySet也是同样.

    public Set<Map.Entry<K,V>> entrySet() {
        Set<Map.Entry<K,V>> es;
        return (es = entrySet) == null ? (entrySet = new EntrySet()) : es;
    }
    final class EntrySet extends AbstractSet<Map.Entry<K,V>> {
        public final int size()                 { return size; }
        public final void clear()               { HashMap.this.clear(); }
        public final Iterator<Map.Entry<K,V>> iterator() {
            return new EntryIterator();
        }
        public final boolean contains(Object o) {
            if (!(o instanceof Map.Entry))
                return false;
            Map.Entry<?,?> e = (Map.Entry<?,?>) o;
            Object key = e.getKey();
            Node<K,V> candidate = getNode(hash(key), key);
            return candidate != null && candidate.equals(e);
        }
        public final boolean remove(Object o) {
            if (o instanceof Map.Entry) {
                Map.Entry<?,?> e = (Map.Entry<?,?>) o;
                Object key = e.getKey();
                Object value = e.getValue();
                return removeNode(hash(key), key, value, true, true) != null;
            }
            return false;
        }
        public final Spliterator<Map.Entry<K,V>> spliterator() {
            return new EntrySpliterator<>(HashMap.this, 0, -1, 0, 0);
        }
        public final void forEach(Consumer<? super Map.Entry<K,V>> action) {
            Node<K,V>[] tab;
            if (action == null)
                throw new NullPointerException();
            if (size > 0 && (tab = table) != null) {
                int mc = modCount;
                // Android-changed: Detect changes to modCount early.
                for (int i = 0; (i < tab.length && modCount == mc); ++i) {
                    for (Node<K,V> e = tab[i]; e != null; e = e.next)
                        action.accept(e);
                }
                if (modCount != mc)
                    throw new ConcurrentModificationException();
            }
        }
    }

下面看下remove操作

    public V remove(Object key) {
        Node<K,V> e;
        return (e = removeNode(hash(key), key, null, false, true)) == null ?
            null : e.value;
    }
    final Node<K,V> removeNode(int hash, Object key, Object value,
                               boolean matchValue, boolean movable) {
        Node<K,V>[] tab; Node<K,V> p; int n, index;
        if ((tab = table) != null && (n = tab.length) > 0 &&
            (p = tab[index = (n - 1) & hash]) != null) {//p指向table里hash对应的元素
            Node<K,V> node = null, e; K k; V v;
            if (p.hash == hash &&
                ((k = p.key) == key || (key != null && key.equals(k))))
                node = p;//node指向需要需要移除的node
            else if ((e = p.next) != null) {
                do {
                    if (e.hash == hash &&
                        ((k = e.key) == key ||
                         (key != null && key.equals(k)))) {
                        node = e;
                        break;
                    }
                    p = e;//p指向需要移除node的父节点
                } while ((e = e.next) != null);
            }
            if (node != null && (!matchValue || (v = node.value) == value ||
                                 (value != null && value.equals(v)))) {
                if (node == p)
                    tab[index] = node.next;
                else
                    p.next = node.next;
                ++modCount;
                --size;
                afterNodeRemoval(node);
                return node;
            }
        }
        return null;
    }

单链表移除元素的常规操作.下面着重看下table是如何扩容的.通过上面的put方法可以看到,在每次put后都会检查当前size是不是超过了threshold,超过了则通过resize()扩容.

final Node<K,V>[] resize() {
    Node<K,V>[] oldTab = table;
    int oldCap = (oldTab == null) ? 0 : oldTab.length;
    int oldThr = threshold;
    int newCap, newThr = 0;
    if (oldCap > 0) {
        if (oldCap >= MAXIMUM_CAPACITY) {
            threshold = Integer.MAX_VALUE;
            return oldTab;
        }else if ((newCap = oldCap << 1) < MAXIMUM_CAPACITY &&
                 oldCap >= DEFAULT_INITIAL_CAPACITY)
            newThr = oldThr << 1; // double threshold
    }else if (oldThr > 0) // initial capacity was placed in threshold
        newCap = oldThr;
    else {               // zero initial threshold signifies using defaults
        newCap = DEFAULT_INITIAL_CAPACITY;
        newThr = (int)(DEFAULT_LOAD_FACTOR * DEFAULT_INITIAL_CAPACITY);
    }
    if (newThr == 0) {
        float ft = (float)newCap * loadFactor;
        newThr = (newCap < MAXIMUM_CAPACITY && ft < (float)MAXIMUM_CAPACITY ?
                  (int)ft : Integer.MAX_VALUE);
    }
    threshold = newThr;
    @SuppressWarnings({"rawtypes","unchecked"})
        Node<K,V>[] newTab = (Node<K,V>[])new Node[newCap];
    table = newTab;
    if (oldTab != null) {
        for (int j = 0; j < oldCap; ++j) {
            Node<K,V> e;
            if ((e = oldTab[j]) != null) {
                oldTab[j] = null;
                if (e.next == null)
                    newTab[e.hash & (newCap - 1)] = e;
                else { // preserve order
                    Node<K,V> loHead = null, loTail = null;
                    Node<K,V> hiHead = null, hiTail = null;
                    Node<K,V> next;
                    do {
                        next = e.next;
                        if ((e.hash & oldCap) == 0) {//新老hash对应的位置相同
                            if (loTail == null)
                                loHead = e;
                            else
                                loTail.next = e;
                            loTail = e;
                        }else {//新的位置为旧的位置 + oldCap
                            if (hiTail == null)
                                hiHead = e;
                            else
                                hiTail.next = e;
                            hiTail = e;
                        }
                    } while ((e = next) != null);
                    if (loTail != null) {
                        loTail.next = null;
                        newTab[j] = loHead;
                    }
                    if (hiTail != null) {
                        hiTail.next = null;
                        newTab[j + oldCap] = hiHead;
                    }
                }
            }
        }
    }
    return newTab;
}

首先每次扩容threshold和capacity都会是原来的2倍.之后使用新的capacity new一个新的table,将需要移动和不需要移动的数据生成两个列表,然后个放到对应的位置. 可见每次扩容都会有rehash的过程,所以如果能在每次操作前预知hashtable的最大值,设置capacity为最大值/load factor,减少rehash,有助于提高性能.

LinkedHashMap

LinkedHashMap继承自HashMap,在entry内增加了两个指针,用来把所有的entry做双向链表(下面代码里的before,after).

public class LinkedHashMap<K,V>
    extends HashMap<K,V>
    implements Map<K,V>
{
    static class LinkedHashMapEntry<K,V> extends HashMap.Node<K,V> {
        LinkedHashMapEntry<K,V> before, after;
        LinkedHashMapEntry(int hash, K key, V value, Node<K,V> next) {
            super(hash, key, value, next);
        }
    }
    transient LinkedHashMapEntry<K,V> head;
    transient LinkedHashMapEntry<K,V> tail;
}

每次有新的元素加入,新的元素除了做所有hashmap的操作,还会被加入双链表里.

    Node<K,V> newNode(int hash, K key, V value, Node<K,V> e) {
        LinkedHashMapEntry<K,V> p =
            new LinkedHashMapEntry<K,V>(hash, key, value, e);
        linkNodeLast(p);
        return p;
    }
    private void linkNodeLast(LinkedHashMapEntry<K,V> p) {
        LinkedHashMapEntry<K,V> last = tail;
        tail = p;
        if (last == null)
            head = p;
        else {
            p.before = last;
            last.after = p;
        }
    }

删除节点在hashmap的remove里最后会调用afterNodeRemoval这个是linkedhashmap用来更新双向链表的.

    void afterNodeRemoval(Node<K,V> e) { // unlink
        LinkedHashMapEntry<K,V> p =
            (LinkedHashMapEntry<K,V>)e, b = p.before, a = p.after;
        p.before = p.after = null;
        if (b == null)
            head = a;
        else
            b.after = a;
        if (a == null)
            tail = b;
        else
            a.before = b;
    }

Android特有的SparseArray, SparseBooleanArray, SparseIntArray, SparseLongArray

SparseArray存储hash值的table是不连续的,用来减少int做key的hashmap的空间占用.在存储间隔比较大的int型key map有一定节约空间的作用,另外减少了key的autobox的过程,能一定程度提高速度. 其查找key的方式并不是hash,而是通过二分查找,详细细节参见SparseArray代码.

public class SparseArray<E> implements Cloneable {
    private static final Object DELETED = new Object();
    private boolean mGarbage = false;
    private int[] mKeys;
    private Object[] mValues;
    private int mSize;
    public E get(int key, E valueIfKeyNotFound) {
        int i = ContainerHelpers.binarySearch(mKeys, mSize, key);
        if (i < 0 || mValues[i] == DELETED) {
            return valueIfKeyNotFound;
        } else {
            return (E) mValues[i];
        }
    }
    public void put(int key, E value) {
        int i = ContainerHelpers.binarySearch(mKeys, mSize, key);
        if (i >= 0) {
            mValues[i] = value;
        } else {
            i = ~i;
            if (i < mSize && mValues[i] == DELETED) {
                mKeys[i] = key;
                mValues[i] = value;
                return;
            }
            if (mGarbage && mSize >= mKeys.length) {
                gc();
                // Search again because indices may have changed.
                i = ~ContainerHelpers.binarySearch(mKeys, mSize, key);
            }
            mKeys = GrowingArrayUtils.insert(mKeys, mSize, i, key);
            mValues = GrowingArrayUtils.insert(mValues, mSize, i, value);
            mSize++;
        }
    }
}
GrowingArrayUtils.java
    public static int[] insert(int[] array, int currentSize, int index, int element) {
        assert currentSize <= array.length;
        if (currentSize + 1 <= array.length) {
            System.arraycopy(array, index, array, index + 1, currentSize - index);
            array[index] = element;
            return array;
        }
        int[] newArray = ArrayUtils.newUnpaddedIntArray(growSize(currentSize));
        System.arraycopy(array, 0, newArray, 0, index);
        newArray[index] = element;
        System.arraycopy(array, index, newArray, index + 1, array.length - index);
        return newArray;
    }

首先,SparseArray使用两个数组mKeysmValues存储key-values.

其中mKeys是按递增顺序的顺序存储的,这样在查找key的时候才可以使用二分查找,这样的负面就是,如果加入的新的key-value pair的key的位置应该位于keys数组中间时, 这时需要移动key后面的所有数据,同时可能需要扩展数组.

这时使用时就需要自己平衡下,hashmap和sparsearray哪个更合适.

TreeMap

treemap是使用red-black tree数据结构实现的map接口.所以其查找添加操作的时间复杂度都为O(lg(n))

public class TreeMap<K,V>
    extends AbstractMap<K,V>
    implements NavigableMap<K,V>, Cloneable, java.io.Serializable
{
    private final Comparator<? super K> comparator;
    private transient TreeMapEntry<K,V> root;
    private transient int size = 0;
    static final class TreeMapEntry<K,V> implements Map.Entry<K,V> {
        K key;
        V value;
        TreeMapEntry<K,V> left;
        TreeMapEntry<K,V> right;
        TreeMapEntry<K,V> parent;
        boolean color = BLACK;
    }
    
    public V get(Object key) {
        TreeMapEntry<K,V> p = getEntry(key);
        return (p==null ? null : p.value);
    }
    final TreeMapEntry<K,V> getEntry(Object key) {
        // Offload comparator-based version for sake of performance
        if (comparator != null)
            return getEntryUsingComparator(key);
        if (key == null)
            throw new NullPointerException();
        @SuppressWarnings("unchecked")
            Comparable<? super K> k = (Comparable<? super K>) key;
        TreeMapEntry<K,V> p = root;
        while (p != null) {
            int cmp = k.compareTo(p.key);
            if (cmp < 0)
                p = p.left;
            else if (cmp > 0)
                p = p.right;
            else
                return p;
        }
        return null;
    }
}
    

Copyright © FengGuangtu 2017