Rollup merge of rust-lang#36692 - arthurprs:hashmap-layout, r=alexcrichton

Jonathan Turner · web-flow · commit 298c313691d0 · 2016-10-11T10:26:46.000-07:00
Cache conscious hashmap table Right now the internal HashMap representation is 3 unziped arrays hhhkkkvvv, I propose to change it to hhhkvkvkv (in further iterations kvkvkvhhh may allow inplace grow). A previous attempt is at rust-lang#21973. This layout is generally more cache conscious as it makes the value immediately accessible after a key matches. The separated hash arrays is a _no-brainer_ because of how the RH algorithm works and that's unchanged. **Lookups**: Upon a successful match in the hash array the code can check the key and immediately have access to the value in the same or next cache line (effectively saving a L[1,2,3] miss compared to the current layout). **Inserts/Deletes/Resize**: Moving values in the table (robin hooding it) is faster because it touches consecutive cache lines and uses less instructions. Some backing benchmarks (besides the ones bellow) for the benefits of this layout can be seen here as well http://www.reedbeta.com/blog/2015/01/12/data-oriented-hash-table/ The obvious drawbacks is: padding can be wasted between the key and value. Because of that keys(), values() and contains() can consume more cache and be slower. Total wasted padding between items (C being the capacity of the table). * Old layout: C * (K-K padding) + C * (V-V padding) * Proposed: C * (K-V padding) + C * (V-K padding) In practice padding between K-K and V-V *can* be smaller than K-V and V-K. The overhead is capped(ish) at sizeof u64 - 1 so we can actually measure the worst case (u8 at the end of key type and value with aliment of 1, _hardly the average case in practice_). Starting from the worst case the memory overhead is: * `HashMap<u64, u8>` 46% memory overhead. (aka *worst case*) * `HashMap<u64, u16>` 33% memory overhead. * `HashMap<u64, u32>` 20% memory overhead. * `HashMap<T, T>` 0% memory overhead * Worst case based on sizeof K + sizeof V: | x | 16 | 24 | 32 | 64 | 128 | |----------------|--------|--------|--------|-------|-------| | (8+x+7)/(8+x) | 1.29 | 1.22 | 1.18 | 1.1 | 1.05 | I've a test repo here to run benchmarks https://github.com/arthurprs/hashmap2/tree/layout ``` ➜ hashmap2 git:(layout) ✗ cargo benchcmp hhkkvv:: hhkvkv:: bench.txt name hhkkvv:: ns/iter hhkvkv:: ns/iter diff ns/iter diff % grow_10_000 922,064 783,933 -138,131 -14.98% grow_big_value_10_000 1,901,909 1,171,862 -730,047 -38.38% grow_fnv_10_000 443,544 418,674 -24,870 -5.61% insert_100 2,469 2,342 -127 -5.14% insert_1000 23,331 21,536 -1,795 -7.69% insert_100_000 4,748,048 3,764,305 -983,743 -20.72% insert_10_000 321,744 290,126 -31,618 -9.83% insert_int_bigvalue_10_000 749,764 407,547 -342,217 -45.64% insert_str_10_000 337,425 334,009 -3,416 -1.01% insert_string_10_000 788,667 788,262 -405 -0.05% iter_keys_100_000 394,484 374,161 -20,323 -5.15% iter_keys_big_value_100_000 402,071 620,810 218,739 54.40% iter_values_100_000 424,794 373,004 -51,790 -12.19% iterate_100_000 424,297 389,950 -34,347 -8.10% lookup_100_000 189,997 186,554 -3,443 -1.81% lookup_100_000_bigvalue 192,509 189,695 -2,814 -1.46% lookup_10_000 154,251 145,731 -8,520 -5.52% lookup_10_000_bigvalue 162,315 146,527 -15,788 -9.73% lookup_10_000_exist 132,769 128,922 -3,847 -2.90% lookup_10_000_noexist 146,880 144,504 -2,376 -1.62% lookup_1_000_000 137,167 132,260 -4,907 -3.58% lookup_1_000_000_bigvalue 141,130 134,371 -6,759 -4.79% lookup_1_000_000_bigvalue_unif 567,235 481,272 -85,963 -15.15% lookup_1_000_000_unif 589,391 453,576 -135,815 -23.04% merge_shuffle 1,253,357 1,207,387 -45,970 -3.67% merge_simple 40,264,690 37,996,903 -2,267,787 -5.63% new 6 5 -1 -16.67% with_capacity_10e5 3,214 3,256 42 1.31% ``` ``` ➜ hashmap2 git:(layout) ✗ cargo benchcmp hhkkvv:: hhkvkv:: bench.txt name hhkkvv:: ns/iter hhkvkv:: ns/iter diff ns/iter diff % iter_keys_100_000 391,677 382,839 -8,838 -2.26% iter_keys_1_000_000 10,797,360 10,209,898 -587,462 -5.44% iter_keys_big_value_100_000 414,736 662,255 247,519 59.68% iter_keys_big_value_1_000_000 10,147,837 12,067,938 1,920,101 18.92% iter_values_100_000 440,445 377,080 -63,365 -14.39% iter_values_1_000_000 10,931,844 9,979,173 -952,671 -8.71% iterate_100_000 428,644 388,509 -40,135 -9.36% iterate_1_000_000 11,065,419 10,042,427 -1,022,992 -9.24% ```
diff --git a/src/libstd/collections/hash/table.rs b/src/libstd/collections/hash/table.rs
@@ -24,10 +24,10 @@ use self::BucketState::*;
 const EMPTY_BUCKET: u64 = 0;
 
 /// The raw hashtable, providing safe-ish access to the unzipped and highly
-/// optimized arrays of hashes, keys, and values.
+/// optimized arrays of hashes, and key-value pairs.
 ///
-/// This design uses less memory and is a lot faster than the naive
-/// `Vec<Option<u64, K, V>>`, because we don't pay for the overhead of an
+/// This design is a lot faster than the naive
+/// `Vec<Option<(u64, K, V)>>`, because we don't pay for the overhead of an
 /// option on every element, and we get a generally more cache-aware design.
 ///
 /// Essential invariants of this structure:
@@ -48,17 +48,19 @@ const EMPTY_BUCKET: u64 = 0;
 ///     which will likely map to the same bucket, while not being confused
 ///     with "empty".
 ///
-///   - All three "arrays represented by pointers" are the same length:
+///   - Both "arrays represented by pointers" are the same length:
 ///     `capacity`. This is set at creation and never changes. The arrays
-///     are unzipped to save space (we don't have to pay for the padding
-///     between odd sized elements, such as in a map from u64 to u8), and
-///     be more cache aware (scanning through 8 hashes brings in at most
-///     2 cache lines, since they're all right beside each other).
+///     are unzipped and are more cache aware (scanning through 8 hashes
+///     brings in at most 2 cache lines, since they're all right beside each
+///     other). This layout may waste space in padding such as in a map from
+///     u64 to u8, but is a more cache conscious layout as the key-value pairs
+///     are only very shortly probed and the desired value will be in the same
+///     or next cache line.
 ///
 /// You can kind of think of this module/data structure as a safe wrapper
 /// around just the "table" part of the hashtable. It enforces some
 /// invariants at the type level and employs some performance trickery,
-/// but in general is just a tricked out `Vec<Option<u64, K, V>>`.
+/// but in general is just a tricked out `Vec<Option<(u64, K, V)>>`.
 pub struct RawTable<K, V> {
     capacity: usize,
     size: usize,
@@ -74,10 +76,8 @@ unsafe impl<K: Sync, V: Sync> Sync for RawTable<K, V> {}
 
 struct RawBucket<K, V> {
     hash: *mut u64,
-
     // We use *const to ensure covariance with respect to K and V
-    key: *const K,
-    val: *const V,
+    pair: *const (K, V),
     _marker: marker::PhantomData<(K, V)>,
 }
 
@@ -181,8 +181,7 @@ impl<K, V> RawBucket<K, V> {
     unsafe fn offset(self, count: isize) -> RawBucket<K, V> {
         RawBucket {
             hash: self.hash.offset(count),
-            key: self.key.offset(count),
-            val: self.val.offset(count),
+            pair: self.pair.offset(count),
             _marker: marker::PhantomData,
         }
     }
@@ -370,8 +369,7 @@ impl<K, V, M> EmptyBucket<K, V, M>
     pub fn put(mut self, hash: SafeHash, key: K, value: V) -> FullBucket<K, V, M> {
         unsafe {
             *self.raw.hash = hash.inspect();
-            ptr::write(self.raw.key as *mut K, key);
-            ptr::write(self.raw.val as *mut V, value);
+            ptr::write(self.raw.pair as *mut (K, V), (key, value));
 
             self.table.borrow_table_mut().size += 1;
         }
@@ -430,7 +428,7 @@ impl<K, V, M: Deref<Target = RawTable<K, V>>> FullBucket<K, V, M> {
 
     /// Gets references to the key and value at a given index.
     pub fn read(&self) -> (&K, &V) {
-        unsafe { (&*self.raw.key, &*self.raw.val) }
+        unsafe { (&(*self.raw.pair).0, &(*self.raw.pair).1) }
     }
 }
 
@@ -447,13 +445,14 @@ impl<'t, K, V> FullBucket<K, V, &'t mut RawTable<K, V>> {
 
         unsafe {
             *self.raw.hash = EMPTY_BUCKET;
+            let (k, v) = ptr::read(self.raw.pair);
             (EmptyBucket {
                  raw: self.raw,
                  idx: self.idx,
                  table: self.table,
              },
-             ptr::read(self.raw.key),
-             ptr::read(self.raw.val))
+            k,
+            v)
         }
     }
 }
@@ -466,8 +465,7 @@ impl<K, V, M> FullBucket<K, V, M>
     pub fn replace(&mut self, h: SafeHash, k: K, v: V) -> (SafeHash, K, V) {
         unsafe {
             let old_hash = ptr::replace(self.raw.hash as *mut SafeHash, h);
-            let old_key = ptr::replace(self.raw.key as *mut K, k);
-            let old_val = ptr::replace(self.raw.val as *mut V, v);
+            let (old_key, old_val) = ptr::replace(self.raw.pair as *mut (K, V), (k, v));
 
             (old_hash, old_key, old_val)
         }
@@ -479,7 +477,8 @@ impl<K, V, M> FullBucket<K, V, M>
 {
     /// Gets mutable references to the key and value at a given index.
     pub fn read_mut(&mut self) -> (&mut K, &mut V) {
-        unsafe { (&mut *(self.raw.key as *mut K), &mut *(self.raw.val as *mut V)) }
+        let pair_mut = self.raw.pair as *mut (K, V);
+        unsafe { (&mut (*pair_mut).0, &mut (*pair_mut).1) }
     }
 }
 
@@ -492,7 +491,7 @@ impl<'t, K, V, M> FullBucket<K, V, M>
     /// in exchange for this, the returned references have a longer lifetime
     /// than the references returned by `read()`.
     pub fn into_refs(self) -> (&'t K, &'t V) {
-        unsafe { (&*self.raw.key, &*self.raw.val) }
+        unsafe { (&(*self.raw.pair).0, &(*self.raw.pair).1) }
     }
 }
 
@@ -502,7 +501,8 @@ impl<'t, K, V, M> FullBucket<K, V, M>
     /// This works similarly to `into_refs`, exchanging a bucket state
     /// for mutable references into the table.
     pub fn into_mut_refs(self) -> (&'t mut K, &'t mut V) {
-        unsafe { (&mut *(self.raw.key as *mut K), &mut *(self.raw.val as *mut V)) }
+        let pair_mut = self.raw.pair as *mut (K, V);
+        unsafe { (&mut (*pair_mut).0, &mut (*pair_mut).1) }
     }
 }
 
@@ -517,8 +517,7 @@ impl<K, V, M> GapThenFull<K, V, M>
     pub fn shift(mut self) -> Option<GapThenFull<K, V, M>> {
         unsafe {
             *self.gap.raw.hash = mem::replace(&mut *self.full.raw.hash, EMPTY_BUCKET);
-            ptr::copy_nonoverlapping(self.full.raw.key, self.gap.raw.key as *mut K, 1);
-            ptr::copy_nonoverlapping(self.full.raw.val, self.gap.raw.val as *mut V, 1);
+            ptr::copy_nonoverlapping(self.full.raw.pair, self.gap.raw.pair as *mut (K, V), 1);
         }
 
         let FullBucket { raw: prev_raw, idx: prev_idx, .. } = self.full;
@@ -560,49 +559,42 @@ fn test_rounding() {
     assert_eq!(round_up_to_next(5, 4), 8);
 }
 
-// Returns a tuple of (key_offset, val_offset),
+// Returns a tuple of (pairs_offset, end_of_pairs_offset),
 // from the start of a mallocated array.
 #[inline]
 fn calculate_offsets(hashes_size: usize,
-                     keys_size: usize,
-                     keys_align: usize,
-                     vals_align: usize)
+                     pairs_size: usize,
+                     pairs_align: usize)
                      -> (usize, usize, bool) {
-    let keys_offset = round_up_to_next(hashes_size, keys_align);
-    let (end_of_keys, oflo) = keys_offset.overflowing_add(keys_size);
-
-    let vals_offset = round_up_to_next(end_of_keys, vals_align);
+    let pairs_offset = round_up_to_next(hashes_size, pairs_align);
+    let (end_of_pairs, oflo) = pairs_offset.overflowing_add(pairs_size);
 
-    (keys_offset, vals_offset, oflo)
+    (pairs_offset, end_of_pairs, oflo)
 }
 
 // Returns a tuple of (minimum required malloc alignment, hash_offset,
 // array_size), from the start of a mallocated array.
 fn calculate_allocation(hash_size: usize,
                         hash_align: usize,
-                        keys_size: usize,
-                        keys_align: usize,
-                        vals_size: usize,
-                        vals_align: usize)
+                        pairs_size: usize,
+                        pairs_align: usize)
                         -> (usize, usize, usize, bool) {
     let hash_offset = 0;
-    let (_, vals_offset, oflo) = calculate_offsets(hash_size, keys_size, keys_align, vals_align);
-    let (end_of_vals, oflo2) = vals_offset.overflowing_add(vals_size);
+    let (_, end_of_pairs, oflo) = calculate_offsets(hash_size, pairs_size, pairs_align);
 
-    let align = cmp::max(hash_align, cmp::max(keys_align, vals_align));
+    let align = cmp::max(hash_align, pairs_align);
 
-    (align, hash_offset, end_of_vals, oflo || oflo2)
+    (align, hash_offset, end_of_pairs, oflo)
 }
 
 #[test]
 fn test_offset_calculation() {
-    assert_eq!(calculate_allocation(128, 8, 15, 1, 4, 4),
-               (8, 0, 148, false));
-    assert_eq!(calculate_allocation(3, 1, 2, 1, 1, 1), (1, 0, 6, false));
-    assert_eq!(calculate_allocation(6, 2, 12, 4, 24, 8), (8, 0, 48, false));
-    assert_eq!(calculate_offsets(128, 15, 1, 4), (128, 144, false));
-    assert_eq!(calculate_offsets(3, 2, 1, 1), (3, 5, false));
-    assert_eq!(calculate_offsets(6, 12, 4, 8), (8, 24, false));
+    assert_eq!(calculate_allocation(128, 8, 16, 8), (8, 0, 144, false));
+    assert_eq!(calculate_allocation(3, 1, 2, 1), (1, 0, 5, false));
+    assert_eq!(calculate_allocation(6, 2, 12, 4), (4, 0, 20, false));
+    assert_eq!(calculate_offsets(128, 15, 4), (128, 143, false));
+    assert_eq!(calculate_offsets(3, 2, 4), (4, 6, false));
+    assert_eq!(calculate_offsets(6, 12, 4), (8, 20, false));
 }
 
 impl<K, V> RawTable<K, V> {
@@ -621,8 +613,7 @@ impl<K, V> RawTable<K, V> {
         // No need for `checked_mul` before a more restrictive check performed
         // later in this method.
         let hashes_size = capacity * size_of::<u64>();
-        let keys_size = capacity * size_of::<K>();
-        let vals_size = capacity * size_of::<V>();
+        let pairs_size = capacity * size_of::<(K, V)>();
 
         // Allocating hashmaps is a little tricky. We need to allocate three
         // arrays, but since we know their sizes and alignments up front,
@@ -634,19 +625,13 @@ impl<K, V> RawTable<K, V> {
         // factored out into a different function.
         let (malloc_alignment, hash_offset, size, oflo) = calculate_allocation(hashes_size,
                                                                                align_of::<u64>(),
-                                                                               keys_size,
-                                                                               align_of::<K>(),
-                                                                               vals_size,
-                                                                               align_of::<V>());
-
+                                                                               pairs_size,
+                                                                               align_of::<(K,
+                                                                                           V)>());
         assert!(!oflo, "capacity overflow");
 
         // One check for overflow that covers calculation and rounding of size.
-        let size_of_bucket = size_of::<u64>()
-            .checked_add(size_of::<K>())
-            .unwrap()
-            .checked_add(size_of::<V>())
-            .unwrap();
+        let size_of_bucket = size_of::<u64>().checked_add(size_of::<(K, V)>()).unwrap();
         assert!(size >=
                 capacity.checked_mul(size_of_bucket)
                     .expect("capacity overflow"),
@@ -669,17 +654,16 @@ impl<K, V> RawTable<K, V> {
 
     fn first_bucket_raw(&self) -> RawBucket<K, V> {
         let hashes_size = self.capacity * size_of::<u64>();
-        let keys_size = self.capacity * size_of::<K>();
+        let pairs_size = self.capacity * size_of::<(K, V)>();
 
-        let buffer = *self.hashes as *const u8;
-        let (keys_offset, vals_offset, oflo) =
-            calculate_offsets(hashes_size, keys_size, align_of::<K>(), align_of::<V>());
+        let buffer = *self.hashes as *mut u8;
+        let (pairs_offset, _, oflo) =
+            calculate_offsets(hashes_size, pairs_size, align_of::<(K, V)>());
         debug_assert!(!oflo, "capacity overflow");
         unsafe {
             RawBucket {
                 hash: *self.hashes,
-                key: buffer.offset(keys_offset as isize) as *const K,
-                val: buffer.offset(vals_offset as isize) as *const V,
+                pair: buffer.offset(pairs_offset as isize) as *const _,
                 _marker: marker::PhantomData,
             }
         }
@@ -844,7 +828,7 @@ impl<'a, K, V> Iterator for RevMoveBuckets<'a, K, V> {
 
                 if *self.raw.hash != EMPTY_BUCKET {
                     self.elems_left -= 1;
-                    return Some((ptr::read(self.raw.key), ptr::read(self.raw.val)));
+                    return Some(ptr::read(self.raw.pair));
                 }
             }
         }
@@ -909,7 +893,7 @@ impl<'a, K, V> Iterator for Iter<'a, K, V> {
     fn next(&mut self) -> Option<(&'a K, &'a V)> {
         self.iter.next().map(|bucket| {
             self.elems_left -= 1;
-            unsafe { (&*bucket.key, &*bucket.val) }
+            unsafe { (&(*bucket.pair).0, &(*bucket.pair).1) }
         })
     }
 
@@ -929,7 +913,8 @@ impl<'a, K, V> Iterator for IterMut<'a, K, V> {
     fn next(&mut self) -> Option<(&'a K, &'a mut V)> {
         self.iter.next().map(|bucket| {
             self.elems_left -= 1;
-            unsafe { (&*bucket.key, &mut *(bucket.val as *mut V)) }
+            let pair_mut = bucket.pair as *mut (K, V);
+            unsafe { (&(*pair_mut).0, &mut (*pair_mut).1) }
         })
     }
 
@@ -950,7 +935,8 @@ impl<K, V> Iterator for IntoIter<K, V> {
         self.iter.next().map(|bucket| {
             self.table.size -= 1;
             unsafe {
-                (SafeHash { hash: *bucket.hash }, ptr::read(bucket.key), ptr::read(bucket.val))
+                let (k, v) = ptr::read(bucket.pair);
+                (SafeHash { hash: *bucket.hash }, k, v)
             }
         })
     }
@@ -974,9 +960,8 @@ impl<'a, K, V> Iterator for Drain<'a, K, V> {
         self.iter.next().map(|bucket| {
             unsafe {
                 (**self.table).size -= 1;
-                (SafeHash { hash: ptr::replace(bucket.hash, EMPTY_BUCKET) },
-                 ptr::read(bucket.key),
-                 ptr::read(bucket.val))
+                let (k, v) = ptr::read(bucket.pair);
+                (SafeHash { hash: ptr::replace(bucket.hash, EMPTY_BUCKET) }, k, v)
             }
         })
     }
@@ -1015,8 +1000,7 @@ impl<K: Clone, V: Clone> Clone for RawTable<K, V> {
                                 (full.hash(), k.clone(), v.clone())
                             };
                             *new_buckets.raw.hash = h.inspect();
-                            ptr::write(new_buckets.raw.key as *mut K, k);
-                            ptr::write(new_buckets.raw.val as *mut V, v);
+                            ptr::write(new_buckets.raw.pair as *mut (K, V), (k, v));
                         }
                         Empty(..) => {
                             *new_buckets.raw.hash = EMPTY_BUCKET;
@@ -1054,14 +1038,11 @@ impl<K, V> Drop for RawTable<K, V> {
         }
 
         let hashes_size = self.capacity * size_of::<u64>();
-        let keys_size = self.capacity * size_of::<K>();
-        let vals_size = self.capacity * size_of::<V>();
+        let pairs_size = self.capacity * size_of::<(K, V)>();
         let (align, _, size, oflo) = calculate_allocation(hashes_size,
                                                           align_of::<u64>(),
-                                                          keys_size,
-                                                          align_of::<K>(),
-                                                          vals_size,
-                                                          align_of::<V>());
+                                                          pairs_size,
+                                                          align_of::<(K, V)>());
 
         debug_assert!(!oflo, "should be impossible");