300_Aggregations/93_technical_docvalues.asciidoc (elasticsearch-cn#455)

* init commit of 93_technical_doc_values * unifrom words * “improve”
shuaiyer · Jan 25, 2017 · bf9020e · bf9020e
1 parent feb3a70
commit bf9020e
Showing 1 changed file with 51 additions and 93 deletions.
diff --git a/300_Aggregations/93_technical_docvalues.asciidoc b/300_Aggregations/93_technical_docvalues.asciidoc
@@ -1,56 +1,40 @@
+[[_deep_dive_on_doc_values]]
+=== 深入文档值
 
-=== Deep Dive on Doc Values
+在上一节一开头我们就说文档值（doc values） 是 _"更快、更高效并且内存友好"_ 。
+听起来好像是不错的营销术语，不过话说回来文档值到底是如何工作的呢？
 
-The last section opened by saying doc values are _"fast, efficient and memory-friendly"_.
-Those are some nice marketing buzzwords, but how do doc values actually work?
+文档值是在索引时与倒排索引同时产生的。也就是说文档值是按段来产生的并且是不可变的，正如用于搜索的倒排索引一样。
+同样，和倒排索引一样，文档值也序列化到磁盘。这些对于性能和伸缩性很重要。
 
-Doc values are generated at index-time, alongside the creation of the inverted index.
-That means doc values are generated on a per-segment basis and are immutable, just like
-the inverted index used for search. And, like the inverted index, doc values are serialized
-to disk.  This is important to performance and scalability.
+通过序列化一个持久化的数据结构到磁盘，我们可以依赖于操作系统的缓存来管理内存，而不是在 JVM 堆栈里驻留数据。
+当 “工作集（working set）” 数据要小于系统可用内存的情况下，操作系统会自然的将文档值驻留在内存，这将会带来和直接使用 JVM 堆栈数据结构相同的性能。
 
-By serializing a persistent data structure to disk, we can rely on the OS's file
-system cache to manage memory instead of retaining structures on the JVM heap.
-In situations where the "working set" of data is smaller than the available
-memory, the OS will naturally keep the doc values resident in memory.  This gives
-the same performance profile as on-heap data structures.
-
-But when your working set is much larger than available memory, the OS will begin
-paging the doc values on/off disk as required.  This will obviously be slower
-than an entirely memory-resident data structure, but it has the advantage of scaling
-well beyond the server's memory capacity.  If these data structures were
-purely on-heap, the only option is to crash with an OutOfMemory exception (or implement
-a paging scheme just like the OS).
+不过，如果你的工作集远大于可用内存，操作系统会开始根据需要对文档值进行分页开/关。这会显著慢于纯内存驻留的数据结构，当然，它也拥有使用远大于服务器内存容量的伸缩性的好处。
+如果这些数据结构是纯粹的存储于 JVM 堆内存，那么唯一的选项只能是随着内存溢出（OutOfMemory）而崩溃（或是实现一个分页模式，正如操作系统的那样）。
 
 [NOTE]
 ====
-Because doc values are not managed by the JVM, Elasticsearch servers can be
-configured with a much smaller heap.  This gives more memory to the OS for caching.
-It also has the benefit of letting the JVM's garbage collector work with a smaller
-heap, which will result in faster and more efficient collection cycles.
+因为文档值不是由 JVM 来管理，所以 Elasticsearch 服务器可以配置一个很小的 JVM 堆栈。
+这会给操作系统带来更多的内存来做缓存。同时也带来一个好处就是让 JVM 的垃圾回收器工作在一个很小的堆栈，结果就是更快更高效的回收周期。
 
-Traditionally, the recommendation has been to dedicate 50% of the machine's memory
-to the JVM heap.  With the introduction of doc values, this recommendation is starting
-to slide.  Consider giving far less to the heap, perhaps 4-16gb on a 64gb machine,
-instead of the full 32gb previously recommended.
+传统上，我们会建议分配机器内存的 50% 来给 JVM 堆栈。随着文档值的引入，这个建议开始不再适用。
+在 64gb 内存的机器上，也许可以考虑给堆栈分配 4-16gb 的内存，而不是之前建议的 32gb。
 
-For a more detailed discussion, see <<heap-sizing>>.
+有关更详细的讨论，查看 <<heap-sizing>>.
 ====
 
 
-==== Column-store compression
+==== 列式存储的压缩
 
-At a high level, doc values are essentially a serialized _column-store_.  As we
-discussed in the last section, column-stores excel at certain operations because
-the data is naturally laid out in a fashion that is amenable to those queries.
+从广义来说，文档值本质上是一个序列化的 _列式存储_ 。
+正如我们上一节所讨论的，_列式存储_ 擅长某些操作，因为这些数据的存储天然适合这些查询。
 
-But they also excel at compressing data, particularly numbers.  This is important for both saving space
-on disk _and_ for faster access.  Modern CPU's are many orders of magnitude faster
-than disk drives (although the gap is narrowing quickly with upcoming NVMe drives).  That means
-it is often advantageous to minimize the amount of data that must be read from disk,
-even if it requires extra CPU cycles to decompress.
+而且，他们也同样擅长数据压缩，特别是数字。
+这对于节省磁盘空间和快速访问很重要。现代 CPU 的处理速度要比磁盘快几个数量级（尽管即将到来的 NVMe 驱动器正在迅速缩小差距）。
+这意味着减少必须从磁盘读取的数据量总是有益的，尽管需要额外的 CPU 运算来进行解压。
 
-To see how it can help compression, take this set of doc values for a numeric field:
+要了解它如何帮助压缩数据，来看一组数字类型的文档值：
 
   Doc      Terms
   -----------------------------------------------------------------
@@ -63,66 +47,45 @@ To see how it can help compression, take this set of doc values for a numeric fi
   Doc_7 | 4200
   -----------------------------------------------------------------
 
-The column-stride layout means we have a contiguous block of numbers:
-`[100,1000,1500,1200,300,1900,4200]`.  Because we know they are all numbers
-(instead of a heterogeneous collection like you'd see in a document or row)
-values can be packed tightly together with uniform offsets.
+按列布局意味着我们有一个连续的数据块： `[100,1000,1500,1200,300,1900,4200]` 。因为我们已经知道他们都是数字（而不是像文档或行中看到的异构集合），所以我们可以使用统一的偏移来将他们紧紧排列。
+
 
-Further, there are a variety of compression tricks we can apply to these numbers.
-You'll notice that each of the above numbers are a multiple of 100.  Doc values
-detect when all the values in a segment share a _greatest common divisor_ and use
-that to compress the values further.
+而且，针对这样的数字有很多种压缩技巧。
+你会注意到这里每个数字都是 100 的倍数，文档值会检测一个段里面的所有数值，并使用一个 _最大公约数_ ，方便做进一步的数据压缩。
 
-If we save `100` as the divisor for this segment, we can divide each number by 100
-to get:  `[1,10,15,12,3,19,42]`.  Now that the numbers are smaller, they require
-fewer bits to store and we've reduced the size on-disk.
+如果我们保存 `100` 作为此段的除数，我们可以对每个数字都除以 100，然后得到：  `[1,10,15,12,3,19,42]` 。现在这些数字变小了，只需要很少的位就可以存储下，也减少了磁盘存放的大小。
 
-Doc values use several tricks like this.  In order, the following compression
-schemes are checked:
+文档值正是使用了像这样的一些技巧。它会按依次检测以下压缩模式:
 
-1. If all values are identical (or missing), set a flag and record the value
-2. If there are fewer than 256 values, a simple table encoding is used
-3. If there are > 256 values, check to see if there is a common divisor
-4. If there is no common divisor, encode everything as an offset from the smallest
-value
+1. 如果所有的数值各不相同（或缺失），设置一个标记并记录这些值
+2. 如果这些值小于 256，将使用一个简单的编码表
+3. 如果这些值大于 256，检测是否存在一个最大公约数
+4. 如果没有存在最大公约数，从最小的数值开始，统一计算偏移量进行编码
 
-You'll note that these compression schemes are not "traditional" general purpose
-compression like DEFLATE or LZ4.  Because the structure of column-stores are
-rigid and well-defined, we can achieve higher compression by using specialized
-schemes rather than the more general compression algorithms like LZ4.
+你会发现这些压缩模式不是传统的通用的压缩方式，比如 DEFLATE 或是 LZ4。
+因为列式存储的结构是严格且良好定义的，我们可以通过使用专门的模式来达到比通用压缩算法（如 LZ4 ）更高的压缩效果。
 
 [NOTE]
 ====
-You may be thinking _"Well that's great for numbers, but what about strings?"_
-Strings are encoded similarly, with the help of an ordinal table.  The
-strings are de-duplicated and sorted into a table, assigned an ID, and then those
-ID's are used as numeric doc values.  Which means strings enjoy many of the same
-compression benefits that numerics do.
-
-The ordinal table itself has some compression tricks, such as using fixed, variable
-or prefix-encoded strings.
+你也许会想 _"好吧，貌似对数字很好，不知道字符串怎么样？"_
+通过借助顺序表（ordinal table），字符类型也是类似进行编码的。字符类型是去重之后存放到顺序表的，通过分配一个 ID，然后这些 ID 和数值类型的文档值一样使用。
+也就是说，字符类型和数值类型一样拥有相同的压缩特性。
+
+顺序表本身也有很多压缩技巧，比如固定长度、变长或是前缀字符编码等等。
+
 ====
 
-==== Disabling Doc Values
+==== 禁用文档值
 
-Doc values are enabled by default for all fields _except_ analyzed strings.  That means
-all numerics, geo_points, dates, IPs and `not_analyzed` strings.
+文档值默认对所有字段启用，除了分析字符类型字段。也就是说所有的数字、地理坐标、日期、IP 和不分析（ `not_analyzed` ）字符类型。
 
-Analyzed strings are not able to use doc values at this time; the analysis process
-generates many tokens and does not work efficiently with doc values.  We'll discuss
-using analyzed strings for aggregations in <<aggregations-and-analysis>>.
+分析字符类型暂时还不使用文档值。分析流程会产生很多新的 token，这会让文档值不能高效的工作。我们将在  <<aggregations-and-analysis>> 讨论如何使用分析字符类型来做聚合。
 
-Because doc values are on by default, you have the option to aggregate and sort
-on most fields in your dataset.  But what if you know you will _never_ aggregate,
-sort or script on a certain field?
+因为文档值默认启用，你可以选择对你数据集里面的大多数字段进行聚合和排序操作。但是如果你知道你永远也不会对某些字段进行聚合、排序或是使用脚本操作？
 
-While rare, these circumstances do arise and you may wish to disable doc values
-on that particular field.  This will save you some disk space (since the doc values
-are not being serialized to disk anymore) and may increase indexing speed slightly
-(since the doc values don't need to be generated).
+尽管罕见，但当这些情况出现时，你还是希望有办法来为特定的字段禁用文档值。这回为你节省磁盘空间（因为文档值再也没有序列化到磁盘），也许还能提升些许索引速度（因为不需要生成文档值）。
 
-To disable doc values, set `doc_values: false` in the field's mapping.  For example,
-here we create a new index where doc values are disabled for the `"session_id"` field:
+要禁用文档值，在字段的映射（mapping）设置 `doc_values: false` 即可。例如，这里我们创建了一个新的索引，字段 `"session_id"` 禁用了文档值：
 
 [source,js]
 ----
@@ -141,12 +104,9 @@ PUT my_index
   }
 }
 ----
-<1> By setting `doc_values: false`, this field will not be usable in aggregations, sorts
-or scripts
+<1> 通过设置 `doc_values: false` ，这个字段将不能被用于聚合、排序以及脚本操作
 
-It is possible to configure the inverse relationship too: make a field available
-for aggregations via doc values, but make it unavailable for normal search by disabling
-the inverted index.  For example:
+反过来也是可以进行配置的：让一个字段可以被聚合，通过禁用倒排索引，使它不能被正常搜索，例如：
 
 
 [source,js]
@@ -167,9 +127,7 @@ PUT my_index
   }
 }
 ----
-<1> Doc values are enabled to allow aggregations
-<2> Indexing is disabled, which makes the field unavailable to queries/searches
+<1> 文档值被启用来允许聚合
+<2> 索引被禁用了，这让该字段不能被查询/搜索
 
-By setting `doc_values: true` and `index: no`, we generate a field which can _only_
-be used in aggregations/sorts/scripts.  This is admittedly a very rare requirement,
-but sometimes useful.
+通过设置 `doc_values: true` 和 `index: no` ，我们得到一个只能被用于聚合/排序/脚本的字段。无可否认，这是一个非常罕见的需求，但有时很有用。