Skip to content

Commit

Permalink
300_Aggregations/93_technical_docvalues.asciidoc (elasticsearch-cn#455)
Browse files Browse the repository at this point in the history
* init commit of 93_technical_doc_values

* unifrom words

* “improve”
  • Loading branch information
medcl authored Jan 25, 2017
1 parent feb3a70 commit bf9020e
Showing 1 changed file with 51 additions and 93 deletions.
144 changes: 51 additions & 93 deletions 300_Aggregations/93_technical_docvalues.asciidoc
Original file line number Diff line number Diff line change
@@ -1,56 +1,40 @@
[[_deep_dive_on_doc_values]]
=== 深入文档值

=== Deep Dive on Doc Values
在上一节一开头我们就说文档值(doc values) 是 _"更快、更高效并且内存友好"_ 。
听起来好像是不错的营销术语,不过话说回来文档值到底是如何工作的呢?

The last section opened by saying doc values are _"fast, efficient and memory-friendly"_.
Those are some nice marketing buzzwords, but how do doc values actually work?
文档值是在索引时与倒排索引同时产生的。也就是说文档值是按段来产生的并且是不可变的,正如用于搜索的倒排索引一样。
同样,和倒排索引一样,文档值也序列化到磁盘。这些对于性能和伸缩性很重要。

Doc values are generated at index-time, alongside the creation of the inverted index.
That means doc values are generated on a per-segment basis and are immutable, just like
the inverted index used for search. And, like the inverted index, doc values are serialized
to disk. This is important to performance and scalability.
通过序列化一个持久化的数据结构到磁盘,我们可以依赖于操作系统的缓存来管理内存,而不是在 JVM 堆栈里驻留数据。
当 “工作集(working set)” 数据要小于系统可用内存的情况下,操作系统会自然的将文档值驻留在内存,这将会带来和直接使用 JVM 堆栈数据结构相同的性能。

By serializing a persistent data structure to disk, we can rely on the OS's file
system cache to manage memory instead of retaining structures on the JVM heap.
In situations where the "working set" of data is smaller than the available
memory, the OS will naturally keep the doc values resident in memory. This gives
the same performance profile as on-heap data structures.

But when your working set is much larger than available memory, the OS will begin
paging the doc values on/off disk as required. This will obviously be slower
than an entirely memory-resident data structure, but it has the advantage of scaling
well beyond the server's memory capacity. If these data structures were
purely on-heap, the only option is to crash with an OutOfMemory exception (or implement
a paging scheme just like the OS).
不过,如果你的工作集远大于可用内存,操作系统会开始根据需要对文档值进行分页开/关。这会显著慢于纯内存驻留的数据结构,当然,它也拥有使用远大于服务器内存容量的伸缩性的好处。
如果这些数据结构是纯粹的存储于 JVM 堆内存,那么唯一的选项只能是随着内存溢出(OutOfMemory)而崩溃(或是实现一个分页模式,正如操作系统的那样)。

[NOTE]
====
Because doc values are not managed by the JVM, Elasticsearch servers can be
configured with a much smaller heap. This gives more memory to the OS for caching.
It also has the benefit of letting the JVM's garbage collector work with a smaller
heap, which will result in faster and more efficient collection cycles.
因为文档值不是由 JVM 来管理,所以 Elasticsearch 服务器可以配置一个很小的 JVM 堆栈。
这会给操作系统带来更多的内存来做缓存。同时也带来一个好处就是让 JVM 的垃圾回收器工作在一个很小的堆栈,结果就是更快更高效的回收周期。
Traditionally, the recommendation has been to dedicate 50% of the machine's memory
to the JVM heap. With the introduction of doc values, this recommendation is starting
to slide. Consider giving far less to the heap, perhaps 4-16gb on a 64gb machine,
instead of the full 32gb previously recommended.
传统上,我们会建议分配机器内存的 50% 来给 JVM 堆栈。随着文档值的引入,这个建议开始不再适用。
在 64gb 内存的机器上,也许可以考虑给堆栈分配 4-16gb 的内存,而不是之前建议的 32gb。
For a more detailed discussion, see <<heap-sizing>>.
有关更详细的讨论,查看 <<heap-sizing>>.
====


==== Column-store compression
==== 列式存储的压缩

At a high level, doc values are essentially a serialized _column-store_. As we
discussed in the last section, column-stores excel at certain operations because
the data is naturally laid out in a fashion that is amenable to those queries.
从广义来说,文档值本质上是一个序列化的 _列式存储_ 。
正如我们上一节所讨论的,_列式存储_ 擅长某些操作,因为这些数据的存储天然适合这些查询。

But they also excel at compressing data, particularly numbers. This is important for both saving space
on disk _and_ for faster access. Modern CPU's are many orders of magnitude faster
than disk drives (although the gap is narrowing quickly with upcoming NVMe drives). That means
it is often advantageous to minimize the amount of data that must be read from disk,
even if it requires extra CPU cycles to decompress.
而且,他们也同样擅长数据压缩,特别是数字。
这对于节省磁盘空间和快速访问很重要。现代 CPU 的处理速度要比磁盘快几个数量级(尽管即将到来的 NVMe 驱动器正在迅速缩小差距)。
这意味着减少必须从磁盘读取的数据量总是有益的,尽管需要额外的 CPU 运算来进行解压。

To see how it can help compression, take this set of doc values for a numeric field:
要了解它如何帮助压缩数据,来看一组数字类型的文档值:

Doc Terms
-----------------------------------------------------------------
Expand All @@ -63,66 +47,45 @@ To see how it can help compression, take this set of doc values for a numeric fi
Doc_7 | 4200
-----------------------------------------------------------------

The column-stride layout means we have a contiguous block of numbers:
`[100,1000,1500,1200,300,1900,4200]`. Because we know they are all numbers
(instead of a heterogeneous collection like you'd see in a document or row)
values can be packed tightly together with uniform offsets.
按列布局意味着我们有一个连续的数据块: `[100,1000,1500,1200,300,1900,4200]` 。因为我们已经知道他们都是数字(而不是像文档或行中看到的异构集合),所以我们可以使用统一的偏移来将他们紧紧排列。


Further, there are a variety of compression tricks we can apply to these numbers.
You'll notice that each of the above numbers are a multiple of 100. Doc values
detect when all the values in a segment share a _greatest common divisor_ and use
that to compress the values further.
而且,针对这样的数字有很多种压缩技巧。
你会注意到这里每个数字都是 100 的倍数,文档值会检测一个段里面的所有数值,并使用一个 _最大公约数_ ,方便做进一步的数据压缩。

If we save `100` as the divisor for this segment, we can divide each number by 100
to get: `[1,10,15,12,3,19,42]`. Now that the numbers are smaller, they require
fewer bits to store and we've reduced the size on-disk.
如果我们保存 `100` 作为此段的除数,我们可以对每个数字都除以 100,然后得到: `[1,10,15,12,3,19,42]` 。现在这些数字变小了,只需要很少的位就可以存储下,也减少了磁盘存放的大小。

Doc values use several tricks like this. In order, the following compression
schemes are checked:
文档值正是使用了像这样的一些技巧。它会按依次检测以下压缩模式:

1. If all values are identical (or missing), set a flag and record the value
2. If there are fewer than 256 values, a simple table encoding is used
3. If there are > 256 values, check to see if there is a common divisor
4. If there is no common divisor, encode everything as an offset from the smallest
value
1. 如果所有的数值各不相同(或缺失),设置一个标记并记录这些值
2. 如果这些值小于 256,将使用一个简单的编码表
3. 如果这些值大于 256,检测是否存在一个最大公约数
4. 如果没有存在最大公约数,从最小的数值开始,统一计算偏移量进行编码

You'll note that these compression schemes are not "traditional" general purpose
compression like DEFLATE or LZ4. Because the structure of column-stores are
rigid and well-defined, we can achieve higher compression by using specialized
schemes rather than the more general compression algorithms like LZ4.
你会发现这些压缩模式不是传统的通用的压缩方式,比如 DEFLATE 或是 LZ4。
因为列式存储的结构是严格且良好定义的,我们可以通过使用专门的模式来达到比通用压缩算法(如 LZ4 )更高的压缩效果。

[NOTE]
====
You may be thinking _"Well that's great for numbers, but what about strings?"_
Strings are encoded similarly, with the help of an ordinal table. The
strings are de-duplicated and sorted into a table, assigned an ID, and then those
ID's are used as numeric doc values. Which means strings enjoy many of the same
compression benefits that numerics do.
The ordinal table itself has some compression tricks, such as using fixed, variable
or prefix-encoded strings.
你也许会想 _"好吧,貌似对数字很好,不知道字符串怎么样?"_
通过借助顺序表(ordinal table),字符类型也是类似进行编码的。字符类型是去重之后存放到顺序表的,通过分配一个 ID,然后这些 ID 和数值类型的文档值一样使用。
也就是说,字符类型和数值类型一样拥有相同的压缩特性。
顺序表本身也有很多压缩技巧,比如固定长度、变长或是前缀字符编码等等。
====

==== Disabling Doc Values
==== 禁用文档值

Doc values are enabled by default for all fields _except_ analyzed strings. That means
all numerics, geo_points, dates, IPs and `not_analyzed` strings.
文档值默认对所有字段启用,除了分析字符类型字段。也就是说所有的数字、地理坐标、日期、IP 和不分析( `not_analyzed` )字符类型。

Analyzed strings are not able to use doc values at this time; the analysis process
generates many tokens and does not work efficiently with doc values. We'll discuss
using analyzed strings for aggregations in <<aggregations-and-analysis>>.
分析字符类型暂时还不使用文档值。分析流程会产生很多新的 token,这会让文档值不能高效的工作。我们将在 <<aggregations-and-analysis>> 讨论如何使用分析字符类型来做聚合。

Because doc values are on by default, you have the option to aggregate and sort
on most fields in your dataset. But what if you know you will _never_ aggregate,
sort or script on a certain field?
因为文档值默认启用,你可以选择对你数据集里面的大多数字段进行聚合和排序操作。但是如果你知道你永远也不会对某些字段进行聚合、排序或是使用脚本操作?

While rare, these circumstances do arise and you may wish to disable doc values
on that particular field. This will save you some disk space (since the doc values
are not being serialized to disk anymore) and may increase indexing speed slightly
(since the doc values don't need to be generated).
尽管罕见,但当这些情况出现时,你还是希望有办法来为特定的字段禁用文档值。这回为你节省磁盘空间(因为文档值再也没有序列化到磁盘),也许还能提升些许索引速度(因为不需要生成文档值)。

To disable doc values, set `doc_values: false` in the field's mapping. For example,
here we create a new index where doc values are disabled for the `"session_id"` field:
要禁用文档值,在字段的映射(mapping)设置 `doc_values: false` 即可。例如,这里我们创建了一个新的索引,字段 `"session_id"` 禁用了文档值:

[source,js]
----
Expand All @@ -141,12 +104,9 @@ PUT my_index
}
}
----
<1> By setting `doc_values: false`, this field will not be usable in aggregations, sorts
or scripts
<1> 通过设置 `doc_values: false` ,这个字段将不能被用于聚合、排序以及脚本操作

It is possible to configure the inverse relationship too: make a field available
for aggregations via doc values, but make it unavailable for normal search by disabling
the inverted index. For example:
反过来也是可以进行配置的:让一个字段可以被聚合,通过禁用倒排索引,使它不能被正常搜索,例如:


[source,js]
Expand All @@ -167,9 +127,7 @@ PUT my_index
}
}
----
<1> Doc values are enabled to allow aggregations
<2> Indexing is disabled, which makes the field unavailable to queries/searches
<1> 文档值被启用来允许聚合
<2> 索引被禁用了,这让该字段不能被查询/搜索

By setting `doc_values: true` and `index: no`, we generate a field which can _only_
be used in aggregations/sorts/scripts. This is admittedly a very rare requirement,
but sometimes useful.
通过设置 `doc_values: true` 和 `index: no` ,我们得到一个只能被用于聚合/排序/脚本的字段。无可否认,这是一个非常罕见的需求,但有时很有用。

0 comments on commit bf9020e

Please sign in to comment.