fielddata
Most fields are indexed by default, which makes them searchable. Sorting, aggregations, and accessing field values in scripts, however, requires a different access pattern from search.
Search needs to answer the question “Which documents contain this term?”, while sorting and aggregations need to answer a different question: “What is the value of this field for this document?”.
Most fields can use index-time, on-disk doc_values
for this data access pattern, but text
fields do not support doc_values
.
Instead, text
fields use a query-time in-memory data structure called fielddata
. This data structure is built on demand the first time that a field is used for aggregations, sorting, or in a script. It is built by reading the entire inverted index for each segment from disk, inverting the term ↔︎ document relationship, and storing the result in memory, in the JVM heap.
Fielddata is disabled on text
fields by default
Fielddata can consume a lot of heap space, especially when loading high cardinality text
fields. Once fielddata has been loaded into the heap, it remains there for the lifetime of the segment. Also, loading fielddata is an expensive process which can cause users to experience latency hits. This is why fielddata is disabled by default.
If you try to sort, aggregate, or access values from a script on a text
field, you will see this exception:
Fielddata is disabled on text fields by default. Set `fielddata=true` on
[`your_field_name`] in order to load fielddata in memory by uninverting the
inverted index. Note that this can however use significant memory.
Before enabling fielddata
Before you enable fielddata, consider why you are using a text
field for aggregations, sorting, or in a script. It usually doesn’t make sense to do so.
A text field is analyzed before indexing so that a value like New York
can be found by searching for new
or for york
. A terms
aggregation on this field will return a new
bucket and a york
bucket, when you probably want a single bucket called New York
.
Instead, you should have a text
field for full text searches, and an unanalyzed keyword
field with doc_values
enabled for aggregations, as follows:
PUT my-index-000001
{
"mappings": {
"properties": {
"my_field": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
}
Use the | |
Use the |
Enabling fielddata on text
fields
You can enable fielddata on an existing text
field using the PUT mapping API as follows:
PUT my-index-000001/_mapping
{
"properties": {
"my_field": {
"type": "text",
"fielddata": true
}
}
}
The mapping that you specify for |
fielddata_frequency_filter
Fielddata filtering can be used to reduce the number of terms loaded into memory, and thus reduce memory usage. Terms can be filtered by frequency:
The frequency filter allows you to only load terms whose document frequency falls between a min
and max
value, which can be expressed an absolute number (when the number is bigger than 1.0) or as a percentage (eg 0.01
is 1%
and 1.0
is 100%
). Frequency is calculated per segment. Percentages are based on the number of docs which have a value for the field, as opposed to all docs in the segment.
Small segments can be excluded completely by specifying the minimum number of docs that the segment should contain with min_segment_size
:
PUT my-index-000001
{
"mappings": {
"properties": {
"tag": {
"type": "text",
"fielddata": true,
"fielddata_frequency_filter": {
"min": 0.001,
"max": 0.1,
"min_segment_size": 500
}
}
}
}
}