Explore your data with runtime fields

Explore your data with runtime fields

Consider a large set of log data that you want to extract fields from. Indexing the data is time consuming and uses a lot of disk space, and you just want to explore the data structure without committing to a schema up front.

You know that your log data contains specific fields that you want to extract. In this case, we want to focus on the @timestamp and message fields. By using runtime fields, you can define scripts to calculate values at search time for these fields.

Define indexed fields as a starting point

You can start with a simple example by adding the @timestamp and message fields to the my-index-000001 mapping as indexed fields. To remain flexible, use wildcard as the field type for message:

  1. PUT /my-index-000001/
  2. {
  3. "mappings": {
  4. "properties": {
  5. "@timestamp": {
  6. "format": "strict_date_optional_time||epoch_second",
  7. "type": "date"
  8. },
  9. "message": {
  10. "type": "wildcard"
  11. }
  12. }
  13. }
  14. }

Ingest some data

After mapping the fields you want to retrieve, index a few records from your log data into Elasticsearch. The following request uses the bulk API to index raw log data into my-index-000001. Instead of indexing all of your log data, you can use a small sample to experiment with runtime fields.

The final document is not a valid Apache log format, but we can account for that scenario in our script.

  1. POST /my-index-000001/_bulk?refresh
  2. {"index":{}}
  3. {"timestamp":"2020-04-30T14:30:17-05:00","message":"40.135.0.0 - - [30/Apr/2020:14:30:17 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736"}
  4. {"index":{}}
  5. {"timestamp":"2020-04-30T14:30:53-05:00","message":"232.0.0.0 - - [30/Apr/2020:14:30:53 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736"}
  6. {"index":{}}
  7. {"timestamp":"2020-04-30T14:31:12-05:00","message":"26.1.0.0 - - [30/Apr/2020:14:31:12 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736"}
  8. {"index":{}}
  9. {"timestamp":"2020-04-30T14:31:19-05:00","message":"247.37.0.0 - - [30/Apr/2020:14:31:19 -0500] \"GET /french/splash_inet.html HTTP/1.0\" 200 3781"}
  10. {"index":{}}
  11. {"timestamp":"2020-04-30T14:31:22-05:00","message":"247.37.0.0 - - [30/Apr/2020:14:31:22 -0500] \"GET /images/hm_nbg.jpg HTTP/1.0\" 304 0"}
  12. {"index":{}}
  13. {"timestamp":"2020-04-30T14:31:27-05:00","message":"252.0.0.0 - - [30/Apr/2020:14:31:27 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736"}
  14. {"index":{}}
  15. {"timestamp":"2020-04-30T14:31:28-05:00","message":"not a valid apache log"}

At this point, you can view how Elasticsearch stores your raw data.

  1. GET /my-index-000001

The mapping contains two fields: @timestamp and message.

  1. {
  2. "my-index-000001" : {
  3. "aliases" : { },
  4. "mappings" : {
  5. "properties" : {
  6. "@timestamp" : {
  7. "type" : "date",
  8. "format" : "strict_date_optional_time||epoch_second"
  9. },
  10. "message" : {
  11. "type" : "wildcard"
  12. },
  13. "timestamp" : {
  14. "type" : "date"
  15. }
  16. }
  17. },
  18. ...
  19. }
  20. }

Define a runtime field with a grok pattern

If you want to retrieve results that include clientip, you can add that field as a runtime field in the mapping. The following runtime script defines a grok pattern that extracts structured fields out of a single text field within a document. A grok pattern is like a regular expression that supports aliased expressions that you can reuse.

The script matches on the %{COMMONAPACHELOG} log pattern, which understands the structure of Apache logs. If the pattern matches, the script emits the value of the matching IP address. If the pattern doesn’t match (clientip != null), the script just returns the field value without crashing.

  1. PUT my-index-000001/_mappings
  2. {
  3. "runtime": {
  4. "http.client_ip": {
  5. "type": "ip",
  6. "script": """
  7. String clientip=grok('%{COMMONAPACHELOG}').extract(doc["message"].value)?.clientip;
  8. if (clientip != null) emit(clientip);
  9. """
  10. }
  11. }
  12. }

This condition ensures that the script doesn’t crash even if the pattern of the message doesn’t match.

Alternatively, you can define the same runtime field but in the context of a search request. The runtime definition and the script are exactly the same as the one defined previously in the index mapping. Just copy that definition into the search request under the runtime_mappings section and include a query that matches on the runtime field. This query returns the same results as if you defined a search query for the http.clientip runtime field in your index mappings, but only in the context of this specific search:

  1. GET my-index-000001/_search
  2. {
  3. "runtime_mappings": {
  4. "http.clientip": {
  5. "type": "ip",
  6. "script": """
  7. String clientip=grok('%{COMMONAPACHELOG}').extract(doc["message"].value)?.clientip;
  8. if (clientip != null) emit(clientip);
  9. """
  10. }
  11. },
  12. "query": {
  13. "match": {
  14. "http.clientip": "40.135.0.0"
  15. }
  16. },
  17. "fields" : ["http.clientip"]
  18. }

Define a composite runtime field

You can also define a composite runtime field to emit multiple fields from a single script. You can define a set of typed subfields and emit a map of values. At search time, each subfield retrieves the value associated with their name in the map. This means that you only need to specify your grok pattern one time and can return multiple values:

  1. PUT my-index-000001/_mappings
  2. {
  3. "runtime": {
  4. "http": {
  5. "type": "composite",
  6. "script": "emit(grok(\"%{COMMONAPACHELOG}\").extract(doc[\"message\"].value))",
  7. "fields": {
  8. "clientip": {
  9. "type": "ip"
  10. },
  11. "verb": {
  12. "type": "keyword"
  13. },
  14. "response": {
  15. "type": "long"
  16. }
  17. }
  18. }
  19. }
  20. }

Search for a specific IP address

Using the http.clientip runtime field, you can define a simple query to run a search for a specific IP address and return all related fields.

  1. GET my-index-000001/_search
  2. {
  3. "query": {
  4. "match": {
  5. "http.clientip": "40.135.0.0"
  6. }
  7. },
  8. "fields" : ["*"]
  9. }

The API returns the following result. Because http is a composite runtime field, the response includes each of the sub-fields under fields, including any associated values that match the query. Without building your data structure in advance, you can search and explore your data in meaningful ways to experiment and determine which fields to index.

  1. {
  2. ...
  3. "hits" : {
  4. "total" : {
  5. "value" : 1,
  6. "relation" : "eq"
  7. },
  8. "max_score" : 1.0,
  9. "hits" : [
  10. {
  11. "_index" : "my-index-000001",
  12. "_type" : "_doc",
  13. "_id" : "sRVHBnwBB-qjgFni7h_O",
  14. "_score" : 1.0,
  15. "_source" : {
  16. "timestamp" : "2020-04-30T14:30:17-05:00",
  17. "message" : "40.135.0.0 - - [30/Apr/2020:14:30:17 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736"
  18. },
  19. "fields" : {
  20. "http.verb" : [
  21. "GET"
  22. ],
  23. "http.clientip" : [
  24. "40.135.0.0"
  25. ],
  26. "http.response" : [
  27. 200
  28. ],
  29. "message" : [
  30. "40.135.0.0 - - [30/Apr/2020:14:30:17 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736"
  31. ],
  32. "http.client_ip" : [
  33. "40.135.0.0"
  34. ],
  35. "timestamp" : [
  36. "2020-04-30T19:30:17.000Z"
  37. ]
  38. }
  39. }
  40. ]
  41. }
  42. }

Also, remember that if statement in the script?

  1. if (clientip != null) emit(clientip);

If the script didn’t include this condition, the query would fail on any shard that doesn’t match the pattern. By including this condition, the query skips data that doesn’t match the grok pattern.

Search for documents in a specific range

You can also run a range query that operates on the timestamp field. The following query returns any documents where the timestamp is greater than or equal to 2020-04-30T14:31:27-05:00:

  1. GET my-index-000001/_search
  2. {
  3. "query": {
  4. "range": {
  5. "timestamp": {
  6. "gte": "2020-04-30T14:31:27-05:00"
  7. }
  8. }
  9. }
  10. }

The response includes the document where the log format doesn’t match, but the timestamp falls within the defined range.

  1. {
  2. ...
  3. "hits" : {
  4. "total" : {
  5. "value" : 2,
  6. "relation" : "eq"
  7. },
  8. "max_score" : 1.0,
  9. "hits" : [
  10. {
  11. "_index" : "my-index-000001",
  12. "_type" : "_doc",
  13. "_id" : "hdEhyncBRSB6iD-PoBqe",
  14. "_score" : 1.0,
  15. "_source" : {
  16. "timestamp" : "2020-04-30T14:31:27-05:00",
  17. "message" : "252.0.0.0 - - [30/Apr/2020:14:31:27 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736"
  18. }
  19. },
  20. {
  21. "_index" : "my-index-000001",
  22. "_type" : "_doc",
  23. "_id" : "htEhyncBRSB6iD-PoBqe",
  24. "_score" : 1.0,
  25. "_source" : {
  26. "timestamp" : "2020-04-30T14:31:28-05:00",
  27. "message" : "not a valid apache log"
  28. }
  29. }
  30. ]
  31. }
  32. }

Define a runtime field with a dissect pattern

If you don’t need the power of regular expressions, you can use dissect patterns instead of grok patterns. Dissect patterns match on fixed delimiters but are typically faster than grok.

You can use dissect to achieve the same results as parsing the Apache logs with a grok pattern. Instead of matching on a log pattern, you include the parts of the string that you want to discard. Paying special attention to the parts of the string you want to discard will help build successful dissect patterns.

  1. PUT my-index-000001/_mappings
  2. {
  3. "runtime": {
  4. "http.client.ip": {
  5. "type": "ip",
  6. "script": """
  7. String clientip=dissect('%{clientip} %{ident} %{auth} [%{@timestamp}] "%{verb} %{request} HTTP/%{httpversion}" %{status} %{size}').extract(doc["message"].value)?.clientip;
  8. if (clientip != null) emit(clientip);
  9. """
  10. }
  11. }
  12. }

Similarly, you can define a dissect pattern to extract the HTTP response code:

  1. PUT my-index-000001/_mappings
  2. {
  3. "runtime": {
  4. "http.responses": {
  5. "type": "long",
  6. "script": """
  7. String response=dissect('%{clientip} %{ident} %{auth} [%{@timestamp}] "%{verb} %{request} HTTP/%{httpversion}" %{response} %{size}').extract(doc["message"].value)?.response;
  8. if (response != null) emit(Integer.parseInt(response));
  9. """
  10. }
  11. }
  12. }

You can then run a query to retrieve a specific HTTP response using the http.responses runtime field. Use the fields parameter of the _search request to indicate which fields you want to retrieve:

  1. GET my-index-000001/_search
  2. {
  3. "query": {
  4. "match": {
  5. "http.responses": "304"
  6. }
  7. },
  8. "fields" : ["http.client_ip","timestamp","http.verb"]
  9. }

The response includes a single document where the HTTP response is 304:

  1. {
  2. ...
  3. "hits" : {
  4. "total" : {
  5. "value" : 1,
  6. "relation" : "eq"
  7. },
  8. "max_score" : 1.0,
  9. "hits" : [
  10. {
  11. "_index" : "my-index-000001",
  12. "_type" : "_doc",
  13. "_id" : "A2qDy3cBWRMvVAuI7F8M",
  14. "_score" : 1.0,
  15. "_source" : {
  16. "timestamp" : "2020-04-30T14:31:22-05:00",
  17. "message" : "247.37.0.0 - - [30/Apr/2020:14:31:22 -0500] \"GET /images/hm_nbg.jpg HTTP/1.0\" 304 0"
  18. },
  19. "fields" : {
  20. "http.verb" : [
  21. "GET"
  22. ],
  23. "http.client_ip" : [
  24. "247.37.0.0"
  25. ],
  26. "timestamp" : [
  27. "2020-04-30T19:31:22.000Z"
  28. ]
  29. }
  30. }
  31. ]
  32. }
  33. }