Downsampling and data retention
InfluxDB can handle hundreds of thousands of data points per second.Working with that much data over a long period of time can create storageconcerns.A natural solution is to downsample the data; keep the high precision raw datafor only a limited time, and store the lower precision, summarized data for muchlonger or forever.
InfluxDB offers two features - continuous queries (CQ) and retention policies(RP) - that automate the process of downsampling data and expiring old data.This guide describes a practical use case for CQs and RPs and covers how toset up those features in InfluxDB databases.
Definitions
A continuous query (CQ) is an InfluxQL query that runs automatically andperiodically within a database.CQs require a function in the SELECT
clause and must include aGROUP BY time()
clause.
A retention policy (RP) is the part of InfluxDB data structurethat describes for how long InfluxDB keeps data.InfluxDB compares your local server’s timestamp to the timestamps on your dataand deletes data that are older than the RP’s DURATION
.A single database can have several RPs and RPs are unique per database.
This guide will not go into detail about the syntax for creating and managingCQs and RPs.If you’re new to both concepts, we recommend looking over the detailedCQ documentation andRP documentation.
Sample data
This section uses fictional real-time data that track the number of food ordersto a restaurant via phone and via website at ten second intervals.We will store those data in adatabase called food_data
, inthe measurement orders
, andin the fields phone
and website
.
Sample:
name: orders
name: orders
time phone website2016-05-10T23:18:00Z 10 302016-05-10T23:18:10Z 12 392016-05-10T23:18:20Z 11 56
Goal
Assume that, in the long run, we’re only interested in the average number of orders by phoneand by website at 30 minute intervals.In the next steps, we use RPs and CQs to:
- Automatically aggregate the ten-second resolution data to 30-minute resolution data
- Automatically delete the raw, ten-second resolution data that are older than two hours
- Automatically delete the 30-minute resolution data that are older than 52 weeks
Database preparation
We perform the following steps before writing the data to the databasefood_data
.We do this before inserting any data because CQs only run against recentdata; that is, data with timestamps that are no older than now()
minusthe FOR
clause of the CQ, or now()
minus the GROUP BY time()
interval ifthe CQ has no FOR
clause.
1. Create the database
> CREATE DATABASE "food_data"
2. Create a two-hour DEFAULT retention policy
InfluxDB writes to the DEFAULT
retention policy if we do not supply an explicit RP whenwriting a point to the database.We make the DEFAULT
RP keep data for two hours, because we want InfluxDB toautomatically write the incoming ten-second resolution data to that RP.
Use theCREATE RETENTION POLICY
statement to create a DEFAULT
RP:
> CREATE RETENTION POLICY "two_hours" ON "food_data" DURATION 2h REPLICATION 1 DEFAULT
That query creates an RP called two_hours
that exists in the databasefood_data
.two_hours
keeps data for a DURATION
of two hours (2h
) and it’s the DEFAULT
RP for the database food_data
.
The replication factor (REPLICATION 1
) is a required parameter but must alwaysbe set to 1 for single node instances.
Note: When we created the
food_data
database in step 1, InfluxDBautomatically generated an RP namedautogen
and set it as theDEFAULT
RP for the database.Theautogen
RP has an infinite retention period.With the query above, the RPtwo_hours
replacesautogen
as theDEFAULT
RPfor thefood_data
database.
3. Create a 52-week retention policy
Next we want to create another retention policy that keeps data for 52 weeks and is not theDEFAULT
retention policy (RP) for the database.Ultimately, the 30-minute rollup data will be stored in this RP.
Use theCREATE RETENTION POLICY
statement to create a non-DEFAULT
retention policy:
> CREATE RETENTION POLICY "a_year" ON "food_data" DURATION 52w REPLICATION 1
That query creates a retention policy (RP) called a_year
that exists in the databasefood_data
.The a_year
setting keeps data for a DURATION
of 52 weeks (52w
).Leaving out the DEFAULT
argument ensures that a_year
is not the DEFAULT
RP for the database food_data
.That is, write and read operations against food_data
that do not specify anRP will still go to the two_hours
RP (the DEFAULT
RP).
4. Create the continuous query
Now that we’ve set up our RPs, we want to create a continuous query (CQ) that will automaticallyand periodically downsample the ten-second resolution data to the 30-minuteresolution, and then store those results in a different measurement with a differentretention policy.
Use theCREATE CONTINUOUS QUERY
statement to generate a CQ:
> CREATE CONTINUOUS QUERY "cq_30m" ON "food_data" BEGIN
SELECT mean("website") AS "mean_website",mean("phone") AS "mean_phone"
INTO "a_year"."downsampled_orders"
FROM "orders"
GROUP BY time(30m)
END
That query creates a CQ called cq_30m
in the database food_data
.cq_30m
tells InfluxDB to calculate the 30-minute average of the two fieldswebsite
and phone
in the measurement orders
and in the DEFAULT
RPtwo_hours
.It also tells InfluxDB to write those results to the measurementdownsampled_orders
in the retention policy a_year
with the field keysmean_website
and mean_phone
.InfluxDB will run this query every 30 minutes for the previous 30 minutes.
Note: Notice that we fully qualify (that is, we use the syntax
"<retention_policy>"."<measurement>"
) the measurement in theINTO
clause.InfluxDB requires that syntax to write data to an RP other than theDEFAULT
RP.
Results
With the new CQ and two new RPs, food_data
is ready to start receiving data.After writing data to our database and letting things run for a bit, we seetwo measurements: orders
and downsampled_orders
.
> SELECT * FROM "orders" LIMIT 5
name: orders
---------
time phone website
2016-05-13T23:00:00Z 10 30
2016-05-13T23:00:10Z 12 39
2016-05-13T23:00:20Z 11 56
2016-05-13T23:00:30Z 8 34
2016-05-13T23:00:40Z 17 32
> SELECT * FROM "a_year"."downsampled_orders" LIMIT 5
name: downsampled_orders
---------------------
time mean_phone mean_website
2016-05-13T15:00:00Z 12 23
2016-05-13T15:30:00Z 13 32
2016-05-13T16:00:00Z 19 21
2016-05-13T16:30:00Z 3 26
2016-05-13T17:00:00Z 4 23
The data in orders
are the raw, ten-second resolution data that reside in thetwo-hour RP.The data in downsampled_orders
are the aggregated, 30-minute resolution datathat are subject to the 52-week RP.
Notice that the first timestamps in downsampled_orders
are older than the firsttimestamps in orders
.This is because InfluxDB has already deleted data from orders
with timestampsthat are older than our local server’s timestamp minus two hours (assume we executed the SELECT
queries at 2016-05-14T00:59:59Z
).InfluxDB will only start dropping data from downsampled_orders
after 52 weeks.
Notes:
Notice that we fully qualify (that is, we use the syntax
"<retention_policy>"."<measurement>"
)downsampled_orders
inthe secondSELECT
statement. We must specify the RP in that query toSELECT
data that reside in an RP other than theDEFAULT
RP.By default, InfluxDB checks to enforce an RP every 30 minutes.Between checks,
orders
may have data that are older than two hours.The rate at which InfluxDB checks to enforce an RP is a configurable setting,seeDatabase Configuration.
Using a combination of RPs and CQs, we’ve successfully set up our database toautomatically keep the high precision raw data for a limited time, create lowerprecision data, and store that lower precision data for a longer period of time.Now that you have a general understanding of how these features can worktogether, check out the detailed documentation on CQs and RPsto see all that they can do for you.