Set up Clouddriver to use SQL

You can configure Clouddriver to use a MySQL compatible database in place of Redis for all of its persistence use cases. This provides more resiliency for your deployment.

These are:

  1. Caching the state of all supported cloud resources and graphing their relationships
  2. The cloud operations task repository
  3. Distributed locking for the caching agent scheduler

You can also use Redis for one of the above, and SQL for another. A DualTaskRepository class is provided to enable online migrations between Redis and SQL in either direction without impact to running pipelines.

This guide covers database and Clouddriver configuration details, as well as tips on how Clouddriver is operated at Netflix.

Why SQL

Growth in Netflix’s infrastructure footprint over time exposed a number of scaling challenges with Clouddriver’s original Redis-backed caching implementation. For example, the functional use of unordered Redis sets as secondary indexes incurs linear cost while also making it challenging to shard the keyspace without hotspots. Additionally, Clouddriver’s data is highly relational, making it a good candidate for modeling to either relational or graph databases.

Small Spinnaker installations (managing up-to a few thousand instances across all applications) are unlikely to see significant performance or cost gains by running Clouddriver on an RDBMS as opposed to Redis. In the Netflix production environment, we have seen improvements of up to 400% in the time it takes to load the clusters view for large applications (hundreds of server groups with thousands of instances) since migrating from Redis to Aurora. We also see greater consistency in 99th percentile times under peak load and have reduced the number of instances required to operate Clouddriver. Besides improving performance at scale, we believe the SQL implementation is better suited to future growth. We’ve load tested Clouddriver backed by Aurora with the Netflix Production data set at more than 10x traffic rates, which previously resulted in Redis related outages.

At some point in the future, Spinnaker may drop support for Redis, but today it remains a good choice for evalauating Spinnaker as well as local development.

Note that cache provider classes within Clouddriver may need development to take full advantage of the SqlCache provider. The inital release adds secondary indexing by application specifically to accelerate /applications/{application}/serverGroups calls made by the UI, but only AWS caching agents and providers initially take advantage of this. Prior to adding application based indexing of AWS resources, Netflix still saw performance and consistency gains using Aurora over Redis. Performance of the SQL cache provider should increase over time as Clouddriver’s data access patterns evolve to better utilize features of the underlying storage.

Configuration Considerations

At Netflix, Orca and Clouddriver run with their own dedicated Aurora cluster but this isn’t required. When sizing a database cluster for Clouddriver, make sure the full dataset fits within the InnoDB buffer cache. If migrating an existing Clouddriver installation from Redis to SQL, provisioning database instances with 4x the RAM of the current Redis working set should ensure a fit with room for growth. Much of the additional memory use is accounted for by secondary indexes.

Throughput and write latency are important considerations, especially for Spinnaker installations that manage many accounts and regions. Each distinct (cloudProvider, account, region, resourceType) tuple generally gets an independently scheduled caching-agent instance within Clouddriver. The number of these instances per Clouddriver configuration can be considered the upper bound for concurrent write-heavy sessions to the database, if not limited by the sql.agent.maxConcurrentAgents property. Caching-agent write throughput can directly impact deployment times, especially in large environments with many accounts or multiple cloud providers. Clouddriver only tries to write new or modified resources to the database however, moderating write requirements after initial population.

Database Setup

Clouddriver ships with mysql-connector-java by default. You can provide additional JDBC connectors on the classpath such as mariadb-connector-j if desired. Use of a different RDBMS family will likely require some development effort at this time. Clouddriver was developed targeting Amazon Aurora’s MySQL 5.7 compatible engine.

Before you deploy Clouddriver, you need to manually create a logical database and user grants.

  1. CREATE DATABASE `clouddriver` DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
  2. GRANT
  3. SELECT, INSERT, UPDATE, DELETE, CREATE, EXECUTE, SHOW VIEW
  4. ON `clouddriver`.*
  5. TO 'clouddriver_service'@'%'; -- IDENTIFIED BY "password" if using password based auth
  6. GRANT
  7. SELECT, INSERT, UPDATE, DELETE, CREATE, DROP, REFERENCES, INDEX, ALTER, LOCK TABLES, EXECUTE, SHOW VIEW
  8. ON `clouddriver`.*
  9. TO 'clouddriver_migrate'@'%'; -- IDENTIFIED BY "password" if using password based auth

The following MySQL configuration parameter is required:

  • tx_isolation Set to READ-COMMITTED

The following MySQL configuration parameter may improve performance for large data sets:

  • tmp_table_size Increase if the Created_tmp_disk_tables MySQL metric regularly grows

Configure Clouddriver to use MySQL

The following yaml-based parameters provide a Clouddriver configuration that entirely uses MySQL in place of Redis. Halyard does not yet natively support Clouddriver SQL configuration. Halyard users can provide these as overrides via a clouddriver-local.yml file.

  1. sql:
  2. enabled: true
  3. # read-only boolean toggles `SELECT` or `DELETE` health checks for all pools.
  4. # Especially relevant for clouddriver-ro and clouddriver-ro-deck which can
  5. # target a SQL read replica in their default pools.
  6. read-only: false
  7. taskRepository:
  8. enabled: true
  9. cache:
  10. enabled: true
  11. # These parameters were determined to be optimal via benchmark comparisons
  12. # in the Netflix production environment with Aurora. Setting these too low
  13. # or high may negatively impact performance. These values may be sub-optimal
  14. # in some environments.
  15. readBatchSize: 500
  16. writeBatchSize: 300
  17. scheduler:
  18. enabled: true
  19. # Enable clouddriver-caching's clean up agent to periodically purge old
  20. # clusters and accounts. Set to true when using the Kubernetes provider.
  21. unknown-agent-cleanup-agent:
  22. enabled: false
  23. connectionPools:
  24. default:
  25. # additional connection pool parameters are available here,
  26. # for more detail and to view defaults, see:
  27. # https://github.com/spinnaker/kork/blob/master/kork-sql/src/main/kotlin/com/netflix/spinnaker/kork/sql/config/ConnectionPoolProperties.kt
  28. default: true
  29. jdbcUrl: jdbc:mysql://your.database:3306/clouddriver
  30. user: clouddriver_service
  31. # password: depending on db auth and how spinnaker secrets are managed
  32. # The following tasks connection pool is optional. At Netflix, clouddriver
  33. # instances pointed to Aurora read replicas have a tasks pool pointed at the
  34. # master. Instances where the default pool is pointed to the master omit a
  35. # separate tasks pool.
  36. tasks:
  37. user: clouddriver_service
  38. jdbcUrl: jdbc:mysql://your.database:3306/clouddriver
  39. migration:
  40. user: clouddriver_migrate
  41. jdbcUrl: jdbc:mysql://your.database:3306/clouddriver
  42. redis:
  43. enabled: false
  44. cache:
  45. enabled: false
  46. scheduler:
  47. enabled: false
  48. taskRepository:
  49. enabled: false

Agent Scheduling

The above yaml configures Clouddriver to use a MySQL table as part of a locking service for the caching agent scheduler. This works well for Netflix in production against Aurora, but could result in poor overall performance for all of Clouddriver, depending on how your database is configured. If you observe high database CPU utilization, lock contention, or if caching agents aren’t consistently running at the expected frequency, you may want to try using the Redis-based scheduler to determine if this is a contributing factor.

The following modification to the above example configures Clouddriver to use SQL for the cache and task repository, and Redis for agent scheduling.

IMPORTANT NOTE FOR SQL USERS NOT USING AWS AURORADB: Problems have been reported when trying to move the Clouddriver Agent Scheduler to SQL in both Google Managed CloudSQL and Amazon Managed RDS SQL. The issue can cause unpredictable behavior concerning Clouddriver’s ability to schedule caching agents. It is recommended to keep your Agent Scheduler in redis if you’re using this type of backend for Clouddriver caching. A configuration example that does this is below. Please see this issue for additional information.

  1. sql:
  2. scheduler:
  3. enabled: false
  4. redis:
  5. enabled: true
  6. connection: redis://your.redis
  7. cache:
  8. enabled: false
  9. scheduler:
  10. enabled: true
  11. taskRepository:
  12. enabled: false

Maintaining Task Repository Availability While Migrating from Redis to SQL

If you’re migrating from Redis to SQL in a production environment, where you need to avoid pipeline failures and downtime, you can modify the configuration, as shown below, to configure Clouddriver use SQL for caching and new tasks, but with fallback reads to Redis for tasks not found in the SQL database.

  1. redis:
  2. enabled: true
  3. connection: redis://your.redis
  4. cache:
  5. enabled: false
  6. scheduler:
  7. enabled: false
  8. taskRepository:
  9. enabled: true
  10. dualTaskRepository:
  11. enabled: true
  12. primaryClass: com.netflix.spinnaker.clouddriver.sql.SqlTaskRepository
  13. previousClass: com.netflix.spinnaker.clouddriver.data.task.jedis.RedisTaskRepository

How Netflix Migrated Clouddriver from Redis to SQL

The following steps were taken to live migrate Clouddriver in the Netflix production Spinnaker stack from Redis to SQL.

  1. Provision the database. In this case, a multi-AZ Aurora cluster was provisioned with 3 reader instances.
  2. Deploy Clouddriver with SQL enabled only for the dualTaskRepository. During the migration, traffic is split between Redis-backed Clouddriver instances and SQL-backed instances. It is important that tasks can be read regardless of request routing to avoid pipeline failures.

    1. # clouddriver.yml; there were no modifications to the redis: properties at this point. The following properties were added:
    2. sql:
    3. enabled: true
    4. taskRepository:
    5. enabled: true
    6. cache:
    7. enabled: false
    8. readBatchSize: 500
    9. writeBatchSize: 300
    10. scheduler:
    11. enabled: false
    12. connectionPools:
    13. default:
    14. default: true
    15. jdbcUrl: jdbc:mysql://clouddriver-aurora-cluster-endpoint:3306/clouddriver
    16. user: clouddriver_service
    17. password: hi! # actually injected from encrypted secrets
    18. migration:
    19. user: clouddriver_migrate
    20. jdbcUrl: jdbc:mysql://clouddriver-aurora-cluster-endpoint:3306/clouddriver
    21. dualTaskRepository:
    22. enabled: true
    23. primaryClass: com.netflix.spinnaker.clouddriver.data.task.jedis.RedisTaskRepository
    24. previousClass: com.netflix.spinnaker.clouddriver.sql.SqlTaskRepository
  3. A custom Spring profile was created and scoped to a set of temporary clouddriver-sql-migration clusters. The dualTaskRepository class ordering is flipped.

    1. sql:
    2. enabled: true
    3. taskRepository:
    4. enabled: true
    5. cache:
    6. enabled: true
    7. readBatchSize: 500
    8. writeBatchSize: 300
    9. scheduler:
    10. enabled: true
    11. connectionPools:
    12. default:
    13. default: true
    14. jdbcUrl: jdbc:mysql://clouddriver-aurora-cluster-endpoint:3306/clouddriver
    15. user: clouddriver_service
    16. password: hi! # actually injected from encrypted secrets
    17. migration:
    18. user: clouddriver_migrate
    19. jdbcUrl: jdbc:mysql://clouddriver-aurora-cluster-endpoint:3306/clouddriver
    20. redis:
    21. enabled: true
    22. connection: redis://your.redis
    23. cache:
    24. enabled: false
    25. scheduler:
    26. enabled: false
    27. taskRepository:
    28. enabled: true
    29. dualTaskRepository:
    30. enabled: true
    31. primaryClass: com.netflix.spinnaker.clouddriver.sql.SqlTaskRepository
    32. previousClass: com.netflix.spinnaker.clouddriver.data.task.jedis.RedisTaskRepository
  4. A clouddriver-sql-migration-caching server group was deployed with the above configuration, followed by a 5 minute WaitStage, allowing time for cache population. Clouddriver API server groups (with caching agent execution disabled) were then deployed behind the same load balancers routing traffic to the Redis-backed server groups.

  5. After another 5 minute wait, the Redis-backed server groups were disabled. At this time, any tasks running on the disabled Redis instance continue until finished. All new requests are routed to clouddriver-sql-migration-api server groups. If a requested taskId is not present in the SQL database, Clouddriver attempts to read it from Redis.
  6. The clouddriver-sql-migration configuration is merged into main, with the following changes:

    1. redis:
    2. enabled: false
    3. dualTaskRepository:
    4. enabled: false
  7. The disabled Redis-backed Clouddriver instances were verified as idle and the new configuration deployed via a red/black.

  8. The temporary migration clusters were disabled and then terminated 5 minutes later.

Last modified May 4, 2021: rest of migration (700781a)