How to Backup & Restore Syndeia Cloud

This KB article discusses various aspects one must consider when backing up and restoring some or all data related to Syndeia Cloud (SC).

Applies to:

  • Syndeia Cloud 3.3

  • Syndeia Cloud 3.4

  • Syndeia Cloud 3.5

  • Syndeia Cloud 3.6

 Overview

This is a somewhat complex question to answer and depends on several factors:

  1. what aspects you wish to backup (ie: installation files + configured settings vs data)

  2. what conditions you can assume about the restoration environment (ie: will it always be with the same software versions? on AWS? etc.)

  3. what cluster topology and Replication Factor (RF) you set for each Cassandra keyspace across your cluster (RF determines how many copies of your data exist across your clusters and spread out your data)

  4. how much data you have (ie: can it live all on one node? or has it grown to the point of not being able to exist all on one node?)

For INSTALLATION FILES + CONFIGURED SETTINGS ON SYSTEMS:

  1. where you can access the raw files of (JG + SC): you can use any imaging tool (or (cloud) service, ex: in the case of a AWS, one could either create a volume snapshot/AMI).

  2. where you can't access the raw files (ex: AWS MSK (Kafka + Zookeeper)): you will need to refer to AWS MSK documentation on how one can do that (if that's even possible).

For APPLICATION DATA in:

  1. Kafka (via AWS MSK): you may be able to have a Kafka consumer run on another "backup node" (AWS EC2 or other) that just "listens" in the background logging all events put on the stream where you periodically archive off the logs generated per any backup tool of your choosing. Note, the value of backing up the Kafka stream is debatable, as it is currently only used to convey the main SC data to the graphing subsystem. Theoretically one could replay the Kafka stream from any point to reconstruct the evolution of the graph over time but that isn't something we do yet. Also, the event sourcing system we use in the backend already supports that

  2. Cassandra (which is the DB for JG + Syndeia Cloud) : WARNING: Confusingly, there are multiple ways DataStax/Apache Cassandra documents to backup/restore data, and each has pros/cons, they are listed in rough order of portability, management simplicity, and FOSS/PAID below:

    1. schema^ + CQLSH CSV Backup/Restore: Apache Cassandra CQLSH command, Intercax tested (see Backup & Restore Methods for Syndeia Cloud Keyspace in Cassandra for SC 3.4 & Backup & Restore Methods for Syndeia Cloud Keyspace in Cassandra for SC 3.3)

      • Pros:

        1. data is complete, ie: this is an entire backup vs an incremental,

        2. data is very portable to DB even outside of Cassandra,

        3. backup can work with any past, present or future Cassandra versions

      • Cons:

        1. could potentially generate large (but compressible) files,

        2. supposedly somewhat slow (may want to tweak CQLSH settings to avoid timeouts (see troubleshooting section))

      • NET = +1

    2. schema^ + dsbulk (CSV/JSON Backup/Export): DataStax provided accessory utility (see https://docs.datastax.com/en/dsbulk/doc/dsbulk/dsbulkAbout.html )

      • Pros:

        1. supposedly this is a faster performing CSV/JSON export,

      • Cons:

        1. more complex, originally meant to be a tool to ingest in non-Cassandra data and not explicitly meant for backup/restores

      • NET = 0

    3. schema^ + nodetool snapshot (& optionally incremental backups^^) + nodetool refresh: Apache Cassandra included (see https://docs.datastax.com/en/archived/cassandra/3.x/cassandra/tools/toolsSnapShot.html , https://docs.datastax.com/en/cassandra-oss/3.x/cassandra/operations/opsBackupIncremental.html , https://docs.datastax.com/en/archived/cassandra/3.x/cassandra/tools/toolsRefresh.html , https://docs.datastax.com/en/cassandra-oss/3.x/cassandra/operations/opsBackupSnapshotRestore.html#Restoringfromlocalnodes , and Restoring a snapshot into a new cluster )

      • Pros:

        1. almost instantaneous (as it creates a copy of the the live DB files),

      • Cons:

        1. Cassandra version needs to be the same,

        2. initial tokens process need to be backed up (see Restoring a snapshot into a new cluster ),

        3. process is node-specific and snapshot / refresh needs to be run simultaneously/atomically on,

        4. ... multiple (potentially all) nodes

      • NET = -3

    4. schema^ + token backup + nodetool snapshot (& optionally incremental backups^^) + sstableloader: Apache Cassandra included (see https://docs.datastax.com/en/archived/cassandra/3.x/cassandra/tools/toolsSnapShot.html , https://docs.datastax.com/en/cassandra-oss/3.x/cassandra/operations/opsBackupIncremental.html , sstableloader (Cassandra bulk loader) , Restoring from a snapshot , and Restoring a snapshot into a new cluster )

      • Pros:

        1. almost instantaneous (as it creates a copy of the the live DB files)

        2. don't have to deal with backup of token ranges,

        3. can deal with a different number of nodes or replication strategy.

      • Cons:

        1. Cassandra versions need to be the same,

        2. process is node-specific and snapshot / refresh needs to be run simultaneously/atomically on,

        3. multiple (potentially all) nodes at the same time (depending on the replication factor)

      • NET = 0

    5. schema + cassandra-data-copy-tool: 3rd-party FOSS tool GitHub - wildengineer/cassandra-data-copy-tool: Tool for table to table data migration in cassandra , could use it as a way to periodically create a "cold" backup by just streaming from live to backup cluster

      • Pros:

        1. Simplifies cluster-to-cluster copying,

        2. The source and destination tables do not need to be on the same cluster or keyspace. All you need to ensure is that the destination table is compatible with the source table.

      • Cons:

        1. Requires second cluster,

        2. Unknown atomicity (presumably cluster is streamed to target)

      • NET = 0

    6. Medusa: 3rd-party FOSS tool GitHub - thelastpickle/cassandra-medusa: Apache Cassandra Backup and Restore Tool

      • Pros:

        1. schema automatically handled?,

        2. tokens automatically handled?,

        3. Cluster wide in place restore (restoring on the same cluster that was used for the backup),

        4. Cluster wide remote restore (restoring on a different cluster than the one used for the backup),

        5. Support for Google Cloud Storage (GCS) and AWS S3 through Apache Libcloud (can be extended to support other storage providers supported by Apache Libcloud)

      • Cons:

        1. Cassandra deployments with multiple data folder directories not supported yet

        2. Still somewhat beta-ish but endorsed by Apache Cassandra

      • NET = +3

    7. DataStax OpsCenter™: 2nd-party PAID tool https://docs.datastax.com/en/opscenter/6.1/opsc/about_c.html (see also https://www.youtube.com/watch?v=HLy1rV4BTT8 )

      • Pros:

        1. very simple to use GUI

        2. handles automatic snapshotting across all nodes simultaneously

      • Cons:

        1. Not compatible with OSS Apache Cassandra or DataStax Distributions of Apache Cassandra™ (DDAC) clusters, ie: requires migrating to a DataStax distribution of Cassandra

        2. Not free

      • NET = 0

    8. Datastax Astra™: 2nd-party PAID tool (see https://www.datastax.com/products/datastax-astra/pricing )

      • Pros:

        1. 80GB Free-tier

        2. Cloud-based (you choose: AWS, GCP, Azure)

        3. Reduced on-site h/w IT maintenance

        4. Can scale on-demand

      • Cons:

        1. Not free after free-tier

        2. Somewhat complex metered rates based on credits ($1 = 1 credit)

        3. Cloud-based (may not be viable for some security-sensitive deployments)

        4. Credits expire after 1-year

      • Net = 0

Footnotes:

  • ^ schema needs to be cqlsh -u <superuser> -e 'DESCRIBE KEYSPACE syndeia_cloud_' <node_FQDN> > syndeia-cloud_v<version_#>_backup_schema.cql -ed out.

  • ^^ one can optionally also attempt to backup commit logs but IMO I believe these are an unnecessary complexity that can be avoided by ensuring one does a nodetool flush prior to creating any Cassandra "backup" or snapshot (see https://support.datastax.com/hc/en-us/articles/115001593706-Manual-Backup-and-Restore-with-Point-in-time-and-table-level-restore- for a more comprehensive explanation of how Cassandra "backups", snapshots, and commitlogs work)

  • ^^^ Cassandra 3rd-party backup/restore options: I have not tested these methods yet, could be better or worse than the official DataStax/Apache Cassandra options.

 Related articles