Content Comparison

...

Kafka (via AWS MSK): you may be able to have a Kafka consumer run on another "backup node" (AWS EC2 or other) that just "listens" in the background logging all events put on the stream where you periodically archive off the logs generated per any backup tool of your choosing. Note, the value of backing up the Kafka stream is debatable, as it is currently only used to convey the main SC data to the graphing subsystem. Theoretically one could replay the Kafka stream from any point to reconstruct the evolution of the graph over time but that isn't something we do yet. Also, the event sourcing system we use in the backend already supports that
Cassandra (which is the DB for JG + Syndeia Cloud) : WARNING: Confusingly, there are multiple ways DataStax/Apache Cassandra documents to backup/restore data, and each has pros/cons, they are listed in rough order of portability, management simplicity, and FOSS/PAID below:
1. schema^ + CQLSH CSV Backup/Restore: Apache Cassandra CQLSH command, Intercax tested (see Backup & Restore Methods for Syndeia Cloud Keyspace in Cassandra )
  - Pros:
    1. data is complete, ie: this is an entire backup vs an incremental,
    2. data is very portable to DB even outside of Cassandra,
    3. backup can work with any past, present or future Cassandra versions
  - Cons:
    1. could potentially generate large (but compressible) files,
    2. supposedly somewhat slow (may want to tweak CQLSH settings to avoid timeouts (see troubleshooting section))
  - NET = +1
2. schema^ + dsbulk (CSV/JSON Backup/Export): DataStax provided accessory utility (see https://docs.datastax.com/en/dsbulk/doc/dsbulk/dsbulkAbout.html )
  - Pros:
    1. supposedly this is a faster performing CSV/JSON export,
  - Cons:
    1. more complex, originally meant to be a tool to ingest in non-Cassandra data and not explicitly meant for backup/restores
  - NET = 0
3. schema^ + nodetool snapshot (& optionally incremental backups^^) + nodetool refresh: Apache Cassandra included (see https://docs.datastax.com/en/archived/cassandra/3.x/cassandra/tools/toolsSnapShot.html , https://docs.datastax.com/en/cassandra-oss/3.x/cassandra/operations/opsBackupIncremental.html , https://docs.datastax.com/en/archived/cassandra/3.x/cassandra/tools/toolsRefresh.html , https://docs.datastax.com/en/cassandra-oss/3.x/cassandra/operations/opsBackupSnapshotRestore.html#Restoringfromlocalnodes , and https://docs.datastax.com/en/cassandra-oss/3.x/cassandra/operations/opsSnapshotRestoreNewCluster.html )
  - Pros:
    1. almost instantaneous (as it creates a copy of the the live DB files),
  - Cons:
    1. Cassandra version needs to be the same,
    2. initial tokens process need to be backed up (see https://docs.datastax.com/en/cassandra-oss/3.x/cassandra/operations/opsSnapshotRestoreNewCluster.html ),
    3. process is node-specific and snapshot / refresh needs to be run simultaneously/atomically on,
    4. ... multiple (potentially all) nodes
  - NET = -3
4. schema^ + token backup + nodetool snapshot (& optionally incremental backups^^) + sstableloader: Apache Cassandra included (see https://docs.datastax.com/en/archived/cassandra/3.x/cassandra/tools/toolsSnapShot.html , https://docs.datastax.com/en/cassandra-oss/3.x/cassandra/operations/opsBackupIncremental.html , https://docs.datastax.com/en/cassandra-oss/3.x/cassandra/tools/toolsBulkloader.html?hl=sstableloader#toolsBulkloader__bulkloader-restoring-snapshots , https://docs.datastax.com/en/cassandra-oss/3.x/cassandra/operations/opsBackupSnapshotRestore.html#Restoringfromcentralizedbackups , and https://docs.datastax.com/en/cassandra-oss/3.x/cassandra/operations/opsSnapshotRestoreNewCluster.html )
  - Pros:
    1. almost instantaneous (as it creates a copy of the the live DB files)
    2. don't have to deal with backup of token ranges,
    3. can deal with a different number of nodes or replication strategy.
  - Cons:
    1. Cassandra versions need to be the same,
    2. process is node-specific and snapshot / refresh needs to be run simultaneously/atomically on,
    3. multiple (potentially all) nodes at the same time (depending on the replication factor)
  - NET = 0
5. schema + cassandra-data-copy-tool: 3rd-party FOSS tool https://github.com/wildengineer/cassandra-data-copy-tool , could use it as a way to periodically create a "cold" backup by just streaming from live to backup cluster
  - Pros:
    1. Simplifies cluster-to-cluster copying,
    2. The source and destination tables do not need to be on the same cluster or keyspace. All you need to ensure is that the destination table is compatible with the source table.
  - Cons:
    1. Requires second cluster,
    2. Unknown atomicity (presumably cluster is streamed to target)
  - NET = 0
6. Medusa: 3rd-party FOSS tool https://github.com/thelastpickle/cassandra-medusa
  - Pros:
    1. schema automatically handled?,
    2. tokens automatically handled?,
    3. Cluster wide in place restore (restoring on the same cluster that was used for the backup),
    4. Cluster wide remote restore (restoring on a different cluster than the one used for the backup),
    5. Support for Google Cloud Storage (GCS) and AWS S3 through Apache Libcloud (can be extended to support other storage providers supported by Apache Libcloud)
  - Cons:
    1. Cassandra deployments with multiple data folder directories not supported yet
    2. Still somewhat beta-ish but endorsed by Apache Cassandra
  - NET = +3
7. DataStax OpsCenter™: 2nd-party PAID tool https://docs.datastax.com/en/opscenter/6.1/opsc/about_c.html (see also https://www.youtube.com/watch?v=HLy1rV4BTT8 )
  - Pros:
    1. very simple to use GUI
    2. handles automatic snapshotting across all nodes simultaneously
  - Cons:
    1. Not compatible with OSS Apache Cassandra or DataStax Distributions of Apache Cassandra™ (DDAC) clusters, ie: requires migrating to a DataStax distribution of Cassandra
    2. Not free
  - NET = 0
8. Datastax Astra™: 2nd-party PAID tool (see https://www.datastax.com/products/datastax-astra/pricing )
  - Pros:
    1. 80GB Free-tier
    2. Cloud-based (you choose: AWS, GCP, Azure)
    3. Reduced on-site h/w IT maintenance
    4. Can scale on-demand
  - Cons:
    1. Not free after free-tier
    2. Somewhat complex metered rates based on credits ($1 = 1 credit)
    3. Cloud-based (may not be viable for some security-sensitive deployments)
    4. Credits expire after 1-year
  - Net = 0

...

Version	Old Version 1	New Version 2
Changes made by	Brian Miller	Brian Miller
Saved on	Oct 25, 2021	Oct 25, 2021

Content Comparison

Versions Compared

Key