Backup & Restore Methods for Syndeia Cloud Keyspace in Cassandra

Overview

Cassandra has multiple methods to backup and restore data depending on the scenario and the amount of data you wish to backup/restore.

(info) Note, this document will not discuss the backup and restore for the application software itself (ie: all the software/dependencies you installed or symlinked into /opt) or the configuration for it.  Use whatever conventional tool(s) you are comfortable with to backup/restore that.  

This document will go over the simplest & most portable method (using the CQL COPY ... TO / COPY ... FROM CQL commands), but be aware that other methods (ie: sstableloader with nodetool snapshot + schema & token export/import or 3rd party ETL tools) are available too but have their own caveats. Refer to the Apache Cassandra and DataStax documentation (https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/migrating.html & https://web.archive.org/web/20161227010514/http://datascale.io/cloning-cassandra-clusters-fast-way/) for further details.

There are three main types of data that we need to be concerned about when backing up & restoring your database:

  • Schema: the schema is a description of what type(s) of data each table contains, think of it like column headers to table(s) of items
  • Data: this is actual application table data.
  • Metadata: this is supporting meta-data used by the database software, ex: Cassandra node token assignments.

For the method we are discussing here, we only need to be concerned about the first two.  


Backup

The following will backup the Syndeia Cloud database schema + keyspace tables automatically via a shell script.

1. Copy and paste the below into a new file named syndeia_cloud-3.3_backup.bash (set the password for syndeia_admin on L4 (default is myPw)):  

(info) Note L3-4 are environment variables that can be optionally picked up from your shell if you export them out beforehand, ie:  export version=3.3.2021-01-27.SP1; syndeia_admin=foo , useful for automation or any (scheduled) cron job)

#!/bin/env bash
cassandra_jg_host=${1:-localhost}
version=${version:-3.3}
syndeia_admin_pw=${syndeia_admin_pw:-myPw}
SC_HOME=${SC_HOME:-/opt/icx/syndeia-cloud-current}
keyspaces=(syndeia_cloud_auth syndeia_cloud_devops syndeia_cloud_graph syndeia_cloud_graph_config syndeia_cloud_store)
tables=(syndeia_cloud_store.snapshots \
syndeia_cloud_store.metadata \
syndeia_cloud_store.offsetstore \
syndeia_cloud_store.auto_key \
syndeia_cloud_store.messages \
syndeia_cloud_store.repositories \
syndeia_cloud_store.repositories_by_uri \
syndeia_cloud_store.containers \
syndeia_cloud_store.containers_by_repository_key \
syndeia_cloud_store.containers_by_external_key \
syndeia_cloud_store.containers_by_external_id \
syndeia_cloud_store.artifacts \
syndeia_cloud_store.artifacts_by_key \
syndeia_cloud_store.artifacts_by_container_key \
syndeia_cloud_store.artifacts_by_external_key \
syndeia_cloud_store.artifacts_by_external_id \
syndeia_cloud_store.artifacts_by_type_key \
syndeia_cloud_store.relations \
syndeia_cloud_store.relations_by_container_key \
syndeia_cloud_store.config \
syndeia_cloud_store.relations_by_key \
syndeia_cloud_store.relations_by_external_id \
syndeia_cloud_store.relations_by_external_key \
syndeia_cloud_store.relations_by_source_key \
syndeia_cloud_store.relations_by_target_key \
syndeia_cloud_store.repository_types \
syndeia_cloud_store.container_types \
syndeia_cloud_store.container_types_by_repository_key \
syndeia_cloud_store.container_types_by_external_key \
syndeia_cloud_store.container_types_by_external_id \
syndeia_cloud_store.artifact_types \
syndeia_cloud_store.artifact_types_by_repository_key \
syndeia_cloud_store.artifact_types_by_external_key \
syndeia_cloud_store.artifact_types_by_external_id \
syndeia_cloud_store.relation_types \
syndeia_cloud_store.relation_types_by_repository_key \
syndeia_cloud_store.relation_types_by_external_key \
syndeia_cloud_store.relation_types_by_external_id \
syndeia_cloud_auth.snapshots \
syndeia_cloud_auth.metadata \
syndeia_cloud_auth.offsetstore \
syndeia_cloud_auth.messages \
syndeia_cloud_auth.config \
syndeia_cloud_auth.auto_key \
syndeia_cloud_auth.users \
syndeia_cloud_auth.users_by_external_id \
syndeia_cloud_auth.users_by_external_key \
syndeia_cloud_auth.users_by_user_name)
LC_NUMERIC=C
# TIMEFORMAT='{"real":%R,"user":%U,"sys":%S}'
TIMEFORMAT='{"real":%R}'
if [[ ! -d SC_backups ]]; then
    mkdir SC_backups
fi;
echo "$(date) - Dump schema for ${keyspaces}..."
for keyspace in ${keyspaces[@]} do;
    dst_filename="SC_backups/${keyspace}_schema_backup.cql"
    all_keyspace_filenames="${all_keyspace_filenames} ${keyspace}_schema_backup.cql"
    echo "$(date)   - Dump schema for ${keyspace}..."
    sudo cqlsh -u syndeia_admin -p ${syndeia_admin_pw} -e "DESCRIBE KEYSPACE ${keyspace};" "${cassandra_jg_host}" > ${dst_filename} > /dev/null
    if [[ $? -ne 0 ]]; then
        echo "$(date) x Error, could not export schema ${keyspace}"
        exit 1
    else
        echo "$(date) - Successfully exported schema ${keyspace}"
    fi;
	cat 
done;
cat "${all_keyspace_filenames}" > SC_backups/syndeia-cloud-${version}_concatenated_schema_backup.cql > /dev/null
echo "$(date) - Export data for tables..."
for table in ${tables[@]} do;
    dst_filename="SC_backups/syndeia-cloud-${version}_backup_${table}.csv"
	# Generate MD5 hash for NULL values during each table export
	md5hash=$(date -Iseconds | md5sum | cut -f1 -d' ')
	echo "${md5hash}" > SC_backups/${table}_md5hash_for_NULL.txt
    echo "$(date)  - Export data for ${table} to ${dst_filename} using md5hash=${md5hash} for NULL..."
    CQL_to_backup_table="COPY ${table} TO '${dst_filename}' WITH HEADER = TRUE AND NULL='${md5hash}'; "
    sudo cqlsh -u syndeia_admin -p ${syndeia_admin_pw} -e "${CQL_to_backup_table}" "${cassandra_jg_host}" > /dev/null
    if [[ $? -ne 0 ]]; then
        echo "$(date) x Error, could not export table ${table}"
        exit 2
    else
        echo "$(date) - Successfully exported table ${table}"
    fi;
done;

2.  Make the script executable, ie:  chmod ug+x syndeia_cloud-3.3_backup.bash

3.  To use the script, run the following command to from the CLI: syndeia_cloud-3.3_backup.bash <node_FQDN>

      where <node_FQDN> is the node Fully Qualified Domain Name (ex: cassandra.mycompany.com) or if unspecified, localhost by default.  

     For each successfully exported schema and table, you should see output similar to the following (note, the # of exported rows):

Mon May 10 16:57:49 EDT 2021 - Successfully exported schema syndeia_cloud_auth

Using 3 child processes

Starting copy of syndeia_cloud_auth.users with columns [id, key, email, provider_id, provider_key, activated, avatar_url, created_by, created_date, first_name, last_name, modified_by, modified_date, permissions, roles].
Processed: 11 rows; Rate:      19 rows/s; Avg. rate:       4 rows/s
11 rows exported to 1 files in 2.698 seconds.



Restore

The following will restore the Syndeia Cloud database schema + keyspace tables automatically via a shell script.

1. Copy and paste the below into a new file named syndeia_cloud-3.3_restore.bash (set the password for syndeia_admin on L4 (default is myPw)):  

(info) Note L3-4 are environment variables that can be optionally picked up from your shell if you export them out beforehand, ie:  export version=3.3.2021-01-27.SP1; syndeia_admin=foo , useful for automation)

#!/bin/env bash
cassandra_jg_host=${1:-localhost}
version=${version:-3.3}
syndeia_admin_pw=${syndeia_admin_pw:-myPw}
SC_HOME=${SC_HOME:-/opt/icx/syndeia-cloud-current}
keyspaces=(syndeia_cloud_auth syndeia_cloud_devops syndeia_cloud_graph syndeia_cloud_graph_config syndeia_cloud_store)
tables=(syndeia_cloud_store.snapshots \
syndeia_cloud_store.metadata \
syndeia_cloud_store.offsetstore \
syndeia_cloud_store.auto_key \
syndeia_cloud_store.messages \
syndeia_cloud_store.repositories \
syndeia_cloud_store.repositories_by_uri \
syndeia_cloud_store.containers \
syndeia_cloud_store.containers_by_repository_key \
syndeia_cloud_store.containers_by_external_key \
syndeia_cloud_store.containers_by_external_id \
syndeia_cloud_store.artifacts \
syndeia_cloud_store.artifacts_by_key \
syndeia_cloud_store.artifacts_by_container_key \
syndeia_cloud_store.artifacts_by_external_key \
syndeia_cloud_store.artifacts_by_external_id \
syndeia_cloud_store.artifacts_by_type_key \
syndeia_cloud_store.relations \
syndeia_cloud_store.relations_by_container_key \
syndeia_cloud_store.config \
syndeia_cloud_store.relations_by_key \
syndeia_cloud_store.relations_by_external_id \
syndeia_cloud_store.relations_by_external_key \
syndeia_cloud_store.relations_by_source_key \
syndeia_cloud_store.relations_by_target_key \
syndeia_cloud_store.repository_types \
syndeia_cloud_store.container_types \
syndeia_cloud_store.container_types_by_repository_key \
syndeia_cloud_store.container_types_by_external_key \
syndeia_cloud_store.container_types_by_external_id \
syndeia_cloud_store.artifact_types \
syndeia_cloud_store.artifact_types_by_repository_key \
syndeia_cloud_store.artifact_types_by_external_key \
syndeia_cloud_store.artifact_types_by_external_id \
syndeia_cloud_store.relation_types \
syndeia_cloud_store.relation_types_by_repository_key \
syndeia_cloud_store.relation_types_by_external_key \
syndeia_cloud_store.relation_types_by_external_id \
syndeia_cloud_auth.snapshots \
syndeia_cloud_auth.metadata \
syndeia_cloud_auth.offsetstore \
syndeia_cloud_auth.messages \
syndeia_cloud_auth.config \
syndeia_cloud_auth.auto_key \
syndeia_cloud_auth.users \
syndeia_cloud_auth.users_by_external_id \
syndeia_cloud_auth.users_by_external_key \
syndeia_cloud_auth.users_by_user_name)
LC_NUMERIC=C
# TIMEFORMAT='{"real":%R,"user":%U,"sys":%S}'
TIMEFORMAT='{"real":%R}'
echo "$(date) - Restore schema for ${keyspaces}..."
src_filename="SC_backups/syndeia-cloud-${version}_concatenated_schema_backup.cql"
sudo cqlsh -u syndeia_admin -p ${syndeia_admin_pw} -e "SOURCE '${src_filename}';" "${cassandra_jg_host}"
if [[ $? -ne 0 ]]; then
    echo "$(date) x Error, could not import concatenated schema ${src_filename}"
    exit 1
else
    echo "$(date) - Successfully imported concatenated schema ${src_filename}"
fi;
echo "$(date) - Import data for tables..."
for table in ${tables[@]} do;
    src_filename="SC_backups/syndeia-cloud-${version}_backup_${table}.csv"
	# Read MD5 hash for NULL values during each table export
	md5hash=$(cat SC_backups/${table}_md5hash_for_NULL.txt)
    echo "$(date)  - Import data for ${table} from ${dst_filename}..."
    CQL_to_restore_table="COPY ${table} FROM '${src_filename}' WITH HEADER = TRUE AND WITH NULL='${md5hash}'; "
    sudo cqlsh -u syndeia_admin -p ${syndeia_admin_pw} -e "${CQL_to_restore_table}" "${cassandra_jg_host}"
    if [[ $? -ne 0 ]]; then
        echo "$(date) x Error, could not import table ${table}"
        exit 2
    else
        echo "$(date) - Successfully imported table ${table}"
    fi;
done;
echo "$(date) - Done!"


2.  Make the script executable, ie:  chmod ug+x syndeia_cloud-3.3_restore.bash

3.  To use the script, run the following command to from the CLI: syndeia_cloud-3.3_restore.bash <node_FQDN>

      where <node_FQDN> is the node Fully Qualified Domain Name (ex: cassandra.mycompany.com) or if unspecified, localhost by default.  

     For each successfully imported schema and table, you should see output similar to the following (note, the # of exported rows):

Mon May 10 16:57:49 EDT 2021 - Successfully imported schema syndeia_cloud_auth

Using 3 child processes

Starting copy of syndeia_cloud_auth.users with columns [id, activated, avatar_url, created_by_user_key, created_timestamp, description, external_id, external_key, first_name, key, last_modified_by_user_key, last_modified_timestamp, last_name, name, other_info, permissions, profiles, roles, user_name].
Processed: 11 rows; Rate:       6 rows/s; Avg. rate:      10 rows/s
11 rows imported from 1 files in 1.048 seconds (0 skipped).



Troubleshooting

Import Errors

Q1: Sometimes I get the following timeout error and nothing gets imported (see below), how do I resolve this?

Failed to import 20 rows: OperationTimedOut - errors={<Host: x.x.x.x dc1>: ConnectionException('Host has been marked down or removed',)}, last_host=y.y.y.y,  will retry later, attempt 1 of 5
Failed to import 20 rows: OperationTimedOut - errors={'y.y.y.y': 'Client request timeout. See Session.execute[\_async](timeout)'}, last\_host=y.y.y.y,  will retry later, attempt 1 of 5
No records inserted in 90 seconds, aborting
Processed: 0 rows; Rate:       0 rows/s; Avg. rate:       0 rows/s
0 rows imported from 1 files in 1 minute and 30.180 seconds (0 skipped).

A1: Rerun the import command or create/set the following ~/.cassandra/cqlshrc parameters from the account used to run cqlsh and re-run the import command

[connection]
request_timeout=6000 
# default is 10 seconds, set to None to disable
client_timeout=3600 
# default is 10 seconds, set to None to disable

Q2: Sometimes I get the following "Pickling" error (see below), how do I resolve this?

PicklingError: Can't pickle <class 'cqlshlib.copyutil.ImmutableDict'>: attribute lookup cqlshlib.copyutil.ImmutableDict failed

A2: Suffix the following to the end of your COPY commands: WITH MINBATCHSIZE=1 AND MAXBATCHSIZE=1 AND PAGESIZE=10