Cassanalyze

Cassanalyze

The name of the tool (which can be run from a command line / terminal) is symbolic of what it does - analyze the purity of data in a Cassandra database. So why do we even need to do that? Well, as it stands today, the materialized view feature has been deprecated in Cassandra v3, and we are on our own to maintain data sanity across various query tables of a base table.

So for instance, we have a base table called repositories and it’s associated query table called repositories_by_uri. Now the data in both these tables should always be the same at any point in time. Similarly, the containers base table has three other query tables - containers_by_repository_key, containers_by_external_id & containers_by_external_key tables.

Features

This utility has four switches, each enabled by passing y to it (ex: -s y or -g y). These 4 parameters can be passed in any order and can also be mixed and matched.:

  1. --validate or -v: Validates various tables with duplicate values for external_id columns.

  2. --orphaned or -o: Finds out orphaned data for which parent keys are missing.

  3. --summary or -s: Creates a count analysis summary of the 4 main base tables and their associated query tables, and the same for 4 type tables and their associated query tables. Sample output is shown below:

    "------------- summary (begin) -------------" {:repo-summary {:repositories {:total 96, :external 95, :internal 1}, :repositories_by_uri 96}, :cont-summary {:containers {:total 1791, :external 1564, :internal 227}, :containers_by_external_id 1564, :containers_by_external_key 1564, :containers_by_repository_key 1791}, :art-summary {:artifacts {:total 31209, :external 31175, :internal 34, :latest 29349}, :artifacts_by_external_id 31174, :artifacts_by_external_key 31158, :artifacts_by_key 31209, :artifacts_by_container_key 29349, :artifacts_by_type_key 29349}, :rel-summary {:relations {:total 16422, :external 6052, :internal 10370, :latest 15670}, :relations_by_external_id 6052, :relations_by_external_key 6052, :relations_by_key 16422, :relations_by_source_key 16422, :relations_by_target_key 16422, :relations_by_container_key 15670}, :repo-types-summary {:repository_types {:total 18, :external 17, :internal 1}}, :cont-types-summary {:container_types {:total 141, :external 140, :internal 1}, :container_types_by_external_id 140, :container_types_by_external_key 140, :container_types_by_repository_key 141}, :art-types-summary {:artifact_types {:total 431, :external 426, :internal 5}, :artifact_types_by_external_id 426, :artifact_types_by_external_key 426, :artifact_types_by_repository_key 431}, :rel-types-summary {:relation_types {:total 71, :external 13, :internal 58}, :relation_types_by_external_id 14, :relation_types_by_external_key 13, :relation_types_by_repository_key 71}} "------------- summary (end) -------------"
  4. --generateor -g: Generates insert statements in case it finds an imbalance of rows between tables (as shown by the --summary option). This step creates a Cassandra CQL file called inserts.cql file in the root folder of the utility from where it is being run. This can be then run using CQLSH to balance the tables.
    The rules by which the INSERT statements are generated by and a colorized sample summary output showing how the rules map are given below:

    1. We treat, the base table as the reference point (i.e. the single source of truth). So data in all the other tables, is compared against this base table.

    2. For external tables, the data is compared between the _by_external_id and _by_external_key tables.

    3. Counts of _by_external_ tables should match

    4. Counts of artifacts_by_container_key & artifacts_by_type_key should match with that of latest count of artifacts table and

    5. Counts of relations_by_container_key should match that of latest count of relations table

    6. Counts of remaining tables except the above should match that of the total count of respective base tables (ie: repositories & repositories_by_uri, containers & containers_by_repository_key, artifacts: total & artifacts_by_key, relations: total & relations_by_key, relations_by_source_key, relations_by_target_key).

    "------------- summary (begin) -------------"

    {:repo-summary {:repositories 77, :repositories_by_uri 77},

    :cont-summary {:containers 804,

    :containers_by_external_id 683,

    :containers_by_external_key 683,

    :containers_by_repository_key 804},

    :art-summary {:artifacts {:total 23134, :latest 21096},

    :artifacts_by_container_key 21096,

    :artifacts_by_external_id 23114,

    :artifacts_by_external_key 23114,

    :artifacts_by_key 23134,

    :artifacts_by_type_key 21096},

    :rel-summary {:relations {:total 12483, :latest 12197},

    :relations_by_container_key 12197,

    :relations_by_external_id 3452,

    :relations_by_external_key 3452,

    :relations_by_key 12483,

    :relations_by_source_key 12483,

    :relations_by_target_key 12483}}

    "------------- summary (end) -------------"
    The generated inserts.cql file will look something similar to the following (sanitized) sample file:

 

Note, --generate is currently only used to address row count discrepancies shown by --summary. --orphan and --validate currently only diagnose.

Example Usage

  • If you want to run only the summary command:

java -Dconfig="config/config.edn" -jar cassanalyze-0.2-standalone.jar -s y
  • If you want to run only the generate command:

java -Dconfig="config/config.edn" -jar cassanalyze-0.2-standalone.jar -g y
  • Validate how many rows in various tables are present with duplicate values in external_id column.

  • Find out how many orphaned rows are present in various tables, where parent keys are missing.

  • If you want to run summary & generate commands together (similarly, other parameters can be mixed and matched) :

Steps to run the utility / tool

(1) Unzip the Cassanalyze utility (should be bundled inside cassanalyze-0.2.2022-09-06.zip) into folder called cassanalyze. This is the root folder which will be your current working folder for remainder of the steps. After unzipping, the utility, your folder structure should like this:

 

(2) Now, we need to specify the particular instance of Cassandra we want to connect to, for which, we need to specify host /port / username / password / keyspace values in the /config/config.edn file. The contents of the config.edn file are:

As mentioned in the comments (a line starting with semi-colon “;”) in the above file, replace "Cassandra Host", "Cassandra username" & "Cassandra password" with appropriate values for your running Cassandra instance.

The Cassandra keyspace has been set to syndeia_cloud_store above, because that’s the keyspace which Syndeia Cloud uses.

(3) Get a summary of the current state of the database:

Sample “bad” data count summary is shown below with counts of interest colorized to show rule failures (note, the exact counts may differ as you most likely have a different # of repositories, containers, artifacts, etc):

"------------- summary (begin) -------------"
{:repo-summary {:repositories 70, :repositories_by_uri 70},
:cont-summary {:containers 1066,
:containers_by_external_id 919,
:containers_by_external_key 919,
:containers_by_repository_key 975},
:art-summary {:artifacts {:total 16620, :latest 15799},
:artifacts_by_container_key 15799,
:artifacts_by_external_id 16598,
:artifacts_by_external_key 16580,
:artifacts_by_key 16620,
:artifacts_by_type_key 15799},
:rel-summary {:relations {:total 9095, :latest 8808},
:relations_by_container_key 8808,
:relations_by_external_id 3920,
:relations_by_external_key 3920,
:relations_by_key 9095,
:relations_by_source_key 8936,
:relations_by_target_key 8828}}
"------------- summary (end) -------------"

Notice that containers_by_repository_key table, has 91 rows less than containers table. Similarly, relations_by_source_key & relations_by_target_key tables have 159 & 267 rows less than the total rows in the relations table. Also, artifacts_by_external_id & artifacts_by_external_key have a difference of 18 rows

(4) If you notice a difference in the number of rows of various tables, then you can generate insert statements (to balance the tables) using:

(5) The above command generates insert statements in an inserts.cql file, in the same folder from which these commands are being run. The insert statements for various tables are separated using CQL comments e.g. -- repositories insert statements. If you see only such comments in the file and no INSERT INTO statements, then you can skip this step, else, run the CQL file using the CQL shell, where <username> = the username & <host> = the host specified in config.edn file:

(6) After running the previous step, ensure that the database is balanced by checking the summary again: