CockroachDB Backups, Exports, and Archives

Today I would like to talk about the differences between backups, exports, and archives as they relate to CockroachDB.  Let's start by deciding, in general, what these three ideas are supposed to accomplish.


Backing up data is a concept that nearly everyone in the IT field has encountered at some point.  Many of us have had to work with various backup systems, differing capabilities, and various requirements.  Backups exist to satisfy the use case where we need to restore from some form of disaster or loss and get a system back to an operational state.  There can be various requirements in the form of encryption, retention, frequency, and so forth.  CockroachDB is highly performant in this situation as both backups and restores are executed in a parallel manner, with work spread through the cluster and there is the ability to backup the MVCC history as well.  The syntax for executing backups and restores can be found in the excellent CRDB documentation.  The limitations on backups is that they are in a format that only CockroachDB can parse, and there is no guarantee that in a decade, you will be able to restore a backup made today.  


Exporting data from CockroachDB is done with the export command and is used to output the data from a table or some select query you specify (including joins) in a tabular manner.  Think of exports as a CSV dump of a table or query.  These too are done in a parallel manner across the CRDB cluster.  The good news is that CSV files can be easily read and parsed, but they are not meant to be used in a situation where you are recovering from a disaster.  The exports are mainly meant to be used to transfer data from one system to another in a much more open format.  Similar to backups and their restores, the export command also has an import command that will ingest CSV files and place their contents into a CockroachDB table.  Assuming that CSV files are still being used in a decade, I would fully expect that exports and imports to that will still be possible a decade from now.  These imports and exports are usable for simple data structures, which are made up of a small number of tables, that contain data that needs to be retained for years to come.  So they can be used as lightweight archives, but don't often meet the requirements for a full data archiving solution.


Archives are not a feature implemented within CockroachDB, as the needs are different from application to application and industry to industry.  Though they can be implemented using the export and import functionality.  Handling data archives often requires the construction of additional tooling and bookkeeping within the application or within one of its microservices.  


Let's say we have a CockroachDB database and in it 20 different tables.  These tables relate to each other and in total can represent the internal data structure for our application.  If we are generating millions of records a day and have a requirement that we need to retain data for 10 years, but do not need to keep it online after 90 days of non-use, then we can build an archive solution for our older data.  This keeps the storage needs for the CockroachDB cluster down, and allows us to "roll off" old data once it is no longer needed.  We could use the export command and a complicated 20-way join to construct a big CSV file for the old records that we can move offline, but how do we unwind that 20-table join when bringing the data back in if it is needed years from now?  That is a non-trivial task.


Fundamentally, the archive and un-archiving functionality we construct needs to take the internal DB representation of a full data record, convert it into a more easily digestible format, encrypt it, compress it, verify that it can be un-archived, place it in an archive location, create bookkeeping that tells us where we should go to un-archive this data in the future, then tags the original record for removal, and finally removes the online record.  As you can see, this is much more involved than just using the export functionality… though doing an export can be part of the overall archiving effort.


In our example with 20 inter-related tables, there may be multiple CSV files we create as part of the archive of a single parent record.  We may also archive one or more parent records from the top-most parent table.  Defining that external archive data format can be a challenge and entails understanding the data structure as it is represented within the DB and how the application uses said data.  So we may very well tag one record in table1 (parent table) and multiple rows in each of the child tables that relate to the parent record and archive them all together as a group in multiple CSV files or in a single XML file.  Similarly, we may tag multiple parent records in table1 along with all of the related child rows in the downstream tables and create a big archive all at once.  Defining the specifics of the archive data format and what sequence of actions need to be executed is all dependent upon your application, regulations and governance requirements you fall under, your needs, and your application's needs.


The sequence of events may look like this to archive older data:

  1. A high level definition of what overall data needs to be archived is created.  This is used to flag or tag the parent records to be moved offline.

  2. All of the downstream child elements that relate to each of these parent records are also flagged or tagged.  

  3. The logic in your archiving functionality goes through and pulls out all of the items that have been flagged above.

  4. This information is packaged into one or more files, compressed, and encrypted.

  5. A dry run of an un-archiving of this package is now attempted to make sure that it really can be brought back when needed in the future.

  6. In a bookkeeping table, a description of which pieces of data, including the key details about it such that you can find it if needed later on, is created that says what is in the archive you just created.

  7. The rows from the various tables that were just archived are flagged for deletion.

  8. The rows flagged for deletion are removed.


The steps involved with unarchiving a piece of data would follow this sequence:

  1. Query the bookkeeping table to find out which archive file(s) contains the data that is needed.

  2. Unencrypt and uncompress the file we identified in step one.

  3. If we don't need to unarchive everything in the file, eliminate the components that are unneeded.

  4. Parsing the archive data structure that we defined back when we constructed our application, place the data back into the live system as appropriate. Flag this unarchived data as being unarchived in the database.

  5. After the unarchived data has been utilized, purge it again as long as it hasn't been modified.  If it has been modified, remove the "un-archive" flag and process it to be archived again like any other piece of data.


As you can see here, a comprehensive archiving solution is much more than a backup and a restore, and it is much more involved than simply doing an export and an import later on. Backups and archives have similar but distinct use cases they're trying to solve.


Comments

Popular posts from this blog

Data Center Failures and CRDB Replication

Bulk Update of data in CockroachDB