Database and table management#

Finding existing databases and database families#

The following service pulls all configuration information of of the Replication/Ingest System, including info on the known database families, databases and tables:

GET

/replication/config

Upon successful (see Error reporting when calling the services) completion of the request, the service will return an object that has the following schema (of which only the database and database family-related fields are shown):

{
    "config": {
        "database_families" : [
            {
                "overlap" :               0.01667,
                "min_replication_level" : 3,
                "num_sub_stripes" :       3,
                "name" :                  "production",
                "num_stripes" :           340
            }
        ],
        "databases" : [
            {
                "database" :     "dp01_dc2_catalogs_02",
                "create_time" :  0,
                "is_published" : 1,
                "publish_time" : 1662688661000,
                "family_name" :  "production",
                "tables" : [
                    {
                        "ang_sep" :                 0,
                        "is_director" :             1,
                        "latitude_key" :            "coord_dec",
                        "create_time" :             1662774817703,
                        "unique_primary_key" :      1,
                        "flag" :                    "",
                        "name" :                    "Source",
                        "director_database_name" :  "",
                        "is_ref_match" :            0,
                        "is_partitioned" :          1,
                        "longitude_key" :           "coord_ra",
                        "database" :                "dp02_dc2_catalogs",
                        "director_table" :          "",
                        "director_key2" :           "",
                        "director_database_name2" : "",
                        "director_key" :            "sourceId",
                        "director_table2" :         "",
                        "director_table_name2" :    "",
                        "is_published" :            1,
                        "director_table_name" :     "",
                        "publish_time" :            1663033002753,
                        "columns" : [
                            {
                                "name" : "qserv_trans_id",
                                "type" : "INT NOT NULL"
                            },
                            {
                                "type" : "BIGINT NOT NULL",
                                "name" : "sourceId"
                            },
                            {
                                "type" : "DOUBLE NOT NULL",
                                "name" : "coord_ra"
                            },
                            {
                                "type" : "DOUBLE NOT NULL",
                                "name" : "coord_dec"
                            }
                        ]
                    }
                ]
            }
        ]
    }
}

Notes:

The sample object was truncated for brevity. The actual number of families, databases, tables and columns were much higher in the real response.
The number of attributes varies depending on a particular table type. The example above shows attributes for the table Source. This table is partitioned and is a director (all director-type tables are partitioned in Qserv).

Registering databases#

Each database has to be registered in Qserv before one can create tables and ingest data. The following service of the Replication Controller allows registering a database:

POST

/ingest/database

The service requires a JSON object of the following schema:

{
    "database" :                   <string>,
    "num_stripes" :                <number>,
    "num_sub_stripes" :            <number>,
    "overlap" :                    <number>,
    "auto_build_secondary_index" : <number>,
    "local_load_secondary_index" : <number>
}

Where:

databasestring

The required name of the database to be created.

num_stripesnumber

The required number of stripes that was used when partitioning data of all tables to be ingested in a scope of the database.

num_sub_stripesnumber

The required number of sub-stripes that was used when partitioning data of all tables to be ingested in a scope of the database.

overlapnumber

The required overlap between the stripes.

auto_build_secondary_indexnumber = 1

The flag that specifies the desired mode for building the director (used to be known as the secondary) indexes of the director tables of the catalog. The flag controls the automatic building of the indexes, where:

1: Build the index automatically during transaction commit time.
0: Do not build the index automatically during transaction commit time. In this case, it will be up to a workflow to trigger the index building as a separated “post-ingest” action using the corresponding service:
- (Re-)building the Index

Note: Catalogs in Qserv may have more than one director table. This option applies to all such tables.

Warning

The service will return an error if the database with the same name already exists in the system.
Values of attributes num_stripes, num_sub_stripes and overlap are expected to match the corresponding partitioning parameters used when partitioning all partitioned tables of the new database. Note that the current implementation of the Qserv Ingest system will not validate contributions to the partitioned tables to enforce this requirement. Only the structural correctness will be checked. It’s up to a workflow to ensure the data ingested into tables are correct.
Building the director index during transaction commit time (for the relevant tables) may have a significant impact on the performance of the transaction commit operation. The impact is proportional to the size of the contributions made into the table during the transaction. This may orotolng the transaction commit time. An alternative option is to build the indexes as a separated “post-ingest” action using the corresponding service:
- (Re-)building the Index

If the operation is successfully finished (see Error reporting when calling the services) a JSON object returned by the service will have the following attribute:

{
    "database": {
        ...
    }
}

The object containing the database configuration information has the same schema as it was explained earlier in section:

Finding existing databases and database families

Registering tables#

All tables, regardless if they are partitioned or regular (fully replicated on all worker nodes), have to be registered using the following Replication Controller’s service:

POST

/ingest/table

The service requires a JSON object of the following schema:

Where a JSON object sent to the service with the request shall describe that table. This is a schema of the object for the partitioned tables is presented below:

{
    "database"             : <string>,
    "table"                : <string>,
    "is_partitioned"       : <number>,
    "schema" : [
        {   "name" :    <string>,
            "type" :    <string>
        },
        ...
    ],
    "director_table"       : <string>,
    "director_key"         : <string>,
    "director_table2"      : <string>,
    "director_key2"        : <string>,
    "latitude_key"         : <string>,
    "longitude_key"        : <string>,
    "flag"                 : <string>,
    "ang_sep"              : <double>,
    "unique_primary_key"   : <number>
}

A description of the regular tables has a fewer number of attributes (attributes that which are specific to the partitioned tables are missing):

{
    "database" :        <string>,
    "table" :           <string>,
    "is_partitioned" :  <number>,
    "schema": [
        {
            "name" :    <string>,
            "type" :    <string>
        },
        ...
    ]
}

Where the attributes are:

databasestring

The required name of the existing database.

tablestring

The required name of a table to be created.

is_partitionednumber

The required type of table. Allowed values:

1 for partitioned tables (including any subtypes)
0 for the regular tables.

schemaarray

The required definition of the table schema, where each entry of the array is an object with the following attributes:

name: The name of the column.
type: The type of the column. The type must adhere to the MySQL requirements for column types.

director_tablestring

The name of the corresponding first (or left) director table. The name is required to be not empty for the dependent tables and it has to be empty for the director tables. This is the only way to differentiate between two types of partitioned tables.

Note: The ref-match tables are considered as the dependent tables since they have columns that are pointing to the corresponding director tables. See attributes: director_key, director_table2, and director_key2.

director_keystring

The required name of a column in a partitioned table. A role of the column depends on a subtype of the table:

director: the primary key of the table
dependent: the foreign key pointing to the corresponding column of the director table

director_table2string

The name of the corresponding second (or right) director table. The non-empty value name is required for the ref-match tables and it has to be empty for the director and dependent tables.

Note: The very presence of this attribute in the input configuration would imply an intent to register a “ref-match* table. In this case, non-empty values of the attributes director_key2 , flag and ang_sep will be required in order to succeed with the registration.

director_key2string

The name of a column that is associated (AKA foreign key) with corresponding column of the second director table. A value of this attribute is required for and it must not be empty when registering the ref-match tables. It will be ignored for other table types. See a description of the attribute director_table2.

latitude_keystring

The required name of a column in a director table represents latitude. It’s optional for the dependent tables.

longitude_keystring

The required name of a column in a director table represents longitude. It’s optional for the dependent tables.

flagstring

The name of the special column that is required to be present on the ref-match tables. Values of the column are populated by the tool sph-partition-matches when partitioning the input files of the ref-match tables. The data type of this column is usually:

INT UNSIGNED

ang_sepdouble

The value of the angular separation for the matched objects that is used by Qserv to process queries which involve the ref-match tables. The value is in radians. The value is required to be non-zero for the ref-match tables.

unique_primary_keynumber = 0

The optional flag allows to drop the uniqueness requirement for the director keys of the table. The parameter is meant to be used for testing new table products, or for the director tables that won’t have any dependants (child tables). Allowed values:

0: The primary key is not unique.
1: The primary key is unique.

Warning

The table schema does not include definitions of indexes. Those are managed separately after the table is published. The index management interface is documented in a dedicated document
- TODO: Managing indexes of MySQL tables at Qserv workers.
The service will return an error if the table with the same name already exists in the system, or if the database didn’ exist at a time when teh request was delivered to the service.
The service will return an error if the table schema is not correct. The schema will be checked for the correctness.

Note

Requirements for the table schema:

The variable-length columns are not allowed in Qserv for the director and ref-match tables. All columns of these tables must have fixed lengths. These are the variable length types: VARCHAR, VARBINARY, BLOB, TEXT, GEOMETRY and JSON.
The partitioned tables are required to have parameters director_key, latitude_key and longitude_key.
The director tables are required to have non-empty column names in the parameters director_key, latitude_key and longitude_key.
The dependent tables are required to have a non-empty column name specified in the parameter director_key.
The dependent tables are allowed to have empty values in the parameters latitude_key and longitude_key.
For tables where the attributes latitude_key and longitude_key are provided (either because they are required of if they are optional), values must be either both non-empty or empty. An attempt to specify only one of the attribute or have a non-empty value in an attribute while the other one has it empty will result in an error.
All columns mentioned in attributes director_key, director_key2, flag, latitude_key and longitude_key must be present in the table schema.
Do not use quotes around the names or type specifications.
Do not start the columm names with teh reserved prefix qserv. This prefix is reserved for the Qserv-specific columns.

An example of the schema definition for the table Source:

[
    {
        "name" : "sourceId"
        "type" : "BIGINT NOT NULL",
    },
    {
        "name" : "coord_ra"
        "type" : "DOUBLE NOT NULL",
    },
    {
        "name" : "coord_dec"
        "type" : "DOUBLE NOT NULL",
    }
]

If the operation is successfully finished (see Error reporting when calling the services) a JSON object returned by the service will have the following attribute:

{
    "database": {
        ...
    }
}

The object will contain the updated database configuration information that will also include the new table. The object will have the same schema as it was explained earlier in section:

Finding existing databases and database families

Notes on the table names:

Generally, the names of the tables must adhere to the MySQL requirements for identifiers as explained in:
- https://dev.mysql.com/doc/refman/8.0/en/identifier-qualifiers.html
The names of identifiers (including tables) in Qserv are case-insensitive. This is not the general requirement in MySQL, where the case sensitivity of identifiers is configurable one way or another. This requirement is enforced by the configuration of MySQL in Qserv.
The length of the name should not exceed 64 characters as per:
- https://dev.mysql.com/doc/refman/8.0/en/identifier-length.html
The names should not start with the prefix qserv. This prefix is reserved for the Qserv-specific tables.

Publishing databases#

Databases are published (made visible to Qserv users) by calling this service:

PUT

/ingest/database/:database

The name of the database is provided as a parameter database of the resource path. There are a few optional parameters to be sent in the JSON body of the request:

{
    "consolidate_secondary_index" :  <number>,
    "row_counters_deploy_at_qserv" : <number>
}

Where:

consolidate_secondary_indexnumber = 0

The optional parameter that controls the final format of all the director index tables of the database. Normally, the director indexes are MySQL-partitioned tables. If the value of this optional parameter is not 0 then the Ingest System will consolidate the MySQL partitions and turn the tables into the monolitical form.

Warning

Depending on the scale of the catalog (sizes of the affected tables), this operation may be quite lengthy (up to many hours). Besides, based on the up to the date experience with using the MySQL-partitioned director indexes, the impact of the partitions on the index’s performance is rather negligible. So, it’s safe to ignore this option in most but very special cases that are not discussed by the document.

One can find more info on the MySQL partitioning at:

https://dev.mysql.com/doc/refman/8.0/en/partitioning.html

row_counters_deploy_at_qservnumber = 0

This optional flag that triggers scanning and deploying the row counters as explained at:

Row counters optimization (ADMIN)
Collecting row counters and deploying them at Qserv (REST)

To trigger this operation the ingest workflow should provide a value that is not 0. In this case the row counters collection service will be invoked with the following combination of parameters:

attr	value
`overlap_selector`	`CHUNK_AND_OVERLAP`
`force_rescan`	`1`
`row_counters_state_update_policy`	`ENABLED`
`row_counters_deploy_at_qserv`	`1`

Warning

The row counters deployment is a very resource-consuming operation. It may take a long time to complete depending on the size of the catalog. This will also delay the catalog publiushing stage of an ingest compaign. A better approach is to deploy the row counters as the “post-ingest” operation as explained in:

(TODO link) Deploying row counters as a post-ingest operation

Note

The catalogs may be also unpublished to add more tables. The relevant REST service is documented in:

(TODO link) Un-publishing databases to allow adding more tables

Un-publishing databases to allow adding more tables#

Unpublished databases as well as previously ingested tables will be still visible to users of Qserv. The main purpose of this operation is to allow adding new tables to the existing catalogs. The new tables won’t be seen by users until the catalog is published back using the following REST service:

Publishing databases

Databases are un-published by calling this service:

PUT

/replication/config/database/:database

The name of the database is provided in a parameter database of the resource. The only mandatory parameter to be sent in the JSON body of the request is:

admin_auth_keystring

The administrator-level authentication key that is required to publish the database. The key is used to prevent unauthorized access to the service.

Note: The key is different from the one used to publish the database. The elevated privileges are needed to reduce risks of disrupting user access to the previously loaded and published databases.

Deleting databases and tables#

These services can be used for deleting non-published (the ones that are still ingested) as well as published databases, or tables, including deleting all relevant persistent structures from Qserv:

DELETE

/ingest/database/:database

/ingest/table/:database/:table

To delete a non-published database (or a table from such database) a client has to provide the normal level authentication key auth_key in a request to the service:

{   "auth_key" : <string>
}

The name of the databases affected by the operation is specified at the resource’s path.

Deleting databases (or tables from those databases) that have already been published requires a user to have elevated administrator-level privileges. These privileges are associated with the authentication key admin_auth_key to be sent with a request instead of auth_key:

{   "admin_auth_key" : <string>
}

Upon successful completion of the request (for both above-mentioned states of the database), the service will return the standard response as explained in the section mentoned below. After that, the database (or the table, depending on a scope of a request) name can be reused for further ingests if needed.

Error reporting when calling the services

Deleting database families#

Warning

Be advised that deleting families that have member databases may result in unpredicted behavior of the Replication/Ingest System. It’s recommended to delete families only when they are empty.

Families can be deleted by calling the following REST service:

DELETE

/replication/config/family/:family[?force={0|1}]

The name of the family is provided in a parameter family of the resource. The optional query attribute force=1 can be used to proceed with the request even if the family has member databases. The default value of the attribute is 0. The only mandatory parameter to be sent in the JSON body of the request is:

admin_auth_keystring: The administrator-level authentication key that is required to delete the family. The key is used to prevent unauthorized access to the service.