Partitioning data of the ref-match tables#

Introduction#

The ref-match tables are a special class of tables that are designed to match rows between two independent director tables belonging to the same or different catalogs. In this case, there is no obvious 1-to-1 match between rows of the director tables. Instead, the pipelines compute a (spatial) match table between the two that provides a many-to-many relationship between both tables. The complication is that a match record might reference a row (an object) and reference row that fall on opposite sides of the partition boundary (into different chunks). Qserv deals with this by taking advantage of the overlap that must be stored alongside each partition (this overlap is stored so that Qserv can avoid inter-worker communication when performing spatial joins on the fly).

Since the ref-match tables are also partitioned tables the input data (CSV) of the tables have to be partitioned into chunks. In order to partition the ref-match tables one would have to use a special version of the partitioning tool sph-partition-matches. A source code of the tool is found in the source tree of Qserv: lsst/qserv. The corresponding binary is built and placed into the binary Docker image of Qserv.

Here is an example illustrating how to launch the tool from the container:

% docker run -it qserv/lite-qserv:2022.9.1-rc1 sph-partition-matches --help

    sph-partition-matches [options]

    The match partitioner partitions one or more input CSV files in
    preparation for loading by database worker nodes. This involves assigning
    both positions in a match pair to a location in a 2-level subdivision
    scheme, where a location consists of a chunk and sub-chunk ID, and
    outputting the match pair once for each distinct location. Match pairs
    are bucket-sorted by chunk ID, resulting in chunk files that can then
    be distributed to worker nodes for loading.
    A partitioned data-set can be built-up incrementally by running the
    partitioner with disjoint input file sets and the same output directory.
    Beware - the output CSV format, partitioning parameters, and worker
    node count MUST be identical between runs. Additionally, only one
    partitioner process should write to a given output directory at a
    time. If any of these conditions are not met, then the resulting
    chunk files will be corrupt and/or useless.
    \_____________________ Common:
    -h [ --help ]                         Demystify program usage.
    -v [ --verbose ]                      Chatty output.
    -c [ --config-file ] arg              The name of a configuration file
                                          containing program option values in a
    ...

The tool has two parameters specifying the locations of the input (CSV) file and the output folder where the partitioned products will be stored:

% sph-partition-matches --help
..
\_____________________ Output:
  --out.dir arg                         The directory to write output files to.
\______________________ Input:
  -i [ --in.path ] arg                  An input file or directory name. If the
                                        name identifies a directory, then all
                                        the files and symbolic links to files
                                        in the directory are treated as inputs.
                                        This option must be specified at least
                                        once.

Hint

If the tool is launched via the docker command as was shown above, one would have to mount the corresponding host paths into the container.

All tables, including both director tables and the ref-match table itself, have to be partitioned using the same values of the partitioning parameters, including:

  • The number of stripes

  • The number of sub-stripes

  • The overlap radius

Values of the partitioning parameters should be specified using the following options (the default values shown below are meaningless for any production scenario):

--part.num-stripes arg (=18)          The number of latitude angle stripes to
                                      divide the sky into.
--part.num-sub-stripes arg (=100)     The number of sub-stripes to divide
                                      each stripe into.
--part.overlap arg (=0.01)            Chunk/sub-chunk overlap radius (deg).

The next sections present two options for partitioning the input data.

The spatial match within the given overlap radius#

This is the most reliable way of partitioning the input data of the match tables. It is available when the input rows of the match table carry the exact spatial coordinates of both matched rows (from the corresponding director tables).

In this scenario, the input data file (CSV) is expected to have 4 columns representing the spatial coordinates of the matched rows from the director tables on the 1st (“left”) and on the 2nd (“righ”). Roles and sample names of the columns are presented in the table below:

dir1_ra

The right ascent coordinate (longitude) of the 1st matched entity (from the 1st director table).

dir1_dec

The declination coordinate (latitude) of the 1st matched entity (from the 1st director table).

dir2_ra

The right ascent coordinate (longitude) of the 2nd matched entity (from the 2nd director table).

dir2_dec

The declination coordinate (latitude) of the 2nd matched entity (from the 2nd director table).

The names of these columns need to be passed to the partitioning tool using two special parameters:

% sph-partition-matches \
    --part.pos1="dir1_ra,dir1_dec" \
    --part.pos2="dir2_ra,dir2_dec"

When using this technique for partitioning the match tables, it’s required that the input CSV file(s) had at least those 4 columns mentioned above. The actual number of columns could be larger. Values of all additional will be copied into the partitioned products (the chunk files). The original order of the columns will be preserved.

Here is an example of a sample CSV file that has values of the above-described spatial coordinates in the first 4 columns and the object identifiers of the corresponding rows from the matched director tables in the last 2 columns:

10.101,43.021,10.101,43.021,123456,6788404040
10.101,43.021,10.102,43.023,123456,6788404041

The last two columns are meant to store values of the following columns:

dir1_objectId

The unique object identifier of the 1st director table.

dir2_objectId

The unique object identifier of the 2nd director table.

The input CSV file shown above could be also presented in the tabular format:

dir1_ra

dir1_dec

dir2_ra

dir2_dec

dir1_objectId

dir2_objectId

0.101

43.021

10.101

43.021

123456

6788404040

0.101

43.021

10.102

43.023

123456

6788404041

Note that this is actually a 1-to-2 match, in which a single object (123456) of the 1st director has two matched objects (6788404040 and 6788404041) in the 2nd director. Also, note that the second matched object has slightly different spatial coordinates than the first one. If the value of the overlap parameter is bigger than the difference between the coordinates then the tool will be able to match the objects successfully. For example, this would work if a value of the overlap was set to 0.01. Otherwise, no match will be made and the row will be ignored by the tool.

It is assumed that the input data of the ref-match tables are correctly produced by the data processing pipelines. Verifying the quality of the input data is beyond the scope of this document. However, one might consider writing a special tool for pre-scanning the input files and finding problems in the files.

Here is the complete practical example of how to run the tool with the assumptions made above:

% cat in.csv
10.101,43.021,10.101,43.021,123456,6788404040
10.101,43.021,10.102,43.023,123456,6788404041

% cat config.json
{
    "part":{
        "num-stripes":340.
        "num-sub-stripes":3,
        "overlap":0.01,
        "pos1":"dir1_ra,dir1_dec",
        "pos2":"dir2_ra,dir2_dec"
    },
    "in":{
        "csv":{
            "null":"\\N",
            "delimiter":",",
            "field":[
                "dir1_ra",
                "dir1_dec"
                "dir2_ra",
                "dir2_dec",
                "dir1_objectId",
                "dir2_objectId"
            ]
        }
    },
    "out":{
        "csv":{
            "null":"\\N",
            "delimiter":",",
            "escape":"\\",
            "no-quote":true
        }
    }
}

% mkdir chunks
% sph-partition-matches -c config.json --in.path=in.csv --out.dir=chunks/

Partitioning using index maps#

Note

This section is under construction. Only the basic idea is presented here.

This is an alternative way of partitioning the input data of the match tables. It is available when the input rows of the match table do not carry the exact spatial coordinates of both matched rows (from the corresponding director tables). Instead, the input data has to carry the unique object identifiers of the matched rows. The tool will use the object identifiers to find the spatial coordinates of the matched rows in the director tables. The tool will use the index maps of the director tables to find the spatial coordinates of the matched rows.