Set up ingest input data#

Prerequisites#

An HTTP(s) server providing access to input data and metadata.

Example#

These examples for input data are used by Qserv ingest continuous integration process.

Input data#

Input data is produced by Qserv partitioner (i.e. sph-partition) and is made of multiples *.csv, *.tsv or *.txt files. Each of these files contains a part of a chunk for a given database and table, as shown in this example. Relation between an input data file and its related table and database is available inside metadata.json, detailed below.

Metadata#

Metadata files below describe input data and are required by qserv-ingest:

metadata.json: contain the name of the configuration files describing the database, the tables, and the indexes. It also contains the relative path to the input chunk data produced by Qserv partitioner. Folder organization for input chunk files is configurable, using the directory and the chunks sections of metadata.json. Each input chunk file name must follow the pattern chunk_<chunk_id>.txt. Add .tables[i].data[i].overlaps section if, for a given chunk, a chunk contribution file does not have a corresponding overlap file, and vice-versa, this might happen if a chunk has an empty overlap or if an empty chunk has a non-empty overlap. If .tables[i].data[i].overlaps section is missing then, for a given chunk, each chunk contribution file must have a corresponding overlap file (i.e. chunk_XXX.txt and chunk_XXX_overlap.txt must exist).

{
"version":12
"database":"test101.json",
"formats":{
    "txt":{
       "fields_terminated_by":","
       "fields_escaped_by":"\\\\"
       "lines_terminated_by":"\\n"
    }
 },
"tables":[
    {
        "schema":"director_table.json",
        "indexes":[
            "idx_director.json"
        ],
        "data":[
            {
            "directory":"director/dir1",
            "chunks":[
                57866,
                57867
            ]
            "overlaps":[
                57800,
                57801
            ]
            },
            {
            "directory":"director/dir2",
            "chunks":[
                57868
            ]
            }
        ]
    },
    {
        "schema":"partitioned_table.json",
        "indexes":[
            "idx_partitioned.json"
        ],
        "data":[
            {
            "directory":"partitioned/dir1",
            "chunks":[
                57866,
                57867
            ]
            },
            {
            "directory":"partitioned/dir2",
            "chunks":[
                57868
            ]
            }
        ]
    }
    ]
}

<database_name>.json: describe the database to register inside the replication service and where the data will be ingested Ingest API documentation for registering databases
<table_name>.json: each of these files describes a table to register inside the replication service and where the data will be ingested, Ingest API documentation for registering tables
<table_index>.json:each of these files describes an index to create for a given set of chunk tables, Ingest API documentation for creating indexes

A valid set of examples for all of these files is available in this Rubin IN2P3 repository.