Introduction#
This document presents the requirements for data to be ingested into Qserv and describes the process of preparing data for ingestion.
The main data format supported by Qserv is CSV
. This choice is motivated by the fact that Qserv is implemented on top of MySQL,
which natively supports the CSV
format. The CSV
format is also very simple and can be easily generated from other
formats like Parquet
. Another reason for choosing the CSV
format is the performance of the LOAD DATA INFILE
statement
used to load data into MySQL. The LOAD DATA INFILE
statement is the fastest way to load data into MySQL and is optimized for
the CSV
format. More information on this subject can be found in:
LOAD DATA INFILE (MySQL)
The second option for loading data into Qserv is to send the data packaged in the JSON
format. This option is not as efficient as the
CSV
format but is more flexible and can be used for loading data into Qserv in a more complex way. The JSON
format is supported by
the following service:
This format is also used by the simple ingest API documented in HTTP frontend of Qserv. The API is meant for ingesting and managing user tables which are relatively small in size.
The data preparation process includes 4 steps:
Data conversion
Partitioning
Post-processing (data sanitization)
Staging
Some steps may be optional depending on the data format and the requirements of the ingest workflow.
Data conversion#
Note
This section is not complete. More details on the data conversion process will be added in the future version of the document.
The data conversion step is required if the input data is not in the CSV
format. The ost common data format
used in the LSST project is the Parquet
format.
Partitioning#
This topic is covered in:
Partitioning (DATA)
Partitioner (DATA)
Post-processing#
Besides converting the data to the CSV format, there are other operations that may optionally be performed on the input data. The purpose of these operations is to ensure the values of the columns are compatible with MySQL expectations. These are a few examples of what may need to be done:
The
BOOL
type in MySQL maps to theTINYINT
type in MySQL. Because of that, values liketrue
andfalse
are not supported by MySQL. Hence, they need to be converted to1
and0
respectively.Some data tools may produce
-inf
and+inf
values when converting floating point numbers into theCSV
format. Neither of these values are supported by MySQL. Assuming the column type isREAL
, they need to be converted to-1.7976931348623157E+308
and1.7976931348623157E+308
respectively.The
NULL
values in theCSV
files need to be converted into\N
.
Handling the binary data#
The binary data is supported by the Qserv ingest system in two ways:
The
CSV
format supports the binary data. The coresponidng fields need to be properly escaped as explained in:LOAD DATA INFILE (MySQL)
The
JSON
format also supports the binary data. However, the data in the correspondin columns need to be encodeded as explained in:
Restrictions for the variable-length column types#
Note that variable-length column types like VARCHAR
and TEXT
are not allowed in the director tables in Qserv.
This is because director tables are used to materialize sub-chunks of the data. Sub-chunks are the smallest units of
data that can be processed by Qserv workers. The sub-chunk tables are implemented using the MEMORY
storage engine.
Further details on this subject can be found in:
MEMORY Storage Engine (MySQL)
Staging#
Once the data are converted and partitioned, they need to be staged at a location from where they can be loaded into Qserv. Depending on the selected ingest method, the data may be:
placed locally, from where they would be pushed into Qserv via the proprietary binary protocol or the REST API.
placed on a distributed filesystem like
GPFS
,Lustre
, etc., which is mounted at the Qserv workers.placed on a Web server, from where they could be pulled into Qserv via the HTTP/HTTPS protocol.
placed into an Object Store (S3 compatible), from where they could be pulled into Qserv via the S3 protocol.
Besides availability, the workflow may also require the data to be retained until the ingest process is completed.