Qserv SQL Dialect#

Introduction#

LSST Query Services (Qserv) provides access to the LSST Database Catalogs. Users can query the catalogs using standard SQL query language with a few restrictions described below. Why restricting it? We are intercepting all queries, rewriting them and executing each large query as many sub-queries in parallel. Introducing these restrictions greatly simplifies parsing incoming queries.

Our parser uses an open-source grammar built from the SQL92 specification. It does not include anything beyond SQL92, with minor exceptions (e.g., != is allowed as well as <>).

The simplest way to look at it is to treat it as a MySQL database server, modulo restrictions and extensions described in this document.

If you run into any syntax that this Manual fails to document, please report it to the Qserv developers.

Selected Design Aspects#

Partitioning and Sharding#

Qserv is designed to handle large volumes of data that can be partitioned in one or more dimensions (e.g., by spatial locality). Once the partitioning columns and parameters (such as partition size) are chosen, the data is directed to appropriate partitions (“chunks”), which are then distributed (sharded) across the nodes in the cluster.

In this scheme, a single large table typically consists of many chunks (potentially hundreds of thousands). This facilitates running full-table-scan queries, as each query can be executed in parallel across multiple chunks. For example, if we have an Object table split into 12345 chunks, Qserv will execute:

SELECT * FROM Object

as 12345 worker-side queries processed in parallel:

SELECT * FROM Object_00001
SELECT * FROM Object_00002
...
SELECT * FROM Object_12345

Director Table#

Often, multiple large tables need to be partitioned and joined together. To ensure joining such tables is possible without sending lots of information between nodes, Qserv has a way to ensure all related chunks always end up on the same machine. To enable that, Qserv has a notion of a “Director Table”, which “drives” partitioning. For example, consider two tables:

CREATE TABLE Object (
    objectId BIGINT PRIMARY KEY, -- unique identifier
    ra DOUBLE,                   -- spatial location (right ascension)
    decl DOUBLE                  -- spatial location (declination)
)

which contains information about astronomical objects (galaxies, stars), one row = one object, and:

CREATE TABLE Source (
    sourceId BIGINT PRIMARY KEY, -- unique identifier
    objectId BIGINT,             -- pointer to corresponding object
                                 -- Note, there maybe many sources per object
    ra DOUBLE                    -- spatial location (right ascension)
    decl DOUBLE                  -- spatial location (declination)
)

which contains information about individual detections of astronomical objects, one row representing one detection of one object.

Note that astronomical objects tend to move, so individual detections of the same object might have different ra / decl positions than the “average” location represented by the ra/decl of their corresponding object.

If we select the Object table to be the “Director Table”, not only will the Object table be partitioned according to its ra / decl values, but more importantly, the Source table will be partitioned based on the ra / decl of corresponding objects.

The Director Index#

The sharding scheme described above has a problem with locating data by objectId. To alleviate this, Qserv maintains a specialized index that maps a primary key of the director table to a chunkId of the chunk that contains a given row. Consider the query:

SELECT * FROM Object WHERE objectId = <id>

behind the scene, it will be executed as:

SELECT chunkId FROM IdToChunkMapping WHERE objectId = <id>

which is a quick index lookup, followed by:

SELECT * FROM Object_<chunkId> WHERE objectId = <id>

which is another quick index lookup inside one small chunk.

Note that the use of the director index has some restrictions, as explained in the restrictions section below.

By the way, do not attempt to issue queries directly on our internal chunk tables. It is blocked.

Extensions#

This section covers extensions to sql which we introduced.

Spatial Constraints#

Spatial constraints in Qserv can be expressed using one of the functions we introduced. Currently supported:

qserv_areaspec_box(
    lonMin  DOUBLE,  -- [deg] Minimum longitude angle
    latMin  DOUBLE,  -- [deg] Minimum latitude angle
    lonMax  DOUBLE,  -- [deg] Maximum longitude angle
    latMax  DOUBLE   -- [deg] Maximum latitude angle
)

qserv_areaspec_circle(
    lon     DOUBLE,  -- [deg] Circle center longitude
    lat     DOUBLE,  -- [deg] Circle center latitude
    radius  DOUBLE   -- [deg] Circle radius
)

qserv_areaspec_ellipse(
    lon                 DOUBLE,  -- [deg] Ellipse center longitude
    lat                 DOUBLE,  -- [deg] Ellipse center latitude
    semiMajorAxisAngle  DOUBLE,  -- [arcsec] Semi-major axis length
    semiMinorAxisAngle  DOUBLE,  -- [arcsec] Semi-minor axis length
    positionAngle       DOUBLE   -- [deg] Ellipse position angle, east of north
)

qserv_areaspec_poly(
    v1Lon  DOUBLE,  -- [deg] Longitude angle of first polygon vertex
    v1Lat  DOUBLE,  -- [deg] Latitude angle of first polygon vertex
    v2Lon  DOUBLE,  -- [deg] Longitude angle of second polygon vertex
    v2Lat  DOUBLE,  -- [deg] Latitude angle of second polygon vertex
    ...
)

Example:

SELECT objectId FROM Object WHERE qserv_areaspec_box(0, 0, 3, 10)

Note that as discussed in the “Restrictions” section below, spatial constraints must be expressed through the qserv_areaspec_* functions.

Restrictions#

This section covers restriction you need to be aware of when interacting with Qserv.

Spatial constraints should be expressed through our `qserv_areaspec_*` functions#

Spatial constraints should be expressed through qserv_areaspec_* functions (see Extensions section above for details). Any other way of specifying spatial restrictions may be significantly slower (e.g., they might devolve to be full table scan). For example, the form:

WHERE ra BETWEEN <ra1> AND <ra2>
  AND decl BETWEEN <decl1> AND <decl2>

even though it is equivalent to:

qserv_areaspec_box(<ra1>, <decl1>, <ra2>, <decl2>)

should not be used.

Spatial constraints must appear at the beginning of `WHERE`#

Spatial constraints must appear at the very beginning of the WHERE clause (before or after the objectId constraint, if there is any).

Only one spatial constraint is allowed per query#

Only one spatial constraint expressed through qserv_areaspec_* is allowed per query, e.g., these are examples of invalid queries:

WHERE qserv_areaspec_box(1, 35, 2, 38)
  AND qserv_areaspec_box(5, 77, 6, 78)

WHERE qserv_areaspec_box(1, 35, 2, 38)
  AND qserv_areaspec_circle(5, 77, 0.1)

Arguments passed to spatial constraints functions must be simple literals#

The arguments passed to the qserv_aresspec_* functions must be simple literals. They may not contain any references, e.g. may not refer to columns.

Example of an invalid entry:

WHERE qserv_areaspec_box(3+4, ra*2, 0, 0)

`OR` is not allowed after `qserv_areaspec_*` constraint#

If the query has extra constraints after the qserv_areaspec_* constraint, OR is not allowed immediately after qserv_areaspec_*, for example:

SELECT objectId, ra, decl, x
 FROM  Object
 WHERE qserv_areaspec_box(1, 35, 2, 38) AND x > 3.5

is valid, but:

SELECT objectId, ra, decl, x
 FROM  Object
 WHERE qserv_areaspec_box(1, 35, 2, 38) OR x > 3.5

is not allowed. We expect to remove this restriction in the future, see DM-2888.

The director index constraint must be expressed through `=`, `IN`, or `BETWEEN`#

If the query has objectId constraint, it should be expressed in one of these three forms:

SELECT * FROM Object WHERE objectId = 123
SELECT * FROM Object WHERE objectId IN (123, 453, 3465)
SELECT * FROM Object WHERE objectId BETWEEN 123 AND 130

E.g., don’t try to express it as WHERE objectId != 1, or WHERE objectId > 123, etc.

Note, we expect to allow decomposing objectId into bitfields (e.g., for sampling) in the future. See DM-2889.

Column(s) used in `ORDER BY` or `GROUP BY` must appear in `SELECT`#

At the moment we require columns used in ORDER BY or GROUP BY to be listed in SELECT. Example of an invalid query:

SELECT x FROM  T ORDER BY y

Correct version:

SELECT y, x FROM T ORDER BY y

Expressions/functions in `ORDER BY` clauses are not allowed#

In SQL92 ORDER BY is limited to actual table columns, thus expressions or functions in ORDER BY are rejected. This is true for Qserv too.

Example of an invalid use of the ORDER BY clause:

SELECT id, ABS(x) FROM Source ORDER BY ABS(x)

However, one can bypass this by using an alias, for example:

SELECT id, ABS(x) as ax FROM Source ORDER BY ax

Sub-queries are NOT supported#

Sub queries are not supported.

Commands that modify tables are disallowed#

Commands for creating or modifying tables are disabled. These commands include INSERT, UPDATE, LOAD INTO, CREATE, ALTER, TRUNCATE, and DROP. We will revisit this as we start adding support for Level 3.

Outer joins are not supported with near-neighbor queries#

Qserv does not support LEFT or RIGHT joins with near-neighbor predicates.

MySQL-specific syntax is not supported#

MySQL-specific syntax is not supported. Example of unsupported syntax that will be rejected: NAME_CONST.

Repeated column names through `*` are not supported#

Queries with a * that resolves to repeated column name are not supported. Example:

SELECT *, id FROM Object

will fail if the table Object has a column called id. Similarly, this query will fail:

SELECT o.*, s.* FROM Object AS o, Source AS s

if both tables Object and Source have a column called id.

A workaround would be to select columns explicitly and alias them, e.g. :

SELECT o.id AS oId, s.id AS sId FROM Object AS o, Source AS s

`USE INDEX()` is not supported#

Qserv will reject query with USE INDEX hint.

Variables are not supported#

You can’t select into a variable. For example:

SELECT scisql_s2CPolyToBin(...) FROM T INTO @poly

will fail. Related story is at DM-2874.

User Defined Functions#

Qserv installation always comes with a set of predefined user defined functions:

spherical geometry aimed to allow quick answers to the following sorts of questions:
- Which points in a table lie inside a region on the sphere? For example, an astronomer might wish to know which stars and galaxies lie inside the region of the sky observed by a single camera CCD.
- Which spherical regions in a table contain a particular point? For example, an astronomer might wish to know which telescope images overlap the position of interesting object X
photometry, aimed to provide conversions between raw fluxes, calibrated (AB) fluxes and AB magnitudes.

For details, see Science Tools for MySQL.

Example Queries Supported#

Counts and simple selections#

You can count objects and run simple selections. Few examples:

Count the number of rows in a table#

SELECT COUNT(*) FROM Object

Find rows with a particular `id`#

SELECT * FROM Object WHERE objectId = <id>

Select rows in a given area#

SELECT objectId FROM Object
 WHERE qserv_areaspec_box(1, 35, 2, 38)

Select rows in a given area meeting certain criteria#

SELECT COUNT(*) FROM Object
 WHERE qserv_areaspec_box(0.1, -6, 4, 6)
   AND x = 3.4
   AND y BETWEEN 1 AND 2

Joins#

Join two tables#

SELECT s.ra, s.decl, o.raRange, o.declRange
  FROM Object o, Source s
 WHERE o.objectId = <id>
   AND o.objectId = s.objectId

or:

SELECT s.ra, s.decl, o.raRange, o.declRange
  FROM Object o, Source s USING (objectId)
 WHERE o.objectId = <id>

Find near neighbors in a given region#

SELECT o1.objectId AS objId1,
       o2.objectId AS objId2,
       scisql_angSep(o1.ra_PS,
                     o1.decl_PS,
                     o2.ra_PS,
                     o2.decl_PS) AS distance
  FROM Object o1, Object o2
 WHERE qserv_areaspec_box(0, 0, 0.2, 1)
   AND scisql_angSep(o1.ra_PS,
                     o1.decl_PS,
                     o2.ra_PS,
                     o2.decl_PS) < 0.05
   AND o1.objectId <> o2.objectId

`LIMIT`, `ORDER BY`#

Limit results, sort results#

SELECT * FROM Object
 WHERE x > 4
 ORDER BY x
 LIMIT 100

Known Bugs#

The list of all known / reported problems can be found at: Data Access and Database Team User-facing Bugs.

Selecting by objectId can miss a row#

Selecting rows using objectId sometimes does not return rows it should. For details, see: DM-2864.

`WHERE objectId BETWEEN` fails#

As explained above, queries in the form WHERE objectId BETWEEN are discouraged. In fact, Qserv will currently return a cryptic message when such query is executed. For details, see DM-2873.

Notes of Performance#

Use objectId when selecting sources#

If you need to locate a small number of sources, try to use objectId if you can. If you don’t, your query will require an index scan for every chunk of the Source table (which can potentially mean thousands of chunk-queries). For example this query will require it:

SELECT * FROM Source WHERE sourceId = 500

but asking for sources related to a given object, like this one:

SELECT * FROM Source WHERE objectId = 123 AND sourceId = 500

will require an index scan for just a single chunk, and thus will typically be much faster.