Information in this guide is known to be outdated. A documentation sprint is underway which will include updates and revisions to this guide.
LSST Query Services (Qserv) provides access to the LSST Database Catalogs. Users can query the catalogs using standard SQL query language with a few restrictions described below. Why restricting it? We are intercepting all queries, rewriting them and executing each large query as many sub-queries in parallel. Introducing these restrictions greatly simplifies parsing incoming queries.
Our parser uses an open-source grammar built from the SQL92 specification. It does not include anything beyond SQL92, with minor exceptions (e.g., != is allowed as well as <>).
The simplest way to look at it is to treat it as a MySQL database server, modulo restrictions and extensions described in this document.
If you run into any syntax that this Manual fails to document, please report it through the Qserv mailing list.
Selected Design Aspects
Partitioning and Sharding
Qserv has been designed to handle large volumes of data that can be partitioned in one or more dimensions (for example, by spacial locality). Once the partitioning column(s) are selected and partitioning parameters (such as partition size) are chosen, loaded data gets directed to appropriate partitions (“chunks”), and chunks are distributed (sharded) across nodes available in the cluster.
In such scheme, a single large table typically consists of many chunks (it could even be tens of thousands of chunks). This helps with running full-table-scan queries, as each such query can be executed in pieces, in parallel. For example, imagine we have an Object table that we split into x chunks. Then Qserv will execute:
SELECT * from Object
SELECT * from Object_00001 SELECT * from Object_00002 ... SELECT * from Object_x
Often, multiple large tables need to be partitioned and joined together. To ensure joining such tables is possible without sending lots of information between nodes, Qserv has a way to ensure all related chunks always end up on the same machine. To enable that, Qserv has a notion of “Director Table”, which “drives” partitioning. For example, consider two tables:
TABLE Object ( objectId BIGINT PRIMARY KEY, # unique identifier ra DOUBLE, # spatial location (right ascension) decl DOUBLE # spatial location (declination) )
which contains information about astronomical objects (galaxies, stars), one row = one object, and:
TABLE Source ( sourceId BIGINT PRIMARY KEY, # unique identifier objectId BIGINT, # pointer to corresponding object # Note, there maybe many sources per object ra DOUBLE # spatial location (right ascension) decl DOUBLE # spatial location (declination) )
which contains information about individual detections of astronomical objects, one row = one detection of one object.
Note that astronomical objects tend to move, so individual detections of the same object might have different ra/decl positions than the “average” location represented by the ra/decl of their corresponding object.
If we elect the Object table to be the “Director Table”, not only the Object table will be partitioned according to its ra/decl values, but more importantly, Source table will be partitioned based on the ra/decl of corresponding objects.
The sharding scheme described above has a problem with locating data by objectId. To alleviate this, Qserv maintains a specialized index that maps a primary key of the director table to a chunkId of the chunk that contains a given row. Consider a query:
SELECT * from Object WHERE objectId = <id>
behind the scene, it will be executed as:
SELECT chunkId FROM IdToChunkMapping WHERE objectId = <id>
which is a quick index lookup, followed by
SELECT * from Object_<chunkId> WHERE objectId = <id>
which is another quick index lookup inside one small chunk.
Note that the use of secondary index has some restrictions, as explained in the restrictions section below.
By the way, do not attempt to issues queries directly on our internal chunk tables. It is blocked.
This section covers extensions to sql which we introduced.
Spatial constraints in Qserv can be expressed using one of the functions we introduced. Currently supported:
qserv_areaspec_box( lonMin DOUBLE PRECISION, # [deg] Minimum longitude angle latMin DOUBLE PRECISION, # [deg] Minimum latitude angle lonMax DOUBLE PRECISION, # [deg] Maximum longitude angle latMax DOUBLE PRECISION # [deg] Maximum latitude angle ) qserv_areaspec_circle( lon DOUBLE PRECISION, # [deg] Circle center longitude lat DOUBLE PRECISION, # [deg] Circle center latitude radius DOUBLE PRECISION # [deg] Circle radius ) qserv_areaspec_ellipse( lon DOUBLE PRECISION, # [deg] Ellipse center longitude lat DOUBLE PRECISION, # [deg] Ellipse center latitude semiMajorAxisAngle DOUBLE PRECISION, # [arcsec] Semi-major axis length semiMinorAxisAngle DOUBLE PRECISION, # [arcsec] Semi-minor axis length positionAngle DOUBLE PRECISION # [deg] Ellipse position angle, east of north ) qserv_areaspec_poly( v1Lon DOUBLE PRECISION, # [deg] Longitude angle of first polygon vertex v1Lat DOUBLE PRECISION, # [deg] Latitude angle of first polygon vertex v2Lon DOUBLE PRECISION, # [deg] Longitude angle of second polygon vertex v2Lat DOUBLE PRECISION, # [deg] Latitude angle of second polygon vertex ... )
SELECT objectId FROM Object WHERE qserv_areaspec_box(0, 0, 3, 10)
Note that as discussed in the “Restrictions” section below, spatial constraints must be expressed through the qserv_areaspec_* functions.
This section covers restriction you need to be aware of when interacting with Qserv.
Spatial constraints should be expressed through our qserv_areaspec_* functions
Spatial constraints should be expressed through qserv_areaspec_* functions (see Extensions section above for details). Any other way of specifying spatial restrictions may be significantly slower (e.g., they might devolve to be full table scan). For example, the form:
WHERE ra BETWEEN <ra1> AND <ra2> AND decl BETWEEN <decl1> AND <decl2>
even though it is equivalent to:
qserv_areaspec_box(<ra1>, <decl1>, <ra2>, <decl2>)
should not be used.
Spatial constraints must appear at the beginning of WHERE
Spatial constraint must appear at the very beginning of the WHERE clause (before or after the objectId constraint, if there is any).
Only one spatial constraint is allowed per query
Only one spatial constraint expressed through qserv_areaspec_* is allowed per query, e.g., these are examples of invalid queries:
WHERE qserv_areaspec_box(1, 35, 2, 38) AND qserv_areaspec_box(5, 77, 6, 78)
WHERE qserv_areaspec_box(1, 35, 2, 38) AND qserv_areaspec_circle(5, 77, 0.1)
Arguments passed to spatial constraints functions must be simple literals
The arguments passed to the qserv_aresspec_ functions must be simple literals. They may not contain any references, e.g. may not refer to columns.
Example of an invalid entry:
WHERE qserv_areaspec_box(3+4, ra*2, 0, 0)
OR is not allowed after qserv_areaspec_* constraint
If the query has extra constraints after the qserv_areaspec_* constraint, OR is not allowed immediately after qserv_areaspec_*, for example:
SELECT objectId, ra, decl, x FROM Object WHERE qserv_areaspec_box(1, 35, 2, 38) AND x > 3.5
is valid, but
SELECT objectId, ra, decl, x FROM Object WHERE qserv_areaspec_box(1, 35, 2, 38) OR x > 3.5
is not allowed. We expect to remove this restriction in the future, see DM-2888.
Secondary index constraint must be expressed through “=”, “IN”, or “BETWEEN”
If the query has objectId constraint, it should be expressed in one of these three forms:
SELECT * FROM Object WHERE objectId = 123 SELECT * FROM Object WHERE objectId IN (123, 453, 3465) SELECT * FROM Object WHERE objectId BETWEEN 123 AND 130
E.g., don’t try to express it as “WHERE objectId != 1”, or WHERE objectId > 123 etc.
Note, we expect to allow decomposing objectId into bitfields (e.g., for sampling) in the future. See DM-2889.
Column(s) used in ORDER BY or GROUP BY must appear in SELECT
At the moment we require columns used in ORDER BY or GROUP BY to be listed in SELECT. Example of an invalid query:
SELECT x FROM T ORDER BY y
SELECT y, x FROM T ORDER BY y
Expressions/functions in ORDER BY clauses are not allowed
In SQL92 ORDER BY is limited to actual table columns, thus expressions or functions in ORDER BY are rejected. This is true for Qserv too.
Example of an invalid ORDER BY:
SELECT id, ABS(x) FROM Source ORDER BY ABS(x)
However, one can bypass this by using an alias, for example:
SELECT id, ABS(x) as ax FROM Source ORDER BY ax
Sub-queries are NOT supported
Sub queries are not supported.
Commands that modify tables are disallowed
Commands for creating or modifying tables are disabled. These commands include “INSERT, UPDATE, LOAD INTO, CREATE, ALTER, TRUNCATE, DROP”. We will revisit this as we start adding support for Level 3.
Outer joins are not supported with near-neighbor queries
Qserv does not support LEFT or RIGHT joins with near-neighbor predicates.
MySQL-specific syntax is not supported
MySQL-specific syntax is not supported. Example of unsupported syntax that will be rejected: NAME_CONST.
Repeated column names through * are not supported
Queries with a * that resolves to repeated column name are not supported. Example:
SELECT *, id FROM Object
will fail if the table Object has a column called “id”. Similarly, this query will fail:
SELECT o.*, s.* FROM Object AS o, Source AS s
if both tables Object and Source have a column called “id”.
A workaround would be to select columns explicitly and alias them, e.g.
SELECT o.id AS oId, s.id AS sId FROM Object AS o, Source AS s
“USE INDEX()” is not supported
Qserv will reject query with “USE INDEX” hint.
Variables are not supported
You can’t select into a variable. For example
SELECT scisql_s2CPolyToBin(...) FROM T INTO @poly
will fail. Related story DM-2874.
User Defined Functions
- Qserv installation always comes with a set of predefined user defined functions:
spherical geometry aimed to allow quick answers to the following sorts of questions: * Which points in a table lie inside a region on the sphere? For example, an astronomer might wish to know which stars and galaxies lie inside the region of the sky observed by a single camera CCD. * Which spherical regions in a table contain a particular point? For example, an astronomer might with to know which telescope images overlap the position of interesting object X
photometry, aimed to provide conversions between raw fluxes, calibrated (AB) fluxes and AB magnitudes.
For details, see Science Tools for MySQL.
Example Queries Supported
Counts and simple selections
You can count objects and run simple selections. Few examples:
Count the number of rows in a table
SELECT COUNT(*) FROM Object
Find rows with a particular id
SELECT * FROM Object WHERE objectId = <theId>
Select rows in a given area
SELECT objectId FROM Object WHERE qserv_areaspec_box(1, 35, 2, 38)
Select rows in a given area meeting certain criteria
SELECT COUNT(*) FROM Object WHERE qserv_areaspec_box(0.1, -6, 4, 6) AND x = 3.4 AND y BETWEEN 1 AND 2
Find a row with a particular id
SELECT * FROM Object WHERE objectId = <theId>
Join two tables
SELECT s.ra, s.decl, o.raRange, o.declRange FROM Object o, Source s WHERE o.objectId = <theId> AND o.objectId = s.objectId
SELECT s.ra, s.decl, o.raRange, o.declRange FROM Object o, Source s USING (objectId WHERE o.objectId = <theId>
Find near neighbors in a given region
SELECT o1.objectId AS objId1, o2.objectId AS objId2, scisql_angSep(o1.ra_PS, o1.decl_PS, o2.ra_PS, o2.decl_PS) AS distance FROM Object o1, Object o2 WHERE qserv_areaspec_box(0, 0, 0.2, 1) AND scisql_angSep(o1.ra_PS, o1.decl_PS, o2.ra_PS, o2.decl_PS) < 0.05 AND o1.objectId <> o2.objectId
LIMIT, ORDER BY
Limit results, sort results
SELECT * FROM Object WHERE x > 4 ORDER BY x LIMIT 100
The list of all known / reported problems can be found at: Data Access and Database Team User-facing Bugs.
Selecting by objectId can miss a row
Selecting rows using objectId sometimes does not return rows it should. For details, see: DM-2864.
WHERE objectId BETWEEN fails
As explained above, queries in the form “WHERE objectId BETWEEN” are discouraged. In fact, Qserv will currently return a cryptic message when such query is executed. For details, see DM-2873.
Notes of Performance
Use objectId when selecting sources
If you need to locate a small number of sources, try to use objectId if you can. If you don’t, your query will require an index scan for every chunk of the Source table (which can potentially mean thousands of chunk-queries). For example this query will require it:
SELECT * FROM Source WHERE sourceId = 500
but asking for sources related to a given object, like this one:
SELECT * FROM Source WHERE objectId = 123 AND sourceId = 500
will require an index scan for just a single chunk, and thus will typically be much faster.