Grammar of JSON Queries

See here for a more comprehensive tutorial

Scott McKellar

Introduction

The format of this grammar approximates Extended Backus-Naur notation. However it is intended as input to human beings, not to parser generators such as Lex or Yacc. Do not expect formal rigor. Sometimes narrative text will explain things that are clumsy to express in formal notation. More often, the text will restate or summarize the formal productions.

Conventions:

The grammar is a series of productions.
A production consists of a name, followed by "::=", followed by a definition for the name. The name identifies a grammatical construct that can appear on the right side of another production.
Literals (including punctuation) are enclosed in 'single quotes', or in "double quotes" if case is not significant.
A single quotation mark within a literal is escaped with a preceding backslash: 'dog\'s tail'.
If a construct can be defined more than one way, then the alternatives may appear in separate productions; or, they may appear in the same production, separated by pipe symbols. The choice between these representations is of only cosmetic significance.
A construct enclosed within square brackets is optional.
A construct enclosed within curly braces may be repeated zero or more times.
JSON allows arbitrary white space between tokens. To avoid ugly clutter, this grammar ignores the optional white space.
In many cases a production defines a JSON object, i.e. a list of name-value pairs, separated by commas. Since the order of these name/value pairs is not significant, the grammar will not try to show all the possible sequences. In general it will present the required pairs first, if any, followed by any optional elements.

Since both EBNF and JSON use curly braces and square brackets, pay close attention to whether these characters are in single quotes. If they're in single quotes, they are literal elements of the JSON notation. Otherwise they are elements of the EBNF notation.

Primitives

We'll start by defining some primitives, to get them out of the way. They're mostly just what you would expect.

[1]	string	`::=`	'"' chars '"'
[2]	chars	`::=`	any valid sequence of UTF-8 characters, with certain special characters escaped according to JSON rules
[3]	integer_literal	`::=`	[ sign ] digit { digit }
[4]	sign	`::=`	'+' \| '-'
[5]	digit	`::=`	digit = '0' \| '1' \| '2' \| '3' \| '4' \| '5' \| '6' \| '7' \| '8' \| '9'
[6]	integer_string	`::=`	'"' integer_literal '"'
[7]	integer	`::=`	integer_literal \| integer_string
[8]	number	`::=`	any valid character sequence that is numeric according to JSON rules

When json_query requires an integral value, it will usually accept a quoted string and convert it to an integer by brute force -- to zero if necessary. Likewise it may truncate a floating point number to an integral value. Scientific notation will be accepted but may not give the intended results.

[9] boolean ::= 'true' | 'false' | string | number

The preferred way to encode a boolean is with the JSON reserved word true or false, in lower case without quotation marks. The string true, in upper, lower, or mixed case, is another way to encode true. Any other string evaluates to false.

As an accommodation to perl, numbers may be used as booleans. A numeric value of 1 means true, and any other numeric value means false.

Any other valid JSON value, such as an array, will be accepted as a boolean but interpreted as false.

The last couple of primitives aren't really very primitive, but we introduce them here for convenience:

[10] class_name ::= string

A class_name is a special case of a string: the name of a class as defined by the IDL. The class may refer either to a database table or to a source_definition, which is a subquery.

[11] field_name ::= string

A field_name is another special case of a string: the name of a non-virtual field as defined by the IDL. A field_name is also a column name for the table corresponding to the relevant class.

Query

The following production applies not only to the main query but also to most subqueries.

[12] query ::= '{'
'"from"' ':' from_list
[ ',' '"select"' ':' select_list ]
[ ',' '"where"' ':' where_condition ]
[ ',' '"having"' ':' where_condition ]
[ ',' '"order_by"' ':' order_by_list ]
[ ',' '"limit"' ':' integer ]
[ ',' '"offset"' ':' integer ]
[ ',' '"distinct"' ':' boolean ]
[ ',' '"no_i18n"' ':' boolean ]
'}'

Except for the "distinct" and no_i18n entries, each name/value pair represents a major clause of the SELECT statement. The name/value pairs may appear in any order.

There is no name/value pair for the GROUP BY clause, because json_query generates it automatically according to information encoded elsewhere.

The "distinct" entry, if present and true, tells json_query that it may have to create a GROUP BY clause. If not present, it defaults to false.

The "no_i18n" entry, if present and true, tells json_query to suppress internationalization. If not present, it defaults to false. (Note that "no_i18n" contains the digit one, not the letter ell.)

The values for limit and offset provide the arguments of the LIMIT and OFFSET clauses, respectively, of the SQL statement. Each value should be non-negative, if present, or else the SQL won't work.

FROM Clause

The object identified by "from" encodes the FROM clause of the SQL. The associated value may be a string, an array, or a JSON object.

[13] from_list ::= class_name

If from_list is a class_name, the json_query inserts the corresponding table name or subquery into the FROM clause, using the class_name as an alias for the table or subquery. The class must be defined as non-virtual in the IDL.

[14]	from_list	`::=`	'[' string { ',' parameter } ']'
[15]	parameter	`::=`	string \| number \| 'null'

If from_list is a JSON array, then it represents a table-like function from which the SQL statement will select rows, using a SELECT clause consisting of "SELECT *" (regardless of the select_list supplied by the method parameter).

The first entry in the array is the name of the function. It must be a string naming a stored function. Each subsequent entry is a function parameter. If it is a string or a number, json_query will insert it into a comma-separated parameter list, enclosed in quotes, with any special characters escaped as needed. If it is the JSON reserved word null, json_query will insert it into the parameter list as a null value.

If from_list is a JSON object, it must contain exactly one entry. The key of this entry must be the name of a non-virtual class defined in the IDL. This class will be the top-level class of the FROM clause, the only one named outside of a JOIN clause.

[16]	from_list	`::=`	'{' class_name ':' join_list '}'
[17]	join_list	`::=`	class_name
[18]	join_list	`::=`	'{' join_def { ',' join_def } '}'

If the associated data is a class_name, json_query will construct an INNER JOIN clause joining the class to the top-level clause, using the columns specified by the IDL for such a join.

Otherwise, the associated data must be a JSON object with one or more entries, each entry defining a join:

[19] join_def ::= class_name ':'
'{'
[ '"type"' ':' string ]
[ '"field"' ':' field_name ]
[ '"fkey"' ':' field_name ]
[ '"filter"' ':' where_condition ]
[ '"filter_op"' ':' string ]
[ '"join"' ':' join_list ]
'}'

The data portion of the "join_type" entry tells json_query whether to use a left join, right join, full join, or inner join. The values "left", "right", and "full", in upper, lower, or mixed case, have the obvious meanings. If the "join_type" entry has any other value, or is not present, json_query constructs an inner join.

The "field" and "fkey" attributes specify the columns to be equated in the join condition. The "field" attribute refers to the column in the joined table, i.e. the one named by the join_def. The "fkey" attribute refers to the corresponding column in the other table, i.e. the one named outside the join_def -- either the top-level table or a table named by some other join_def.

It may be tempting to suppose that "fkey" stands for "foreign key", and therefore refers to a column in the child table that points to the key of a parent table. Resist the temptation; the labels are arbitrary. The json_query method doesn't care which table is the parent and which is the child.

These relationships are best explained with an example. The following from_list:

	{
	    "aou": {
	        "asv": {
	            "type" : "left",
	            "fkey" : "id",
	            "field" : "owner"
	        }
	    }
	}

...turns into the following FROM clause:

	FROM
	    actor.org_unit AS "aou"
	        LEFT JOIN action.survey AS "asv"
	            ON ( "asv".owner = "aou".id )

Note in this example that "fkey" refers to a column of the class "aou", and "field" refers to a column of the class "asv".

If you specify only one of the two columns, json_query will try to identify the other one from the IDL. However, if you specify only the column from the parent table, this attempt will probably fail.

If you specify both columns, json_query will use the column names you specify, without verifying them with a lookup in the IDL. By this means you can perform a join using a linkage that the IDL doesn't define. Of course, if the columns don't exist in the database, the query will fail when json_query tries to execute it.

Using the columns specified, either explicitly or implicitly, the json_query method constructs a join condition. With raw SQL it is possible (though rarely useful) to join two tables by an inequality. However the json_query method always uses a simple equality condition.

Using a "filter" entry in the join_def, you can apply one or more additional conditions to the JOIN clause, typically to restrict the join to certain rows of the joined table. The data associated with the "filter" key is the same sort of where_condition that you use for a WHERE clause (discussed below).

If the string associated with the "filter_op" entry is "OR" in upper, lower, or mixed case, then the json_query method uses OR to connect the standard join condition to any additional conditions supplied by a "filter" entry.

(Note that if the where_condition supplies multiple conditions, they will be connected by AND. You will probably want to move them down a layer -- enclose them in parentheses, in effect -- to avoid a confusing mixture of ANDs and ORs.)

If the "filter_op" entry carries any other value, or if it is absent, then the json_query method uses AND. In the absence of a "filter" entry, "filter_op" has no effect.

A "join" entry in a join_def specifies another layer of join. The class named in the subjoin is joined to the class named by the join_def to which it is subordinate. By this means you can encode multiple joins in a hierarchy.

SELECT Clause

If a query does not contain an entry for "select", json_query will construct a default SELECT clause. The default includes every non-virtual field from the top-level class of the FROM clause, as defined by the IDL. The result is similar to SELECT *, except:

The default includes only the fields defined in the IDL.
The columns will appear in the same order in which they appear in the IDL, regardless of the order in which the database defines them.

There are other ways to specify a default SELECT list, as shown below.

If a "select" entry is present, the associated value must be a JSON object, keyed on class names:

[20] select_list ::= '{' class_name ':' field_list { ',' class_name ':' field_list } '}'

The class_name must identify either the top-level class or a class belonging to one of the joins. Otherwise json_query will silently ignore the select_list.

[21] field_list ::= 'null' | '"*"'

If a field_list is either the JSON reserved word null (in lower case) or an asterisk in double quotes, json_query constructs a default SELECT list -- provided that the class is the top-level class of the query. If the class belongs to a join somewhere, json_query ignores the field_list.

More commonly, the field_list is a JSON array of zero or more field specifications:

[22] field_list ::= '[' [ field_spec { ',' field_spec } ] ']'

If the array is empty, json_query will construct a default SELECT list for the class -- again, provided that the class is the top-level class in the query.

In the simplest case, a field specification may name a non-virtual field defined in the IDL:

[23] field_spec ::= field_name

In some cases json_query constructs a call to the oils_i18n_xlate function to internationalize the value of the selected column. Specifically, it does so if all the following are true:

the settings file defines a locale;
in the field definition for the field in the IDL, the tag "il8n" is present and true;
the query does not include the "no_il8n" tag (or includes it with a value of false).

A field specification may be a JSON object:

[24] field_spec ::= '{'
'"column"' ':'
[ ',' '"alias"' ':' string ]
[ ',' '"aggregate"' ':' boolean ]
[ ',' transform_spec ]
'}'

The "column" entry provides the column name, which must be defined as non-virtual in the IDL.

The "alias" entry provides a column alias. If no alias is specified, json_query uses the column name as its own alias.

The "aggregate" entry has no effect on the SELECT clause itself. Rather, it affects the construction of a GROUP BY class. If there is an "aggregate" entry for any field, then json_query builds a GROUP BY clause listing every column that is not tagged for aggregation (or that carries an "aggregate" entry with a value of false). If all columns are tagged for aggregation, then json_query omits the GROUP BY clause.

[25] transform_spec ::= '"transform"' ':' string ]
[ ',' '"result_field" ':' string ]
[ ',' '"params" ':' param_list ]

When a transform_spec is present, json_query selects the return value of a function instead of selecting the column directly. The entry for "transform" provides the name of the function, and the column name (as specified by the "column" tag), qualified by the class name, is the argument to the function. For example, you might use such a function to format a date or time, or otherwise transform a column value. You might also use an aggregate function such as SUM, COUNT, or MAX (possibly together with the "aggregate" tag).

The "result_field" entry, when present, specifies a subcolumn of the function's return value. The resulting SQL encloses the function call in parentheses, and follows it with a period and the subcolumn name.

The "params" entry, if present, provides a possibly empty array of additional parameter values, either strings, numbers, or nulls:

[26] param_list ::= '[' [ parameter { ',' parameter } ] ']'

Such parameter values are enclosed in single quotes, with any special characters escaped as needed, and inserted after the column name as additional parameters to the function. You might, for example, use an additional parameter to provide a format string for a reformatting function.

WHERE Clause

There are two types of where_condition: objects and arrays. Of these, the object type is the more fundamental, and occurs at some level in every where_condition. The array type is mainly a way of circumventing a limitation of the object type.

The object type of where_condition is a comma-separated list of one or more conditions:

[27] where_condition ::= '{' condition { ',' condition } '}'

The generated SQL will include a code fragment for each condition, joined by AND (or in some cases by OR, as described below).

As usual for entries in a JSON object, each condition consists of a unique string to serve as a key, a colon, and an associated value.

The key string may be the name of a column belonging to the relevant table, or it may be an operator string. In order to distinguish it from any possible column name, an operator string always begins with a plus sign or minus sign.

JSON requires that every key string be unique within an object. This requirement imposes some awkward limitations on a JSON query. For example, you might want to express two conditions for the same column: id > 10 and id != 25. Since each of those conditions would have the same key string, namely "id", you can't put them into the same JSON object.

The solution is to put such conflicting conditions in separate JSON objects, and put the objects into an array:

[28] where_condition ::= '[' where_condition { ',' where_condition } ']'

The resulting SQL encloses each subordinate set of conditions in parentheses, and connects the sets with AND (or in some cases OR, as described below). It's possible to put only a single where_condition in the array; the result is to add a layer of parentheses around the condition.

There are two kinds of condition where the operator begins with a plus sign. In the simpler case, the associated data is simply a column name:

[29] condition ::= plus_class ':' field_name

A plus_class is a string that begins with a plus sign. The rest of the string, after the plus sign, must be the class name for the table to which the column belongs.

If the column is a boolean, then the resulting SQL uses it (as qualified by the class name) as a stand-alone condition.

Otherwise, this kind of syntax provides a way to place a column on the right side of a comparison operator. For example:

	{
	    "from":"aou",
	    "select": { "aou":[ "id", "name" ] },
	    "where": {
	        "id": {
	            ">": { "+aou":"parent_ou" }
	        }
	    }
	}

The resulting SQL:

	SELECT
	    "aou".id AS "id",
	    "aou".name AS "name"
	FROM
	    actor.org_unit AS "aou"
	WHERE
	    (
	        "aou".id > (  "aou".parent_ou  )
	    );

The other type of condition that uses a plus_class applies a specified class name to a where_condition:

[30] condition ::= plus_class ':' where_condition

The resulting SQL is enclosed in parentheses, and qualifies the columns with the specified class name. This syntax provides a mechanism to shift the class context -- i.e. to refer to one class in a context that would otherwise refer to a different class.

Ordinarily the class name must be a valid non-virtual class defined in the IDL, and applicable to the associated where_condition. There is at least one peculiar exception. The JSON fragment:

	"+abc": { "+xyz":"frobozz" }

...is rendered as:

	(  "xyz".frobozz  )

...even though neither "abc", nor "xyz", nor "frobozz" is defined in the IDL. The class name "abc" isn't used at all because the "+xyz" operator overrides it. Such a query won't fail until json_query tries to execute it in the database.

The other operators that may occur at this level all begin with a minus sign, and they all represent familiar SQL operators. For example, the "-or" operator joins the conditions within a where_condition by OR (instead of the default AND), and encloses them all in parentheses:

[31] condition ::= '"-or"' ':' where_condition

In fact the "-or" operator is the only way to get OR into the WHERE clause.

The "-and" operator is similar, except that it uses AND:

[32] condition ::= '"-and"' ':' where_condition

Arguably the "-and" operator is redundant, because you can get the same effect by wrapping the subordinate where_condition in a JSON array. Either technique merely adds a layer of parentheses, since AND connects successive conditions by default.

The "-not" operator expands the subordinate where_condition within parentheses, and prefaces the result with NOT:

[33] condition ::= '"-not"' ':' where_condition

The "-exists" or "-not-exists" operator constructs a subquery within an EXISTS or NOT EXISTS clause, respectively:

[34]	condition	`::=`	'"-exists"' ':' query
[35]	condition	`::=`	'"-not-exists"' ':' query

The remaining kinds of condition all have a field_name on the left and some kind of predicate on the right. A predicate places a constraint on the value of the column -- or, in some cases, on the value of the column as transformed by some function call:

[36] condition ::= field_name ':' predicate

The simplest such constraint is to require that the column have a specific value, or be null:

[37]	predicate	`::=`	lit_value \| 'null'
[38]	lit_value	`::=`	string \| number

You can also compare a column to a literal value using some kind of inequality. However it's a bit more complicated because you have to specify what kind of comparison to make:

[39]	predicate	`::=`	'{' compare_op ':' lit_value '}'
[40]	compare_op	`::=`	string

A compare_op is a string that defines a comparison operator. Valid values include the following:

	=    <>   !=
	<    >    <=   >=
	~    ~*   !~   !~*
	like      ilike
	similar to

Strictly speaking, json_query accepts any compare_op that doesn't contain semicolons or white space (or "similar to" as a special exception). As a result, it is possible -- and potentially useful -- to use a custom operator like ">100*" in order to insert an expression that would otherwise be difficult or impossible to create through a JSON query. The ban on semicolons and white space prevents certain kinds of SQL injection.

Note that json_query does not accept two operators that PostgreSQL does accept: "is distinct from" and "is not distinct from".

You can also compare a column to a null value:

[41] predicate ::= '{' compare_op ':' 'null' '}'

The equality operator "=" turns into IS NULL. Any other operator turns into IS NOT NULL.

When a compare_op is paired with an array, it defines a function call:

[42] predicate ::= '{' compare_op ':' '[' string { ',' parameter } ']' '}'

The first entry in the array is the function's name. Subsequent entries in the array, if any, represent the parameters of the function call. They may be strings, numbers, or nulls. In the generated SQL, the function call appears on the right of the comparison.

The "between" operator creates a BETWEEN clause:

[43] predicate ::= '{' "between" ':' '[' lit_value ',' lit_value ']' '}'

Although PostgreSQL allows a null value in a BETWEEN clause, json_query requires literal non-null values. It isn't sensible to use null values in a BETWEEN clause. A few experiments show that the results of the comparison are peculiar and erratic.

There are two ways to create an IN list of allowed values. The simplest is to put literal values into a JSON array:

[44] predicate ::= '[' lit_value { ',' lit_value } ']'

As with BETWEEN clauses, json_query does not accept null values in an IN list, even though PostgreSQL does allow them. Nulls are not sensible in this context because they never match anything.

Having Clause

For the HAVING clause, json_query accepts exactly the same syntax as it accepts for the WHERE clause.

The other way to create an IN list is to use an explicit "in" operator with an array of literal values. This format also works for the "not in" operator:

[45]	predicate	`::=`	'{' in_operator ';' '[' lit_value [ ',' lit_value ] ']' '}'
[46]	in_operator	`::=`	"in" \| "not in"

Another kind of IN or NOT IN clause uses a subquery instead of a list of values:

[47] predicate ::= '{' in_operator ':' query '}'

The remaining types of predicate can put a function call on the left of the comparison, by using a transform_spec together with a "value" tag. The transform_spec is optional, and if you don't need it, the same SQL would in many cases be easier to express by other means.

The transform_spec construct was described earlier in connection with the SELECT clause, but here it is again:

[48] transform_spec ::= '"transform"' ':' string ]
[ ',' '"result_field" ':' string ]
[ ',' '"params" ':' param_list ]

As in the SELECT clause, the "transform" string names the function. The first parameter is always the column identified by the field_name. Additional parameters, if any, appear in the param_list. The "result_field" string, if present, identifies one column of a multicolumn return value.

Here's a second way to compare a value to a literal value (but not to a null value):

[49] predicate ::= '{' compare_op ':' '{' '"value"' ':' lit_value
[ transform_spec ] '}' '}'

...and a way to compare a value to a boolean expression:

[50] predicate ::= '{' compare_op ':' '{' '"value"' ':' '{'
condition { ',' condition } [ transform_spec ] '}' '}'

The final predicate is another way to put a function call on the right side of the comparison:

[51] predicate ::= '{' compare_op ':' '{' '"value"' ':' '['
string { ',' parameter } ']' [ transform_spec ] '}' '}'

This format is available for the sake of consistency, but offers no advantage over the simpler version.

ORDER BY Clause

There are two ways to encode an ORDER BY clause: as an array, or as a list. Either may be empty, in which case the generated SQL will not include an ORDER BY clause:

[52] order_by_list ::= '[' ']' | '{' '}'

If not empty, the array contains one or more objects, each defining a sort field:

[53]	order_by_list	`::=`	'{' sort_field_def { ',' sort_field_def } '}'
[54]	sort_field_def	`::=`	'{' '"class"' ':' class_name ',' '"field"' ':' field_name [ ',' '"direction"' ':' lit_value ] [ ',' transform_spec ] '}'

The "class" and "field" entries are required, and of course the field must belong to the class. Furthermore, at least one field from the class must appear in the SELECT clause.

The "direction" entry, if present, specifies whether the sort will be ascending or descending for the column in question. If the associated value begins with "D" or "d", the sort will be descending; otherwise the sort will be ascending. If the value is a number, it will be treated as a string that does not start with "D" or "d", resulting in an ascending sort.

In the absence of a "direction" entry, the sort will be ascending.

The transform_spec works here the same way it works in the SELECT clause and the WHERE clause, enabling you to pass the column through a transforming function before the sort:

[55] transform_spec ::= '"transform"' ':' string ]
[ ',' '"result_field" ':' string ]
[ ',' '"params" ':' param_list ]

When the order_by_list is an object instead of an array, the syntax is less verbose, but also less flexible. The keys for the object are class names:

[56] order_by_list ::= '{' class_name ':' sort_class_def
{ ',' class_name ':' sort_class_def } '}'

Each class must be referenced in the SELECT clause.

As in the SELECT clause, all the fields for a given class must be grouped together. You can't sort by a column from one table, then a column from a second table, then another column from the first table. If you need this kind of sort, you must encode the ORDER BY clause as an array instead of an object.

The data associated with a class_name may be either an array or an object. If an array, it's simply a list of field names, and each field must belong to the class:

[57] sort_class_def ::= '[' field_name { ',' field_name } ']'

With this syntax, the direction of sorting will always be ascending.

If the data is an object, the keys are field names, and as usual the fields must belong to the class:

[58] sort_class_def ::= '{' field_name ':' sort_class_subdef
{ ',' field_name ':' sort_class_subdef } '}'

Since the field_name is the key for the object, it cannot appear more than once. As a result, some kinds of sorts are not possible with this syntax. For example, one might want to sort by UPPER( family_name ), and then by family_name with case unchanged, to make sure that "diBona" comes before "Dibona". For situations like this, you must encode the ORDER BY clause as an array rather than an object.

The data associated with each field_name may take either of two forms. In the simplest case, it's a literal value to specify the direction of sorting:

[59] sort_class_subdef ::= lit_value

If the literal is a string starting with "D" or "d", json_query sorts the field in descending order. Otherwise it sorts the field in ascending order.

In other cases, the field_name may be paired with an object to specify more details:

[60] sort_class_subdef ::= '{'
[ '"direction"' ':' lit_value ]
[ transform_spec ]
'}'

As before, the value tagged as "direction" specifies the direction of the sort, depending on the first character. If not otherwise specified, the sort direction defaults to ascending.

Also as before, the transform_spec may specify a function through which to pass the column.

Since both the "direction" tag and the transform_spec are optional, the object may be empty:

[61] sort_class_subdef ::= '{' '}'