Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC-0051: EXCLUDE Clause #51

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open

RFC-0051: EXCLUDE Clause #51

wants to merge 8 commits into from

Conversation

alancai98
Copy link
Member

@alancai98 alancai98 commented Nov 11, 2023

Issue #, if available: partiql/partiql-lang#27

RFC to define the EXCLUDE operator and definition in terms of existing PartiQL operators. Current reference implementation is in partiql-lang-kotlin's EvaluatingCompiler (though I'm currently working to port it to the PhysicalPlanCompilerImpl).

Rendered doc

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@alancai98 alancai98 self-assigned this Nov 11, 2023
@alancai98 alancai98 changed the title [WIP] EXCLUDE clause RFC [WIP] EXCLUDE clause RFC draft Nov 11, 2023
@alancai98 alancai98 force-pushed the exclude-rfc branch 2 times, most recently from 595268b to dc140e4 Compare November 11, 2023 01:52
@alancai98 alancai98 marked this pull request as draft November 11, 2023 01:55
@alancai98 alancai98 changed the title [WIP] EXCLUDE clause RFC draft EXCLUDE clause RFC draft Nov 15, 2023
@alancai98 alancai98 marked this pull request as ready for review November 15, 2023 23:57
@alancai98 alancai98 changed the title EXCLUDE clause RFC draft RFC-0051: EXCLUDE Clause Nov 16, 2023
@alancai98 alancai98 added the RFC label Nov 16, 2023
RFCs/0051-exclude-operator.adoc Outdated Show resolved Hide resolved
RFCs/0051-exclude-operator.adoc Outdated Show resolved Hide resolved
RFCs/0051-exclude-operator.adoc Outdated Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A discussion from the original issue revolves around replacing items rather than just excluding them. A major use-case of PartiQL is using PartiQL as a means of performing transformations on semi-structured, open-schema data. Mentioned in the issue are also customers who have 1000+ columns in their source tables.

From how I've been reading this RFC, we might be able to provide a useful work-around -- at least for top-level values. We can take advantage of the fact that LET evaluates before EXCLUDE. See below:

SELECT t.*, someItemThatHasBeenReplaced
EXCLUDE t.b
FROM t
LET t.b + 1 AS someItemThatHasBeenReplaced

For nested attributes, however, I couldn't immediately find an intuitive solution.

With this RFC, do you expect any future necessary RFC's to add support for REPLACE? If so, in your opinion, does this RFC impede or allow for the addition of REPLACE?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this RFC, do you expect any future necessary RFC's to add support for REPLACE?

That was my assumption and to leave REPLACE out of scope for this PR. REPLACE is included in the "Future possibilities" section of the RFC.

If so, in your opinion, does this RFC impede or allow for the addition of REPLACE?

I need to think more about the relationship between EXCLUDE and REPLACE. I think the syntactic rewrite included in the RFC could be adapted to support REPLACE, so I don't believe this RFC impedes an addition of REPLACE. After I get back from the Thanksgiving holiday, I'll look more into if the syntactic rewrite approach could be applied to nested attributes of REPLACE.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Playing around a bit with the rewrite rules from the RFC, we could do something similar in the nested case branches for REPLACE of nested attributes. For example, using the query from example-tuple-attribute-as-final-step, if we had added the REPLACE clause: REPLACE t.b.field_x AS t.b.field_x * 42, the rewrite could add a WHEN branch like

WHEN LOWER(attr_1) = LOWER('b') THEN
    CASE 
        WHEN v_1 IS STRUCT THEN (
            PIVOT (
                CASE 
                    WHEN LOWER(attr_2) = LOWER('field_x') THEN v_2 * 42
                    ELSE v_2
                END
            ) AT attr_2
            FROM UNPIVOT v_1 AS v_2 AT attr_2
        )
    ELSE v_1
    END
ELSE v_1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The full query could look something like:

-- EXCLUDE t.a.field_x
-- REPLACE t.b.field_x AS t.b.field_x * 42
SELECT t.*
FROM (
    SELECT VALUE {
        't':
            CASE
                WHEN t IS STRUCT THEN (
                    PIVOT (
                        CASE
                            WHEN LOWER(attr_1) = LOWER('a') THEN
                                CASE
                                    WHEN v_1 IS STRUCT THEN (
                                        PIVOT v_2 AT attr_2
                                        FROM UNPIVOT v_1 AS v_2 AT attr_2
                                        WHERE LOWER(attr_2) NOT IN [LOWER('field_x')]
                                    )
                                    ELSE v_1
                                END
                            WHEN LOWER(attr_1) = LOWER('b') THEN
                                CASE 
                                    WHEN v_1 IS STRUCT THEN (
                                        PIVOT (
                                            CASE 
                                                WHEN LOWER(attr_2) = LOWER('field_x') THEN v_2 * 42
                                                ELSE v_2
                                            END
                                        ) AT attr_2
                                        FROM UNPIVOT v_1 AS v_2 AT attr_2
                                    )
                                ELSE v_1
                                END
                            ELSE v_1
                        END
                    ) AT attr_1 FROM UNPIVOT t AS v_1 AT attr_1
                )
                ELSE t
            END
    }
    FROM <<
    {
        'a': { 'field_x': 0, 'field_y': 'zero' },  -- `field_x` excluded
        'b': { 'field_x': 1, 'field_y': 'one' },   -- `field_y` replaced with `field_y` * 42
        'c': { 'field_x': 2, 'field_y': 'two' }
    }
    >> AS t
)

, which the Kotlin implementation will output as:

<<
  {
    'a': {
      'field_y': 'zero'
    },
    'b': {
      'field_x': 42,
      'field_y': 'one'
    },
    'c': {
      'field_x': 2,
      'field_y': 'two'
    }
  }
>>

RFCs/0051-exclude-operator.adoc Outdated Show resolved Hide resolved
RFCs/0051-exclude-operator.adoc Outdated Show resolved Hide resolved
RFCs/0051-exclude-operator.adoc Show resolved Hide resolved
RFCs/0051-exclude-operator.adoc Show resolved Hide resolved
RFCs/0051-exclude-operator.adoc Show resolved Hide resolved

Why is `EXCLUDE` modeled as a binding tuple operator as opposed to a value expression?::

We had also considered modeling `EXCLUDE` as a value operation evaluated after the `<select clause>`. Evaluating `EXCLUDE` last could contradict the PartiQL specification's assertion that the `<select clause>` is evaluated last, which may add confusion. There were also some additional edge cases that complicated defining `EXCLUDE` as a value operator. For example, let's look at the following query:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Evaluating EXCLUDE last could contradict the PartiQL specification's assertion that the <select clause> is evaluated last, which may add confusion.

Did you explore EXCLUDE as a value operation (maybe just a special-form) evaluated during the <select clause>? Similar to how any other special-form function invocation would work in the projection list? For example:

SELECT t EXCLUDE "a", s EXCLUDE d[0]
FROM <<
  { 'a': 1, 'b': 2 }
>> AS t, <<
  { 'c': 3, 'd': [ 0, 1, 2 ] }
>> AS s

This would maintain PartiQL specification's assertion that the <select clause> is evaluated last, and it gives end-users flexibility in when/where they can strip out attributes. Since it is a value expression, it could also be potentially combined with other EXCLUDEs or REPLACES. Consider a simple example with nested attributes:

SELECT
  person
    REPLACE
      info.age WITH info.age + 1,
      info.ssn_encrypted WITH udf_encrypt(info.ssn) -- some user-defined function to encrypt the SSN
    EXCLUDE info.ssn -- let's exclude the raw SSN
FROM <<
  { 'info': { 'age': 1, 'ssn': 10} }
>> AS person

Assuming the expressions are evaluated left-to-right, the output could look like:

<<
  { 'info': { 'age': 2, 'ssn_encrypted': xy190d...sws8ch } }
>>

Copy link
Member Author

@alancai98 alancai98 Nov 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you explore EXCLUDE as a value operation (maybe just a special-form) evaluated during the <select clause>? Similar to how any other special-form function invocation would work in the projection list?

Could you clarify what you mean by special-form function? Is this referring to non-standard function syntax (kinda like what substring, overlay do)?

Regarding modeling EXCLUDE as a value operation/function evaluated within the SELECT clause, we had also explored this option. These aren't quite functions though, since function arguments expect value expressions and exclude paths do not return values.

The above way you had modeled expressions evaluated left-to-right could be problematic/confusing. Consider SELECT EXCLUDE t.a[0], EXCLUDE t.a[1].field FROM .... Should the second exclude path remove field from the a's original 1 index or after evaluating the first exclude path (i.e. originally the 2 index). We found considering the paths as associative and evaluating the paths in parallel (i.e. applying all exclude paths on the same value/binding) as more intuitive.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Yes, meaning non-standard function syntax.
  2. Yes -- but that is the AST, which isn't the same as execution. We don't operate on the AST -- we convert TRIM to $trim_leading and CURRENT_USER to $current_user. The SQL:1999 Specification even talks about the conversion of the syntactic BETWEEN to comparison operators -- the theoretical between AST node disappears and gets replaced. We could just as easily say <lhs> EXCLUDE ( <exclude-arg-1>, <exclude-arg-2> ) gets converted to EXCLUDE(<lhs>, [ <exclude-arg-1-stringified>, <exclude-arg-2-stringified>]). Anyways, it could provide flexibility, and I believe it's possible to model.
  3. We could probably look at BigQuery which expects the RHS to be parenthesized. Ex: SELECT * EXCEPT (a, b). I would imagine this query in BigQuery strips out both a and b simultaneously.

Given PartiQL's flexibility, I'm personally intrigued by the prospect of writing:

SELECT VALUE {
  'full_name': person.l_name || ', ' person.f_name,
  'encrypted_ssn': udf_to_encrypt(person.ssn),
  'all_other_details': person EXCLUDE (f_name, l_name, ssn)
} FROM Persons AS person

Especially since tuples are first-class citizens, I find it long overdue for some practical built-ins (beyond TUPLEUNION which PLK doesn't have implemented -- yet). It seems like a potential opportunity to allow for its introduction, especially since SQL's SELECT gets transformed to PartiQL's SELECT VALUE <tuple>.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cannot be a value operator if you want to ever prune binding tuple variables.

Copy link
Member Author

@alancai98 alancai98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review, John. Once I get back after the Thanksgiving holiday, I'll followup to your comments and apply your feedback around variable definitions + IS TUPLE/IS ARRAY in the next revision.

RFCs/0051-exclude-operator.adoc Outdated Show resolved Hide resolved
RFCs/0051-exclude-operator.adoc Show resolved Hide resolved
RFCs/0051-exclude-operator.adoc Outdated Show resolved Hide resolved
RFCs/0051-exclude-operator.adoc Outdated Show resolved Hide resolved
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this RFC, do you expect any future necessary RFC's to add support for REPLACE?

That was my assumption and to leave REPLACE out of scope for this PR. REPLACE is included in the "Future possibilities" section of the RFC.

If so, in your opinion, does this RFC impede or allow for the addition of REPLACE?

I need to think more about the relationship between EXCLUDE and REPLACE. I think the syntactic rewrite included in the RFC could be adapted to support REPLACE, so I don't believe this RFC impedes an addition of REPLACE. After I get back from the Thanksgiving holiday, I'll look more into if the syntactic rewrite approach could be applied to nested attributes of REPLACE.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Playing around a bit with the rewrite rules from the RFC, we could do something similar in the nested case branches for REPLACE of nested attributes. For example, using the query from example-tuple-attribute-as-final-step, if we had added the REPLACE clause: REPLACE t.b.field_x AS t.b.field_x * 42, the rewrite could add a WHEN branch like

WHEN LOWER(attr_1) = LOWER('b') THEN
    CASE 
        WHEN v_1 IS STRUCT THEN (
            PIVOT (
                CASE 
                    WHEN LOWER(attr_2) = LOWER('field_x') THEN v_2 * 42
                    ELSE v_2
                END
            ) AT attr_2
            FROM UNPIVOT v_1 AS v_2 AT attr_2
        )
    ELSE v_1
    END
ELSE v_1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The full query could look something like:

-- EXCLUDE t.a.field_x
-- REPLACE t.b.field_x AS t.b.field_x * 42
SELECT t.*
FROM (
    SELECT VALUE {
        't':
            CASE
                WHEN t IS STRUCT THEN (
                    PIVOT (
                        CASE
                            WHEN LOWER(attr_1) = LOWER('a') THEN
                                CASE
                                    WHEN v_1 IS STRUCT THEN (
                                        PIVOT v_2 AT attr_2
                                        FROM UNPIVOT v_1 AS v_2 AT attr_2
                                        WHERE LOWER(attr_2) NOT IN [LOWER('field_x')]
                                    )
                                    ELSE v_1
                                END
                            WHEN LOWER(attr_1) = LOWER('b') THEN
                                CASE 
                                    WHEN v_1 IS STRUCT THEN (
                                        PIVOT (
                                            CASE 
                                                WHEN LOWER(attr_2) = LOWER('field_x') THEN v_2 * 42
                                                ELSE v_2
                                            END
                                        ) AT attr_2
                                        FROM UNPIVOT v_1 AS v_2 AT attr_2
                                    )
                                ELSE v_1
                                END
                            ELSE v_1
                        END
                    ) AT attr_1 FROM UNPIVOT t AS v_1 AT attr_1
                )
                ELSE t
            END
    }
    FROM <<
    {
        'a': { 'field_x': 0, 'field_y': 'zero' },  -- `field_x` excluded
        'b': { 'field_x': 1, 'field_y': 'one' },   -- `field_y` replaced with `field_y` * 42
        'c': { 'field_x': 2, 'field_y': 'two' }
    }
    >> AS t
)

, which the Kotlin implementation will output as:

<<
  {
    'a': {
      'field_y': 'zero'
    },
    'b': {
      'field_x': 42,
      'field_y': 'one'
    },
    'c': {
      'field_x': 2,
      'field_y': 'two'
    }
  }
>>

RFCs/0051-exclude-operator.adoc Outdated Show resolved Hide resolved
RFCs/0051-exclude-operator.adoc Show resolved Hide resolved

Why is `EXCLUDE` modeled as a binding tuple operator as opposed to a value expression?::

We had also considered modeling `EXCLUDE` as a value operation evaluated after the `<select clause>`. Evaluating `EXCLUDE` last could contradict the PartiQL specification's assertion that the `<select clause>` is evaluated last, which may add confusion. There were also some additional edge cases that complicated defining `EXCLUDE` as a value operator. For example, let's look at the following query:
Copy link
Member Author

@alancai98 alancai98 Nov 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you explore EXCLUDE as a value operation (maybe just a special-form) evaluated during the <select clause>? Similar to how any other special-form function invocation would work in the projection list?

Could you clarify what you mean by special-form function? Is this referring to non-standard function syntax (kinda like what substring, overlay do)?

Regarding modeling EXCLUDE as a value operation/function evaluated within the SELECT clause, we had also explored this option. These aren't quite functions though, since function arguments expect value expressions and exclude paths do not return values.

The above way you had modeled expressions evaluated left-to-right could be problematic/confusing. Consider SELECT EXCLUDE t.a[0], EXCLUDE t.a[1].field FROM .... Should the second exclude path remove field from the a's original 1 index or after evaluating the first exclude path (i.e. originally the 2 index). We found considering the paths as associative and evaluating the paths in parallel (i.e. applying all exclude paths on the same value/binding) as more intuitive.

Copy link
Member Author

@alancai98 alancai98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New revision addresses John's comments related to

  • IS STRUCT -> IS TUPLE, IS LIST -> IS ARRAY and related prose
  • more consistent variable definitions through the doc
  • change in multi-exclude path variable definitions
  • additional subsumption rule for empty steps
  • typo fixes

RFCs/0051-exclude-operator.adoc Outdated Show resolved Hide resolved
Copy link
Contributor

@am357 am357 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still need to review the rest of RFC; adding the portion that I have reviewed so far.

RFCs/0051-exclude-operator.adoc Show resolved Hide resolved
RFCs/0051-exclude-operator.adoc Show resolved Hide resolved
* If sufficient schema is present and the path can be resolved, we assume the root of an `EXCLUDE` path can be omitted. The variable resolution rules follow what is already included in the PartiQL specification.
* We require that every fully-qualified `<exclude path>` contain a root and at least one step. If a use case arises to exclude a binding tuple variable, then this functionality can be added.
* S-expressions are part of the Ion type system.footnote:[https://amazon-ion.github.io/ion-docs/docs/spec.html#sexp]
PartiQL should support s-expression types and values since PartiQL's type system is a superset over the Ion types. Because the current PartiQL specification does not formally define s-expressions operations, we consider the definition of collection index and wildcard steps on s-expressions as out-of-scope for this RFC.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps the statement can be less assertive; I know this is one of those hotly debated topics. The spec. says:

PartiQL’s data model extends SQL to Ion’s type system to cover schema-less and nested data. Such values can be
directly quoted with `quotes.

So text can just convey the message that s-expressions semantics as a collection type is not fully defined yet, hence is out of the scope.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. This statement makes more assertions about the PartiQL value system than does the spec.


NOTE: The following rules assume `root~p~=root~q~`.

.Subsumption rules
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we have the table 1. for the examples, but adding an example for each of the rules would also enhance readability.

Otherwise, there must be some step at which `p` and `q` diverge. Let's call this step's index `i`.

[[anchor-1c]] Rule 1.c::
If `s~i~` is a tuple attribute and `t~i~` is a tuple wildcard and `t~i+1~...t~y~` subsumes `s~i+1~...s~x~` (i.e. the steps following `t~i~` subsumes the steps following `s~i~`), then `q` subsumes `p`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this and others that apply: In order for the underlying subsumption to happen (I.e., t~i+1~...t~y~ subsumes s~i+1~...s~x~) the two should be considered as independent <exclude path>s so that they can have root; is that so? if yes, text may need to clarify that further.

[[anchor-1d]] Rule 1.d::
If `s~i~` is a collection index and `t~i~` is a collection wildcard and `t~i+1~...t~y~` subsumes `s~i+1~...s~x~` (i.e. the steps following `t~i~` subsumes the steps following `s~i~`), then `q` subsumes `p`.
[[anchor-1e]] Rule 1.e::
If `s~i~` is a case-sensitive tuple attribute and `t~i~` is a case-insensitive tuple attribute and `t~i+1~...t~y~` subsumes `s~i+1~...s~x~` (i.e. the steps following `t~i~` subsumes the steps following `s~i~`), then `q` subsumes `p`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If `s~i~` is a case-sensitive tuple attribute and `t~i~` is a case-insensitive tuple attribute and `t~i+1~...t~y~` subsumes `s~i+1~...s~x~` (i.e. the steps following `t~i~` subsumes the steps following `s~i~`), then `q` subsumes `p`.
Given `i < y and i < x`, if `s~i~` is a case-sensitive tuple attribute and `t~i~` is a case-insensitive tuple attribute and `t~i+1~...t~y~` subsumes `s~i+1~...s~x~` (i.e. the steps following `t~i~` subsumes the steps following `s~i~`), then `q` subsumes `p`.

|`t.a.b[1].c` |`t.a.b[*]` |`q` subsumes `p` (by <<anchor-1d, 1.d>> then <<anchor-1a, 1.a>>)
|`t.a.b[1].c` |`t.a.b[*].c`|`q` subsumes `p` (by <<anchor-1d, 1.d>> then <<anchor-1b, 1.b>>)
|`t.a."b"` |`t.a.b` |`q` subsumes `p` (by <<anchor-1e, 1.e>> then <<anchor-1a, 1.a>>)
|`t.a."b".c` |`t.a.b.c` |`q` subsumes `p` (by <<anchor-1e, 1.e>> then <<anchor-1b, 1.b>>)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to have a 'No subsumption rule apply` for case sensitive mismatch.

<select clause>
FROM (
SELECT VALUE {
'r': -- Apply below rewrite rules for steps `s~1~...s~n~`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
'r': -- Apply below rewrite rules for steps `s~1~...s~n~`
'r': -- Apply rewrite rules explained in the following sections for steps `s~1~...s~n~`

FROM (
SELECT VALUE {
'r': -- Apply below rewrite rules for steps `s~1~...s~n~`
... -- Other vars created from the other clauses
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
... -- Other vars created from the other clauses
... -- Add other variables created from the other clauses using identity function

----


The main idea for rewriting the `EXCLUDE` steps `s~1~,...,s~n~` is to create a nested `CASE` expression for each step, whereby the nested `CASE` expressions for `s~1~,...,s~n-1~` unnest the input binding tuple and the final `CASE` expression for `s~n~` (i.e. the final step) filters out the desired tuple field(s) or collection index(es). Every exclude step has an expected type to process during evaluation. Tuple attribute and wildcard exclude steps expect a tuple. Whereas a collection index expects an array and a collection wildcard expects an array or bag. The `CASE` expression at each level `i` recreates this expected type by including a `WHEN` branch based on the expected type. Each `CASE` expression will include an `ELSE` branch which outputs the previous level's identifier. This set of branches ensures that at evaluation time, if there is a type mismatch (e.g. evaluation value is an array while the exclude step is a tuple attribute), there is no evaluation error and the previous level's value is returned through the `ELSE` branch. This behavior applies to both the permissive and strict typing modes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The main idea for rewriting the `EXCLUDE` steps `s~1~,...,s~n~` is to create a nested `CASE` expression for each step, whereby the nested `CASE` expressions for `s~1~,...,s~n-1~` unnest the input binding tuple and the final `CASE` expression for `s~n~` (i.e. the final step) filters out the desired tuple field(s) or collection index(es). Every exclude step has an expected type to process during evaluation. Tuple attribute and wildcard exclude steps expect a tuple. Whereas a collection index expects an array and a collection wildcard expects an array or bag. The `CASE` expression at each level `i` recreates this expected type by including a `WHEN` branch based on the expected type. Each `CASE` expression will include an `ELSE` branch which outputs the previous level's identifier. This set of branches ensures that at evaluation time, if there is a type mismatch (e.g. evaluation value is an array while the exclude step is a tuple attribute), there is no evaluation error and the previous level's value is returned through the `ELSE` branch. This behavior applies to both the permissive and strict typing modes.
The main idea for rewriting the `EXCLUDE` steps `s~1~,...,s~n~` is to create a nested `CASE` expression for each step, whereby the nested `CASE` expressions for `s~1~,...,s~n-1~` unnest the input binding tuple and the final `CASE` expression for `s~n~` (i.e. the final step) filters out the desired tuple field(s) or collection index(es). Every exclude step has an expected type to process during evaluation. Tuple attribute and wildcard exclude steps expect a `Tuple`, whereas a collection index expects `Array` and a collection wildcard expects `Array` or `Bag` types. The `CASE` expression at each level `i` recreates this expected type by including a `WHEN` branch based on the expected type. Each `CASE` expression will include an `ELSE` branch which outputs the previous level's identifier. This set of branches ensures that at evaluation time, if there is a type mismatch (e.g. evaluation value is an array while the exclude step is a tuple attribute), there is no evaluation error and the previous level's value is returned through the `ELSE` branch. This behavior applies to both the permissive and strict typing modes.

---
We first illustrate the rewrite rule for a single `EXCLUDE` path and then explain the syntax rewrite for multiple exclude paths.

==== Step 2 (single): rewrite a single `EXCLUDE` path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering that there are many rules, perhaps a pseudo-code accompanying the text explanation could also be helpful.

=== Out of scope / assumptions

* We restrict tuple attribute exclude steps to use string literals and collection index exclude steps to use int literals. Thus `<exclude paths>` are statically known. We can decide whether to add other exclude paths (e.g. expressions) if a use case arises.
* If sufficient schema is present and the path can be resolved, we assume the root of an `EXCLUDE` path can be omitted. The variable resolution rules follow what is already included in the PartiQL specification.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might want to have an example of attribute as a variable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Top-level, let's format these lines to be like 80 or 120 characters wide.


* We restrict tuple attribute exclude steps to use string literals and collection index exclude steps to use int literals. Thus `<exclude paths>` are statically known. We can decide whether to add other exclude paths (e.g. expressions) if a use case arises.
* If sufficient schema is present and the path can be resolved, we assume the root of an `EXCLUDE` path can be omitted. The variable resolution rules follow what is already included in the PartiQL specification.
* We require that every fully-qualified `<exclude path>` contain a root and at least one step. If a use case arises to exclude a binding tuple variable, then this functionality can be added.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the rationale for this limitation? We should put that here.


Why is `EXCLUDE` modeled as a binding tuple operator as opposed to a value expression?::

We had also considered modeling `EXCLUDE` as a value operation evaluated after the `<select clause>`. Evaluating `EXCLUDE` last could contradict the PartiQL specification's assertion that the `<select clause>` is evaluated last, which may add confusion. There were also some additional edge cases that complicated defining `EXCLUDE` as a value operator. For example, let's look at the following query:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cannot be a value operator if you want to ever prune binding tuple variables.

* If sufficient schema is present and the path can be resolved, we assume the root of an `EXCLUDE` path can be omitted. The variable resolution rules follow what is already included in the PartiQL specification.
* We require that every fully-qualified `<exclude path>` contain a root and at least one step. If a use case arises to exclude a binding tuple variable, then this functionality can be added.
* S-expressions are part of the Ion type system.footnote:[https://amazon-ion.github.io/ion-docs/docs/spec.html#sexp]
PartiQL should support s-expression types and values since PartiQL's type system is a superset over the Ion types. Because the current PartiQL specification does not formally define s-expressions operations, we consider the definition of collection index and wildcard steps on s-expressions as out-of-scope for this RFC.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. This statement makes more assertions about the PartiQL value system than does the spec.

Comment on lines +86 to +92
=== Out of scope / assumptions

* We restrict tuple attribute exclude steps to use string literals and collection index exclude steps to use int literals. Thus `<exclude paths>` are statically known. We can decide whether to add other exclude paths (e.g. expressions) if a use case arises.
* If sufficient schema is present and the path can be resolved, we assume the root of an `EXCLUDE` path can be omitted. The variable resolution rules follow what is already included in the PartiQL specification.
* We require that every fully-qualified `<exclude path>` contain a root and at least one step. If a use case arises to exclude a binding tuple variable, then this functionality can be added.
* S-expressions are part of the Ion type system.footnote:[https://amazon-ion.github.io/ion-docs/docs/spec.html#sexp]
PartiQL should support s-expression types and values since PartiQL's type system is a superset over the Ion types. Because the current PartiQL specification does not formally define s-expressions operations, we consider the definition of collection index and wildcard steps on s-expressions as out-of-scope for this RFC.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would reword this section. I would avoid usage of terms like 'we' and write it more formally and more tersely.

e.g.,

=== Limitations

* This RFC requires that every fully-qualified `<exclude path>` contain a root and at least one step.
* This RFC restricts tuple attribute exclude steps to use string literals and collection index exclude steps to use int literals. Thus `<exclude paths>` are statically known. This 
* This RFC makes no changes to schema and name inference, and assumes that such inference is run as a prerequisite.
* This RFC defines `<exclude paths>` only over `list`, `bag`, and `tuple` value collections.

db.inventory.find( { status: "A" }, { status: 0, instock: 0 } )
----

== Unresolved questions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if partial schema case be out of scope only? i.e. having SQL's schema-full within the scope as well.

@johnedquinn
Copy link
Member

I have come across a scenario where I'd like to express caution. It has to do with scoping of variables specifically when EXCLUDE is present. Consider the following simple query

SELECT
    t.a,
    t.b -- this won't produce anything!
EXCLUDE t.b
FROM t

This makes sense. However, I've been dealing with nested queries, and I'm curious what would happen in the following scenario:

SELECT
    (
        SELECT t2.c + t1.a -- this shouldn't work!
        EXCLUDE t1.a
        FROM t2
    ) AS t1_plus_t2
FROM t1 -- this has columns a and b

In the above, your RFC works great I believe. The query should fail. However, this may cause problems if EXCLUDE is allowed to remove bindings entirely. In my opinion, here is the query (again) with some more information regarding binding environments:

global env = E0 = < t1: <<{a, b}>>, t2: <<{c, d}>> >
SELECT
    (
        SELECT t2.c + t1.a AS x-- input env = E0 || E1 || E2 = < t1: {b}, t2: {c, d} >. Output env = < x >
        EXCLUDE t1.a -- input env = E0 || E1 || E2. Output Env = E0 || E1 || E2 (with some minor eliminations of attributes) = < t1: {b}, t2: {c, d} >
        FROM t2 -- input env = E0 || E1. Output env (E2) = E0 || E1 || < t2: {c, d} > = < t1: {a, b}, t2: {c, d} >
    ) AS t1_plus_t2 -- the whole SELECT (including this projection item subquery) has input env = E0 || E1
FROM t1 -- input env = E0, output env (E1) = E0 || < t1: { a, b } > = < t1: {a, b}, t2: <<{c, d}>> >

If we want to make sure that the inner select does NOT get access to what is being excluded, then we must not allow EXCLUDE to exclude whole bindings (rather than attributes of bindings). If we had a very similar query that excluded an entire binding, we might still allow the projection list to access the original t1.a. See below:

global env = E0 = < t1: <<{a, b}>>, t2: <<{c, d}>> >
SELECT
    (
        SELECT t2.c + t1.a AS x-- input env = E0 || E1 || E3 = < t1: {a, b}, t2: {c, d} >. Output env (E4) = < x >
        EXCLUDE t1 -- input env = E0 || E1 || E2. Output Env (E3) = E0 || E1 || E2 = < t2: {c, d} >
        FROM t2 -- input env = E0 || E1. Output env (E2) = E0 || E1 || < t2: {c, d} > = < t1: {a, b}, t2: {c, d} >
    ) AS t1_plus_t2 -- the whole SELECT (including this projection item subquery) has input env = E0 || E1
FROM t1 -- input env = E0, output env (E1) = E0 || < t1: { a, b } > = < t1: {a, b}, t2: <<{c, d}>> >

Notice that the inner select still received t1 in its entirety due to the concatenation of environments (effectively bypassing the exclusion). One might ask: "That wouldn't be the case. Bindings always come from their inputs. Why is this different?" We might look at another operator that strips out bindings, the aggregation. In other engines, we can still access the outer variables. See the following examples:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants