Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Self serve replication API server side implementation #227

Merged

Conversation

chenselena
Copy link
Collaborator

@chenselena chenselena commented Oct 9, 2024

Summary

Branched off from #220, this PR adds the server side implementation for the self serve replication API. Separate PR for SQL level changes can be found here: #226.

This PR adds validations for the interval and destination cluster parameters and stores the replication config as part of table policies in table properties.

Validations on parameters:

  • Destination cluster cannot be the same as the source cluster of the table.
  • For the interval parameter, if user inputted it should be in the format H or D where hourly inputs can be 12H and daily inputs can be 1-3D.

This PR doesn't include the changes to generate the cron schedule from the interval input, those will be made in a separate PR.

Changes

  • Client-facing API Changes
  • Internal API Changes
  • Bug Fixes
  • New Features
  • Performance Improvements
  • Code Style
  • Refactoring
  • Documentation
  • Tests

For all the boxes checked, please include additional details of the changes made in this pull request.

Testing Done

  • Manually Tested on local docker setup. Please include commands ran, and their output.
  • Added new tests for the changes made.
  • Updated existing tests to reflect the changes made.
  • No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
  • Some other form of testing like staging or soak time in production. Please explain.

Added unit testing.

Tested with local docker server:
successful POST to http://localhost:8000/v1/databases/u_tableowner/tables with parameters:

{
    "tableId": "test_table",
    "databaseId": "u_tableowner",
    "baseTableVersion": "INITIAL_VERSION",
    "clusterId": "LocalHadoopCluster",
    "schema": "{\"type\": \"struct\", \"fields\": [{\"id\": 1,\"required\": true,\"name\": \"id\",\"type\": \"string\"},{\"id\": 2,\"required\": true,\"name\": \"name\",\"type\": \"string\"},{\"id\": 3,\"required\": true,\"name\": \"ts\",\"type\": \"timestamp\"}]}",
    "tableProperties": {
        "key": "value"
    },
    "policies": {
        "sharingEnabled": "true",
        "replication": {
            "config": [
                {
                    "destination": "LocalHadoopClusterA",
                    "interval": "12H"
                }
            ]
        }
    }
}

successful POST to http://localhost:8000/v1/databases/u_tableowner/tables with parameters:

{
    "tableId": "test_table",
    "databaseId": "u_tableowner",
    "baseTableVersion": "INITIAL_VERSION",
    "clusterId": "LocalHadoopCluster",
    "schema": "{\"type\": \"struct\", \"fields\": [{\"id\": 1,\"required\": true,\"name\": \"id\",\"type\": \"string\"},{\"id\": 2,\"required\": true,\"name\": \"name\",\"type\": \"string\"},{\"id\": 3,\"required\": true,\"name\": \"ts\",\"type\": \"timestamp\"}]}",
    "tableProperties": {
        "key": "value"
    },
    "policies": {
        "sharingEnabled": "true",
        "replication": {
            "config": [
                {
                    "destination": "LocalHadoopClusterA",
                    "interval": "1D"
                }
            ]
        }
    }
}

Using interval: 24H gives the following error:

{
    "status": "BAD_REQUEST",
    "error": "Bad Request",
    "message": " : Replication interval for the table LocalHadoopCluster.u_tableowner.test_table1 can either be 12 hours or daily for up to 3 days",
    "stacktrace": null,
    "cause": "Not Available"
}

Trying to set the destination cluster as the source cluster gives the following error:

{
    "status": "BAD_REQUEST",
    "error": "Bad Request",
    "message": " : Replication destination cluster for the table LocalHadoopCluster.u_tableowner.test_table1 must be different from the source cluster",
    "stacktrace": null,
    "cause": "Not Available"
}

For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request.

Additional Information

  • Breaking Changes
  • Deprecations
  • Large PR broken into smaller PRs, and PR plan linked in the description.

For all the boxes checked, include additional details of the changes made in this pull request.

@chenselena chenselena mentioned this pull request Oct 10, 2024
17 tasks
Copy link
Collaborator

@HotSushi HotSushi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some pre comments

@chenselena chenselena force-pushed the selchen/replication-server-side-update branch 2 times, most recently from 81cba30 to 673a27a Compare October 16, 2024 18:13
@chenselena chenselena changed the title [Draft] Self serve replication server side implementation [Draft] Self serve replication API server side implementation Oct 16, 2024
@chenselena chenselena changed the title [Draft] Self serve replication API server side implementation Self serve replication API server side implementation Oct 16, 2024
@chenselena chenselena marked this pull request as ready for review October 17, 2024 02:51
@chenselena chenselena force-pushed the selchen/replication-server-side-update branch 2 times, most recently from 27bc75a to 8a87977 Compare October 17, 2024 18:58
chenselena added a commit that referenced this pull request Oct 17, 2024
## Summary

<!--- HINT: Replace #nnn with corresponding Issue number, if you are
fixing an existing issue -->

Branched off from #220, this
PR contains only the scope for SQL API support for self serve
replication. The changes include SQL API support for adding replication
configs to table policies within table properties.

SQL API that is supported:
```
ALTER TABLE db.testTable SET POLICY (REPLICATION=({destination:'a', interval:12h}))
```
```
ALTER TABLE db.testTable SET POLICY (REPLICATION=({destination:'a'}))
```
where interval is defined as the interval at which the replication job
is run and cluster is the destination cluster.
Interval is an optional parameter where users can define an interval
from 12 to 72 as `12h/H`, `24h/H`, etc. If interval is not given, the
replication schedule will be set up as daily (24h intervals).

We also allow a list input with multiple clusters to enable
multi-cluster table replication.
```
ALTER TABLE db.testTable SET POLICY (REPLICATION=({destination:'a', interval:12H}, {destination:'aa', interval:12h}))
```

**Future Scope:**
Add validations to check that the destination cluster != source cluster,
and that the replication interval follows rules defined for data
freshness and compliance.
Separate PR for server-side implementation:
#227 which will contain
validation for SQL string input and cron schedule.

## Changes

- [x] Client-facing API Changes
- [ ] Internal API Changes
- [ ] Bug Fixes
- [x] New Features
- [ ] Performance Improvements
- [ ] Code Style
- [ ] Refactoring
- [ ] Documentation
- [ ] Tests

For all the boxes checked, please include additional details of the
changes made in this pull request.

## Testing Done
<!--- Check any relevant boxes with "x" -->

- [x] Manually Tested on local docker setup. Please include commands
ran, and their output.
- [x] Added new tests for the changes made.
- [ ] Updated existing tests to reflect the changes made.
- [ ] No tests added or updated. Please explain why. If unsure, please
feel free to ask for help.
- [ ] Some other form of testing like staging or soak time in
production. Please explain.

For all the boxes checked, include a detailed description of the testing
done for the changes made in this pull request.

Added unit tests.

Ran following commands on local docker:
```
scala> spark.sql("alter table u_tableowner.test_table set policy (replication=({destination:'WAR'}))").show(false)
ANTLR Tool version 4.7.1 used for code generation does not match the current runtime version 4.8ANTLR Tool version 4.7.1 used for code generation does not match the current runtime version 4.8++
||
++
++
```

```
scala> spark.sql("alter table u_tableowner.test_table set policy (replication=({destination:'WAR', interval:12H}))").show(false)
++
||
++
++
```

```
scala> spark.sql("alter table u_tableowner.test_table set policy (replication=({interval:'12H'}))").show(false)
com.linkedin.openhouse.spark.sql.catalyst.parser.extensions.OpenhouseParseException: mismatched input 'interval' expecting {'.', 'SET'}; line 1 pos 62
```

```
scala> spark.sql("alter table u_tableowner.test_table set policy (replication=({destination:'A', interval:12d}))").show(false)
com.linkedin.openhouse.spark.sql.catalyst.parser.extensions.OpenhouseParseException: mismatched input '12d' expecting RETENTION_HOUR; line 1 pos 84
```

# Additional Information

- [ ] Breaking Changes
- [ ] Deprecations
- [x] Large PR broken into smaller PRs, and PR plan linked in the
description.

For all the boxes checked, include additional details of the changes
made in this pull request.
Copy link
Collaborator

@rohitkum2506 rohitkum2506 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @chenselena for plugging in the backend components. Added a few comments/questions on first pass.

@chenselena chenselena force-pushed the selchen/replication-server-side-update branch from 88f0c56 to 166525b Compare October 21, 2024 22:33
chenselena added a commit that referenced this pull request Oct 22, 2024
…rval (#234)

## Summary

<!--- HINT: Replace #nnn with corresponding Issue number, if you are
fixing an existing issue -->

This PR adds support for daily granularity as an valid input in the SQL
API for the `interval` parameter as part of self serve replication.

Now the following SQL is valid and will not throw an exception:
```
ALTER TABLE db.testTable SET POLICY (REPLICATION=({destination:'a', interval:1D}))
```
where interval is supported to take daily and hourly inputs. 
The validations for 'D' and 'H' inputs will continue to be performed at
the server-side level to accept 12H and 1/2/3D inputs. The PR for that
can be found [here](#227).

## Changes

- [x] Client-facing API Changes
- [ ] Internal API Changes
- [ ] Bug Fixes
- [x] New Features
- [ ] Performance Improvements
- [ ] Code Style
- [ ] Refactoring
- [ ] Documentation
- [ ] Tests

For all the boxes checked, please include additional details of the
changes made in this pull request.

## Testing Done
<!--- Check any relevant boxes with "x" -->

- [x] Manually Tested on local docker setup. Please include commands
ran, and their output.
- [ ] Added new tests for the changes made.
- [x] Updated existing tests to reflect the changes made.
- [ ] No tests added or updated. Please explain why. If unsure, please
feel free to ask for help.
- [ ] Some other form of testing like staging or soak time in
production. Please explain.

For all the boxes checked, include a detailed description of the testing
done for the changes made in this pull request.
Updated unit tests for SQL statements and tested in local docker:

```
scala> spark.sql("ALTER TABLE u_tableowner.test SET POLICY (REPLICATION=({destination:'a', interval:1D}))")
res6: org.apache.spark.sql.DataFrame = []
```
```
scala> spark.sql("ALTER TABLE u_tableowner.test SET POLICY (REPLICATION=({destination:'a', interval:12H}))")
res8: org.apache.spark.sql.DataFrame = []
```
Using anything other than `h/H` or `d/D` throws an exception:
```
scala> spark.sql("ALTER TABLE u_tableowner.test SET POLICY (REPLICATION=({destination:'a', interval:1}))")
com.linkedin.openhouse.spark.sql.catalyst.parser.extensions.OpenhouseParseException: no viable alternative at input 'interval:1'; line 1 pos 82
```
```
scala> spark.sql("ALTER TABLE u_tableowner.test SET POLICY (REPLICATION=({destination:'a', interval:1Y}))")
com.linkedin.openhouse.spark.sql.catalyst.parser.extensions.OpenhouseParseException: no viable alternative at input 'interval:1Y'; line 1 pos 82
  at com.linkedin.openhouse.spark.sql.catalyst.parser.extensions.OpenhouseParseErrorListener$.syntaxError(OpenhouseSparkSqlExtensionsParser.scala:123)
  at org.antlr.v4.runtime.ProxyErrorListener.syntaxError(ProxyErrorListener.java:41)
```

# Additional Information

- [ ] Breaking Changes
- [ ] Deprecations
- [ ] Large PR broken into smaller PRs, and PR plan linked in the
description.

For all the boxes checked, include additional details of the changes
made in this pull request.
@chenselena chenselena force-pushed the selchen/replication-server-side-update branch 2 times, most recently from 548343f to e533299 Compare October 24, 2024 06:35
@chenselena chenselena force-pushed the selchen/replication-server-side-update branch from e533299 to 48fc186 Compare October 24, 2024 06:50
Copy link
Collaborator

@rohitkum2506 rohitkum2506 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @chenselena for perseverance in getting it done.

@chenselena chenselena merged commit 74d85fc into linkedin:main Oct 24, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants