-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Supporting CTAS queries for Hive to Spark query translations #324
base: master
Are you sure you want to change the base?
Conversation
coral-hive/src/main/java/com/linkedin/coral/hive/hive2rel/parsetree/ParseTreeBuilder.java
Outdated
Show resolved
Hide resolved
coral-common/src/main/java/com/linkedin/coral/common/calcite/sql/SqlCreateTable.java
Outdated
Show resolved
Hide resolved
coral-common/src/main/java/com/linkedin/coral/common/calcite/sql/SqlCreateTable.java
Outdated
Show resolved
Hide resolved
if(tableRowFormat != null){ | ||
writer.keyword("ROW FORMAT DELIMITED FIELDS TERMINATED BY"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition sounds loose. See all options of the row format: https://cwiki.apache.org/confluence/display/hive/languagemanual+ddl#LanguageManualDDL-RowFormats&SerDe and https://spark.apache.org/docs/latest/sql-ref-syntax-hive-format.html.
It is okay to support only ROW FORMAT DELIMITED FIELDS TERMINATED BY
, but the condition should reflect that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you suggesting having a conditional check if entered values for the row format are correct or something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am saying ROW FORMAT implies a few things (e.g., either SERDE or DELIMITED). Somehow you take tableRowFormat
to reflect a lot of things together: ROW FORMAT DELIMITED FIELDS TERMINATED BY
.
private final @Nullable SqlNode tableSerializer; | ||
private final @Nullable SqlNodeList tableFileFormat; | ||
private final @Nullable SqlCharStringLiteral tableRowFormat; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using both serializer and row format is confusing since serializer/deserializer is folded under row format. ROW FORMAT SERDE
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, when referencing serializer, can we just use serDe
?
You can also remove the table
prefix since it is already implied.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you suggesting using a separate custom SqlNode for RowFormat which will contain information like serDe, fieldDelim, colDelim, etc? I think having a custom SqlNode will later help in supporting other syntaxes of row formats in a better way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am saying ROW FORMAT
concept encompasses both SERDE
and DELIMITED
concepts/keywords. The code structure should reflect the syntax structure.
if(tableFileFormat != null){ | ||
if(tableFileFormat.size() == 1){ | ||
writer.keyword("STORED AS"); | ||
tableFileFormat.get(0).unparse(writer, 0, 0); | ||
writer.newlineAndIndent(); | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the above, can we make the conditions stricter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, are you suggesting to have a check, if the file format is valid, like ORC, Parquet, etc? I think unparsing logic shouldn't worry about these.
import java.util.List; | ||
import java.util.Objects; | ||
|
||
public class SqlCreateTable extends SqlCreate { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let us get back to this, but it might depend on the discussion above.
coral-hive/src/main/java/com/linkedin/coral/hive/hive2rel/parsetree/ParseTreeBuilder.java
Outdated
Show resolved
Hide resolved
@@ -64,7 +64,7 @@ public ResponseEntity translate(@RequestBody TranslateRequestBody translateReque | |||
else if (fromLanguage.equalsIgnoreCase("hive")) { | |||
// To Spark | |||
if (toLanguage.equalsIgnoreCase("spark")) { | |||
translatedSql = translateHiveToSpark(query); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason for this change? @ljfgem FYI.
@@ -33,4 +36,11 @@ public static String translateHiveToSpark(String query) { | |||
CoralSpark coralSpark = CoralSpark.create(relNode); | |||
return coralSpark.getSparkSql(); | |||
} | |||
|
|||
public static String translateHiveQueryToSparkSql(String query){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will get back to this after resolving the above.
public static CoralSpark create(SqlNode sqlNode, Function<SqlNode, RelNode> convertor){ | ||
SparkRelInfo sparkRelInfo; | ||
//apply RelNode transformations for sqlNode eligible for transformation. | ||
if(sqlNode instanceof SqlCreate) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about adding an interface SqlCommand
, and have SqlCreateTable
extend SqlCreate
and implement SqlCommand
? SqlCommand
defines getSelectQueries()
and setSelectQueries()
. Those are just list versions of what you have now in SqlCreateTable
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, this sounds good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel that getSelectQueries()
and setSelectQueries()
will be used for CREATE TABLE/VIEW....AS
. If there's no other use case, can we rename SqlCommand
-> SqlCreateCommand
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure, but I think commands like INSERT OVERWRITE TABLE tmp select * FROM foo
, might also leverage this
* | ||
* @return [[CoralSparkInfo]] | ||
*/ | ||
public static CoralSpark create(SqlNode sqlNode, Function<SqlNode, RelNode> convertor){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that we are assuming this is the Coral IR SqlNode
. Assuming that Coral IR SqlNode
has array indexes start from 1 (see standardizeRel()
), I am pretty sure the array index in this tree will start from 0. For consistency, somehow we need a SELECT
node here that:
1- is aSqlNode
2- has array indexes starting from 1.
I think this implies that the caller to this method must have taken the query already, converted the SELECT SqlNode
to RelNode
then back to SELECT SqlNode
. The former SELECT
will have array indexes start from 0, and the latter will have them start from 1. What remains after the caller has done this step is that the latter Coral IR SqlNode
is simply converted to Spark SQL string (using the Spark dialect and after applying theSparkSqlRewriter
(on the latter SqlNode
)), without having to go through RelNode
here.
The caller can call the RelNode
-based create
on the obtained RelNode
in the caller. It is best to code this logic in a test case. @aastha25 @ljfgem FYI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, currently SqlNode given as input to this function has 0-based indexing. Now as you have mentioned that we need a
SELECTnode here should have array indexes starting from 1, so it means all
RelNodebased transformations, would have been done by the caller of this function. It means the caller should be able to execute the below transformations on the
SELECT `node before calling this new create function.
if(sqlNode instanceof SqlCommand) {
SqlNode selectNode = ((SqlCreateTable) sqlNode).getSelectQuery();
sparkRelInfo = IRRelToSparkRelTransformer.transform(convertor.apply(selectNode));
selectNode = new CoralRelToSqlNodeConverter().convert(sparkRelInfo.getSparkRelNode());
((SqlCreateTable) sqlNode).setSelectQuery(selectNode);
} else {
sparkRelInfo = IRRelToSparkRelTransformer.transform(convertor.apply(sqlNode));
sqlNode = new CoralRelToSqlNodeConverter().convert(sparkRelInfo.getSparkRelNode());
}
Should we create a separate function in CoralSpark
to handle this SqlNode
->RelNode
->SqlNode
transformation?
I don't think, we can call RelNode
-based create as currently, it will convert RelNode
to SQL
directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the translation path:
Language1 (hiveSQL) --> CoralSqlNode (LHS) --> CoralRelNode --> CoralSqlNode (RHS) --> Language 2 (spark SQL)
The coral-spark
class is responsible for generating sparkSQL from CoralIR. It currently accepts the CoralRelNode
and applies the RHS side of transformations to generate a spark-compatible SQL. In your current implementation of create(..)
, you are passing CoralSqlNode (LHS) and using convertor.apply(selectNode)
to generate CoralRelNode
. Could you please refactor to apply the LHS side of transformations outside this class and then invoke existing method create(RelNode irRelNode){..}
to generate the updated selectNode
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aastha25 do you mean the entry point of CTAS statement translation should be in another class? I think it's fine to put it in CoralSpark
. And looks like create(RelNode irRelNode)
can't generate updated selectNode
, it would return Spark SQL directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nimesh1601 could you summarize the offline discussion/agreement here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @nimesh1601 for the PR!
* | ||
* @return [[CoralSparkInfo]] | ||
*/ | ||
public static CoralSpark create(SqlNode sqlNode, Function<SqlNode, RelNode> convertor){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the argument sqlNode
is of type SqlCreate
, the query's semantics are not getting validated. For example:
for a table foo
with multiple columns, if there are input queries like create table t1 as select * as col0 from foo
or create table t1 (a int) as select * from foo
, I'm not sure if Coral will throw an exception right now.
The validation work can be incremental, but maybe we can setup a class like ddlSqlValidator
in coral-common / coral-hive modules which can validate the different types of ddl statements and invoke it here.
We should validate that the rowtype is compatible for SqlCreateTable
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Queries like create table t1 as select * as col0 from foo
will fail query parsing, but yes, we would need validation for cases like create table t1 (a int) as select * from foo
. Do we have anything similar for select queries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for select queries, we use the hiveSqlValidator to validate the CoralSqlNode and then generate the relNode representation. This is done here.
if create DDL (and other DDL) are not flowing through this translation path, we should validate them separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aastha25 I do not think we should create two parallel paths. Can we unify them using common APIs?
break; | ||
} | ||
} | ||
return new SqlCreateTable(ZERO, false, ctOptions.ifNotExists != null ? ctOptions.ifNotExists : false, ctOptions.name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we introduce a utility class like calcite does here https://github.com/apache/calcite/blob/main/core/src/main/java/org/apache/calcite/sql/ddl/SqlDdlNodes.java to manage all ddl sqlNode creation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aastha25 Why does DDL SqlNode creation warrant a separate path? Why cannot it be done through generic SqlNode generation?
public static CoralSpark create(SqlNode sqlNode, Function<SqlNode, RelNode> convertor){ | ||
SparkRelInfo sparkRelInfo; | ||
//apply RelNode transformations for sqlNode eligible for transformation. | ||
if(sqlNode instanceof SqlCreate) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel that getSelectQueries()
and setSelectQueries()
will be used for CREATE TABLE/VIEW....AS
. If there's no other use case, can we rename SqlCommand
-> SqlCreateCommand
?
coral-common/src/main/java/com/linkedin/coral/common/calcite/sql/SqlCreateTable.java
Outdated
Show resolved
Hide resolved
aafad66
to
51b5451
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @nimesh1601 for this PR!
Please run ./gradlew spotlessApply
to fix the format issues.
if(sqlNode instanceof SqlCommand) { | ||
SqlNode selectNode = ((SqlCreateTable) sqlNode).getSelectQuery(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we cast sqlNode
to SqlCreateTable
if sqlNode instanceof SqlCommand
? Will there be more classes implementing SqlCommand
?
* | ||
* @return [[CoralSparkInfo]] | ||
*/ | ||
public static CoralSpark create(SqlNode sqlNode, Function<SqlNode, RelNode> convertor){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aastha25 do you mean the entry point of CTAS statement translation should be in another class? I think it's fine to put it in CoralSpark
. And looks like create(RelNode irRelNode)
can't generate updated selectNode
, it would return Spark SQL directly.
* Internally Appropriate parts of Sql RelNode is converted to Spark RelNode, Spark RelNode is converted back | ||
* to SqlNode and SqlNode to SparkSQL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not following Internally Appropriate parts of Sql RelNode
, could you clarify it?
* 2) Base tables | ||
* 3) Spark UDF information objects, ie. List of {@link SparkUDFInfo} | ||
* | ||
* @param sqlNode CoralNode which will be translated to SparkSQL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to clarify sqlNode
here, it's CoralSqlNode (LHS)
in the translation chain Language1 (hiveSQL) --> CoralSqlNode (LHS) --> CoralRelNode --> CoralSqlNode (RHS) --> Language 2 (spark SQL)
, which is the SqlNode parsed from the source SQL.
* 3) Spark UDF information objects, ie. List of {@link SparkUDFInfo} | ||
* | ||
* @param sqlNode CoralNode which will be translated to SparkSQL. | ||
* @param convertor Functional Interface to convert SqlNode to appropriate RelNode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think appropriate RelNode
can be replaced by CoralRelNode
.
coral-common/src/main/java/com/linkedin/coral/common/calcite/sql/SqlCreateTable.java
Outdated
Show resolved
Hide resolved
coral-common/src/main/java/com/linkedin/coral/common/calcite/sql/SqlCreateTable.java
Outdated
Show resolved
Hide resolved
coral-hive/src/main/java/com/linkedin/coral/hive/hive2rel/parsetree/AbstractASTVisitor.java
Outdated
Show resolved
Hide resolved
coral-hive/src/main/java/com/linkedin/coral/hive/hive2rel/parsetree/ParseTreeBuilder.java
Outdated
Show resolved
Hide resolved
coral-hive/src/main/java/com/linkedin/coral/hive/hive2rel/parsetree/ParseTreeBuilder.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes!! Left some comments about the structuring, will continue reviewing.
Also, could you please add javadoc for all public classes and methods :)
@@ -52,6 +54,7 @@ public class HiveToRelConverter extends ToRelConverter { | |||
// The validator must be reused | |||
SqlValidator sqlValidator = new HiveSqlValidator(getOperatorTable(), getCalciteCatalogReader(), | |||
((JavaTypeFactory) getRelBuilder().getTypeFactory()), HIVE_SQL); | |||
DdlSqlValidator ddlSqlValidator = new HiveDdlSqlValidator(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you also make this private final?
} | ||
|
||
private void validateCreateTable(SqlNode sqlNode) { | ||
//Todo need to add appropriate validations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this going to be implemented as part of this PR?
* | ||
* @return [[SqlNode]] | ||
*/ | ||
public static SqlNode getCoralSqlNode(RelNode irRelNode) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unclear why this is needed.
CoralSpark API should take Coral intermediate representation as input and return Spark SQL / SparkSQLNode.
You could create a
String convert(SqlCommand sqlCommand, RelNode coralRelNode) {
// generate CoralSqlNode (RHS) for coralRelNode
// set the updated CoralSqlNode (RHS) in sqlCommand
// send this (combined) SqlNode to CoralSqlNodeToSparkSqlNodeConverter and then SparkSqlRewriter
}
RelNode relNode = new HiveToRelConverter(hiveMetastoreClient).convertSql(query); | ||
CoralSpark coralSpark = CoralSpark.create(relNode); | ||
return coralSpark.getSparkSql(); | ||
HiveToRelConverter hiveToRelConverter = new HiveToRelConverter(hiveMetastoreClient); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Pl read the comment on CoralSpark.getCoralSqlNode(..) before this comment)
I think this class could use a bit of refactoring.
For SqlNode sqlNode = hiveToRelConverter.toSqlNode(query)
: hiveToRelConverter could return a coralSqlNode. Currently, on the LHS side, hiveSqlNode is equivalent CoralSqlNode (LHS).
Inside the if condition, you could extract CoralSelectSqlNode from coralSqlNode, and hand it over first to hiveToRelConverter
to generate CoralRelNode. And then hand over sqlCommand + CoralRelNode to CoralSpark.
RelRoot root = getSqlToRelConverter().convertQuery(sqlNode, true, true); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unnecessary empty line?
if(tableRowFormat != null){ | ||
writer.keyword("ROW FORMAT DELIMITED FIELDS TERMINATED BY"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am saying ROW FORMAT implies a few things (e.g., either SERDE or DELIMITED). Somehow you take tableRowFormat
to reflect a lot of things together: ROW FORMAT DELIMITED FIELDS TERMINATED BY
.
private final @Nullable SqlNode tableSerializer; | ||
private final @Nullable SqlNodeList tableFileFormat; | ||
private final @Nullable SqlCharStringLiteral tableRowFormat; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am saying ROW FORMAT
concept encompasses both SERDE
and DELIMITED
concepts/keywords. The code structure should reflect the syntax structure.
import java.util.List; | ||
import java.util.Objects; | ||
|
||
public class SqlCreateTable extends SqlCreate { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let us get back to this, but it might depend on the discussion above.
* | ||
* @return [[CoralSparkInfo]] | ||
*/ | ||
public static CoralSpark create(SqlNode sqlNode, Function<SqlNode, RelNode> convertor){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nimesh1601 could you summarize the offline discussion/agreement here?
* | ||
* @return [[CoralSparkInfo]] | ||
*/ | ||
public static CoralSpark create(SqlNode sqlNode, Function<SqlNode, RelNode> convertor){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aastha25 I do not think we should create two parallel paths. Can we unify them using common APIs?
break; | ||
} | ||
} | ||
return new SqlCreateTable(ZERO, false, ctOptions.ifNotExists != null ? ctOptions.ifNotExists : false, ctOptions.name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aastha25 Why does DDL SqlNode creation warrant a separate path? Why cannot it be done through generic SqlNode generation?
public interface DdlSqlValidator { | ||
|
||
void validate(SqlNode ddlSqlNode); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Who is supposed to call this (external calling code or internally inside some Coral step)? How can we unify it with the existing pipeline for validation? @aastha25 please let me know your thoughts.
public class SqlDdlNodes { | ||
|
||
/** Creates a CREATE TABLE. */ | ||
public static SqlCreateTable createTable(SqlParserPos pos, boolean replace, boolean ifNotExists, SqlIdentifier name, | ||
SqlNodeList columnList, SqlNode query, SqlNode tableSerializer, SqlNodeList tableFileFormat, | ||
SqlCharStringLiteral tableRowFormat) { | ||
return new SqlCreateTable(pos, replace, ifNotExists, name, columnList, query, tableSerializer, tableFileFormat, | ||
tableRowFormat); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds redundant? @aastha25
/** | ||
* Users use this function to get CoralSqlNode from CoralRelNode | ||
* This should be used when user need to get CoralSqlNode from CoralRelNode by applying | ||
* spark specific transformations on CoralRelNode | ||
* with Coral-schema output schema | ||
* | ||
* @return [[SqlNode]] | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Java doc needs to be more elaborate, and reflect the relationship with existing methods. @aastha25 could help with that.
The approach and details are mentioned in this Doc for reference: Supporting CTAS queries for Hive to Spark
Unit tests are WIP