-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HIVE-28600: Iceberg: Check that table/partition requires compaction b… #5529
base: master
Are you sure you want to change the base?
Conversation
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/IcebergTableUtil.java
Outdated
Show resolved
Hide resolved
CloseableIterable.filter(fileScanTasks, t -> { | ||
DataFile file = t.asFileScanTask().file(); | ||
return (!table.spec().isPartitioned() || | ||
partitionPath == null && file.specId() != table.spec().specId() || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use better separation between operands to make it readable/understandable and reliable, since now the precedence can cause issues.
Like: ((a==b && c!=2) && (a!=b && c=2) && (...) )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously I used braces in very similar case, and @deniskuzZ asked to remove them: #5389 (comment)
The precedence won't cause issues because in Java the && operand has higher precedence over ||, so it will be evaluated deterministically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See other response to this topic.
CloseableIterable<ScanTask> filteredDeletesScanTasks = | ||
CloseableIterable.filter(deletesScanTasks, t -> { | ||
DeleteFile file = ((PositionDeletesScanTask) t).file(); | ||
return !table.spec().isPartitioned() || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use better separation between operands to make it readable/understandable and reliable, since now the precedence can cause issues.
Like: ((a==b && c!=2) && (a!=b && c=2) && (...) )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to previous comment, previously I used braces in very similar case, and @deniskuzZ asked to remove them: #5389 (comment)
The precedence won't cause issues because in Java the && operand has higher precedence over ||, so it will be evaluated deterministically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Highly disagree. Please check this:
boolean x = false;
boolean y = true;
boolean z = true;
if(x && y ||z){
System.out.println("true");
}else{
System.out.println("false");
}
if(x && (y ||z)){
System.out.println("true");
}else{
System.out.println("false");
}
The result for the first is true, for the second it is false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seems to be no contradiction, your test confirms that && has higher precedence than || as it gets evaluated first.
First case:
x && y || z
false && true || true
false || true
true
Second case:
x && (y || z)
false && (true || true)
false && (true)
false
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It definitely confirms how important parenthesis is when there are multiple conditions in if. With or without parenthesis it could have a different result, as shown.Which means if it is done the wrong way it results in error.
It also not readable for others who didn't developed it, just try to understand or modify later. Imagine someone has to add or remove a condition in the future and it leads to bugs.
Please check Oracle's recommendation to this topic:
https://www.oracle.com/java/technologies/javase/codeconventions-programmingpractices.html
"10.5 Miscellaneous Practices
10.5.1 Parentheses
It is generally a good idea to use parentheses liberally in expressions involving mixed operators
to avoid operator precedence problems. Even if the operator precedence seems clear to you, it
might not be to others—you shouldn’t assume that other programmers know precedence as
well as you do.
if (a == b && c == d) // AVOID!
if ((a == b) && (c == d)) // RIGHT"
Also some valuable experience from others for this topic. I can not express myself better then the comments in this :
https://softwareengineering.stackexchange.com/questions/201175/should-i-fully-parenthesize-expressions-or-rely-on-precedence-rules
"As for correctness, conditions without parentheses are a recipe for silly mistakes. When they happen, they can be bugs that are hard to find--because often an incorrect condition will behave correctly most of the time, and only occasionally fail."
"Good developers strive to write code that is clear and correct. Parentheses in conditionals, even if they are not strictly required, help with both."
"Furthermore, extra parentheses, just like indentations, whitespace, and other style standards, help visually organize the code in a logical way."
"And even if you get it right, the next person to work on your code may not, either adding errors to the expression or misunderstanding your logic and thus adding errors elsewhere (as LarsH rightly points out)."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @deniskuzZ, can you please provide your view on this topic, should we add parenthesis here or not?
Currently it follows the recommendation from this review comment:
#5389 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Up to you; it's a matter of style. I haven't seen any recommendations on that topic in either Clean Code or Effective Java.
Only remark it should be easy to read and understand.
Even if the operator precedence seems clear to you, it might not be to others—you shouldn’t assume that other programmers know precedence as well as you do.
if (a == b && c == d) // AVOID!
if ((a == b) && (c == d)) // RIGHT"
to be honest, that seems like a joke
x && y || z
x && (y || z)
are 2 completely different expressions, if you know boolean algebra
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/IcebergTableUtil.java
Show resolved
Hide resolved
('fn11','ln11', 3, 20, 101), | ||
('fn12','ln12', 3, 20, 101); | ||
|
||
select * from ice_orc where company_id = 100; | ||
select * from ice_orc where dept_id = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you delete those selects?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because with the other changes done to this q file in this PR the order of the result records from these selects is not fixed and changes randomly in different executions of the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about adding -- SORT_QUERY_RESULTS
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This q-test tests if records get sorted after executing compaction with sort clause.
I removed -- SORT_QUERY_RESULTS
intentionally, otherwise it won't be possible to test if records are sorted after compaction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is there an impact of this PR to change the order of the result of this query?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of changes that I did on the lines 35-60 in iceberg_major_compaction_partition_evolution_ordered.q
ql/src/java/org/apache/hadoop/hive/ql/ddl/table/storage/compact/AlterTableCompactOperation.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check my comments!
Quality Gate passedIssues Measures |
@@ -2232,4 +2233,59 @@ private static List<FieldSchema> schema(List<VirtualColumn> exprs) { | |||
private static List<FieldSchema> orderBy(VirtualColumn... exprs) { | |||
return schema(Arrays.asList(exprs)); | |||
} | |||
|
|||
@Override | |||
public boolean canCompact(HiveConf hiveConf, org.apache.hadoop.hive.ql.metadata.Table table, String partitionPath, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- that shouldn't be part of StorageHandler API, it's Compactor responsibility, please take a look at https://github.com/apache/amoro/blob/master/amoro-format-iceberg/src/main/java/org/apache/amoro/optimizing/plan/CommonPartitionEvaluator.java, maybe follow a similar design. Not sure, but we might even reuse some of their code (evaluation if compaction is needed) by adding amoro lib to Hive.
- consider rename to
isEligibleForCompaction
- you can use Partish abstraction that could be either Table or Partition
- hiveConf should be probably the last param
new SizeValidator(), "Iceberg data file size in megabytes below which a file needs to be compacted."), | ||
ICEBERG_COMPACTION_FILE_SIZE_RATIO("hive.iceberg.compaction.file.size.ratio", 0.1f, | ||
"Ratio of # data files below threshold / # data files above threshold above which compaction is needed"), | ||
ICEBERG_COMPACTION_DELETE_RECORDS_THRESHOLD("hive.iceberg.delete.records.threshold", 100, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe delete files?
…efore compacting
What changes were proposed in this pull request?
Added checking if Iceberg table/partition requires compaction before starting a manual compaction request.
Why are the changes needed?
Performance improvement to skip tables/partitions which do not require compaction.
Does this PR introduce any user-facing change?
No
Is the change a dependency upgrade?
No
How was this patch tested?
Updated existing q-tests, added new unit test.