Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HIVE-28600: Iceberg: Check that table/partition requires compaction b… #5529

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

difin
Copy link
Contributor

@difin difin commented Nov 5, 2024

…efore compacting

What changes were proposed in this pull request?

Added checking if Iceberg table/partition requires compaction before starting a manual compaction request.

Why are the changes needed?

Performance improvement to skip tables/partitions which do not require compaction.

Does this PR introduce any user-facing change?

No

Is the change a dependency upgrade?

No

How was this patch tested?

Updated existing q-tests, added new unit test.

CloseableIterable.filter(fileScanTasks, t -> {
DataFile file = t.asFileScanTask().file();
return (!table.spec().isPartitioned() ||
partitionPath == null && file.specId() != table.spec().specId() ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use better separation between operands to make it readable/understandable and reliable, since now the precedence can cause issues.

Like: ((a==b && c!=2) && (a!=b && c=2) && (...) )

Copy link
Contributor Author

@difin difin Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously I used braces in very similar case, and @deniskuzZ asked to remove them: #5389 (comment)

The precedence won't cause issues because in Java the && operand has higher precedence over ||, so it will be evaluated deterministically.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See other response to this topic.

CloseableIterable<ScanTask> filteredDeletesScanTasks =
CloseableIterable.filter(deletesScanTasks, t -> {
DeleteFile file = ((PositionDeletesScanTask) t).file();
return !table.spec().isPartitioned() ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use better separation between operands to make it readable/understandable and reliable, since now the precedence can cause issues.

Like: ((a==b && c!=2) && (a!=b && c=2) && (...) )

Copy link
Contributor Author

@difin difin Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to previous comment, previously I used braces in very similar case, and @deniskuzZ asked to remove them: #5389 (comment)

The precedence won't cause issues because in Java the && operand has higher precedence over ||, so it will be evaluated deterministically.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Highly disagree. Please check this:

boolean x = false;
boolean y = true;
boolean z = true;
if(x && y ||z){
  System.out.println("true");
}else{
  System.out.println("false");
}

if(x && (y ||z)){
  System.out.println("true");
}else{
  System.out.println("false");
}

The result for the first is true, for the second it is false.

Copy link
Contributor Author

@difin difin Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be no contradiction, your test confirms that && has higher precedence than || as it gets evaluated first.

First case:

x && y || z
false && true || true
false || true
true

Second case:

x && (y || z)
false && (true || true)
false && (true)
false

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It definitely confirms how important parenthesis is when there are multiple conditions in if. With or without parenthesis it could have a different result, as shown.Which means if it is done the wrong way it results in error.
It also not readable for others who didn't developed it, just try to understand or modify later. Imagine someone has to add or remove a condition in the future and it leads to bugs.

Please check Oracle's recommendation to this topic:
https://www.oracle.com/java/technologies/javase/codeconventions-programmingpractices.html

"10.5 Miscellaneous Practices
10.5.1 Parentheses
It is generally a good idea to use parentheses liberally in expressions involving mixed operators
to avoid operator precedence problems. Even if the operator precedence seems clear to you, it
might not be to others—you shouldn’t assume that other programmers know precedence as
well as you do.
if (a == b && c == d) // AVOID!
if ((a == b) && (c == d)) // RIGHT"

Also some valuable experience from others for this topic. I can not express myself better then the comments in this :
https://softwareengineering.stackexchange.com/questions/201175/should-i-fully-parenthesize-expressions-or-rely-on-precedence-rules
"As for correctness, conditions without parentheses are a recipe for silly mistakes. When they happen, they can be bugs that are hard to find--because often an incorrect condition will behave correctly most of the time, and only occasionally fail."

"Good developers strive to write code that is clear and correct. Parentheses in conditionals, even if they are not strictly required, help with both."
"Furthermore, extra parentheses, just like indentations, whitespace, and other style standards, help visually organize the code in a logical way."
"And even if you get it right, the next person to work on your code may not, either adding errors to the expression or misunderstanding your logic and thus adding errors elsewhere (as LarsH rightly points out)."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @deniskuzZ, can you please provide your view on this topic, should we add parenthesis here or not?
Currently it follows the recommendation from this review comment:
#5389 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up to you; it's a matter of style. I haven't seen any recommendations on that topic in either Clean Code or Effective Java.
Only remark it should be easy to read and understand.

Even if the operator precedence seems clear to you, it might not be to others—you shouldn’t assume that other programmers know precedence as well as you do.
if (a == b && c == d) // AVOID!
if ((a == b) && (c == d)) // RIGHT"

to be honest, that seems like a joke

x && y || z
x && (y || z)

are 2 completely different expressions, if you know boolean algebra

('fn11','ln11', 3, 20, 101),
('fn12','ln12', 3, 20, 101);

select * from ice_orc where company_id = 100;
select * from ice_orc where dept_id = 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you delete those selects?

Copy link
Contributor Author

@difin difin Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because with the other changes done to this q file in this PR the order of the result records from these selects is not fixed and changes randomly in different executions of the test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about adding -- SORT_QUERY_RESULTS ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This q-test tests if records get sorted after executing compaction with sort clause.
I removed -- SORT_QUERY_RESULTS intentionally, otherwise it won't be possible to test if records are sorted after compaction.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is there an impact of this PR to change the order of the result of this query?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of changes that I did on the lines 35-60 in iceberg_major_compaction_partition_evolution_ordered.q

Copy link
Contributor

@zratkai zratkai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check my comments!

Copy link

sonarcloud bot commented Nov 15, 2024

@@ -2232,4 +2233,59 @@ private static List<FieldSchema> schema(List<VirtualColumn> exprs) {
private static List<FieldSchema> orderBy(VirtualColumn... exprs) {
return schema(Arrays.asList(exprs));
}

@Override
public boolean canCompact(HiveConf hiveConf, org.apache.hadoop.hive.ql.metadata.Table table, String partitionPath,
Copy link
Member

@deniskuzZ deniskuzZ Nov 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. that shouldn't be part of StorageHandler API, it's Compactor responsibility, please take a look at https://github.com/apache/amoro/blob/master/amoro-format-iceberg/src/main/java/org/apache/amoro/optimizing/plan/CommonPartitionEvaluator.java, maybe follow a similar design. Not sure, but we might even reuse some of their code (evaluation if compaction is needed) by adding amoro lib to Hive.
  2. consider rename to isEligibleForCompaction
  3. you can use Partish abstraction that could be either Table or Partition
  4. hiveConf should be probably the last param

new SizeValidator(), "Iceberg data file size in megabytes below which a file needs to be compacted."),
ICEBERG_COMPACTION_FILE_SIZE_RATIO("hive.iceberg.compaction.file.size.ratio", 0.1f,
"Ratio of # data files below threshold / # data files above threshold above which compaction is needed"),
ICEBERG_COMPACTION_DELETE_RECORDS_THRESHOLD("hive.iceberg.delete.records.threshold", 100,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe delete files?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants