Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OAK-11165 - Cache the Path field in the NodeDocument class. #1758

Open
wants to merge 1 commit into
base: trunk
Choose a base branch
from

Conversation

nfsantos
Copy link
Contributor

@nfsantos nfsantos commented Oct 3, 2024

Creating a Path instance from a NodeDocument is an expensive operation, so it is worth to cache it. The performance improvements are noticeable in the indexing job, which calls #getPath twice in each NodeDocument that it processes.

Similar idea as what is done here:

public int hashCode() {
int h = hash;
if (h == -1 && parent != null) {
h = 17;
h = 37 * h + parent.hashCode();
h = 37 * h + name.hashCode();
hash = h;
}
return h;
}

@Joscorbe
Copy link
Member

Joscorbe commented Oct 3, 2024

Are there any concerns about increased memory usage by caching the Path object in memory in every case?

I understand it can improve performance on indexing job, but it could increase memory consumption during normal operations of the repository. Have you evaluated this impact?

Maybe we could have a way to enable/disable the cache of this object depending on the needs. For example, enable it as part of indexing, but keep it disabled on other parts, at least until we assess any possible impact.

@nfsantos
Copy link
Contributor Author

nfsantos commented Oct 3, 2024

Are there any concerns about increased memory usage by caching the Path object in memory in every case?

I understand it can improve performance on indexing job, but it could increase memory consumption during normal operations of the repository. Have you evaluated this impact?

Maybe we could have a way to enable/disable the cache of this object depending on the needs. For example, enable it as part of indexing, but keep it disabled on other parts, at least until we assess any possible impact.

I have not evaluated the potential increase in memory in other situations, I was hoping someone more knowledgeable in this module would be able to comment. The main place where there could be a increased in memory usage that I can think of is in the NodeDocument cache. If getPath is called on the documents in the cache and these instances of Path would not be held otherwise anywhere else, then the memory usage will increase. I don't know if this could be an issue.

I was hoping that in most cases the lifetime of the Path instance created by calling #getPAth will be similar to the one of the NodeDocument from where it was created, so there would be no increase in memory.

I'm ok with a way to enabled/disable this feature, could be a static flag set from an env variable with a default value of disabled. And we would enable it only for indexing jobs. But this optimization could benefit other uses of Oak, but I'm not sure how to evaluate it.

@nfsantos
Copy link
Contributor Author

nfsantos commented Oct 3, 2024

The reason why calling getPath is so expensive are the calls to StringCache for each path segment. I am assuming this is done to save memory, as two different paths often have many path segments in common. On the other hand, this requires computing the hash of every path segment, which might not be otherwise needed in many situations. And this cache might also be ineffective because it only has capacity for 1024 elements and when traversing an Oak tree we will come across a very large number of different names. We can assume that the leaves will all have different names, so traversing 1024 nodes will completely trash the cache, pushing out the common names that appear in the parent segments. I have not done a deep analysis, but I suspect this StringCache might not be reducing significantly the memory usage and is increasing CPU usage. On the indexing job, the calls to StringCache.get are responsible for more than 10% of the total CPU usage, which seems quite high.

@stefan-egli
Copy link
Contributor

I think there is a very high likelihood that getPath is executed by the time it gets put into the cache, as for example getNodeAtRevision already calls it, which is very likely to be called. Perhaps having an idea by how much the memory would increase in a realistic scenario would be useful.

@thomasmueller
Copy link
Member

Compared to the "data" map, memory usage of the path is small. To implement estimateMemoryUsage, the memory usage was estimated once, and the overheads (at that time, with an old JVM) are listed: 112 bytes overhead for just an entry, 64 bytes per map entry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants