-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat(#56) blog about caching #58
Changes from 14 commits
b46d9c1
805d79c
ca11fee
b8789f7
198cb96
96e9f05
4d30d65
3cdae01
ed510bc
5c2fa3e
ad6e8eb
9e6a736
daab2ce
25396af
5c065a5
8f27368
f25314b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,233 @@ | ||
--- | ||
layout: post | ||
date: 2024-02-06 | ||
title: "Build cache in EO and other build systems" | ||
author: Alekseeva Yana | ||
--- | ||
|
||
|
||
## Introduction | ||
In [EO](https://github.com/objectionary/eo), caching is used to speed up program compilation. | ||
Recently we found a caching | ||
[bug](https://github.com/objectionary/eo/issues/2790) between goals in `eo-maven-plugin` | ||
for EO version `0.34.0`. The bug occurred because the old verification method | ||
used compilation time and caching time to search for a cached file. | ||
This is not the most reliable verification method, | ||
because caching time does not have to be equal to compilation time. | ||
We came to the conclusion that we need caching with a reliable verification method. | ||
Furthermore, this verification method should refrain from reading the file content. | ||
|
||
The goal is to implement effective caching in EO. | ||
To achieve the goal, we will briefly look at how well-known used build systems (such as ccache, Maven, Gradle) | ||
in order to gain a deeper understanding of the caching concepts employed within them. | ||
|
||
<!--more--> | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Yanich96 "More"? |
||
|
||
## Caching in Other Build Systems | ||
|
||
### ccache/sccache | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Yanich96 Is it a build system or what? Where is the link? Short description? |
||
In compiled programming languages, building a project with many source code files takes a long time. | ||
This time is spent on loading of libraries, preparing, optimizing, checking the code, and so on. | ||
Let's look at the assembly scheme using C++ as an example: | ||
|
||
<p align="center"> | ||
<img src="/images/defaultCPhase.svg"> | ||
</p> | ||
|
||
1) First, preprocessor retrieves the source code files, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Yanich96 You only say that "preprocessor" only retrieves the source code files. And then... magic...:
Moreover, you don't need "compiler will get" |
||
which consist of both source files `.cpp` and header files `.h`. | ||
The result is a single file `.cpp` with human-readable code that the compiler will get. | ||
2) The compiler receives the file `.cpp` from the preprocessor and compiles it into an object file - `.obj`. | ||
At the compilation stage, parsing checks whether the code matches rules of a specific programming language. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Yanich96 Did you mean "parser" instead of "parsing"? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @volodya-lombrozo yes, thanks |
||
At the end, the compiler optimizes the resulting machine code and produces an object file. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Yanich96 You already mentioned it:
|
||
To speed up compilation, different files of the same project might be compiled in parallel. | ||
3) Then, the [Linker](https://en.wikipedia.org/wiki/Linker_(computing)) combines object files | ||
into an executable `.exe` file. | ||
|
||
|
||
To speed up the build of compiled languages, [ccache](https://ccache.dev) | ||
and [sccache](https://github.com/mozilla/sccache) are used. | ||
`ccache` uses the hash algorithm for the hashing of code at certain stages of the build. | ||
When compiling a file, its hash is calculated. | ||
If the file is already present in the registry of compiled files, the file will not be compiled again. | ||
Instead, the previously compiled binary file will be utilized. | ||
This approach can significantly accelerate the build process of certain packages, reducing build times by 5-10 times. | ||
The [`ccache` hash](https://ccache.dev/manual/4.8.2.html#_common_hashed_information) is | ||
based on: | ||
* the file contents | ||
* the current directory of the file | ||
* the name of the compiler | ||
* the compiler’s size and modification time | ||
* extensions used by the compiler. | ||
|
||
Moreover, `ccache` has two types of the hashing: | ||
1) `Direct mode` - the hash is generated based on the source code only. | ||
When using this mode, the user must ensure that the external libraries used in a project have not changed. | ||
Otherwise, the project might fail to build, resulting in errors. | ||
2) `Preprocessor mode` - hash is generated based on the `.cpp` file received after the preprocessor step. | ||
|
||
|
||
`Sccache` is similar in purpose to `ccache` but provides more functionality. | ||
`Sccache` allows to store cached files not only locally, but also in a cloud data storage. | ||
And `sccache` supports a wider range of languages, while `ccache` focuses on caching C and C++ compiler. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Yanich96 Maybe we need to write a short summary 1-2 sentences about this type of caching? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @volodya-lombrozo The principle of caching in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Yanich96 I mean ccache and sccache altogether. What is the difference with other types of caching? Why did you choose these tools? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @volodya-lombrozo I wrote above that I looked at well-known used build systems. Isn't this enough? |
||
|
||
`ccache` is a high-level tool and cannot work with individual compilation tasks, | ||
therefore `ccache` is not suitable for solving our problems. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Yanich96 I wouldn't say this:
What about this:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @volodya-lombrozo There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
However, the concept of non-local data storage could potentially be incorporated during the development of the EO. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Yanich96 "of the EO cache" There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. or "EO caching implementation"? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @volodya-lombrozo I meant EO in general. If "EO caching implementation" is better, I will fix it. |
||
|
||
|
||
### Gradle | ||
[Gradle](https://gradle.org) builds projects using a | ||
[task graph](https://docs.gradle.org/current/userguide/build_lifecycle.html) that allows for synchronous execution | ||
of certain tasks. | ||
`Gradle` employs | ||
[Incremental build](https://docs.gradle.org/current/userguide/incremental_build.html#sec:how_does_it_work), | ||
to speed up project builds. | ||
For an incremental build to work, the tasks used to build the project must have specified | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Yanich96 Could you please simplify this sentence and use simple active voice? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @volodya-lombrozo "The tasks that build the project must have input and output files for an incremental build to work." - is it ok? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Yanich96 The second sentence clearly explains the idea which you are trying to explain here. I would suggest to combine this two sentences into a single one. Or jut to remove this sentence. What do you think? |
||
input and output files. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Yanich96 "To enable an incremental build, the tasks that build the project must specify their input and output files." |
||
The provided code snippet demonstrates the implementation of a custom task in Gradle, | ||
showcasing how inputs and outputs are specified to enable `Incremental build`: | ||
``` | ||
task myTask { | ||
inputs.file 'src/main/java/MyTask.somebody' // Specify the input file | ||
outputs.file 'build/classes/java/main/MyTask.somebody' // Specify the output file | ||
|
||
doLast { | ||
// Task actions go here | ||
// This code will only be executed if the inputs or outputs have changed | ||
} | ||
} | ||
``` | ||
|
||
|
||
To understand how `Incremental build` works, consider the following steps: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Yanich96 Something strange is happening here with punctuation. Did you put this sentences in this order intentionally? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @volodya-lombrozo If I replace "To understand how There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Yanich96 Is it possible to remove this sentence? |
||
1) Before executing a task, `Gradle` takes a hash of the path and contents of the inputs files and saves it. | ||
The hash is considered current if the last modification time | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Yanich96 "current" is a strange word here. I guess you meant something different. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @volodya-lombrozo "valid" is better? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Yanich96 I'm not sure what you mean here, but I guess, yes. |
||
and the size of the source files have not changed. | ||
2) Then `Gradle` executes the task and saves a hash of the path and contents of the output files. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Yanich96 Is it a single hash for all the files? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @volodya-lombrozo yes. It is written in The Gradle documentation - "Gradle takes a fingerprint of the inputs. This fingerprint contains the paths of input files and a hash of the contents of each file. Gradle then executes the task." Should I mark this clarification in blog-post? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Yanich96 It would be nice |
||
3) Then, when Gradle starts a project build again, it generates a new hash for the same files. | ||
If the new hash is current, Gradle can safely skip this task. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Yanich96 Please, you something different from "current". It looks strange. |
||
In the opposite case, the task performs an action again and rewrites outputs. | ||
|
||
|
||
In addition to `Incremental build`, `Gradle` also stores hashes of previous builds, enabling quick project builds, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Yanich96 [I'm not sure] "the hashes" ? |
||
for example when switching from one git branch to another. This feature is known as | ||
the [Build Cache](https://docs.gradle.org/current/userguide/build_cache.html). | ||
|
||
|
||
The concept of `Gradle Incremental build` bears resemblance to a tool that is essential for our purposes. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Yanich96 It's a "water". Where is the conclusion? Why incremental build is redundant? What is a "fingerprint"? - you haven't mentioned it before. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @volodya-lombrozo The conclusion of it's all right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Yanich96 it's a bit better, but you can clarify it, I believe. |
||
It has the capability to manage separate compilation tasks based on inputs and outputs. | ||
However, an incremental build in Gradle may be redundant for the EO. | ||
In contrast to other programming languages, EO currently lacks pre-existing libraries that can be integrated | ||
into the project. Consequently, there is no need to generate a fingerprint for each task's data. | ||
|
||
|
||
### Maven | ||
[Maven](https://maven.apache.org) automates and manages Java-project builds. | ||
`Maven` is based on the concept of | ||
[Maven LifeCycles](https://maven.apache.org/guides/introduction/introduction-to-the-lifecycle.html), | ||
which include default, clean, and site lifecycles. | ||
Each lifecycle consists of `phases` and these `phases` consist of sets of `goals`. | ||
|
||
In Maven, there are default phases and goals for building any projects: | ||
|
||
<p align="center"> | ||
<img src="/images/defaultPhaseMaven.svg"> | ||
</p> | ||
|
||
By default, the `phases` in Maven are inherently connected within the build lifecycle. | ||
Each `phase` represents a specific task, and the execution order of `goals` within `phases` is determined | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Yanich96 I thought that:
|
||
by the default Maven lifecycle bindings. This means that while each `phase` operates as a series of individual tasks | ||
and their execution order is predefined by Maven. | ||
|
||
|
||
`Maven` utilizes caching mechanisms through the `takari-lifecycle-plugin` and `maven-build-cache-extension`: | ||
* The [takari-lifecycle-plugin](http://takari.io/book/40-lifecycle.html) is an alternative to the default Maven lifecycle | ||
(building JAR files). Its distinctive feature is the use of a single universal plugin with the same functionality | ||
as five separate plugins for the standard lifecycle, but with significantly fewer dependencies. As a result, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Yanich96 Why do we need to know that there are exactly "five" separate plugins? |
||
it provides a much faster startup, more optimal operation, and lower resource consumption. | ||
This leads to a significant increase in performance when compiling complex projects with a large number of modules. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Yanich96 Where is a "cache" here? You mentioned that it's a build tool only. |
||
|
||
* The [maven-build-cache-extension](https://maven.apache.org/extensions/maven-build-cache-extension/) | ||
is used for large Maven projects that have a significant number of small `modules`. | ||
A `module` refers to a subproject within a larger project. | ||
Each `module` has its own `pom.xm` file, and there is an aggregator `pom.xml` that consolidates all the `modules`. | ||
This plugin takes a hash for a `module`, it encapsulates the essential aspects of the `module`, | ||
including the source code and the configuration of the plugins used within it. | ||
`Modules` with the same hash are current or unchanged and the cache can efficiently restore them. | ||
In the opposite case, the cache seamlessly delegates the build work to the standard Maven core, | ||
without interfering with the build execution logic. | ||
`maven-build-cache-extension` ensures that only the changed `modules` within the project will rebuild. | ||
|
||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Yanich96 What is the conclusion? Why did you mention Maven? Does this caching similar to Grade? to ccache? What is the difference? |
||
Maven's caching mechanisms operate at the level of `phases` and individual project modules. | ||
Therefore, existing caching systems in Maven do not align with our requirements for resolving present issues. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Yanich96 You haven't explained the caching in this section at all. How hash is generated which data it requires to generate a hash? file content, file path, last modification time? |
||
|
||
### EO build cache | ||
|
||
The EO code uses the `Maven` for building projects. | ||
For this purpose, there is the `eo-maven-plugin` containing the essential goals for working with EO code. | ||
As previously mentioned, the build of projects in Maven follows a specific order of phases. | ||
Below is a diagram illustrating the main phases and their corresponding goals for the EO: | ||
|
||
<p align="center"> | ||
<img src="/images/EO.svg"> | ||
</p> | ||
|
||
In [Picture 3](/images/EO.svg) the goals of the `eo-maven-plugin` are highlighted in green. | ||
|
||
|
||
However, the actual work with EO code takes place in `AssembleMojo`. | ||
`AssembleMojo` is the goal consisting of other goals that work with the EO file, as shown in | ||
[Picture 4](/images/AssembleMojo.svg). | ||
|
||
|
||
<p align="center"> | ||
<img src="/images/AssembleMojo.svg"> | ||
</p> | ||
|
||
Each goal within `AssembleMojo` is a distinct compilation step for EO code. | ||
These tasks happen one after the other, and each task relies on the output of the one before it. | ||
Each task has directories for input and output data, as well as a directory for storing cached data. | ||
Using the program name, each task can receive and store data. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Yanich96 Why do you need two consecutive empty lines here? If you need some logical division, use headings and clear sections. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Yanich96 Same question. |
||
|
||
The previous caching mechanism in EO made use of distinct interfaces, specifically `Footprint` and `Optimization`. | ||
These caching interfaces shared similar logic, but with minor differences. | ||
For instance, `Footprint` verifies the EO version of the compiler, whereas the remaining checks are identical. | ||
Additionally, the conditions for searching data in the cache had errors. | ||
Due to this issue, the program behaved incorrectly, because saving the goal's result to the cache is not instantaneous. | ||
After conducting an in-depth analysis of the project's incorrect operation, | ||
several disadvantages of the previous caching mechanism in EO were brought to light: | ||
* Incorrect search conditions for data in the cache. | ||
* The verification method requires reading the file content, which results in inefficiencies. | ||
* The presence of multiple caching mechanisms creates challenges in identifying and rectifying caching errors. | ||
* Employing multiple caching mechanisms for similar entities is a suboptimal practice, | ||
leading to redundancy and complicating the caching infrastructure. | ||
|
||
|
||
To address caching challenges in EO, we closely examined existing caching systems. However, we cannot use them. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Yanich96 Why?
We can utilize parts of existing solutions, such as the hash generation algorithm from ccache and Gradle's task caching approach for our compilation steps. |
||
We require a caching mechanism at the level of `goals`. | ||
In fact, we don't need to invent a new caching mechanism for EO. | ||
Instead, it suffices to verify the last modification time of the files involved in EO compilation. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Yanich96 Why "last modification time" ? We need more examples here, but it looks like a totally unreliable approach. |
||
The modification time of the preceding task must not exceed that of the subsequent one. | ||
As each task possesses directories for input and output data, accessing the desired file | ||
via an absolute path enables retrieval of essential information, as file name and last modified time, | ||
from the file attributes without reading the file context. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Yanich96, could you write down the cache usage algorithm as you have done it in the Gradle section? Please outline the steps as you "see" them. |
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Yanich96 It's better to say: "The bug occurred because the old verification method used compilation time and caching time to search for a cached file"