Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added design doc for sparse checkout #335

Merged
merged 17 commits into from
Aug 27, 2024
Merged
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
68 changes: 68 additions & 0 deletions research/design-doc/sparse_checkout_asishkumar.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# KPM sparse checkout

**Author**: Asish Kumar

## Abstract

`kpm` manages third-party libraries through Git repositories, requiring a `kcl.mod` file at the root directory. It treats the entire Git repository as a single `kcl` package, which is inefficient for monorepos containing multiple `kcl` packages. Often, a `kcl` project depends on just one package within a monorepo, but `kpm` downloads the entire repository. Therefore, `kpm` needs to allow adding a subdirectory of a Git repository as a dependency, enabling it to download only the necessary parts and improve performance.

## User Interface
zong-zhe marked this conversation as resolved.
Show resolved Hide resolved

I will add a new flag called `--subdir` in `kpm add` command. This flag will specify the path to the desired subdirectory within the Git repository. Below is the syntax for the enhanced kpm add command:

```
kpm add --subdir <subdir> <git-repo-url>
```

The `--subdir` flag will be optional. If the flag is not provided, `kpm` will download the entire repository as it does now. If the flag is provided, `kpm` will download only the specified subdirectory. The `kcl.mod` file will be generated with the path to the subdirectory.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The kcl.mod file will be generated with the path to the subdirectory.

Here, I personally think that if there is no kcl.mod file in the subdirectory downloaded by the user, we should output diagnostic information and notify the user, rather than generate the file for him, because the kcl.mod file will affect the compilation result in some cases, if the user does not know that kpm have generated the file for him, it may cause trouble to his use.


Example usage:

```
kpm add --subdir 1.21/* k8s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, I recommend that this function should be operated on a valid KCL package and not extended to various files and file directories, so I do not recommend adding support such as * or a *.k file.

--subdir xxx/xxx # only a subdir here, and the subdir must contains `kcl.mod` file, if not, raise an error.

In the future, if there is a demand for *, we can consider the research again to decide whether to support it. It is too early to introduce this feature in the path details.

```

This command will download the `1.21` directory and all its contents from the `k8s` repository hosted in https://github.com/kcl-lang/modules

The `kcl.mod` file of the users project will also contain an array of path to the subdirectories.

```
[dependencies]
bbb = { path = "../bbb", subdir = ["test-*", "test-*"]}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the workflow for kpm when it get the kcl.mod like this, could you please add more details ?

The main point that confuse me is:

The main purpose of this work was to support the addition of dependencies from subdirectories of git repo.

bbb = { path = "../bbb", subdir = ["test-", "test-"]}

It doesn't seem to have anything to do with git

```

## Design

The path to the directory will be passed to `CloneOptions` in [pkg/git/git.go](https://github.com/kcl-lang/kpm/blob/d20b1acdc988f600c8f8465ecd9fe04225e19149/pkg/git/git.go#L19) as subDir.

### using go-getter

As mentioned in the [go-getter](https://pkg.go.dev/github.com/hashicorp/go-getter#readme-subdirectories) docs, we can append our subDir from `CloneOptions` (only if subDir is not empty) in `WithRepoURL` function.

### using go-git
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kpm used to use go-git to interact with git repo, but one issue that forced me to switch to go-getters. Let me give you a few more details that may be helpful.

When supporting taking git repo as dependencies, there is one tricky issue to consider: authentication. go-gitand the native git client do not share the same authentication, which means that if you use go-git, you need to consider the authentication in kpm and implement it. It's a lot of work.

And the authentication part of go-git itself is not yet complete: go-git/go-git#490

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. So I will be using go-getter methods.


This process will involve using the `sparse-checkout` feature of git.

1. Initialize a new git repository in the local `.kcl/kpm/` directory using [PlainInit](https://pkg.go.dev/github.com/go-git/go-git#PlainInit). The repository name will be the PackageName_version.

2. Create a new worktree using [Worktree](https://pkg.go.dev/github.com/go-git/go-git/v5#Repository.Worktree)

3. Enable the sparse-checkout feature using [SparseCheckout](https://pkg.go.dev/github.com/go-git/go-git/v5#Worktree.SparseCheckout). The second argument will be a slice of strings containing the subdirectory path.

4. Add the remote repository using [AddRemote](https://pkg.go.dev/github.com/go-git/go-git/v5#Repository.CreateRemote)

5. Pull the repository using [Pull](https://pkg.go.dev/github.com/go-git/go-git/v5#Worktree.Pull)

Whenever we want to access the subdirectory using any command, we can refer to `kcl.mod` file of the project and iterate over the `subdir` array to get the path to the subdirectory. The `kcl.mod` file will automatically get updated whenever `kpm add` command is run.

### Additional information

1. To avoid creating a new root for each subdirectory download, I can add some check functions.

2. The subdir flag is only for git options. If we pass it as a flag after oci, for example: `kpm add k8s --subdir 1.21/*`, it will not work. We can add a check [here](https://github.com/kcl-lang/kpm/blob/92158183556d39545bc0734a1e24284344ff3d9e/pkg/cmd/cmd_add.go#L154) that will give a warning if the subdir flag is passed. Furthermore, the subdir flag will only work for git repositories since it will insert the flag value into the field variable of the [Git](https://github.com/kcl-lang/kpm/blob/92158183556d39545bc0734a1e24284344ff3d9e/pkg/package/modfile.go#L375) struct.

## References

1. https://medium.com/@marcoscannabrava/git-download-a-repositorys-specific-subfolder-ceeabc6023e2
2. https://pkg.go.dev/github.com/go-git/go-git/v5
3. https://pkg.go.dev/github.com/hashicorp/go-getter
Loading