Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add includes/excludes fields to filter remotes #1677

Merged
merged 1 commit into from
Jul 4, 2024

Conversation

git-hyagi
Copy link
Contributor

closes: #459

Copy link
Member

@lubosmj lubosmj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After skimming through the PR, I noticed that you added a check to include/exclude a remote repository during regular syncing. Can you describe a workflow for this use case? Do you think a user would want to execute syncing of a remote repository while at the same time excluding the repository she wants to sync from? It appears like there is no valid use case for this.

I imagined having this include/exclude facility implemented only for the pull-through caching machinery because only there we can actually "sync" multiple repositories.

repo_name = self.remote.namespaced_upstream_name
try:
repo_name = filter_repo(self.remote)
except InvalidRequest:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A side note: this is a rest_framework exception that is supposed to be used within the api-app stack. It helps us to better render error messages. It is not very helpful to "raise" an HTTP 400 error from a task running inside a worker because there is no HTTP interface involved and a user cannot benefit from it.

@git-hyagi
Copy link
Contributor Author

Can you describe a workflow for this use case? Do you think a user would want to execute syncing of a remote repository while at the same time excluding the repository she wants to sync from?

My bad! I thought that we could sync/mirror an entire namespace from a remote upstream, but testing it again, I realized that it is not possible, which invalidates the use of includes/excludes in syncs.

@git-hyagi git-hyagi marked this pull request as draft June 25, 2024 13:28
@git-hyagi git-hyagi force-pushed the remote-repo-include-exclude branch from cdeca78 to ba2e297 Compare June 25, 2024 15:42
@git-hyagi git-hyagi marked this pull request as ready for review June 25, 2024 15:59
@@ -0,0 +1 @@
Added support to filter repos in pull-through distributions using `includes`/`excludes` fields.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Added support to filter repos in pull-through distributions using `includes`/`excludes` fields.
Added support for filtering remote repositories in pull-through caching using `includes` and
`excludes` fields. These fields can be set on pull-through caching remote objects.

Comment on lines 358 to 368
includes (models.JSONField): Dictionary of repositories to include. [default=null]
excludes (models.JSONField): Dictionary of repositories to exclude. [default=null]
"""

upstream_name = models.TextField(db_index=True)
include_foreign_layers = models.BooleanField(default=False)
include_tags = fields.ArrayField(models.TextField(null=True), null=True)
exclude_tags = fields.ArrayField(models.TextField(null=True), null=True)
sigstore = models.TextField(null=True)
includes = models.JSONField(null=True)
excludes = models.JSONField(null=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a left-over from the previous implementation. Can we remove it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hum... I had left it there because the get_or_create from get_pull_through_drv was expecting it (if not provided, it was raising a FieldError exception):

remote_data = _get_pull_through_remote_data(pull_through_cache_distribution)
upstream_name = path.split(pull_through_cache_distribution.base_path, maxsplit=1)[1]
remote, _ = models.ContainerRemote.objects.get_or_create(
name=path,
upstream_name=upstream_name.strip("/"),
url=pull_through_cache_distribution.remote.url,
**remote_data,

but, now that you mentioned it, I guess we can remove it and just delete the includes/excludes keys from remote_data dictionary before calling the get_or_create.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Are not these two models completely isolated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hum... kind of!?

from my understanding, when we create the ContainerRemote

remote, _ = models.ContainerRemote.objects.get_or_create(
name=path,
upstream_name=upstream_name.strip("/"),
url=pull_through_cache_distribution.remote.url,
**remote_data,
)
we are passing the remote values (remote_data) extracted from the ContainerPullThroughRemote
remote_data = _get_pull_through_remote_data(pull_through_cache_distribution)

def _get_pull_through_remote_data(root_cache_distribution):
remote_data = models.ContainerPullThroughRemote.objects.filter(
pk=root_cache_distribution.remote_id
).values()[0]
for attr in IGNORED_PULL_THROUGH_REMOTE_ATTRIBUTES:
remote_data.pop(attr, None)
return remote_data

Copy link
Member

@lubosmj lubosmj Jun 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, I see. Then, we may want to ignore these fields too. Thus, we can alter IGNORED_PULL_THROUGH_REMOTE_ATTRIBUTES to better suit our needs.

Comment on lines 1114 to 1117
try:
repo_name = filter_repo(remote)
except RepositoryNotFound:
raise
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this stage, we already created a repository-remote-distribution triplet inside

def get_pull_through_drv(self, path):

If the filtering fails, we end up with left-overs. Do you think it makes more sense to move the check to a different place? For instance, here:

if not pull_through_cache_distribution:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it makes more sense to move the check to a different place?

Oh!! Nice!
At first, I thought that putting it there would be good because we would be failing early (before downloading the manifest), but I didn't think about the previous resources that had been created.

Comment on lines 322 to 327
if tags:
include = remote.include_tags
exclude = remote.exclude_tags
else:
include = remote.includes
exclude = remote.excludes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having this if-else statement, can we try doing something like this:

from functools import partial

def include(x, patterns):
    return any(fnmatch.fnmatch(x, pattern) for pattern in patterns)

def exclude(x, patterns):
    return not include(x, patterns)

def filter_resource(element, include_patterns, exclude_patterns):
    return include(element, include_patterns or []) and exclude(element, exclude_patterns or [])

def filter_resources(element_list, include_patterns, exclude_patterns):
    if include_patterns:
        element_list = filter(partial(include, patterns=include_patterns), element_list)
    if exclude_patterns:
        element_list = filter(partial(exclude, patterns=exclude_patterns), element_list)
    return element_list
filtered_tags = filter_resources(tag_list, remote.include_tags, remote.exclude_tags)
has_access = filter_resource(upstream_name, pt_remote.includes, pt_remote.excludes)

The execution logic appears more clear to me. What do you think?

It is a bit unfortunate that we cannot say that include_tags/exclude_tags and includes/excludes can be empty lists by default (i.e., JSONField(default=list); though, I am not sure if migrating existing include_tags and exclude_tags is worth the pleasure). Otherwise, we could easily get rid of those if statements.

Comment on lines 15 to 37
@pytest.fixture
def pull_through_distribution(
gen_object_with_cleanup,
container_pull_through_remote_api,
container_pull_through_distribution_api,
):
def _pull_through_distribution(includes, excludes):
remote = gen_object_with_cleanup(
container_pull_through_remote_api,
{
"name": str(uuid4()),
"url": REGISTRY_V2_FEED_URL,
"includes": includes,
"excludes": excludes,
},
)
distribution = gen_object_with_cleanup(
container_pull_through_distribution_api,
{"name": str(uuid4()), "base_path": str(uuid4()), "remote": remote.pulp_href},
)
return distribution

return _pull_through_distribution
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of defining another custom fixture, do you mind moving

to conftest.py and making it more generic?

Comment on lines 75 to 48
includes = []
excludes = []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
includes = []
excludes = []
includes = None
excludes = []

Just to see if the implementation can handle None values on the backend.

@git-hyagi git-hyagi force-pushed the remote-repo-include-exclude branch 2 times, most recently from 070a0bb to 0639927 Compare June 26, 2024 14:27
Comment on lines 338 to 332
if not (include_patterns or []) and not (exclude_patterns or []):
return True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if not (include_patterns or []) and not (exclude_patterns or []):
return True
if not (include_patterns or exclude_patterns):
return True

Comment on lines 315 to 325
pull_through_remote = models.ContainerPullThroughRemote.objects.get(
pk=pull_through_cache_distribution.remote
)
if not filter_resource(path, pull_through_remote.includes, pull_through_remote.excludes):
raise RepositoryNotFound(name=path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can access the pull-through remote "directly". Then, we can also pass pull_through_cache_distribution.remote to _get_pull_through_remote_data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pull_through_cache_distribution.remote is of type Remote (not ContainerPullThroughRemote) ... so there is no includes/excludes definition in it 😢

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can call cast() on it to get the correct type.

@git-hyagi git-hyagi force-pushed the remote-repo-include-exclude branch from 0639927 to e0aabad Compare June 26, 2024 15:51
pk=pull_through_cache_distribution.remote_id
).values()[0]
if not filter_resource(
path, pull_through_remote.get("includes"), pull_through_remote.get("excludes")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to run the validation against upstream_name (i.e., repository name), not path. path includes the namespace of the caching distribution that is not transparent to the remote registry.

Comment on lines 310 to 311
includes = serializers.JSONField(required=False, allow_null=True)
excludes = serializers.JSONField(required=False, allow_null=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, we should go with serializers.ListField and fields.ArrayField like we did for include_tags/exclude_tags. This will force users to use a valid list as an input.

Comment on lines 315 to 317
pull_through_remote = models.ContainerPullThroughRemote.objects.filter(
pk=pull_through_cache_distribution.remote_id
).values()[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can access pull_through_remote directly from pull_through_cache_distribution.remote. To save one DB query, we can run this a couple of lines above:

    pull_through_cache_distribution = (
        models.ContainerPullThroughDistribution.objects.annotate(path=Value(path))
        .filter(path__startswith=F("base_path"))
        .order_by("-base_path")
        .select_related("remote")
        .first()
    )

@git-hyagi git-hyagi force-pushed the remote-repo-include-exclude branch from e0aabad to 9fdd136 Compare June 28, 2024 14:36
Copy link
Member

@lubosmj lubosmj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! We are almost there! 🎸

Comment on lines 306 to 311
try:
pull_through_cache_distribution, upstream_name, pull_through_remote = _filter_remotes(
path
)
except RepositoryNotFound:
raise
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 Returning "semi-random" values from a function does not look that elegant. Is the _filter_remotes function filtering pull-through distributions? Let's revert this change instead of forcing the function extraction.

Also, there is no point of catching an exception and re-raising it without any additional processing. Please, remove the try-except block.

http ${BASE_ADDR}/pulp/api/v3/distributions/container/pull-through/ remote=${REMOTE_HREF} name=docker-cache base_path=docker-cache
```

Pulling an allowed content:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Pulling an allowed content:
Pulling allowed content:

Comment on lines 67 to 68
trying to pull from a repo which has *molecule_debian* in its name will fail because it is filtered
by the *excludes* definition:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
trying to pull from a repo which has *molecule_debian* in its name will fail because it is filtered
by the *excludes* definition:
Pulling from a repository that includes *molecule_debian* in its name will fail because it is filtered by the *excludes* definition.

Comment on lines 75 to 76
and since we are *including* only repositories with \*pulp\* in its name (`includes=["*pulp*"]`),
the following image pull should also fail:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
and since we are *including* only repositories with \*pulp\* in its name (`includes=["*pulp*"]`),
the following image pull should also fail:
Since only repositories with *pulp* in their names are included (`includes=["*pulp*"]`), the following image pull will also fail:


```
podman pull localhost:24817/docker-cache/library/hello-world
Error response from daemon: unknown: No repository found for the defined remote filters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure this is the error the client sees?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops... my bad
I changed the code and forgot to update the doc 😅

Comment on lines 25 to 28
if excludes and re.match(".*fixture.*", image_path):
with pytest.raises(subprocess.CalledProcessError):
local_registry.pull(local_image_path)
assert (
re.search(
".*Repository not found.*",
capfd.readouterr().err,
)
is not None
)
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is very hard to read and understand this part. We can pass expected values into this function and do the following:

if image_path not in XXX:
    with pytest.raises(subprocess.CalledProcessError) as exc_info:
        local_registry.pull(local_image_path)
    assert "Repository not found" in exc_info.value  # if possible, https://docs.pytest.org/en/7.1.x/how-to/assert.html#assertions-about-expected-exceptions
else:
    local_registry.pull(local_image_path)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to use the exception output, but the outputs from subprocesses seem to not be captured:
https://docs.pytest.org/en/6.2.x/capture.html

If you want to capture on filedescriptor level you can use the capfd fixture which offers the exact same interface but allows to also capture output from libraries or subprocesses that directly write to operating system level output streams (FD1 and FD2).

if not filter_resource(
upstream_name, pull_through_remote.get("includes"), pull_through_remote.get("excludes")
):
raise RepositoryNotFound(name=path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would not be better to return an HTTP 403 error instead (https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/403). What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hum... when I see a 403 I think about roles/rights and, in this new filter feature, we are not denying or blocking users access, it feels more like "not making it visible" or just removing/bypassing a resource from the results as a convenience and not denying it since they could create a new pull-through remote with different includes/excludes ... and maybe a 403 could "misdirect" into a role troubleshoot. Do you agree?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right. Let's leave it unchanged then.

Comment on lines 411 to 413
pull_through_remote = models.ContainerPullThroughRemote.objects.filter(
pk=pull_through_cache_distribution.remote_id
).values()[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to chatgpt, there are two ways we can follow to refactor this part:

from django.db import models
from django.forms.models import model_to_dict

class YourModel(models.Model):
    field1 = models.CharField(max_length=100)
    field2 = models.IntegerField()
    field3 = models.DateField()

    def to_dict(self):
        return {
            'field1': self.field1,
            'field2': self.field2,
            'field3': self.field3,
        }

# 1. Using model_to_dict
model_instance = YourModel.objects.get(pk=1)
model_dict = model_to_dict(model_instance)

# 2. Using the custom to_dict method
model_dict = model_instance.to_dict()

Do you find any of these methods more elegant? Maybe, we will not even need _get_pull_through_remote_data.

Comment on lines 506 to 508
TYPE = "pull-through"
includes = fields.ArrayField(models.TextField(null=True), null=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
TYPE = "pull-through"
includes = fields.ArrayField(models.TextField(null=True), null=True)
TYPE = "pull-through"
includes = fields.ArrayField(models.TextField(null=True), null=True)

It looks neat if we semantically separate model fields and constants.

Comment on lines 45 to 49
def test_no_filter(pull_through_distribution, pull_and_verify):
images = [f"{PULP_FIXTURE_1}:manifest_a", f"{PULP_FIXTURE_1}:manifest_b"]
includes = None
excludes = []
pull_and_verify(images, pull_through_distribution, includes, excludes)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you run out of function names, it is still feasible to run the same test with different parameters, like

.

@git-hyagi git-hyagi force-pushed the remote-repo-include-exclude branch from 9fdd136 to 993fe47 Compare July 2, 2024 14:57
@@ -512,6 +517,35 @@ class Meta:
),
]

def model_to_dict(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did not from django.forms.models import model_to_dict suit our needs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it does, but when I tried to use it I had 3 concerns:

  • the "old version" was returning a field called pulp_domain whereas with django's model_to_dict it returns pulp_domain_id, so we would need to somehow handle this corner case
  • with django's model_to_dict we would still need _get_pull_through_remote_data (not the method itself, but the IGNORED_PULL_THROUGH_REMOTE_ATTRIBUTES that is used by it)
  • had to remove the url from IGNORED_PULL_THROUGH_REMOTE_ATTRIBUTES (it was causing an exception without it)

Here is what I tried:

    remote_data = model_to_dict(pull_through_remote, exclude=IGNORED_PULL_THROUGH_REMOTE_ATTRIBUTES)
    remote_data['pulp_domain_id'] = remote_data['pulp_domain']
    remote_data.pop('pulp_domain', None)

Do you think we should use it instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did the execution fail when you passed "pulp_domain_id" to "Remote.create"? Either way, we can try doing what you have suggested:

IGNORED_PULL_THROUGH_REMOTE_ATTRIBUTES = ("includes", "excludes", "pulp_domain")
remote_data = model_to_dict(pull_through_remote, exclude=IGNORED_PULL_THROUGH_REMOTE_ATTRIBUTES)

remote_image_path = f"{REGISTRY_V2}/{image_path}"
local_image_path = f"{distr.base_path}/{image_path}"

if excludes and "hello-world" not in image_path:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is hello-world? It came out of nowhere. Is not it better to use the following structure?

if image_path not in expected_paths:
   blabla

@lubosmj
Copy link
Member

lubosmj commented Jul 3, 2024

Lastly, would you change the commit message to refer to remote repositories pulled during the pull-through caching, please?

@git-hyagi git-hyagi force-pushed the remote-repo-include-exclude branch from 993fe47 to 08c9124 Compare July 3, 2024 22:50
Comment on lines 28 to 34
continue

local_registry.pull(local_image_path)
local_image = local_registry.inspect(local_image_path)
registry_client.pull(remote_image_path)
remote_image = registry_client.inspect(remote_image_path)
assert local_image[0]["Id"] == remote_image[0]["Id"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
continue
local_registry.pull(local_image_path)
local_image = local_registry.inspect(local_image_path)
registry_client.pull(remote_image_path)
remote_image = registry_client.inspect(remote_image_path)
assert local_image[0]["Id"] == remote_image[0]["Id"]
else:
local_registry.pull(local_image_path)
local_image = local_registry.inspect(local_image_path)
registry_client.pull(remote_image_path)
remote_image = registry_client.inspect(remote_image_path)
assert local_image[0]["Id"] == remote_image[0]["Id"]

@git-hyagi git-hyagi force-pushed the remote-repo-include-exclude branch from 08c9124 to ff72ab6 Compare July 4, 2024 15:11
Copy link
Member

@lubosmj lubosmj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! A little more nitpicking, but approving.


def include(x, patterns):
"""
Checks if any item from `patterns` matches x, meaning it should be included as a remote repo.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make these docstrings more generic. We do not know how the functions are going to be used. The caller is the one that decides the purpose of the functions. For these one-line functions, it might be worth omitting the docstring too. Also, we do not care only about remote repositories in this context.

@pytest.fixture
def pull_and_verify(
capfd,
delete_orphans_pre,
Copy link
Member

@lubosmj lubosmj Jul 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should move delete_orphans_pre to test_includes_excludes_filter. It is a test case that needs to decide whether to use a god-like cleanup or not.

This commit adds a feature to refine (include only and/or remove unwanted) remote
repositories pulled during pull-through caching.

closes: pulp#459
@git-hyagi git-hyagi force-pushed the remote-repo-include-exclude branch from ff72ab6 to 4a8d746 Compare July 4, 2024 18:55
@lubosmj lubosmj merged commit 5908f47 into pulp:main Jul 4, 2024
16 checks passed
@git-hyagi git-hyagi deleted the remote-repo-include-exclude branch July 4, 2024 19:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature request: whitelist upstreams
2 participants