-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add includes/excludes fields to filter remotes #1677
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After skimming through the PR, I noticed that you added a check to include/exclude a remote repository during regular syncing. Can you describe a workflow for this use case? Do you think a user would want to execute syncing of a remote repository while at the same time excluding the repository she wants to sync from? It appears like there is no valid use case for this.
I imagined having this include/exclude facility implemented only for the pull-through caching machinery because only there we can actually "sync" multiple repositories.
repo_name = self.remote.namespaced_upstream_name | ||
try: | ||
repo_name = filter_repo(self.remote) | ||
except InvalidRequest: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A side note: this is a rest_framework
exception that is supposed to be used within the api-app stack. It helps us to better render error messages. It is not very helpful to "raise" an HTTP 400 error from a task running inside a worker because there is no HTTP interface involved and a user cannot benefit from it.
My bad! I thought that we could sync/mirror an entire namespace from a remote upstream, but testing it again, I realized that it is not possible, which invalidates the use of |
cdeca78
to
ba2e297
Compare
CHANGES/459.feature
Outdated
@@ -0,0 +1 @@ | |||
Added support to filter repos in pull-through distributions using `includes`/`excludes` fields. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added support to filter repos in pull-through distributions using `includes`/`excludes` fields. | |
Added support for filtering remote repositories in pull-through caching using `includes` and | |
`excludes` fields. These fields can be set on pull-through caching remote objects. |
pulp_container/app/models.py
Outdated
includes (models.JSONField): Dictionary of repositories to include. [default=null] | ||
excludes (models.JSONField): Dictionary of repositories to exclude. [default=null] | ||
""" | ||
|
||
upstream_name = models.TextField(db_index=True) | ||
include_foreign_layers = models.BooleanField(default=False) | ||
include_tags = fields.ArrayField(models.TextField(null=True), null=True) | ||
exclude_tags = fields.ArrayField(models.TextField(null=True), null=True) | ||
sigstore = models.TextField(null=True) | ||
includes = models.JSONField(null=True) | ||
excludes = models.JSONField(null=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a left-over from the previous implementation. Can we remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hum... I had left it there because the get_or_create
from get_pull_through_drv
was expecting it (if not provided, it was raising a FieldError
exception):
pulp_container/pulp_container/app/registry_api.py
Lines 318 to 324 in b126f4c
remote_data = _get_pull_through_remote_data(pull_through_cache_distribution) | |
upstream_name = path.split(pull_through_cache_distribution.base_path, maxsplit=1)[1] | |
remote, _ = models.ContainerRemote.objects.get_or_create( | |
name=path, | |
upstream_name=upstream_name.strip("/"), | |
url=pull_through_cache_distribution.remote.url, | |
**remote_data, |
but, now that you mentioned it, I guess we can remove it and just delete the includes
/excludes
keys from remote_data
dictionary before calling the get_or_create
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. Are not these two models completely isolated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hum... kind of!?
from my understanding, when we create the ContainerRemote
pulp_container/pulp_container/app/registry_api.py
Lines 320 to 325 in b126f4c
remote, _ = models.ContainerRemote.objects.get_or_create( | |
name=path, | |
upstream_name=upstream_name.strip("/"), | |
url=pull_through_cache_distribution.remote.url, | |
**remote_data, | |
) |
remote_data
) extracted from the ContainerPullThroughRemote
remote_data = _get_pull_through_remote_data(pull_through_cache_distribution) |
pulp_container/pulp_container/app/registry_api.py
Lines 393 to 399 in ba2e297
def _get_pull_through_remote_data(root_cache_distribution): | |
remote_data = models.ContainerPullThroughRemote.objects.filter( | |
pk=root_cache_distribution.remote_id | |
).values()[0] | |
for attr in IGNORED_PULL_THROUGH_REMOTE_ATTRIBUTES: | |
remote_data.pop(attr, None) | |
return remote_data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, I see. Then, we may want to ignore these fields too. Thus, we can alter IGNORED_PULL_THROUGH_REMOTE_ATTRIBUTES
to better suit our needs.
pulp_container/app/registry_api.py
Outdated
try: | ||
repo_name = filter_repo(remote) | ||
except RepositoryNotFound: | ||
raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this stage, we already created a repository-remote-distribution triplet inside
def get_pull_through_drv(self, path): |
If the filtering fails, we end up with left-overs. Do you think it makes more sense to move the check to a different place? For instance, here:
if not pull_through_cache_distribution: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it makes more sense to move the check to a different place?
Oh!! Nice!
At first, I thought that putting it there would be good because we would be failing early (before downloading the manifest), but I didn't think about the previous resources that had been created.
pulp_container/app/utils.py
Outdated
if tags: | ||
include = remote.include_tags | ||
exclude = remote.exclude_tags | ||
else: | ||
include = remote.includes | ||
exclude = remote.excludes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of having this if-else statement, can we try doing something like this:
from functools import partial
def include(x, patterns):
return any(fnmatch.fnmatch(x, pattern) for pattern in patterns)
def exclude(x, patterns):
return not include(x, patterns)
def filter_resource(element, include_patterns, exclude_patterns):
return include(element, include_patterns or []) and exclude(element, exclude_patterns or [])
def filter_resources(element_list, include_patterns, exclude_patterns):
if include_patterns:
element_list = filter(partial(include, patterns=include_patterns), element_list)
if exclude_patterns:
element_list = filter(partial(exclude, patterns=exclude_patterns), element_list)
return element_list
filtered_tags = filter_resources(tag_list, remote.include_tags, remote.exclude_tags)
has_access = filter_resource(upstream_name, pt_remote.includes, pt_remote.excludes)
The execution logic appears more clear to me. What do you think?
It is a bit unfortunate that we cannot say that include_tags
/exclude_tags
and includes
/excludes
can be empty lists by default (i.e., JSONField(default=list)
; though, I am not sure if migrating existing include_tags
and exclude_tags
is worth the pleasure). Otherwise, we could easily get rid of those if statements.
@pytest.fixture | ||
def pull_through_distribution( | ||
gen_object_with_cleanup, | ||
container_pull_through_remote_api, | ||
container_pull_through_distribution_api, | ||
): | ||
def _pull_through_distribution(includes, excludes): | ||
remote = gen_object_with_cleanup( | ||
container_pull_through_remote_api, | ||
{ | ||
"name": str(uuid4()), | ||
"url": REGISTRY_V2_FEED_URL, | ||
"includes": includes, | ||
"excludes": excludes, | ||
}, | ||
) | ||
distribution = gen_object_with_cleanup( | ||
container_pull_through_distribution_api, | ||
{"name": str(uuid4()), "base_path": str(uuid4()), "remote": remote.pulp_href}, | ||
) | ||
return distribution | ||
|
||
return _pull_through_distribution |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of defining another custom fixture, do you mind moving
def pull_through_distribution( |
conftest.py
and making it more generic?
includes = [] | ||
excludes = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
includes = [] | |
excludes = [] | |
includes = None | |
excludes = [] |
Just to see if the implementation can handle None
values on the backend.
070a0bb
to
0639927
Compare
pulp_container/app/utils.py
Outdated
if not (include_patterns or []) and not (exclude_patterns or []): | ||
return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if not (include_patterns or []) and not (exclude_patterns or []): | |
return True | |
if not (include_patterns or exclude_patterns): | |
return True |
pulp_container/app/registry_api.py
Outdated
pull_through_remote = models.ContainerPullThroughRemote.objects.get( | ||
pk=pull_through_cache_distribution.remote | ||
) | ||
if not filter_resource(path, pull_through_remote.includes, pull_through_remote.excludes): | ||
raise RepositoryNotFound(name=path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can access the pull-through remote "directly". Then, we can also pass pull_through_cache_distribution.remote
to _get_pull_through_remote_data
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pull_through_cache_distribution.remote
is of type Remote
(not ContainerPullThroughRemote
) ... so there is no includes/excludes definition in it 😢
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can call cast() on it to get the correct type.
0639927
to
e0aabad
Compare
pulp_container/app/registry_api.py
Outdated
pk=pull_through_cache_distribution.remote_id | ||
).values()[0] | ||
if not filter_resource( | ||
path, pull_through_remote.get("includes"), pull_through_remote.get("excludes") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to run the validation against upstream_name
(i.e., repository name), not path
. path
includes the namespace of the caching distribution that is not transparent to the remote registry.
pulp_container/app/serializers.py
Outdated
includes = serializers.JSONField(required=False, allow_null=True) | ||
excludes = serializers.JSONField(required=False, allow_null=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, we should go with serializers.ListField
and fields.ArrayField
like we did for include_tags
/exclude_tags
. This will force users to use a valid list as an input.
pulp_container/app/registry_api.py
Outdated
pull_through_remote = models.ContainerPullThroughRemote.objects.filter( | ||
pk=pull_through_cache_distribution.remote_id | ||
).values()[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can access pull_through_remote
directly from pull_through_cache_distribution.remote
. To save one DB query, we can run this a couple of lines above:
pull_through_cache_distribution = (
models.ContainerPullThroughDistribution.objects.annotate(path=Value(path))
.filter(path__startswith=F("base_path"))
.order_by("-base_path")
.select_related("remote")
.first()
)
e0aabad
to
9fdd136
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! We are almost there! 🎸
pulp_container/app/registry_api.py
Outdated
try: | ||
pull_through_cache_distribution, upstream_name, pull_through_remote = _filter_remotes( | ||
path | ||
) | ||
except RepositoryNotFound: | ||
raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 Returning "semi-random" values from a function does not look that elegant. Is the _filter_remotes
function filtering pull-through distributions? Let's revert this change instead of forcing the function extraction.
Also, there is no point of catching an exception and re-raising it without any additional processing. Please, remove the try-except block.
http ${BASE_ADDR}/pulp/api/v3/distributions/container/pull-through/ remote=${REMOTE_HREF} name=docker-cache base_path=docker-cache | ||
``` | ||
|
||
Pulling an allowed content: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pulling an allowed content: | |
Pulling allowed content: |
trying to pull from a repo which has *molecule_debian* in its name will fail because it is filtered | ||
by the *excludes* definition: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trying to pull from a repo which has *molecule_debian* in its name will fail because it is filtered | |
by the *excludes* definition: | |
Pulling from a repository that includes *molecule_debian* in its name will fail because it is filtered by the *excludes* definition. |
and since we are *including* only repositories with \*pulp\* in its name (`includes=["*pulp*"]`), | ||
the following image pull should also fail: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and since we are *including* only repositories with \*pulp\* in its name (`includes=["*pulp*"]`), | |
the following image pull should also fail: | |
Since only repositories with *pulp* in their names are included (`includes=["*pulp*"]`), the following image pull will also fail: |
|
||
``` | ||
podman pull localhost:24817/docker-cache/library/hello-world | ||
Error response from daemon: unknown: No repository found for the defined remote filters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure this is the error the client sees?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops... my bad
I changed the code and forgot to update the doc 😅
if excludes and re.match(".*fixture.*", image_path): | ||
with pytest.raises(subprocess.CalledProcessError): | ||
local_registry.pull(local_image_path) | ||
assert ( | ||
re.search( | ||
".*Repository not found.*", | ||
capfd.readouterr().err, | ||
) | ||
is not None | ||
) | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is very hard to read and understand this part. We can pass expected values into this function and do the following:
if image_path not in XXX:
with pytest.raises(subprocess.CalledProcessError) as exc_info:
local_registry.pull(local_image_path)
assert "Repository not found" in exc_info.value # if possible, https://docs.pytest.org/en/7.1.x/how-to/assert.html#assertions-about-expected-exceptions
else:
local_registry.pull(local_image_path)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to use the exception output, but the outputs from subprocesses seem to not be captured:
https://docs.pytest.org/en/6.2.x/capture.html
If you want to capture on filedescriptor level you can use the capfd fixture which offers the exact same interface but allows to also capture output from libraries or subprocesses that directly write to operating system level output streams (FD1 and FD2).
pulp_container/app/registry_api.py
Outdated
if not filter_resource( | ||
upstream_name, pull_through_remote.get("includes"), pull_through_remote.get("excludes") | ||
): | ||
raise RepositoryNotFound(name=path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it would not be better to return an HTTP 403 error instead (https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/403). What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hum... when I see a 403 I think about roles/rights and, in this new filter feature, we are not denying or blocking users access, it feels more like "not making it visible" or just removing/bypassing a resource from the results as a convenience and not denying it since they could create a new pull-through remote with different includes/excludes ... and maybe a 403 could "misdirect" into a role troubleshoot. Do you agree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right. Let's leave it unchanged then.
pulp_container/app/registry_api.py
Outdated
pull_through_remote = models.ContainerPullThroughRemote.objects.filter( | ||
pk=pull_through_cache_distribution.remote_id | ||
).values()[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to chatgpt, there are two ways we can follow to refactor this part:
from django.db import models
from django.forms.models import model_to_dict
class YourModel(models.Model):
field1 = models.CharField(max_length=100)
field2 = models.IntegerField()
field3 = models.DateField()
def to_dict(self):
return {
'field1': self.field1,
'field2': self.field2,
'field3': self.field3,
}
# 1. Using model_to_dict
model_instance = YourModel.objects.get(pk=1)
model_dict = model_to_dict(model_instance)
# 2. Using the custom to_dict method
model_dict = model_instance.to_dict()
Do you find any of these methods more elegant? Maybe, we will not even need _get_pull_through_remote_data
.
pulp_container/app/models.py
Outdated
TYPE = "pull-through" | ||
includes = fields.ArrayField(models.TextField(null=True), null=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TYPE = "pull-through" | |
includes = fields.ArrayField(models.TextField(null=True), null=True) | |
TYPE = "pull-through" | |
includes = fields.ArrayField(models.TextField(null=True), null=True) |
It looks neat if we semantically separate model fields and constants.
def test_no_filter(pull_through_distribution, pull_and_verify): | ||
images = [f"{PULP_FIXTURE_1}:manifest_a", f"{PULP_FIXTURE_1}:manifest_b"] | ||
includes = None | ||
excludes = [] | ||
pull_and_verify(images, pull_through_distribution, includes, excludes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you run out of function names, it is still feasible to run the same test with different parameters, like
@pytest.mark.parametrize( |
9fdd136
to
993fe47
Compare
pulp_container/app/models.py
Outdated
@@ -512,6 +517,35 @@ class Meta: | |||
), | |||
] | |||
|
|||
def model_to_dict(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did not from django.forms.models import model_to_dict
suit our needs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it does, but when I tried to use it I had 3 concerns:
- the "old version" was returning a field called
pulp_domain
whereas with django's model_to_dict it returnspulp_domain_id
, so we would need to somehow handle this corner case - with django's model_to_dict we would still need _get_pull_through_remote_data (not the method itself, but the
IGNORED_PULL_THROUGH_REMOTE_ATTRIBUTES
that is used by it) - had to remove the url from
IGNORED_PULL_THROUGH_REMOTE_ATTRIBUTES
(it was causing an exception without it)
Here is what I tried:
remote_data = model_to_dict(pull_through_remote, exclude=IGNORED_PULL_THROUGH_REMOTE_ATTRIBUTES)
remote_data['pulp_domain_id'] = remote_data['pulp_domain']
remote_data.pop('pulp_domain', None)
Do you think we should use it instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did the execution fail when you passed "pulp_domain_id" to "Remote.create"? Either way, we can try doing what you have suggested:
IGNORED_PULL_THROUGH_REMOTE_ATTRIBUTES = ("includes", "excludes", "pulp_domain")
remote_data = model_to_dict(pull_through_remote, exclude=IGNORED_PULL_THROUGH_REMOTE_ATTRIBUTES)
remote_image_path = f"{REGISTRY_V2}/{image_path}" | ||
local_image_path = f"{distr.base_path}/{image_path}" | ||
|
||
if excludes and "hello-world" not in image_path: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is hello-world
? It came out of nowhere. Is not it better to use the following structure?
if image_path not in expected_paths:
blabla
Lastly, would you change the commit message to refer to remote repositories pulled during the pull-through caching, please? |
993fe47
to
08c9124
Compare
continue | ||
|
||
local_registry.pull(local_image_path) | ||
local_image = local_registry.inspect(local_image_path) | ||
registry_client.pull(remote_image_path) | ||
remote_image = registry_client.inspect(remote_image_path) | ||
assert local_image[0]["Id"] == remote_image[0]["Id"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
continue | |
local_registry.pull(local_image_path) | |
local_image = local_registry.inspect(local_image_path) | |
registry_client.pull(remote_image_path) | |
remote_image = registry_client.inspect(remote_image_path) | |
assert local_image[0]["Id"] == remote_image[0]["Id"] | |
else: | |
local_registry.pull(local_image_path) | |
local_image = local_registry.inspect(local_image_path) | |
registry_client.pull(remote_image_path) | |
remote_image = registry_client.inspect(remote_image_path) | |
assert local_image[0]["Id"] == remote_image[0]["Id"] |
08c9124
to
ff72ab6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! A little more nitpicking, but approving.
pulp_container/app/utils.py
Outdated
|
||
def include(x, patterns): | ||
""" | ||
Checks if any item from `patterns` matches x, meaning it should be included as a remote repo. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should make these docstrings more generic. We do not know how the functions are going to be used. The caller is the one that decides the purpose of the functions. For these one-line functions, it might be worth omitting the docstring too. Also, we do not care only about remote repositories in this context.
@pytest.fixture | ||
def pull_and_verify( | ||
capfd, | ||
delete_orphans_pre, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should move delete_orphans_pre
to test_includes_excludes_filter
. It is a test case that needs to decide whether to use a god-like cleanup or not.
This commit adds a feature to refine (include only and/or remove unwanted) remote repositories pulled during pull-through caching. closes: pulp#459
ff72ab6
to
4a8d746
Compare
closes: #459