-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
exec remote host collectors in a daemonset #1671
Conversation
5791595
to
92e215d
Compare
pkg/supportbundle/collect.go
Outdated
// TODO: | ||
// delete the config map | ||
// delete the remote pods | ||
clientset.AppsV1().DaemonSets(ds.Namespace).Delete(ctx, ds.Name, metav1.DeleteOptions{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential panic when ds is not created
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added
_, err := clientset.AppsV1().DaemonSets(ds.Namespace).Get(ctx, ds.Name, metav1.GetOptions{})
if err != nil {
klog.Errorf("Failed to verify remote host collector daemonset %s still exists: %v", ds.Name, err)
return
}
if err := clientset.AppsV1().DaemonSets(ds.Namespace).Delete(ctx, ds.Name, metav1.DeleteOptions{}); err != nil {
klog.Errorf("Failed to delete remote host collector daemonset %s: %v", ds.Name, err)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if the ds is not created, it will log the not found error.
pkg/supportbundle/collect.go
Outdated
if additionalRedactors != nil { | ||
return additionalRedactors.Spec.Redactors | ||
func waitForPodRunning(ctx context.Context, clientset kubernetes.Interface, pod *corev1.Pod) error { | ||
watcher, err := clientset.CoreV1().Pods(pod.Namespace).Watch(ctx, metav1.ListOptions{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To improve reliability, should we poll if the user does not have the necessary "watch"
permissions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! I have checked to polling method
pkg/supportbundle/collect.go
Outdated
} | ||
|
||
func waitForDS(ctx context.Context, clientset kubernetes.Interface, ds *appsv1.DaemonSet) error { | ||
watcher, err := clientset.AppsV1().DaemonSets(ds.Namespace).Watch(ctx, metav1.ListOptions{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. Poll if no "watch"
permissions.
We might just default to polling if it does not introduce a significant delay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to polling
pkg/supportbundle/collect.go
Outdated
klog.V(2).Infof("Created Remote Host Collector Daemonset %s", ds.Name) | ||
pods, err := clientset.CoreV1().Pods(ds.Namespace).List(ctx, metav1.ListOptions{ | ||
LabelSelector: selectorLabelKey + "=" + selectorLabelValue, | ||
TimeoutSeconds: new(int64), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this 0 timeout intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have changed to TimeoutSeconds: ptr.To(int64(defaultTimeout)),
pods, err := clientset.CoreV1().Pods(ds.Namespace).List(ctx, metav1.ListOptions{ | ||
LabelSelector: selectorLabelKey + "=" + selectorLabelValue, | ||
TimeoutSeconds: new(int64), | ||
Limit: 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't limit=0 mean that no pods should be returned? I might be missing something here though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0 means that there is no limit on the number of pods returned in this list operation. It will attempt to return all matching pods without restricting the count.
|
||
parameterCodec := runtime.NewParameterCodec(scheme) | ||
req.VersionedParams(&corev1.PodExecOptions{ | ||
Command: []string{"/troubleshoot/collect", "-", "--chroot", "/host", "--format", "raw"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using stdin instead of relying on configmaps is pretty neat!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that is very efficient!
pkg/supportbundle/collect.go
Outdated
// delete the config map | ||
// delete the remote pods | ||
// check if the daemonset still exists | ||
_, err := clientset.AppsV1().DaemonSets(ds.Namespace).Get(ctx, ds.Name, metav1.GetOptions{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be sufficient to just delete the DaemonSet and log the error if it wasn't present?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it should be sufficient to just delete the DaemonSet and log the not found error. I have modified the code.
pkg/supportbundle/collect.go
Outdated
} | ||
|
||
func createHostCollectorDS(ctx context.Context, clientset kubernetes.Interface, labels map[string]string) (*appsv1.DaemonSet, error) { | ||
ns := "default" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we allow passing in the namespace? In embedded cluster for example, kotsadm
& embedded-cluster
namespaces exist. KOTS may have permissions on one of these but not to the default
namespace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch. Namespace has been added
|
||
func createHostCollectorDS(ctx context.Context, clientset kubernetes.Interface, labels map[string]string) (*appsv1.DaemonSet, error) { | ||
ns := "default" | ||
imageName := "replicated/troubleshoot:latest" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use a versioned tag instead? You can get the version string from https://github.com/replicatedhq/troubleshoot/blob/main/pkg/version/version.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found the challenging part is that if we are using version string in local development, I found the image was not existed. We have to build it in local first. Do you have any idea about that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found one to use version tag. It has to be semantic version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems cannot pass the tests. I reverts docker image versioned tag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest that we can switch to version tag in next PR
pkg/supportbundle/collect.go
Outdated
func createHostCollectorDS(ctx context.Context, clientset kubernetes.Interface, labels map[string]string) (*appsv1.DaemonSet, error) { | ||
ns := "default" | ||
imageName := "replicated/troubleshoot:latest" | ||
imagePullPolicy := corev1.PullAlways |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use a tag, perhaps this should be IfNotPresent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to IfNotPresent
pkg/supportbundle/collect.go
Outdated
ImagePullPolicy: imagePullPolicy, | ||
Name: "remote-collector", | ||
Command: []string{"/bin/bash", "-c"}, | ||
Args: []string{"while true; do sleep 30; done;"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Collection shouldn't take longer than 30s but I'm thinking we should have a better way of handling this. We can for example implement a collect pause
subcommand to pause and wait for a termination signal then exit gracefully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have changed to use
Command: []string{"tail", "-f", "/dev/null"},
It effectively blocks forever without doing anything.
Copy from another PR
|
The remote collectors has been using
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Description, Motivation and Context
This PR changes how remote host collection is scheduled.
it switches the runners to be exec calls to pods managed by a daemonset, rather than one runner per pod.
Checklist
Does this PR introduce a breaking change?