Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exec remote host collectors in a daemonset #1671

Merged
merged 36 commits into from
Nov 11, 2024
Merged

exec remote host collectors in a daemonset #1671

merged 36 commits into from
Nov 11, 2024

Conversation

hedge-sparrow
Copy link
Member

@hedge-sparrow hedge-sparrow commented Nov 1, 2024

Description, Motivation and Context

This PR changes how remote host collection is scheduled.

it switches the runners to be exec calls to pods managed by a daemonset, rather than one runner per pod.

Checklist

  • New and existing tests pass locally with introduced changes.
  • Tests for the changes have been added (for bug fixes / features)
  • The commit message(s) are informative and highlight any breaking changes
  • Any documentation required has been added/updated. For changes to https://troubleshoot.sh/ create a PR here

Does this PR introduce a breaking change?

  • Yes
  • No

@hedge-sparrow hedge-sparrow requested a review from a team as a code owner November 1, 2024 13:55
@hedge-sparrow hedge-sparrow force-pushed the ash/daemonset-exec branch 2 times, most recently from 5791595 to 92e215d Compare November 1, 2024 14:38
@hedge-sparrow hedge-sparrow added the type::feature New feature or request label Nov 1, 2024
DexterYan
DexterYan previously approved these changes Nov 4, 2024
// TODO:
// delete the config map
// delete the remote pods
clientset.AppsV1().DaemonSets(ds.Namespace).Delete(ctx, ds.Name, metav1.DeleteOptions{})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential panic when ds is not created

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added

	_, err := clientset.AppsV1().DaemonSets(ds.Namespace).Get(ctx, ds.Name, metav1.GetOptions{})
		if err != nil {
			klog.Errorf("Failed to verify remote host collector daemonset %s still exists: %v", ds.Name, err)
			return
		}

		if err := clientset.AppsV1().DaemonSets(ds.Namespace).Delete(ctx, ds.Name, metav1.DeleteOptions{}); err != nil {
			klog.Errorf("Failed to delete remote host collector daemonset %s: %v", ds.Name, err)
		}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if the ds is not created, it will log the not found error.

if additionalRedactors != nil {
return additionalRedactors.Spec.Redactors
func waitForPodRunning(ctx context.Context, clientset kubernetes.Interface, pod *corev1.Pod) error {
watcher, err := clientset.CoreV1().Pods(pod.Namespace).Watch(ctx, metav1.ListOptions{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To improve reliability, should we poll if the user does not have the necessary "watch" permissions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I have checked to polling method

}

func waitForDS(ctx context.Context, clientset kubernetes.Interface, ds *appsv1.DaemonSet) error {
watcher, err := clientset.AppsV1().DaemonSets(ds.Namespace).Watch(ctx, metav1.ListOptions{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. Poll if no "watch" permissions.

We might just default to polling if it does not introduce a significant delay.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to polling

klog.V(2).Infof("Created Remote Host Collector Daemonset %s", ds.Name)
pods, err := clientset.CoreV1().Pods(ds.Namespace).List(ctx, metav1.ListOptions{
LabelSelector: selectorLabelKey + "=" + selectorLabelValue,
TimeoutSeconds: new(int64),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this 0 timeout intentional?

Copy link
Member

@DexterYan DexterYan Nov 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed to TimeoutSeconds: ptr.To(int64(defaultTimeout)),

pods, err := clientset.CoreV1().Pods(ds.Namespace).List(ctx, metav1.ListOptions{
LabelSelector: selectorLabelKey + "=" + selectorLabelValue,
TimeoutSeconds: new(int64),
Limit: 0,
Copy link
Member

@banjoh banjoh Nov 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't limit=0 mean that no pods should be returned? I might be missing something here though

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 means that there is no limit on the number of pods returned in this list operation. It will attempt to return all matching pods without restricting the count.


parameterCodec := runtime.NewParameterCodec(scheme)
req.VersionedParams(&corev1.PodExecOptions{
Command: []string{"/troubleshoot/collect", "-", "--chroot", "/host", "--format", "raw"},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using stdin instead of relying on configmaps is pretty neat!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that is very efficient!

// delete the config map
// delete the remote pods
// check if the daemonset still exists
_, err := clientset.AppsV1().DaemonSets(ds.Namespace).Get(ctx, ds.Name, metav1.GetOptions{})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be sufficient to just delete the DaemonSet and log the error if it wasn't present?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should be sufficient to just delete the DaemonSet and log the not found error. I have modified the code.

}

func createHostCollectorDS(ctx context.Context, clientset kubernetes.Interface, labels map[string]string) (*appsv1.DaemonSet, error) {
ns := "default"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we allow passing in the namespace? In embedded cluster for example, kotsadm & embedded-cluster namespaces exist. KOTS may have permissions on one of these but not to the default namespace

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch. Namespace has been added


func createHostCollectorDS(ctx context.Context, clientset kubernetes.Interface, labels map[string]string) (*appsv1.DaemonSet, error) {
ns := "default"
imageName := "replicated/troubleshoot:latest"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use a versioned tag instead? You can get the version string from https://github.com/replicatedhq/troubleshoot/blob/main/pkg/version/version.go

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the challenging part is that if we are using version string in local development, I found the image was not existed. We have to build it in local first. Do you have any idea about that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one to use version tag. It has to be semantic version

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems cannot pass the tests. I reverts docker image versioned tag.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest that we can switch to version tag in next PR

func createHostCollectorDS(ctx context.Context, clientset kubernetes.Interface, labels map[string]string) (*appsv1.DaemonSet, error) {
ns := "default"
imageName := "replicated/troubleshoot:latest"
imagePullPolicy := corev1.PullAlways
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use a tag, perhaps this should be IfNotPresent

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to IfNotPresent

ImagePullPolicy: imagePullPolicy,
Name: "remote-collector",
Command: []string{"/bin/bash", "-c"},
Args: []string{"while true; do sleep 30; done;"},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Collection shouldn't take longer than 30s but I'm thinking we should have a better way of handling this. We can for example implement a collect pause subcommand to pause and wait for a termination signal then exit gracefully.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed to use

Command:         []string{"tail", "-f", "/dev/null"},

It effectively blocks forever without doing anything.

@DexterYan
Copy link
Member

@diamonwiggins

Copy from another PR

Is the reporting and the progress updates accurate if the flow is:

for pod in pods:
    run each collector

or should it be:

for collector in collectors:
    run in each pod

@DexterYan
Copy link
Member

The remote collectors has been using

for collector in collectors:
   tracing start
    run in each pod
   tracing end

Copy link
Member

@nvanthao nvanthao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@banjoh banjoh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@DexterYan DexterYan merged commit deeeea7 into main Nov 11, 2024
24 checks passed
@DexterYan DexterYan deleted the ash/daemonset-exec branch November 11, 2024 19:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type::feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants