exec remote host collectors in a daemonset #1671

hedge-sparrow · 2024-11-01T13:55:37Z

Description, Motivation and Context

This PR changes how remote host collection is scheduled.

it switches the runners to be exec calls to pods managed by a daemonset, rather than one runner per pod.

Checklist

New and existing tests pass locally with introduced changes.
Tests for the changes have been added (for bug fixes / features)
The commit message(s) are informative and highlight any breaking changes
Any documentation required has been added/updated. For changes to https://troubleshoot.sh/ create a PR here

Does this PR introduce a breaking change?

Yes
No

pkg/supportbundle/supportbundle.go

nvanthao · 2024-11-04T23:37:27Z

pkg/supportbundle/collect.go

+		// TODO:
+		// delete the config map
+		// delete the remote pods
+		clientset.AppsV1().DaemonSets(ds.Namespace).Delete(ctx, ds.Name, metav1.DeleteOptions{})


Potential panic when ds is not created

I have added

_, err := clientset.AppsV1().DaemonSets(ds.Namespace).Get(ctx, ds.Name, metav1.GetOptions{}) if err != nil { klog.Errorf("Failed to verify remote host collector daemonset %s still exists: %v", ds.Name, err) return } if err := clientset.AppsV1().DaemonSets(ds.Namespace).Delete(ctx, ds.Name, metav1.DeleteOptions{}); err != nil { klog.Errorf("Failed to delete remote host collector daemonset %s: %v", ds.Name, err) }

I think if the ds is not created, it will log the not found error.

banjoh · 2024-11-05T14:47:54Z

pkg/supportbundle/collect.go

-	if additionalRedactors != nil {
-		return additionalRedactors.Spec.Redactors
+func waitForPodRunning(ctx context.Context, clientset kubernetes.Interface, pod *corev1.Pod) error {
+	watcher, err := clientset.CoreV1().Pods(pod.Namespace).Watch(ctx, metav1.ListOptions{


To improve reliability, should we poll if the user does not have the necessary "watch" permissions?

Thank you! I have checked to polling method

banjoh · 2024-11-05T14:49:05Z

pkg/supportbundle/collect.go

+}
+
+func waitForDS(ctx context.Context, clientset kubernetes.Interface, ds *appsv1.DaemonSet) error {
+	watcher, err := clientset.AppsV1().DaemonSets(ds.Namespace).Watch(ctx, metav1.ListOptions{


Same here. Poll if no "watch" permissions.

We might just default to polling if it does not introduce a significant delay.

changed to polling

banjoh · 2024-11-05T14:54:10Z

pkg/supportbundle/collect.go

+	klog.V(2).Infof("Created Remote Host Collector Daemonset %s", ds.Name)
+	pods, err := clientset.CoreV1().Pods(ds.Namespace).List(ctx, metav1.ListOptions{
+		LabelSelector:  selectorLabelKey + "=" + selectorLabelValue,
+		TimeoutSeconds: new(int64),


Is this 0 timeout intentional?

I have changed to TimeoutSeconds: ptr.To(int64(defaultTimeout)),

banjoh · 2024-11-05T14:59:05Z

pkg/supportbundle/collect.go

+	pods, err := clientset.CoreV1().Pods(ds.Namespace).List(ctx, metav1.ListOptions{
+		LabelSelector:  selectorLabelKey + "=" + selectorLabelValue,
+		TimeoutSeconds: new(int64),
+		Limit:          0,


Wouldn't limit=0 mean that no pods should be returned? I might be missing something here though

0 means that there is no limit on the number of pods returned in this list operation. It will attempt to return all matching pods without restricting the count.

banjoh · 2024-11-05T15:01:06Z

pkg/supportbundle/collect.go

+
+	parameterCodec := runtime.NewParameterCodec(scheme)
+	req.VersionedParams(&corev1.PodExecOptions{
+		Command:   []string{"/troubleshoot/collect", "-", "--chroot", "/host", "--format", "raw"},


Using stdin instead of relying on configmaps is pretty neat!

yes, that is very efficient!

banjoh · 2024-11-05T15:14:25Z

pkg/supportbundle/collect.go

+		// delete the config map
+		// delete the remote pods
+		// check if the daemonset still exists
+		_, err := clientset.AppsV1().DaemonSets(ds.Namespace).Get(ctx, ds.Name, metav1.GetOptions{})


Wouldn't it be sufficient to just delete the DaemonSet and log the error if it wasn't present?

Yes, it should be sufficient to just delete the DaemonSet and log the not found error. I have modified the code.

banjoh · 2024-11-05T15:26:15Z

pkg/supportbundle/collect.go

+}
+
+func createHostCollectorDS(ctx context.Context, clientset kubernetes.Interface, labels map[string]string) (*appsv1.DaemonSet, error) {
+	ns := "default"


Should we allow passing in the namespace? In embedded cluster for example, kotsadm & embedded-cluster namespaces exist. KOTS may have permissions on one of these but not to the default namespace

good catch. Namespace has been added

banjoh · 2024-11-05T15:28:07Z

pkg/supportbundle/collect.go

+
+func createHostCollectorDS(ctx context.Context, clientset kubernetes.Interface, labels map[string]string) (*appsv1.DaemonSet, error) {
+	ns := "default"
+	imageName := "replicated/troubleshoot:latest"


Should we use a versioned tag instead? You can get the version string from https://github.com/replicatedhq/troubleshoot/blob/main/pkg/version/version.go

I found the challenging part is that if we are using version string in local development, I found the image was not existed. We have to build it in local first. Do you have any idea about that?

I found one to use version tag. It has to be semantic version

It seems cannot pass the tests. I reverts docker image versioned tag.

I suggest that we can switch to version tag in next PR

banjoh · 2024-11-05T15:28:44Z

pkg/supportbundle/collect.go

+func createHostCollectorDS(ctx context.Context, clientset kubernetes.Interface, labels map[string]string) (*appsv1.DaemonSet, error) {
+	ns := "default"
+	imageName := "replicated/troubleshoot:latest"
+	imagePullPolicy := corev1.PullAlways


If we use a tag, perhaps this should be IfNotPresent

changed to IfNotPresent

banjoh · 2024-11-05T15:31:48Z

pkg/supportbundle/collect.go

+				ImagePullPolicy: imagePullPolicy,
+				Name:            "remote-collector",
+				Command:         []string{"/bin/bash", "-c"},
+				Args:            []string{"while true; do sleep 30; done;"},


Collection shouldn't take longer than 30s but I'm thinking we should have a better way of handling this. We can for example implement a collect pause subcommand to pause and wait for a termination signal then exit gracefully.

I have changed to use

Command: []string{"tail", "-f", "/dev/null"},

It effectively blocks forever without doing anything.

DexterYan · 2024-11-05T19:33:46Z

@diamonwiggins

Copy from another PR

Is the reporting and the progress updates accurate if the flow is:
for pod in pods:
    run each collector
or should it be:
for collector in collectors:
    run in each pod

pkg/supportbundle/collect.go

DexterYan · 2024-11-06T23:16:49Z

The remote collectors has been using

for collector in collectors:
   tracing start
    run in each pod
   tracing end

nvanthao

LGTM

banjoh

LGTM

nvanthao and others added 6 commits October 2, 2024 12:45

wip

448cdda

wip

328a817

test log stream

e921577

wip

c2662c2

Use daemonset

b4120ae

error handling

92e215d

hedge-sparrow requested a review from a team as a code owner November 1, 2024 13:55

hedge-sparrow force-pushed the ash/daemonset-exec branch 2 times, most recently from 5791595 to 92e215d Compare November 1, 2024 14:38

merge branch main

f97ec64

hedge-sparrow added the type::feature New feature or request label Nov 1, 2024

hedge-sparrow and others added 3 commits November 1, 2024 15:17

linting

1d22bff

oops

14b8691

use polling instead of waiting and fix save empty data

77939a2

DexterYan reviewed Nov 4, 2024

View reviewed changes

pkg/supportbundle/supportbundle.go Outdated Show resolved Hide resolved

DexterYan and others added 5 commits November 4, 2024 17:27

Merge branch 'main' into ash/daemonset-exec

2ee404a

fix test fail

d7cf653

testing polling

80084e4

reset polling

57d24a2

remove isMasterNode

80d9964

DexterYan previously approved these changes Nov 4, 2024

View reviewed changes

nvanthao requested changes Nov 4, 2024

View reviewed changes

DexterYan mentioned this pull request Nov 5, 2024

WIP feat(collector): improve remote host collector speed #1673

Closed

6 tasks

add summary for host remote collectors

8db2e3e

DexterYan dismissed their stale review via 8db2e3e November 5, 2024 05:03

fix potential panic

65f15bf

banjoh requested changes Nov 5, 2024

View reviewed changes

DexterYan added 2 commits November 6, 2024 11:16

use per collector per pod

675f058

improve test log

7e9b304

DexterYan added 7 commits November 6, 2024 12:00

add sleep for test

ebfa828

use unique name for ds

01c643a

improve ds wait check

ee8f11d

improve wait for pod

f87e9c4

add pod ready

147fb0a

fix label selector issue

c80a99f

improve defer delete

4bc655c

nvanthao reviewed Nov 6, 2024

View reviewed changes

pkg/supportbundle/collect.go Show resolved Hide resolved

DexterYan added 5 commits November 6, 2024 23:48

improve TimeoutSeconds, ds nil check and use PullIfNotPresent

6ce12b5

use polling method

34acec6

use tail -f to keep live

c4ae3ff

allow passing namespace

8e2130e

use version to pin troubleshoot image

8d476b6

DexterYan added 5 commits November 7, 2024 14:01

reverse image version

a567843

Merge branch 'main' into ash/daemonset-exec

65be570

Merge branch 'main' into ash/daemonset-exec

819030a

Merge branch 'main' into ash/daemonset-exec

0e16a65

fix save node_list.json fail

76ad948

nvanthao approved these changes Nov 11, 2024

View reviewed changes

banjoh approved these changes Nov 11, 2024

View reviewed changes

DexterYan merged commit deeeea7 into main Nov 11, 2024
24 checks passed

DexterYan deleted the ash/daemonset-exec branch November 11, 2024 19:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

exec remote host collectors in a daemonset #1671

exec remote host collectors in a daemonset #1671

hedge-sparrow commented Nov 1, 2024 •

edited by DexterYan

Loading

nvanthao Nov 4, 2024

DexterYan Nov 5, 2024

DexterYan Nov 6, 2024

banjoh Nov 5, 2024

DexterYan Nov 6, 2024

banjoh Nov 5, 2024

DexterYan Nov 6, 2024

banjoh Nov 5, 2024

DexterYan Nov 6, 2024 •

edited

Loading

banjoh Nov 5, 2024 •

edited

Loading

DexterYan Nov 6, 2024

banjoh Nov 5, 2024

DexterYan Nov 6, 2024

banjoh Nov 5, 2024

DexterYan Nov 6, 2024

banjoh Nov 5, 2024

DexterYan Nov 6, 2024

banjoh Nov 5, 2024

DexterYan Nov 6, 2024

DexterYan Nov 6, 2024

DexterYan Nov 7, 2024

DexterYan Nov 7, 2024

banjoh Nov 5, 2024

DexterYan Nov 6, 2024

banjoh Nov 5, 2024

DexterYan Nov 6, 2024

DexterYan commented Nov 5, 2024

DexterYan commented Nov 6, 2024

nvanthao left a comment

banjoh left a comment

exec remote host collectors in a daemonset #1671

exec remote host collectors in a daemonset #1671

Conversation

hedge-sparrow commented Nov 1, 2024 • edited by DexterYan Loading

Description, Motivation and Context

Checklist

Does this PR introduce a breaking change?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DexterYan Nov 6, 2024 • edited Loading

Choose a reason for hiding this comment

banjoh Nov 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DexterYan commented Nov 5, 2024

DexterYan commented Nov 6, 2024

nvanthao left a comment

Choose a reason for hiding this comment

banjoh left a comment

Choose a reason for hiding this comment

hedge-sparrow commented Nov 1, 2024 •

edited by DexterYan

Loading

DexterYan Nov 6, 2024 •

edited

Loading

banjoh Nov 5, 2024 •

edited

Loading