Skip to content

Commit

Permalink
Merge remote-tracking branch 'pokt/main' into refactor/message-handling
Browse files Browse the repository at this point in the history
* pokt/main:
  [Helm] Add ServiceMonitor to the helm chart (#767)
  Update PULL_REQUEST_TEMPLATE.md (#772)
  [CI/Infra] E2E tests on Argo Workflows (#737)
  • Loading branch information
bryanchriswhite committed May 23, 2023
2 parents 47af1f2 + cb0a0ab commit ef21e79
Show file tree
Hide file tree
Showing 13 changed files with 277 additions and 34 deletions.
6 changes: 4 additions & 2 deletions .github/PULL_REQUEST_TEMPLATE.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,15 @@
1. Make the title of the PR is descriptive and follows this format: `[<Module>] <DESCRIPTION>`
2. Update the _Assigness_, _Labels_, _Projects_, _Milestone_ before submitting the PR for review.
3. Add label(s) for the purpose (e.g. `persistence`) and, if applicable, priority (e.g. `low`) labels as well.
4. See our custom action driven labels if you need to trigger a build or interact with an LLM - https://github.com/pokt-network/pocket/blob/main/docs/development/README.md#github-labels
-->

## Description

<!-- REMOVE this comment block after following the instructions
1. Add a summary of the change including: motivation, reasons, context, dependencies, etc...
2. If applicable, specify the key files that should be looked at.
1. Add a summary of the change including: motivation, reasons, context, dependencies, etc...
2. If applicable, specify the key files that should be looked at.
3. If you leave the `reviewpad:summary` block below, it'll autopopulate an AI generated summary. Alternatively, you can leave a `/reviewpad summarize` comment to trigger it manually.
-->
reviewpad:summary

Expand Down
69 changes: 63 additions & 6 deletions .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,13 +6,15 @@ on:
workflow_dispatch:
push:
branches: [main]
paths-ignore:
- "docs/**"
- "**.md"
# OPTIMIZE: We generate new images even on non src code changes, but this cost is okay for now
# paths-ignore:
# - "docs/**"
# - "**.md"
pull_request:
paths-ignore:
- "docs/**"
- "**.md"
# paths-ignore:
# - "docs/**"
# - "**.md"


env:
# Even though we can test against multiple versions, this one is considered a target version.
Expand Down Expand Up @@ -151,3 +153,58 @@ jobs:
cache-to: type=gha,mode=max
build-args: |
TARGET_GOLANG_VERSION=${{ env.TARGET_GOLANG_VERSION }}
# Run e2e tests on devnet if the PR has a label "e2e-devnet-test"
e2e-tests:
runs-on: ubuntu-latest
needs: build-images
if: contains(github.event.pull_request.labels.*.name, 'e2e-devnet-test')
env:
ARGO_HTTP1: true
ARGO_SECURE: true
ARGO_SERVER: ${{ vars.ARGO_SERVER }}
permissions:
contents: "read"
id-token: "write"

steps:
- id: "auth"
uses: "google-github-actions/auth@v1"
with:
credentials_json: "${{ secrets.ARGO_WORKFLOW_EXTERNAL }}"

- id: "get-credentials"
uses: "google-github-actions/get-gke-credentials@v1"
with:
cluster_name: "nodes-gcp-dev-us-east4-1"
location: "us-east4"

- id: "install-argo"
run: |
curl -sLO https://github.com/argoproj/argo-workflows/releases/download/v3.4.7/argo-linux-amd64.gz
gunzip argo-linux-amd64.gz
chmod +x argo-linux-amd64
mv ./argo-linux-amd64 /usr/local/bin/argo
argo version
- id: "wait-for-infra"
shell: bash
run: |
start_time=$(date +%s) # store current time
timeout=900 # 15 minute timeout in seconds
until argo template get dev-e2e-tests --namespace=devnet-issue-${{ github.event.pull_request.number }}; do
current_time=$(date +%s)
elapsed_time=$(( current_time - start_time ))
if (( elapsed_time > timeout )); then
echo "Timeout of $timeout seconds reached. Exiting..."
exit 1
fi
echo "Waiting for devnet-issue-${{ github.event.pull_request.number }} to be provisioned..."
sleep 5
done
- id: "run-e2e-tests"
run: |
argo submit --wait --log --namespace devnet-issue-${{ github.event.pull_request.number }} --from 'wftmpl/dev-e2e-tests' --parameter gitsha="${{ github.event.pull_request.head.sha }}"
5 changes: 5 additions & 0 deletions build/docs/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,11 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]


## [0.0.0.43] - 2023-05-18

- Added functionality to `cluster-manager` to delete crashed pods so StatefulSetController would recreate them with a new version.

## [0.0.0.42] - 2023-05-12

- Added private keys for all (except fisherman) actors
Expand Down
137 changes: 137 additions & 0 deletions build/localnet/cluster-manager/crashed_pods_deleter.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
package main

// Monitors Pods created by StatefulSets, and if the Pods are in a `CrashLoopBackOff` status,
// and they have a different image tag - kill them. StatefulSet would then recreate the Pod with a new image.

import (
"context"
"errors"
"strings"

pocketk8s "github.com/pokt-network/pocket/shared/k8s"
corev1 "k8s.io/api/core/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
watch "k8s.io/apimachinery/pkg/watch"
"k8s.io/client-go/kubernetes"
appstypedv1 "k8s.io/client-go/kubernetes/typed/apps/v1"
coretypedv1 "k8s.io/client-go/kubernetes/typed/core/v1"
)

// Loop through existing pods and set up a watch for new Pods so we don't hit Kubernetes API all the time
// This is a blocking function, intended for running in a goroutine
func initCrashedPodsDeleter(client *kubernetes.Clientset) {
stsClient := client.AppsV1().StatefulSets(pocketk8s.CurrentNamespace)
podClient := client.CoreV1().Pods(pocketk8s.CurrentNamespace)

// Loop through all existing Pods and delete the ones that are in CrashLoopBackOff status
podList, err := podClient.List(context.TODO(), metav1.ListOptions{})
if err != nil {
logger.Error().Err(err).Msg("error listing pods on init")
}

for i := range podList.Items {
pod := &podList.Items[i]
if err := deleteCrashedPods(pod, stsClient, podClient); err != nil {
logger.Error().Err(err).Msg("error deleting crashed pod on init")
}
}

// Set up a watch for new Pods
w, err := podClient.Watch(context.TODO(), metav1.ListOptions{})
if err != nil {
logger.Error().Err(err).Msg("error setting up watch for new pods")
}
for event := range w.ResultChan() {
switch event.Type {
case watch.Added, watch.Modified:
pod, ok := event.Object.(*corev1.Pod)
if !ok {
logger.Error().Msg("error casting pod on watch")
continue
}

if err := deleteCrashedPods(pod, stsClient, podClient); err != nil {
logger.Error().Err(err).Msg("error deleting crashed pod on watch")
}
}
}
}

func isContainerStatusErroneous(status *corev1.ContainerStatus) bool {
return status.State.Waiting != nil &&
(strings.HasPrefix(status.State.Waiting.Reason, "Err") ||
strings.HasSuffix(status.State.Waiting.Reason, "BackOff"))
}

func deleteCrashedPods(
pod *corev1.Pod,
stsClient appstypedv1.StatefulSetInterface,
podClient coretypedv1.PodInterface,
) error {
// If annotation is present, we monitor the Pod
containerToMonitor, ok := pod.Annotations["cluster-manager-delete-on-crash-container"]
if !ok {
return nil
}

for ci := range pod.Spec.Containers {
podContainer := &pod.Spec.Containers[ci]

// Only proceed if container is the one we monitor
if podContainer.Name != containerToMonitor {
continue
}

for si := range pod.Status.ContainerStatuses {
containerStatus := &pod.Status.ContainerStatuses[si]

// Only proceed if container is in some sort of Err status
if !isContainerStatusErroneous(containerStatus) {
continue
}

// Get StatefulSet that created the Pod
var stsName string
for _, ownerRef := range pod.OwnerReferences {
if ownerRef.Kind == "StatefulSet" {
stsName = ownerRef.Name
break
}
}

if stsName == "" {
return errors.New("no StatefulSet found for this pod")
}

sts, err := stsClient.Get(context.TODO(), stsName, metav1.GetOptions{})
if err != nil {
return err
}

// Loop through all containers in the StatefulSet and find the one we monitor
for sci := range sts.Spec.Template.Spec.Containers {
stsContainer := &sts.Spec.Template.Spec.Containers[sci]
if stsContainer.Name != containerToMonitor {
continue
}

// If images are different, delete the Pod
if stsContainer.Image != podContainer.Image {
deletePolicy := metav1.DeletePropagationForeground

if err := podClient.Delete(context.TODO(), pod.Name, metav1.DeleteOptions{
PropagationPolicy: &deletePolicy,
}); err != nil {
return err
}

logger.Info().Str("pod", pod.Name).Msg("deleted crashed pod")
} else {
logger.Info().Str("pod", pod.Name).Msg("pod crashed, but image is the same, not deleting")
}
}
}
}

return nil
}
3 changes: 3 additions & 0 deletions build/localnet/cluster-manager/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,9 @@ func main() {
panic(err.Error())
}

// Monitor for crashed pods and delete them
go initCrashedPodsDeleter(clientset)

validatorKeysMap, err := pocketk8s.FetchValidatorPrivateKeys(clientset)
if err != nil {
panic(err)
Expand Down
39 changes: 38 additions & 1 deletion build/localnet/manifests/cluster-manager.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -21,4 +21,41 @@ spec:
env:
- name: RPC_HOST
value: pocket-full-nodes
serviceAccountName: cluster-manager-account
serviceAccountName: cluster-manager
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: cluster-manager
---
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: cluster-manager
subjects:
- kind: ServiceAccount
name: cluster-manager
apiGroup: ""
roleRef:
kind: Role
name: cluster-manager
apiGroup: ""
---
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: cluster-manager
rules:
- apiGroups: [""]
resources: ["secrets"]
resourceNames: ["validators-private-keys"]
verbs: ["get"]
- apiGroups: [""]
resources: ["services", "pods"]
verbs: ["watch", "list", "get"]
- apiGroups: [""]
resources: ["pods"]
verbs: ["delete"]
- apiGroups: ["apps"]
resources: ["statefulsets"]
verbs: ["get"]
17 changes: 0 additions & 17 deletions build/localnet/manifests/role-bindings.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,24 +7,7 @@ subjects:
- kind: ServiceAccount
name: debug-client-account
apiGroup: ""
- kind: ServiceAccount
name: cluster-manager-account
apiGroup: ""
roleRef:
kind: Role
name: private-keys-viewer
apiGroup: ""
---
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: services-watcher-binding
namespace: default
subjects:
- kind: ServiceAccount
name: cluster-manager-account
apiGroup: ""
roleRef:
kind: Role
name: services-watcher
apiGroup: ""
6 changes: 0 additions & 6 deletions build/localnet/manifests/service-accounts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,3 @@ kind: ServiceAccount
metadata:
name: debug-client-account
namespace: default
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: cluster-manager-account
namespace: default
4 changes: 4 additions & 0 deletions charts/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

## [0.0.0.5] - 2023-05-20

- Added `ServiceMonitor` to the helm chart.

## [0.0.0.4] - 2023-05-12

- Added `nodeType` parameter to the helm chart, which is now actor-agnostic.
Expand Down
1 change: 1 addition & 0 deletions charts/pocket/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -127,4 +127,5 @@ privateKeySecretKeyRef:
| serviceAccount.annotations | object | `{}` | Annotations to add to the service account |
| serviceAccount.create | bool | `true` | Specifies whether a service account should be created |
| serviceAccount.name | string | `""` | The name of the service account to use. If not set and create is true, a name is generated using the fullname template |
| serviceMonitor.enabled | bool | `false` | enable service monitor |
| tolerations | list | `[]` | |
12 changes: 12 additions & 0 deletions charts/pocket/templates/service-monitor.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
{{ if .Values.serviceMonitor.enabled }}
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: {{ include "pocket.fullname" . }}
spec:
endpoints:
- port: metrics
selector:
matchLabels:
{{- include "pocket.selectorLabels" . | nindent 6 }}
{{ end }}
6 changes: 4 additions & 2 deletions charts/pocket/templates/statefulset.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -58,10 +58,12 @@ spec:
- -config=/pocket/configs/config.json
- -genesis=/pocket/configs/genesis.json
ports:
- containerPort: 42069
- containerPort: {{ .Values.service.ports.consensus }}
name: consensus
- containerPort: 50832
- containerPort: {{ .Values.service.ports.rpc }}
name: rpc
- containerPort: {{ .Values.service.ports.metrics }}
name: metrics
env:
{{ if or .Values.privateKeySecretKeyRef.name .Values.privateKeySecretKeyRef.key }}
- name: POCKET_PRIVATE_KEY
Expand Down
Loading

0 comments on commit ef21e79

Please sign in to comment.