-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Review #1
base: empty
Are you sure you want to change the base?
Review #1
Conversation
* Add minor go code, super broken! * Add NOTES.md for me
* Update command and args and use env instead * Fix go code... * Set SA correct in triggerTemplate * Set ful path in poddeleter command, tekton changes the WORKSPACE * Ignore poddeleter in falco so we don't get a loop of delete pods
* Add go.sum * Clean up NOTES.md * Add LICENSE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partial review: the readme proved tricky to review on my phone.
I think I would have split the go main function into two functions: one to create the client and one two maybe kill the pod (i.e. it would take a client, a blacklist and a falco event and maybe do something). That should make it possible to write a few unit test for the latter.
critical = true | ||
break | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe my calendar says it is 2021. Surely there is a set or hashmap we can use somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than use an array and iterate over it, we can use a map. Consider:
CriticalNamespaces := map[string]bool {
"kube-system": true,
"kube-public": true,
"kube-node-lease": true,
"falco": true,
}
// ...
if CriticalNamespaces[ns] {
log.Printf("Protected namespace %s", ns)
return
}
Golang has no set type so we will have to abuse a map, using bools as values. Once we have a map, we can ask for stuff in it and we do not need to iterate at all. More about maps here https://blog.golang.org/maps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh this was what I couldn't really wrap my head around. What I should use for key/value but a simple bool seems like a easy enough idea.
I guess you could argue that the []string is easier for people to understand + in the future when we support having criticalNamepsaces as a input variable that probably would be a []string we would need to do some formatting from the input in to the hash map.
Once again way to long in the future for this issue but still.
But I will update it so I remember how to do it in the future :)
log.Fatalf("Unable to delete pod due to err %v", err) | ||
os.Exit(1) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that you try to create the client first, so you get failure even if we don't need to act. Of course, that assumes there are no kube-system (i.e. cluster-level) events that shows up here when the API is temporarily unavailable (i.e. events that should be ignored but we don't get to the ignoring part because we fail).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have done a break out now, please have a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really cool!! 😁😁
config: | ||
webhook: | ||
address: http://el-falco-listener.falcoresponse.svc.cluster.local:8080 | ||
customHeaders: Falcon:true\,Stuff:yes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be Falcon
or Falco
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have been testing with a few things earlier, I should probably delete it. I would like to get a CEL filter working but tekton have changed the API and got rather tiered when it didn't work.
For now I will keep it just to show that you can add a customHeader. The reason why I wrote falcon was that I wanted to be 100% sure I didn't overwrite any headers that is included in falco.
main.go
Outdated
} | ||
|
||
func main() { | ||
var CriticalNamespaces = []string{"kube-system", "kube-public", "kube-node-lease", "falco"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be nice with ability to append more namespaces from environment variables here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simongottschlag That would indeed be useful, but remember that we are reviewing code already on master and should be very careful about feature growth. If we introduce env var for this it should be its own PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you look in the bottom of the README you will see a conclusion where I talk about the need for this feature.
The reason why I don't add it is because I want the code to have a quick overview.
This code is just for this blog, we had a small discussion about writing a bigger application to be able to manage these kind of loads in the falco slack channel. https://kubernetes.slack.com/archives/CMWH3EH32/p1619869899307600?thread_ts=1619847638.297000&cid=CMWH3EH32
So lets see how that evolves in the futre.
main.go
Outdated
bodyReq := os.Getenv("BODY") | ||
if bodyReq == "" { | ||
panic("Need to get environment variable BODY") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how big the body can be, but there are limits (even though large) to environment variables. Can this be read from a file instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I wonder if encoding is correctly handled through the chain?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall the event won't be suuuuper big so we shouldn't hit any issues.
Yeah I have been able to trigger a bug, instead of using uptime in the exec and used:
echo '"please dont kill me"'
The json got screwed up and I was unable to parse to input.
At this time I don't think it's worth making it better. But to make this production ready we should see if we can find another way of sending data to our pod through the event-listener or we should handle a bunch of inputs that can create issues.
main.go
Outdated
|
||
bodyReq := os.Getenv("BODY") | ||
if bodyReq == "" { | ||
panic("Need to get environment variable BODY") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personal preference: never panic, print error to stderr and os.Exit(1).
I would put most of the code in a func run() error {}
and return early on this and then in main do something like:
func main() {
err := run()
if err != nil {
os.Exit(1)
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
panic
and os.Exit
are different things. Read e.g. here https://stackoverflow.com/a/28473339/205674. For example os.Exit
will not run you defers. Cleanup and unwinding may not matter a lot in a simple program like this, but if you want to be conscientious, I think this guy has the right idea https://stackoverflow.com/a/46255965/205674 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestingly, Simon demonstrates proper use of os.Exit
in his PR https://github.com/simongottschlag/mqtt-log-stdout-v2/pull/1/files#diff-9edc738f3908a55100ae0f3f1adbb25e9b341052e68eebf7e792440bdf06ec8aR30 . Here, the main function can be trivially judged not to require cleanup, since that happened at a lower level already; thus we can safely os.Exit
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think short is swat in this case but it's a good point for the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This Go Time (podcast) episode goes into it quite well: https://changelog.com/gotime/165
main.go
Outdated
|
||
if !critical { | ||
log.Printf("Deleting pod %s from namespace %s", podName, namespace) | ||
err := kubeClient.CoreV1().Pods(namespace).Delete(context.Background(), podName, metaV1.DeleteOptions{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we force delete? Or maybe make it configurable.
Should the context have a timeout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the use case is terminating suspicious pods, it may be that there should be some form of ops notification in case a pod resists arrest? (Would be new PR, tho.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm about force delete I will skip it to keep it shorter.
About context it's a good idea. I just don't know how the k8s client works in go well enough.
I assume that even if the context timesout and the application stops freezing the k8s API will continue to try to delete the pod just like it would if you wrote a kubectl delete command.
If so It might not be worth adding it since the call have gone through.
My initial thought is to do something like this. Is 5 seconds to long?
ctx, cancel := context.WithTimeout(context.Background(), time.Duration(time.Second*5))
defer cancel()
err := kubeClient.CoreV1().Pods(namespace).Delete(ctx, podName, metaV1.DeleteOptions{})
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, deleting pods is about changing the ideal state and letting Kubernetes try to reconcile that state. I think it makes sense that this process waits a bit to see if the delete is successful, but we do not want to keep trying forever or this will itself be a vector for DoS. Probably, the process should wait a bit, log whether it succeeded or timed out and move on.
Update naming overall
It's a less general name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments for the readme
README.md
Outdated
|
||
This was a rather simple example on how we can use the power of tekton together with Falco to protect us from bad actors that is trying to take over pods in our cluster. | ||
|
||
As noted during this post there are allot of potential improvements before this is production ready, for example: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a lot
as an example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noticed that Golang 1.15
was used to build
* Also remove a unused exit(1)
criticalNamespaces := map[string]bool{ | ||
"kube-system": true, | ||
"kube-public": true, | ||
"kube-node-lease": true, | ||
"falco": true, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would actually fail early here if it was a critical namespace and skip passing it to the deletePod() function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't both you and @bittrance happy on this one :)
He asked me earlier to break out to delete and if statement in a separate function. At least if I understood him correctly.
The question is, should we test to see if we are in k8s first or should we check if we aren't in a protected namespace first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a process should generally ascertain its ability to function properly before starting any work. Consider a scenario where it is difficult for me to provoke a Falco event that results in a dead pod. In that scenario, if we are taking decisions (i.e. early return on protected namespaces) before we done all the setup, it may take a long time before we discover that we have bad credentials. This is basically the fail-fast principle applied to software design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was what I meant: #1 (comment)
I was thinking that this could be done earlier, but making sure the credentials work seems like a good idea before. 👍
if criticalNamespaces[namespace] { | ||
log.Printf("The pod %v won't be deleted due to it's part of the critical ns list: %v ", podName, namespace) | ||
return nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My previous review comment may have been confusing, but I think the namespace check should go "around" this method rather than inside it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay so if it's not present call the function to kill the pod?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, something like that. In main(), I think I would do something like:
if criticalNamespaces[namespace] {
log.Printf("The pod %v won't be deleted due to it's part of the critical ns list: %v ", podName, namespace)
return
}
err = deletePod(kubeClient, falcoEvent, criticalNamespaces)
if err != nil {
log.Fatalf("Unable to delete pod due to err %v", err)
}
return nil | ||
} | ||
|
||
log.Printf("Deleting pod %s from namespace %s", podName, namespace) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not yet published my magnus opus on logging statements, but I prefer one "info" level log entry per "request". This method has three. Also, logging should in my opinion be written with the assumption that the code works (you use tests to verify that the code works) which means that the request logging goes after the operation, documenting that it happened. Consider:
log.Printf("Do stuff")
doStuff()
vs
doStuff()
log.Printf("Did stuff")
The doStuff()
function is almost guaranteed to be more complex than the logging. Thus, if you see "Did stuff" in your log, you are a lot more sure that stuff actually happened than if you see "Do stuff". The former style is usually a product of a developer trying to find out if code works and should not go into production - presumably you found out that it works or you would not put the code in production.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometimes "ingress" logging is motivated, but should generally be at debug or trace level.
No description provided.