Trivy in Kubernetes
Table of Contents
I am currently working with automated security testing to get control of the known vulnerabilities in our applications. As part of this, I am scanning a Kubernetes cluster and it’s images, as well as application code. We want to cover the whole width, not only application code. Now, we look at a Infrastructure as Code (IaC) scanning tool, Trivy.
Read more about Trivy in my other post.
Trivy has a Kubernetes operator called Trivy Operator. Advantages with using the Trivy Operator are (source) :
- Trivy Operator does background scans continuously in the cluster
- Trivy CLI cannot detect changes of any resources running inside the cluster
- Trivy Operator allows integrating with tools that can consume Kubernetes manifests as it produces reports that are CRDs
- Kubernetes best practice is to push information from within the cluster to tools outside rather than letting the tools pull data from the outside
Prereqs to follow this guide #
- Kind
- Kubectl
- kubecm, strictly not necessary, but I like this tool!
- Helm
- Trivy
- An image that can be deployed to the cluster
Info #
The Trivy Operator continuously scans the Kubernetes cluster. From docs:
The Operator does this by watching Kubernetes for state changes and automatically triggering security scans in response. For example, a vulnerability scan is initiated when a new Pod is created. This way, users can find and view the risks that relate to different resources in a Kubernetes-native way.
Step-by-step #
Setup local cluster #
Setup Flux with local registry OR setup a simple cluster
kind create cluster
Select local cluster (it is called kind-kind by default) with kubecm
kubecm s kind-kind
For a more realistic environment, run a deployment
k apply -f ~/git/testing/flux-image-updates/clusters/my-cluster/podinfo/podinfo-deployment.yaml watch kubectl get pods
(optional) push image to the local registry and create deployment for it
Trivy needs images to scan, but there are probably already other images in your cluster. E.g. from Flux or others.
docker tag 0ac97f5bbbb5 localhost:5001/my-api:1.0.1 docker push localhost:5001/my-api:1.0.1
(optional) Check contents of registry
Flux should automatically deploy new images when they get a new tag (according to the tag policy). To verify what’s in the registry, you can curl it.
✗ curl -X GET http://localhost:5001/v2/_catalog {"repositories":["my-api","hello-app","podinfo"]} **strong text**
✗ curl -X GET http://localhost:5001/v2/podinfo/tags/list {"name":"podinfo","tags":["5.0.7","5.0.3","5.0.5","5.0.0","5.0.4","5.0.6"]}
(optional) Test Trivy Operator locally #
https://aquasecurity.github.io/trivy-operator/latest/
I do this to get a better feeling of how Trivy works and how it should look in the cluster. You can install the Trivy Operator using a YAML manifest file, or as a Helm Chart. We will do the latter. Steps from the docs:
Option 1: Install from traditional Helm Chart repository
- Add the Aqua chart repository:
helm repo add aqua https://aquasecurity.github.io/helm-charts/
helm repo update
- Install the Helm Chart:
helm install trivy-operator aqua/trivy-operator \
--namespace trivy-system \
--create-namespace \
--version 0.21.0
Installing the operator yields the following output:
✗ helm install trivy-operator aqua/trivy-operator \
--namespace trivy-system \
--create-namespace \
--version 0.21.0
NAME: trivy-operator
LAST DEPLOYED: Wed Mar 27 09:14:18 2024
NAMESPACE: trivy-system
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
You have installed Trivy Operator in the trivy-system namespace.
It is configured to discover Kubernetes workloads and resources in
all namespace(s).
Inspect created VulnerabilityReports by:
kubectl get vulnerabilityreports --all-namespaces -o wide
Inspect created ConfigAuditReports by:
kubectl get configauditreports --all-namespaces -o wide
Inspect the work log of trivy-operator by:
kubectl logs -n trivy-system deployment/trivy-operator
Running these commands, we see that the operator is starting making reports. I am interested to see if the podinfo deployment has any reports.
The instructions above are from the “Home” page of the docs, while there are also more options in the Helm installation page.
Play with the in-cluster API #
… to get an overview of have the tool works. Is it even worth installing in the cluster?
Trivy creates several types of reports:
- VulnerabilityReport
- ConfigAuditReport
- ExposedSecretReport
- RbacAssessmentReport
- InfraAssessmentReport
- ClusterComplianceReport
- ClusterVulnerabilityReport
- SbomReport
Get an overview #
To get an overview of all findings, we can use the reports as shown in the Helm Install output above:
✗ kubectl get vulnerabilityreports --all-namespaces -o wide
NAMESPACE NAME REPOSITORY TAG SCANNER AGE CRITICAL HIGH MEDIUM LOW UNKNOWN
flux-system replicaset-helm-controller-58d5cc6f5b-manager fluxcd/helm-controller v0.37.2 Trivy 92m 0 1 11 0 0
flux-system replicaset-image-automation-controller-654dc4897-manager fluxcd/image-automation-controller v0.37.0 Trivy 93m 0 1 8 0 0
flux-system replicaset-image-reflector-controller-8498c88d9-manager fluxcd/image-reflector-controller v0.31.1 Trivy 92m 0 0 9 0 0
...
✗ kubectl get configauditreports --all-namespaces -o wide
NAMESPACE NAME SCANNER AGE CRITICAL HIGH MEDIUM LOW
default replicaset-podinfo-5d869859bd Trivy 94m 0 2 3 9
default service-kubernetes Trivy 94m 0 0 0 0
my-api replicaset-my-api-7cc565547 Trivy 94m 0 1 2 9
flux-system networkpolicy-allow-egress Trivy 94m 0 0 0 0
...
Dive deeper into findings #
To checkout the findings, run the following commands
✗ kubectl describe vulnerabilityreport my-vulnerability-report -n default
✗ kubectl describe configauditreport my-configaudit-report -n default
My pod had the following HIGH finding:
Category: Kubernetes Security Check
Check ID: KSV118
Description: Security context controls the allocation of security parameters for the pod/container/volume, ensuring the appropriate level
of protection. Relying on default security context may expose vulnerabilities to potential attacks that rely on privileged access.
Messages:
replicaset my-api-7cc565547 in my-api namespace is using the default security context, which allows root privileges
Remediation: To enhance security, it is strongly recommended not to rely on the default security context. Instead, it is advisable to exp
licitly define the required security parameters (such as runAsNonRoot, capabilities, readOnlyRootFilesystem, etc.) within the security context
.
Severity: HIGH
Success: false
Title: Default security context configured
Setup Trivy Operator in production #
Trivy Operator manifest files #
https://aquasecurity.github.io/trivy/v0.50/tutorials/kubernetes/gitops/
Okay, so I think this will give value. Especially in a scenario where Kubernetes is used for standard applications in an organisation. Then we have centralized image scanning and can report known vulnerabilities without the teams having to set up anything themselves.
So let’s look into how to add the Trivy Operator to a cluster in production.
---
apiVersion: v1
kind: Namespace
metadata:
name: trivy-system
---
apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: HelmRepository
metadata:
name: trivy-operator
namespace: flux-system
spec:
interval: 60m
type: oci
url: oci://ghcr.io/aquasecurity/helm-charts
---
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
name: trivy-operator
namespace: trivy-system
spec:
chart:
spec:
chart: trivy-operator
version: 0.21.0
sourceRef:
kind: HelmRepository
name: trivy-operator
namespace: flux-system
interval: 60m
values:
trivy:
resources:
limits:
memory: 1500M # Default of ?? wasn't enough, causing OOMKilled workload containers
ignoreUnfixed: true
operator:
scanJobsConcurrentLimit: 2 # Default of 10 used too much RAM at once for Nodes, causing OOMkilled workload containers
install:
crds: CreateReplace
createNamespace: false
Configure Calico network policy #
If you are using Calico or other network management tools and run the manifests above, you will most likely get the following error or something similar:
unable to run trivy operator: failed getting configmap: trivy-operator: Get "https://10.0.0.1:443/api/v1/namespaces/trivy-system/configmaps/trivy-operator": dial tcp 10.0.0.1:443: i/o timeout
.
This means that you need to add a network policy. Trivy requires two network accesses:
- Access to the K8s API
- Access to the vulnerability database
Adhering to the principle of least privilege is quite hard here. In the network policy, only IP ranges can be set. Using Azure, one can set more tailored rules in front. However, those rules apply to the whole cluster, not one namespace or Kubernetes resource.
Here is an example of a network policy for Calico to allow Trivy access to the K8s API:
---
apiVersion: projectcalico.org/v3
kind: NetworkPolicy
metadata:
name: allow-trivy-operator-egress
namespace: trivy-system
spec:
types:
- Egress
egress:
- action: Allow # allow k8s API calls
destination:
services:
name: kubernetes
namespace: default
- action: Allow # allow vulnerability database fetch
destination:
ports:
- "443"
nets:
- 0.0.0.0/0
protocol: TCP
To check the calico network policy when it has been applied:
calicoctl get networkpolicy allow-trivy-operator-egress -o yaml -n trivy-system --allow-version-mismatch
Tolerate node taints #
https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/
If your cluster has taints on nodes, you will see that the Trivy node collector isn’t running correctly. You can check by first finding the node collector name (list all resources in the namespace and you will have it), and then run kubectl describe on it. The events will show what is wrong, e.g. FailedScheduling
with the message 0/7 nodes are available: 1 node(s) had untolerated taint ....
.
k describe pod/node-collector-677f9fb5b8-jc6tw -n trivy-system
Find the taints on the nodes in your cluster:
➜ ~ kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, taints: .spec.taints}'
Then you can add it to the HelmRelease as values. Note that it is not common pod tolerations you must configure, but tell the Trivy Operator through values.triyvOperator.scanJobTolerations
which tolerations the node-collector pod should have.
What happens when setting common tolerations
Updating the tolerations, the node-collector pod didn’t get them applied, only the operator pod (checking with k describe pod/node-collector-some-id
and same for pod/trivy-operator). I found an issue and deleted all files to reset, but it didn’t work. With some more research, I found that I must set the tolerations in the Helm values
...
values:
trivy:
ignoreUnfixed: true
trivyOperator:
scanJobTolerations:
- key: "key1"
operator: "Equal"
value: "value1"
effect: "NoSchedule"
In version 0.21.1, it was supported to add tolerations to NodeCollector. At that point, we started getting the same toleration error messages and the NodeCollector timed out. Looking at the HelmChart Artifact Hub, we fixed by adding tolerations to the node collector as well:
...
values:
trivy:
ignoreUnfixed: true
trivyOperator:
scanJobTolerations:
- key: "key1"
operator: "Equal"
value: "value1"
effect: "NoSchedule"
nodeCollector:
- key: "key1"
operator: "Equal"
value: "value1"
effect: "NoSchedule"
Done!
Add image scanning #
Trivy scans images by default. If it is not working, make sure you allow network to fetch the vulnerability database, as well as allowing network to fetch images from your repository. Maybe this will help:
- https://aquasecurity.github.io/trivy-operator/latest/docs/vulnerability-scanning/private-registries/
- https://aquasecurity.github.io/trivy-operator/v0.19.0/tutorials/private-registries/
Create reports #
Coming soon maybe.
Grafana dashboard #
Documentation at https://aquasecurity.github.io/trivy-operator/v0.22.0/tutorials/grafana-dashboard/#using-the-grafana-helm-chart.
This was an easy fix, however it took a little time to figure out how to use the gnetId to create image through Terraform. I made a PR to the Trivy docs, so it should be documented now.
Useful commands #
Trivy Operator logs #
From K8s:
k logs -n trivy-system deployment/trivy-operator
See all trivy-system resource (except from network policy):
k get all -n trivy-system
Flux reconcile logs #
flux logs --namespace flux-system --since=1h -f
flux logs --namespace flux-system --since=1h -f --kind=kustomization
flux logs --namespace flux-system --since=1h -f --kind=kustomization --name=trivy-prereqs
See new commits as they are detected:
flux logs --namespace flux-system --since=1h -f --kind=gitrepository
Delete/restart the operator #
After doing lots of testing, you might want to delete the operator and install it again to see that a clean install works. You can do like this:
kubectl delete all --all -n trivy-system
Flux might automatically reinstall. If not, you can run
flux reconcile kustomization flux-system --with-source -n flux-system
or maybe
flux reconcile helmrelease trivy-operator -n trivy-system
Or just delete the files and see that the resources are gone, and then put them back.
Delete all reports #
kubectl delete exposedsecretreport --all --all-namespaces
And then same for other reports, such as vulnerabilityreport
.
Improvements and further thoughts #
There are several improvements to be done here. Here are some of my thoughts:
- Create reports, e.g. monthly reports or immediate alerts for critical findings
- Scan regularly
- Use hash of versions and do not scan that very image again every time it appears in the cluster. E.g. job images.
Debugging #
SBOM decode error: failed to decode: multiple OS components are not supported #
{
"level": "error",
"ts": "2024-04-08T10:04:45Z",
"logger": "reconciler.scan job",
"msg": "Scan job container",
"job": "trivy-system/scan-vulnerabilityreport-6cccfb67dd",
"container": "k8s-cluster",
"status.reason": "Error",
"status.message": "2024-04-08T10:04:37.564Z\t\u001b[31mFATAL\u001b[0m\tsbom scan error: scan error: scan failed: failed analysis: SBOM decode error: failed to decode: failed to decode components: multiple OS components are not supported\n",
"stacktrace": "github.com/aquasecurity/trivy-operator/pkg/vulnerabilityreport/controller.(*ScanJobController).completedContainers\n\t/home/runner/work/trivy-operator/trivy-operator/pkg/vulnerabilityreport/controller/scanjob.go:353\ngithub.com/aquasecurity/trivy-operator/pkg/vulnerabilityreport/controller.(*ScanJobController).SetupWithManager.(*ScanJobController).reconcileJobs.func1\n\t/home/runner/work/trivy-operator/trivy-operator/pkg/vulnerabilityreport/controller/scanjob.go:80\nsigs.k8s.io/controller-runtime/pkg/reconcile.Func.Reconcile\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/reconcile/reconcile.go:113\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:227"
}
Haven’t looked into this yet.
scan error: unable to initialize an image scanner: remote error (image fetch) #
This problem is because GETing the URL provides a bad response.
{
"level": "error",
"ts": "2024-04-23T03:07:55Z",
"logger": "reconciler.scan job",
"msg": "Scan job container",
"job": "trivy-system/scan-vulnerabilityreport-7f665c795b",
"container": "calico-windows-upgrade",
"status.reason": "Error",
"status.message": "2024-04-23T03:07:53.141Z\t\u001b[31mFATAL\u001b[0m\timage scan error: scan error: unable to initialize a scanner: unable to initialize an image scanner: 4 errors occurred:\n\t* docker error: unable to inspect the image (mcr.microsoft.com/oss/calico/windows-upgrade:v3.26.3): Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?\n\t* containerd error: containerd socket not found: /run/containerd/containerd.sock\n\t* podman error: unable to initialize Podman client: no podman socket found: stat podman/podman.sock: no such file or directory\n\t* remote error: GET https://mcr.microsoft.com/v2/oss/calico/windows-upgrade/manifests/v3.26.3: MANIFEST_UNKNOWN: manifest tagged by \"v3.26.3\" is not found; map[Tag:v3.26.3]\n\n\n",
"stacktrace": "github.com/aquasecurity/trivy-operator/pkg/vulnerabilityreport/controller.(*ScanJobController).completedContainers\n\t/home/runner/work/trivy-operator/trivy-operator/pkg/vulnerabilityreport/controller/scanjob.go:353\ngithub.com/aquasecurity/trivy-operator/pkg/vulnerabilityreport/controller.(*ScanJobController).SetupWithManager.(*ScanJobController).reconcileJobs.func1\n\t/home/runner/work/trivy-operator/trivy-operator/pkg/vulnerabilityreport/controller/scanjob.go:80\nsigs.k8s.io/controller-runtime/pkg/reconcile.Func.Reconcile\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/reconcile/reconcile.go:113\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:227"
}
I don’t need Windows upgrades at all, so I am thinking of options to handle this:
- Disable scanning with Trivy
- with label
- target workload (turn off DaemonSet-reports)
- namespace (turn off reports for all calico-system resources)
- Disable calico-windows-upgrade DaemonSet
Don’t have a solution just yet. Didn’t prioritise this because triggers this error is a deprecated feature that will be removed in the future.
Scanning Windows images is not supported #
Pretty self-explanatory.
{
"level": "info",
"ts": "2024-04-08T10:07:35Z",
"logger": "reconciler.scan job",
"msg": "Scan job container",
"job": "trivy-system/scan-vulnerabilityreport-7f4d674d74",
"container": "init",
"status.reason": "Error",
"status.message": "Scanning Windows images is not supported."
}
Scan job - OOMKilled ✅ #
{
"level": "error",
"ts": "2024-04-08T10:08:16Z",
"logger": "reconciler.scan job",
"msg": "Scan job container",
"job": "trivy-system/scan-vulnerabilityreport-5c498d8bc6",
"container": "prometheus",
"status.reason": "OOMKilled",
"status.message": "",
"stacktrace": "github.com/aquasecurity/trivy-operator/pkg/vulnerabilityreport/controller.(*ScanJobController).completedContainers\n\t/home/runner/work/trivy-operator/trivy-operator/pkg/vulnerabilityreport/controller/scanjob.go:353\ngithub.com/aquasecurity/trivy-operator/pkg/vulnerabilityreport/controller.(*ScanJobController).SetupWithManager.(*ScanJobController).reconcileJobs.func1\n\t/home/runner/work/trivy-operator/trivy-operator/pkg/vulnerabilityreport/controller/scanjob.go:80\nsigs.k8s.io/controller-runtime/pkg/reconcile.Func.Reconcile\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/reconcile/reconcile.go:113\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:227"
}
The default values for RAM wasn’t enough. We gave a little more request
and limit
for the scanners and this solved the problem. In addition there were 10 reports generated simultanuously. We changed it to 2 to give the nodes a little room. It is shown under spec.values in the manifest file.
TOO MANY REQUESTS ✅ #
This is an ongoing issue, so Trivy might solve this without us having to take action. However, it has been going on a while now. There are two options for this issue;
- Wait until the Trivy maintainers have fixed the issue
- Setup a mirror for the vulnerability database
At some point, we suddenly got the following error:
2024-10-03T07:31:26Z FATAL Fatal error init error: DB error: failed to download vulnerability DB: database download error: oci download error: failed to fetch the layer: GET https://ghcr.io/v2/aquasecurity/trivy-db/blobs/sha256:77a50f405854d311fdf062f2d7edf3c04c63e2f5d218751a29125431376757a1: TOOMANYREQUESTS: retry-after: 600.129µs, allowed: 44000/minute
I found out why this happens from a discussions thread in the Trivy repo: “This is happening just because Trivy has too many users and reached the rate limits. “. Apparently, the GitHub container registry, ghcr.io, introduced rate limiting which ended up with this issue. Or, it is just Trivy that has gotten too many users.
So, in theory setting up cache for the vulnerability database should help. However, if we need to update the database often, it doesn’t really help too much. It certainly doesn’t solve the problem entirely.
That leaves us two other options;
- Setting up our own package registry to mirror the vulnerability database. A comment on the option above states that this adds security risk. It definitely force us to maintain another thing and we risk not having the latest discovered vulnerabilities in our mirrored version. If the database doesn’t get updated, we might just end up allowing another Log4j affect our systems. Anyways, I think this is a good option if this is still an issue next week.
- If you are patient and OK with rerunning your failing pipelines, then a solution could be to wait for the Trivy maintainers to fix the problem. They are looking into the issue and have already pushed small improvements very quickly to remediate. I am sure they are looking into it and trying their best to fix the issue quickly.
Later, I found this announcement about the issue.
Resources #
- https://aquasecurity.github.io/trivy/v0.50/docs/target/kubernetes/
- https://aquasecurity.github.io/trivy-operator/latest/
- https://www.aquasec.com/blog/vulnerability-scanning-trivy-vs-the-trivy-operator/
- https://github.com/aquasecurity/trivy/discussions/4905
- https://github.com/aquasecurity/trivy/discussions/4499
- https://aquasecurity.github.io/trivy/v0.17.2/private-registries/
- K8s Lens: https://docs.k8slens.dev/
- Lens extension: https://aquasecurity.github.io/trivy-operator/v0.10.1/tutorials/integrations/lens/
- https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/
- https://github.com/aquasecurity/trivy-operator/issues/1659