This is the multi-page printable view of this section. Click here to print.
Security
1 - Apply Pod Security Standards at the Cluster Level
Note
This tutorial applies only for new clusters.Pod Security admission (PSA) is enabled by default in v1.23 and later, as it has
graduated to beta.
Pod Security
is an admission controller that carries out checks against the Kubernetes
Pod Security Standards when new pods are
created. This tutorial shows you how to enforce the baseline
Pod Security
Standard at the cluster level which applies a standard configuration
to all namespaces in a cluster.
To apply Pod Security Standards to specific namespaces, refer to Apply Pod Security Standards at the namespace level.
If you are running a version of Kubernetes other than v1.25, check the documentation for that version.
Before you begin
Install the following on your workstation:
Choose the right Pod Security Standard to apply
Pod Security Admission
lets you apply built-in Pod Security Standards
with the following modes: enforce
, audit
, and warn
.
To gather information that helps you to choose the Pod Security Standards that are most appropriate for your configuration, do the following:
Create a cluster with no Pod Security Standards applied:
kind create cluster --name psa-wo-cluster-pss --image kindest/node:v1.24.0
The output is similar to this:
Creating cluster "psa-wo-cluster-pss" ... ✓ Ensuring node image (kindest/node:v1.24.0) 🖼 ✓ Preparing nodes 📦 ✓ Writing configuration 📜 ✓ Starting control-plane 🕹️ ✓ Installing CNI 🔌 ✓ Installing StorageClass 💾 Set kubectl context to "kind-psa-wo-cluster-pss" You can now use your cluster with: kubectl cluster-info --context kind-psa-wo-cluster-pss Thanks for using kind! 😊
Set the kubectl context to the new cluster:
kubectl cluster-info --context kind-psa-wo-cluster-pss
The output is similar to this:
Kubernetes control plane is running at https://127.0.0.1:61350 CoreDNS is running at https://127.0.0.1:61350/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
Get a list of namespaces in the cluster:
kubectl get ns
The output is similar to this:
NAME STATUS AGE default Active 9m30s kube-node-lease Active 9m32s kube-public Active 9m32s kube-system Active 9m32s local-path-storage Active 9m26s
Use
--dry-run=server
to understand what happens when different Pod Security Standards are applied:- Privileged
kubectl label --dry-run=server --overwrite ns --all \ pod-security.kubernetes.io/enforce=privileged
The output is similar to this:
namespace/default labeled namespace/kube-node-lease labeled namespace/kube-public labeled namespace/kube-system labeled namespace/local-path-storage labeled
- Baseline
kubectl label --dry-run=server --overwrite ns --all \ pod-security.kubernetes.io/enforce=baseline
The output is similar to this:
namespace/default labeled namespace/kube-node-lease labeled namespace/kube-public labeled Warning: existing pods in namespace "kube-system" violate the new PodSecurity enforce level "baseline:latest" Warning: etcd-psa-wo-cluster-pss-control-plane (and 3 other pods): host namespaces, hostPath volumes Warning: kindnet-vzj42: non-default capabilities, host namespaces, hostPath volumes Warning: kube-proxy-m6hwf: host namespaces, hostPath volumes, privileged namespace/kube-system labeled namespace/local-path-storage labeled
- Restricted
kubectl label --dry-run=server --overwrite ns --all \ pod-security.kubernetes.io/enforce=restricted
The output is similar to this:
namespace/default labeled namespace/kube-node-lease labeled namespace/kube-public labeled Warning: existing pods in namespace "kube-system" violate the new PodSecurity enforce level "restricted:latest" Warning: coredns-7bb9c7b568-hsptc (and 1 other pod): unrestricted capabilities, runAsNonRoot != true, seccompProfile Warning: etcd-psa-wo-cluster-pss-control-plane (and 3 other pods): host namespaces, hostPath volumes, allowPrivilegeEscalation != false, unrestricted capabilities, restricted volume types, runAsNonRoot != true Warning: kindnet-vzj42: non-default capabilities, host namespaces, hostPath volumes, allowPrivilegeEscalation != false, unrestricted capabilities, restricted volume types, runAsNonRoot != true, seccompProfile Warning: kube-proxy-m6hwf: host namespaces, hostPath volumes, privileged, allowPrivilegeEscalation != false, unrestricted capabilities, restricted volume types, runAsNonRoot != true, seccompProfile namespace/kube-system labeled Warning: existing pods in namespace "local-path-storage" violate the new PodSecurity enforce level "restricted:latest" Warning: local-path-provisioner-d6d9f7ffc-lw9lh: allowPrivilegeEscalation != false, unrestricted capabilities, runAsNonRoot != true, seccompProfile namespace/local-path-storage labeled
- Privileged
From the previous output, you'll notice that applying the privileged
Pod Security Standard shows no warnings
for any namespaces. However, baseline
and restricted
standards both have
warnings, specifically in the kube-system
namespace.
Set modes, versions and standards
In this section, you apply the following Pod Security Standards to the latest
version:
baseline
standard inenforce
mode.restricted
standard inwarn
andaudit
mode.
The baseline
Pod Security Standard provides a convenient
middle ground that allows keeping the exemption list short and prevents known
privilege escalations.
Additionally, to prevent pods from failing in kube-system
, you'll exempt the namespace
from having Pod Security Standards applied.
When you implement Pod Security Admission in your own environment, consider the following:
Based on the risk posture applied to a cluster, a stricter Pod Security Standard like
restricted
might be a better choice.Exempting the
kube-system
namespace allows pods to run asprivileged
in this namespace. For real world use, the Kubernetes project strongly recommends that you apply strict RBAC policies that limit access tokube-system
, following the principle of least privilege. To implement the preceding standards, do the following:Create a configuration file that can be consumed by the Pod Security Admission Controller to implement these Pod Security Standards:
mkdir -p /tmp/pss cat <<EOF > /tmp/pss/cluster-level-pss.yaml apiVersion: apiserver.config.k8s.io/v1 kind: AdmissionConfiguration plugins: - name: PodSecurity configuration: apiVersion: pod-security.admission.config.k8s.io/v1 kind: PodSecurityConfiguration defaults: enforce: "baseline" enforce-version: "latest" audit: "restricted" audit-version: "latest" warn: "restricted" warn-version: "latest" exemptions: usernames: [] runtimeClasses: [] namespaces: [kube-system] EOF
Configure the API server to consume this file during cluster creation:
cat <<EOF > /tmp/pss/cluster-config.yaml kind: Cluster apiVersion: kind.x-k8s.io/v1alpha4 nodes: - role: control-plane kubeadmConfigPatches: - | kind: ClusterConfiguration apiServer: extraArgs: admission-control-config-file: /etc/config/cluster-level-pss.yaml extraVolumes: - name: accf hostPath: /etc/config mountPath: /etc/config readOnly: false pathType: "DirectoryOrCreate" extraMounts: - hostPath: /tmp/pss containerPath: /etc/config # optional: if set, the mount is read-only. # default false readOnly: false # optional: if set, the mount needs SELinux relabeling. # default false selinuxRelabel: false # optional: set propagation mode (None, HostToContainer or Bidirectional) # see https://kubernetes.io/docs/concepts/storage/volumes/#mount-propagation # default None propagation: None EOF
Note: If you use Docker Desktop with KinD on macOS, you can add/tmp
as a Shared Directory under the menu item Preferences > Resources > File Sharing.Create a cluster that uses Pod Security Admission to apply these Pod Security Standards:
kind create cluster --name psa-with-cluster-pss --image kindest/node:v1.24.0 --config /tmp/pss/cluster-config.yaml
The output is similar to this:
Creating cluster "psa-with-cluster-pss" ... ✓ Ensuring node image (kindest/node:v1.24.0) 🖼 ✓ Preparing nodes 📦 ✓ Writing configuration 📜 ✓ Starting control-plane 🕹️ ✓ Installing CNI 🔌 ✓ Installing StorageClass 💾 Set kubectl context to "kind-psa-with-cluster-pss" You can now use your cluster with: kubectl cluster-info --context kind-psa-with-cluster-pss Have a question, bug, or feature request? Let us know! https://kind.sigs.k8s.io/#community 🙂
Point kubectl to the cluster
kubectl cluster-info --context kind-psa-with-cluster-pss
The output is similar to this:
Kubernetes control plane is running at https://127.0.0.1:63855 CoreDNS is running at https://127.0.0.1:63855/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
Create the following Pod specification for a minimal configuration in the default namespace:
cat <<EOF > /tmp/pss/nginx-pod.yaml apiVersion: v1 kind: Pod metadata: name: nginx spec: containers: - image: nginx name: nginx ports: - containerPort: 80 EOF
Create the Pod in the cluster:
kubectl apply -f /tmp/pss/nginx-pod.yaml
The output is similar to this:
Warning: would violate PodSecurity "restricted:latest": allowPrivilegeEscalation != false (container "nginx" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "nginx" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "nginx" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container "nginx" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost") pod/nginx created
Clean up
Run kind delete cluster --name psa-with-cluster-pss
and
kind delete cluster --name psa-wo-cluster-pss
to delete the clusters you
created.
What's next
- Run a
shell script
to perform all the preceding steps at once:
- Create a Pod Security Standards based cluster level Configuration
- Create a file to let API server consume this configuration
- Create a cluster that creates an API server with this configuration
- Set kubectl context to this new cluster
- Create a minimal pod yaml file
- Apply this file to create a Pod in the new cluster
- Pod Security Admission
- Pod Security Standards
- Apply Pod Security Standards at the namespace level
2 - Apply Pod Security Standards at the Namespace Level
Note
This tutorial applies only for new clusters.Pod Security admission (PSA) is enabled by default in v1.23 and later, as it
graduated to beta. Pod Security Admission
is an admission controller that applies
Pod Security Standards
when pods are created. In this tutorial, you will enforce the baseline
Pod Security Standard,
one namespace at a time.
You can also apply Pod Security Standards to multiple namespaces at once at the cluster level. For instructions, refer to Apply Pod Security Standards at the cluster level.
Before you begin
Install the following on your workstation:
Create cluster
Create a
KinD
cluster as follows:kind create cluster --name psa-ns-level --image kindest/node:v1.23.0
The output is similar to this:
Creating cluster "psa-ns-level" ... ✓ Ensuring node image (kindest/node:v1.23.0) 🖼 ✓ Preparing nodes 📦 ✓ Writing configuration 📜 ✓ Starting control-plane 🕹️ ✓ Installing CNI 🔌 ✓ Installing StorageClass 💾 Set kubectl context to "kind-psa-ns-level" You can now use your cluster with: kubectl cluster-info --context kind-psa-ns-level Not sure what to do next? 😅 Check out https://kind.sigs.k8s.io/docs/user/quick-start/
Set the kubectl context to the new cluster:
kubectl cluster-info --context kind-psa-ns-level
The output is similar to this:
Kubernetes control plane is running at https://127.0.0.1:50996 CoreDNS is running at https://127.0.0.1:50996/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
Create a namespace
Create a new namespace called example
:
kubectl create ns example
The output is similar to this:
namespace/example created
Apply Pod Security Standards
Enable Pod Security Standards on this namespace using labels supported by built-in Pod Security Admission. In this step we will warn on baseline pod security standard as per the latest version (default value)
kubectl label --overwrite ns example \ pod-security.kubernetes.io/warn=baseline \ pod-security.kubernetes.io/warn-version=latest
Multiple pod security standards can be enabled on any namespace, using labels. Following command will
enforce
thebaseline
Pod Security Standard, butwarn
andaudit
forrestricted
Pod Security Standards as per the latest version (default value)kubectl label --overwrite ns example \ pod-security.kubernetes.io/enforce=baseline \ pod-security.kubernetes.io/enforce-version=latest \ pod-security.kubernetes.io/warn=restricted \ pod-security.kubernetes.io/warn-version=latest \ pod-security.kubernetes.io/audit=restricted \ pod-security.kubernetes.io/audit-version=latest
Verify the Pod Security Standards
Create a minimal pod in
example
namespace:cat <<EOF > /tmp/pss/nginx-pod.yaml apiVersion: v1 kind: Pod metadata: name: nginx spec: containers: - image: nginx name: nginx ports: - containerPort: 80 EOF
Apply the pod spec to the cluster in
example
namespace:kubectl apply -n example -f /tmp/pss/nginx-pod.yaml
The output is similar to this:
Warning: would violate PodSecurity "restricted:latest": allowPrivilegeEscalation != false (container "nginx" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "nginx" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "nginx" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container "nginx" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost") pod/nginx created
Apply the pod spec to the cluster in
default
namespace:kubectl apply -n default -f /tmp/pss/nginx-pod.yaml
Output is similar to this:
pod/nginx created
The Pod Security Standards were applied only to the example
namespace. You could create the same Pod in the default
namespace
with no warnings.
Clean up
Run kind delete cluster --name psa-ns-level
to delete the cluster created.
What's next
Run a shell script to perform all the preceding steps all at once.
- Create KinD cluster
- Create new namespace
- Apply
baseline
Pod Security Standard inenforce
mode while applyingrestricted
Pod Security Standard also inwarn
andaudit
mode. - Create a new pod with the following pod security standards applied
3 - Restrict a Container's Access to Resources with AppArmor
Kubernetes v1.4 [beta]
AppArmor is a Linux kernel security module that supplements the standard Linux user and group based permissions to confine programs to a limited set of resources. AppArmor can be configured for any application to reduce its potential attack surface and provide greater in-depth defense. It is configured through profiles tuned to allow the access needed by a specific program or container, such as Linux capabilities, network access, file permissions, etc. Each profile can be run in either enforcing mode, which blocks access to disallowed resources, or complain mode, which only reports violations.
AppArmor can help you to run a more secure deployment by restricting what containers are allowed to do, and/or provide better auditing through system logs. However, it is important to keep in mind that AppArmor is not a silver bullet and can only do so much to protect against exploits in your application code. It is important to provide good, restrictive profiles, and harden your applications and cluster from other angles as well.
Objectives
- See an example of how to load a profile on a node
- Learn how to enforce the profile on a Pod
- Learn how to check that the profile is loaded
- See what happens when a profile is violated
- See what happens when a profile cannot be loaded
Before you begin
Make sure:
Kubernetes version is at least v1.4 -- Kubernetes support for AppArmor was added in v1.4. Kubernetes components older than v1.4 are not aware of the new AppArmor annotations, and will silently ignore any AppArmor settings that are provided. To ensure that your Pods are receiving the expected protections, it is important to verify the Kubelet version of your nodes:
kubectl get nodes -o=jsonpath=$'{range .items[*]}{@.metadata.name}: {@.status.nodeInfo.kubeletVersion}\n{end}'
gke-test-default-pool-239f5d02-gyn2: v1.4.0 gke-test-default-pool-239f5d02-x1kf: v1.4.0 gke-test-default-pool-239f5d02-xwux: v1.4.0
AppArmor kernel module is enabled -- For the Linux kernel to enforce an AppArmor profile, the AppArmor kernel module must be installed and enabled. Several distributions enable the module by default, such as Ubuntu and SUSE, and many others provide optional support. To check whether the module is enabled, check the
/sys/module/apparmor/parameters/enabled
file:cat /sys/module/apparmor/parameters/enabled Y
If the Kubelet contains AppArmor support (>= v1.4), it will refuse to run a Pod with AppArmor options if the kernel module is not enabled.
Container runtime supports AppArmor -- Currently all common Kubernetes-supported container runtimes should support AppArmor, like Docker, CRI-O or containerd. Please refer to the corresponding runtime documentation and verify that the cluster fulfills the requirements to use AppArmor.
Profile is loaded -- AppArmor is applied to a Pod by specifying an AppArmor profile that each container should be run with. If any of the specified profiles is not already loaded in the kernel, the Kubelet (>= v1.4) will reject the Pod. You can view which profiles are loaded on a node by checking the
/sys/kernel/security/apparmor/profiles
file. For example:ssh gke-test-default-pool-239f5d02-gyn2 "sudo cat /sys/kernel/security/apparmor/profiles | sort"
apparmor-test-deny-write (enforce) apparmor-test-audit-write (enforce) docker-default (enforce) k8s-nginx (enforce)
For more details on loading profiles on nodes, see Setting up nodes with profiles.
As long as the Kubelet version includes AppArmor support (>= v1.4), the Kubelet will reject a Pod with AppArmor options if any of the prerequisites are not met. You can also verify AppArmor support on nodes by checking the node ready condition message (though this is likely to be removed in a later release):
kubectl get nodes -o=jsonpath='{range .items[*]}{@.metadata.name}: {.status.conditions[?(@.reason=="KubeletReady")].message}{"\n"}{end}'
gke-test-default-pool-239f5d02-gyn2: kubelet is posting ready status. AppArmor enabled
gke-test-default-pool-239f5d02-x1kf: kubelet is posting ready status. AppArmor enabled
gke-test-default-pool-239f5d02-xwux: kubelet is posting ready status. AppArmor enabled
Securing a Pod
AppArmor profiles are specified per-container. To specify the AppArmor profile to run a Pod container with, add an annotation to the Pod's metadata:
container.apparmor.security.beta.kubernetes.io/<container_name>: <profile_ref>
Where <container_name>
is the name of the container to apply the profile to, and <profile_ref>
specifies the profile to apply. The profile_ref
can be one of:
runtime/default
to apply the runtime's default profilelocalhost/<profile_name>
to apply the profile loaded on the host with the name<profile_name>
unconfined
to indicate that no profiles will be loaded
See the API Reference for the full details on the annotation and profile name formats.
Kubernetes AppArmor enforcement works by first checking that all the prerequisites have been met, and then forwarding the profile selection to the container runtime for enforcement. If the prerequisites have not been met, the Pod will be rejected, and will not run.
To verify that the profile was applied, you can look for the AppArmor security option listed in the container created event:
kubectl get events | grep Created
22s 22s 1 hello-apparmor Pod spec.containers{hello} Normal Created {kubelet e2e-test-stclair-node-pool-31nt} Created container with docker id 269a53b202d3; Security:[seccomp=unconfined apparmor=k8s-apparmor-example-deny-write]
You can also verify directly that the container's root process is running with the correct profile by checking its proc attr:
kubectl exec <pod_name> -- cat /proc/1/attr/current
k8s-apparmor-example-deny-write (enforce)
Example
This example assumes you have already set up a cluster with AppArmor support.
First, we need to load the profile we want to use onto our nodes. This profile denies all file writes:
#include <tunables/global>
profile k8s-apparmor-example-deny-write flags=(attach_disconnected) {
#include <abstractions/base>
file,
# Deny all file writes.
deny /** w,
}
Since we don't know where the Pod will be scheduled, we'll need to load the profile on all our nodes. For this example we'll use SSH to install the profiles, but other approaches are discussed in Setting up nodes with profiles.
NODES=(
# The SSH-accessible domain names of your nodes
gke-test-default-pool-239f5d02-gyn2.us-central1-a.my-k8s
gke-test-default-pool-239f5d02-x1kf.us-central1-a.my-k8s
gke-test-default-pool-239f5d02-xwux.us-central1-a.my-k8s)
for NODE in ${NODES[*]}; do ssh $NODE 'sudo apparmor_parser -q <<EOF
#include <tunables/global>
profile k8s-apparmor-example-deny-write flags=(attach_disconnected) {
#include <abstractions/base>
file,
# Deny all file writes.
deny /** w,
}
EOF'
done
Next, we'll run a simple "Hello AppArmor" pod with the deny-write profile:
apiVersion: v1
kind: Pod
metadata:
name: hello-apparmor
annotations:
# Tell Kubernetes to apply the AppArmor profile "k8s-apparmor-example-deny-write".
# Note that this is ignored if the Kubernetes node is not running version 1.4 or greater.
container.apparmor.security.beta.kubernetes.io/hello: localhost/k8s-apparmor-example-deny-write
spec:
containers:
- name: hello
image: busybox:1.28
command: [ "sh", "-c", "echo 'Hello AppArmor!' && sleep 1h" ]
kubectl create -f ./hello-apparmor.yaml
If we look at the pod events, we can see that the Pod container was created with the AppArmor profile "k8s-apparmor-example-deny-write":
kubectl get events | grep hello-apparmor
14s 14s 1 hello-apparmor Pod Normal Scheduled {default-scheduler } Successfully assigned hello-apparmor to gke-test-default-pool-239f5d02-gyn2
14s 14s 1 hello-apparmor Pod spec.containers{hello} Normal Pulling {kubelet gke-test-default-pool-239f5d02-gyn2} pulling image "busybox"
13s 13s 1 hello-apparmor Pod spec.containers{hello} Normal Pulled {kubelet gke-test-default-pool-239f5d02-gyn2} Successfully pulled image "busybox"
13s 13s 1 hello-apparmor Pod spec.containers{hello} Normal Created {kubelet gke-test-default-pool-239f5d02-gyn2} Created container with docker id 06b6cd1c0989; Security:[seccomp=unconfined apparmor=k8s-apparmor-example-deny-write]
13s 13s 1 hello-apparmor Pod spec.containers{hello} Normal Started {kubelet gke-test-default-pool-239f5d02-gyn2} Started container with docker id 06b6cd1c0989
We can verify that the container is actually running with that profile by checking its proc attr:
kubectl exec hello-apparmor -- cat /proc/1/attr/current
k8s-apparmor-example-deny-write (enforce)
Finally, we can see what happens if we try to violate the profile by writing to a file:
kubectl exec hello-apparmor -- touch /tmp/test
touch: /tmp/test: Permission denied
error: error executing remote command: command terminated with non-zero exit code: Error executing in Docker Container: 1
To wrap up, let's look at what happens if we try to specify a profile that hasn't been loaded:
kubectl create -f /dev/stdin <<EOF
apiVersion: v1
kind: Pod
metadata:
name: hello-apparmor-2
annotations:
container.apparmor.security.beta.kubernetes.io/hello: localhost/k8s-apparmor-example-allow-write
spec:
containers:
- name: hello
image: busybox:1.28
command: [ "sh", "-c", "echo 'Hello AppArmor!' && sleep 1h" ]
EOF
pod/hello-apparmor-2 created
kubectl describe pod hello-apparmor-2
Name: hello-apparmor-2
Namespace: default
Node: gke-test-default-pool-239f5d02-x1kf/
Start Time: Tue, 30 Aug 2016 17:58:56 -0700
Labels: <none>
Annotations: container.apparmor.security.beta.kubernetes.io/hello=localhost/k8s-apparmor-example-allow-write
Status: Pending
Reason: AppArmor
Message: Pod Cannot enforce AppArmor: profile "k8s-apparmor-example-allow-write" is not loaded
IP:
Controllers: <none>
Containers:
hello:
Container ID:
Image: busybox
Image ID:
Port:
Command:
sh
-c
echo 'Hello AppArmor!' && sleep 1h
State: Waiting
Reason: Blocked
Ready: False
Restart Count: 0
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-dnz7v (ro)
Conditions:
Type Status
Initialized True
Ready False
PodScheduled True
Volumes:
default-token-dnz7v:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-dnz7v
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: <none>
Events:
FirstSeen LastSeen Count From SubobjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
23s 23s 1 {default-scheduler } Normal Scheduled Successfully assigned hello-apparmor-2 to e2e-test-stclair-node-pool-t1f5
23s 23s 1 {kubelet e2e-test-stclair-node-pool-t1f5} Warning AppArmor Cannot enforce AppArmor: profile "k8s-apparmor-example-allow-write" is not loaded
Note the pod status is Pending, with a helpful error message: Pod Cannot enforce AppArmor: profile "k8s-apparmor-example-allow-write" is not loaded
. An event was also recorded with the same message.
Administration
Setting up nodes with profiles
Kubernetes does not currently provide any native mechanisms for loading AppArmor profiles onto nodes. There are lots of ways to set up the profiles though, such as:
- Through a DaemonSet that runs a Pod on each node to ensure the correct profiles are loaded. An example implementation can be found here.
- At node initialization time, using your node initialization scripts (e.g. Salt, Ansible, etc.) or image.
- By copying the profiles to each node and loading them through SSH, as demonstrated in the Example.
The scheduler is not aware of which profiles are loaded onto which node, so the full set of profiles must be loaded onto every node. An alternative approach is to add a node label for each profile (or class of profiles) on the node, and use a node selector to ensure the Pod is run on a node with the required profile.
Disabling AppArmor
If you do not want AppArmor to be available on your cluster, it can be disabled by a command-line flag:
--feature-gates=AppArmor=false
When disabled, any Pod that includes an AppArmor profile will fail validation with a "Forbidden" error.
Authoring Profiles
Getting AppArmor profiles specified correctly can be a tricky business. Fortunately there are some tools to help with that:
aa-genprof
andaa-logprof
generate profile rules by monitoring an application's activity and logs, and admitting the actions it takes. Further instructions are provided by the AppArmor documentation.- bane is an AppArmor profile generator for Docker that uses a simplified profile language.
To debug problems with AppArmor, you can check the system logs to see what, specifically, was
denied. AppArmor logs verbose messages to dmesg
, and errors can usually be found in the system
logs or through journalctl
. More information is provided in
AppArmor failures.
API Reference
Pod Annotation
Specifying the profile a container will run with:
- key:
container.apparmor.security.beta.kubernetes.io/<container_name>
Where<container_name>
matches the name of a container in the Pod. A separate profile can be specified for each container in the Pod. - value: a profile reference, described below
Profile Reference
runtime/default
: Refers to the default runtime profile.- Equivalent to not specifying a profile, except it still requires AppArmor to be enabled.
- In practice, many container runtimes use the same OCI default profile, defined here: https://github.com/containers/common/blob/main/pkg/apparmor/apparmor_linux_template.go
localhost/<profile_name>
: Refers to a profile loaded on the node (localhost) by name.- The possible profile names are detailed in the core policy reference.
unconfined
: This effectively disables AppArmor on the container.
Any other profile reference format is invalid.
What's next
Additional resources:
4 - Restrict a Container's Syscalls with seccomp
Kubernetes v1.19 [stable]
Seccomp stands for secure computing mode and has been a feature of the Linux kernel since version 2.6.12. It can be used to sandbox the privileges of a process, restricting the calls it is able to make from userspace into the kernel. Kubernetes lets you automatically apply seccomp profiles loaded onto a node to your Pods and containers.
Identifying the privileges required for your workloads can be difficult. In this tutorial, you will go through how to load seccomp profiles into a local Kubernetes cluster, how to apply them to a Pod, and how you can begin to craft profiles that give only the necessary privileges to your container processes.
Objectives
- Learn how to load seccomp profiles on a node
- Learn how to apply a seccomp profile to a container
- Observe auditing of syscalls made by a container process
- Observe behavior when a missing profile is specified
- Observe a violation of a seccomp profile
- Learn how to create fine-grained seccomp profiles
- Learn how to apply a container runtime default seccomp profile
Before you begin
In order to complete all steps in this tutorial, you must install kind and kubectl.
This tutorial shows some examples that are still beta (since v1.25) and others that use only generally available seccomp functionality. You should make sure that your cluster is configured correctly for the version you are using.
The tutorial also uses the curl
tool for downloading examples to your computer.
You can adapt the steps to use a different tool if you prefer.
privileged: true
set in the container's securityContext
. Privileged containers always
run as Unconfined
.Download example seccomp profiles
The contents of these profiles will be explored later on, but for now go ahead
and download them into a directory named profiles/
so that they can be loaded
into the cluster.
{
"defaultAction": "SCMP_ACT_LOG"
}
{
"defaultAction": "SCMP_ACT_ERRNO"
}
{
"defaultAction": "SCMP_ACT_ERRNO",
"architectures": [
"SCMP_ARCH_X86_64",
"SCMP_ARCH_X86",
"SCMP_ARCH_X32"
],
"syscalls": [
{
"names": [
"accept4",
"epoll_wait",
"pselect6",
"futex",
"madvise",
"epoll_ctl",
"getsockname",
"setsockopt",
"vfork",
"mmap",
"read",
"write",
"close",
"arch_prctl",
"sched_getaffinity",
"munmap",
"brk",
"rt_sigaction",
"rt_sigprocmask",
"sigaltstack",
"gettid",
"clone",
"bind",
"socket",
"openat",
"readlinkat",
"exit_group",
"epoll_create1",
"listen",
"rt_sigreturn",
"sched_yield",
"clock_gettime",
"connect",
"dup2",
"epoll_pwait",
"execve",
"exit",
"fcntl",
"getpid",
"getuid",
"ioctl",
"mprotect",
"nanosleep",
"open",
"poll",
"recvfrom",
"sendto",
"set_tid_address",
"setitimer",
"writev"
],
"action": "SCMP_ACT_ALLOW"
}
]
}
Run these commands:
mkdir ./profiles
curl -L -o profiles/audit.json https://k8s.io/examples/pods/security/seccomp/profiles/audit.json
curl -L -o profiles/violation.json https://k8s.io/examples/pods/security/seccomp/profiles/violation.json
curl -L -o profiles/fine-grained.json https://k8s.io/examples/pods/security/seccomp/profiles/fine-grained.json
ls profiles
You should see three profiles listed at the end of the final step:
audit.json fine-grained.json violation.json
Create a local Kubernetes cluster with kind
For simplicity, kind can be used to create a single node cluster with the seccomp profiles loaded. Kind runs Kubernetes in Docker, so each node of the cluster is a container. This allows for files to be mounted in the filesystem of each container similar to loading files onto a node.
apiVersion: kind.x-k8s.io/v1alpha4
kind: Cluster
nodes:
- role: control-plane
extraMounts:
- hostPath: "./profiles"
containerPath: "/var/lib/kubelet/seccomp/profiles"
Download that example kind configuration, and save it to a file named kind.yaml
:
curl -L -O https://k8s.io/examples/pods/security/seccomp/kind.yaml
You can set a specific Kubernetes version by setting the node's container image. See Nodes within the kind documentation about configuration for more details on this. This tutorial assumes you are using Kubernetes v1.25.
As a beta feature, you can configure Kubernetes to use the profile that the
container runtime
prefers by default, rather than falling back to Unconfined
.
If you want to try that, see
enable the use of RuntimeDefault
as the default seccomp profile for all workloads
before you continue.
Once you have a kind configuration in place, create the kind cluster with that configuration:
kind create cluster --config=kind.yaml
After the new Kubernetes cluster is ready, identify the Docker container running as the single node cluster:
docker ps
You should see output indicating that a container is running with name
kind-control-plane
. The output is similar to:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
6a96207fed4b kindest/node:v1.18.2 "/usr/local/bin/entr…" 27 seconds ago Up 24 seconds 127.0.0.1:42223->6443/tcp kind-control-plane
If observing the filesystem of that container, you should see that the
profiles/
directory has been successfully loaded into the default seccomp path
of the kubelet. Use docker exec
to run a command in the Pod:
# Change 6a96207fed4b to the container ID you saw from "docker ps"
docker exec -it 6a96207fed4b ls /var/lib/kubelet/seccomp/profiles
audit.json fine-grained.json violation.json
You have verified that these seccomp profiles are available to the kubelet running within kind.
Enable the use of RuntimeDefault
as the default seccomp profile for all workloads
Kubernetes v1.25 [beta]
To use seccomp profile defaulting, you must run the kubelet with the SeccompDefault
feature gate enabled
(this is the default). You must also explicitly enable the defaulting behavior for each
node where you want to use this with the corresponding --seccomp-default
command line flag.
Both have to be enabled simultaneously to use the feature.
If enabled, the kubelet will use the RuntimeDefault
seccomp profile by default, which is
defined by the container runtime, instead of using the Unconfined
(seccomp disabled) mode.
The default profiles aim to provide a strong set
of security defaults while preserving the functionality of the workload. It is
possible that the default profiles differ between container runtimes and their
release versions, for example when comparing those from CRI-O and containerd.
securityContext.seccompProfile
API field nor add the deprecated annotations of
the workload. This provides users the possibility to rollback anytime without
actually changing the workload configuration. Tools like
crictl inspect
can be used to
verify which seccomp profile is being used by a container.Some workloads may require a lower amount of syscall restrictions than others.
This means that they can fail during runtime even with the RuntimeDefault
profile. To mitigate such a failure, you can:
- Run the workload explicitly as
Unconfined
. - Disable the
SeccompDefault
feature for the nodes. Also making sure that workloads get scheduled on nodes where the feature is disabled. - Create a custom seccomp profile for the workload.
If you were introducing this feature into production-like cluster, the Kubernetes project recommends that you enable this feature gate on a subset of your nodes and then test workload execution before rolling the change out cluster-wide.
You can find more detailed information about a possible upgrade and downgrade strategy in the related Kubernetes Enhancement Proposal (KEP): Enable seccomp by default.
Kubernetes 1.25 lets you configure the seccomp profile
that applies when the spec for a Pod doesn't define a specific seccomp profile.
This is a beta feature and the corresponding SeccompDefault
feature
gate is enabled by
default. However, you still need to enable this defaulting for each node where
you would like to use it.
If you are running a Kubernetes 1.25 cluster and want to
enable the feature, either run the kubelet with the --seccomp-default
command
line flag, or enable it through the kubelet configuration
file. To enable the
feature gate in kind, ensure that kind
provides
the minimum required Kubernetes version and enables the SeccompDefault
feature
in the kind configuration:
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
featureGates:
SeccompDefault: true
nodes:
- role: control-plane
image: kindest/node:v1.23.0@sha256:49824ab1727c04e56a21a5d8372a402fcd32ea51ac96a2706a12af38934f81ac
kubeadmConfigPatches:
- |
kind: JoinConfiguration
nodeRegistration:
kubeletExtraArgs:
seccomp-default: "true"
- role: worker
image: kindest/node:v1.23.0@sha256:49824ab1727c04e56a21a5d8372a402fcd32ea51ac96a2706a12af38934f81ac
kubeadmConfigPatches:
- |
kind: JoinConfiguration
nodeRegistration:
kubeletExtraArgs:
feature-gates: SeccompDefault=true
seccomp-default: "true"
If the cluster is ready, then running a pod:
kubectl run --rm -it --restart=Never --image=alpine alpine -- sh
Should now have the default seccomp profile attached. This can be verified by
using docker exec
to run crictl inspect
for the container on the kind
worker:
docker exec -it kind-worker bash -c \
'crictl inspect $(crictl ps --name=alpine -q) | jq .info.runtimeSpec.linux.seccomp'
{
"defaultAction": "SCMP_ACT_ERRNO",
"architectures": ["SCMP_ARCH_X86_64", "SCMP_ARCH_X86", "SCMP_ARCH_X32"],
"syscalls": [
{
"names": ["..."]
}
]
}
Create a Pod with a seccomp profile for syscall auditing
To start off, apply the audit.json
profile, which will log all syscalls of the
process, to a new Pod.
Here's a manifest for that Pod:
apiVersion: v1
kind: Pod
metadata:
name: audit-pod
labels:
app: audit-pod
spec:
securityContext:
seccompProfile:
type: Localhost
localhostProfile: profiles/audit.json
containers:
- name: test-container
image: hashicorp/http-echo:0.2.3
args:
- "-text=just made some syscalls!"
securityContext:
allowPrivilegeEscalation: false
The functional support for the already deprecated seccomp annotations
seccomp.security.alpha.kubernetes.io/pod
(for the whole pod) and
container.seccomp.security.alpha.kubernetes.io/[name]
(for a single container)
is going to be removed with a future release of Kubernetes. Please always use
the native API fields in favor of the annotations.
Since Kubernetes v1.25, kubelets no longer support the annotations, use of the annotations in static pods is no longer supported, and the seccomp annotations are no longer auto-populated when pods with seccomp fields are created. Auto-population of the seccomp fields from the annotations is planned to be removed in a future release.
Create the Pod in the cluster:
kubectl apply -f https://k8s.io/examples/pods/security/seccomp/ga/audit-pod.yaml
This profile does not restrict any syscalls, so the Pod should start successfully.
kubectl get pod/audit-pod
NAME READY STATUS RESTARTS AGE
audit-pod 1/1 Running 0 30s
In order to be able to interact with this endpoint exposed by this container, create a NodePort Services that allows access to the endpoint from inside the kind control plane container.
kubectl expose pod audit-pod --type NodePort --port 5678
Check what port the Service has been assigned on the node.
kubectl get service audit-pod
The output is similar to:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
audit-pod NodePort 10.111.36.142 <none> 5678:32373/TCP 72s
Now you can use curl
to access that endpoint from inside the kind control plane container,
at the port exposed by this Service. Use docker exec
to run the curl
command within the
container belonging to that control plane container:
# Change 6a96207fed4b to the control plane container ID you saw from "docker ps"
docker exec -it 6a96207fed4b curl localhost:32373
just made some syscalls!
You can see that the process is running, but what syscalls did it actually make?
Because this Pod is running in a local cluster, you should be able to see those
in /var/log/syslog
. Open up a new terminal window and tail
the output for
calls from http-echo
:
tail -f /var/log/syslog | grep 'http-echo'
You should already see some logs of syscalls made by http-echo
, and if you
curl
the endpoint in the control plane container you will see more written.
For example:
Jul 6 15:37:40 my-machine kernel: [369128.669452] audit: type=1326 audit(1594067860.484:14536): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=29064 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=51 compat=0 ip=0x46fe1f code=0x7ffc0000
Jul 6 15:37:40 my-machine kernel: [369128.669453] audit: type=1326 audit(1594067860.484:14537): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=29064 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=54 compat=0 ip=0x46fdba code=0x7ffc0000
Jul 6 15:37:40 my-machine kernel: [369128.669455] audit: type=1326 audit(1594067860.484:14538): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=29064 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=202 compat=0 ip=0x455e53 code=0x7ffc0000
Jul 6 15:37:40 my-machine kernel: [369128.669456] audit: type=1326 audit(1594067860.484:14539): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=29064 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=288 compat=0 ip=0x46fdba code=0x7ffc0000
Jul 6 15:37:40 my-machine kernel: [369128.669517] audit: type=1326 audit(1594067860.484:14540): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=29064 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=0 compat=0 ip=0x46fd44 code=0x7ffc0000
Jul 6 15:37:40 my-machine kernel: [369128.669519] audit: type=1326 audit(1594067860.484:14541): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=29064 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=270 compat=0 ip=0x4559b1 code=0x7ffc0000
Jul 6 15:38:40 my-machine kernel: [369188.671648] audit: type=1326 audit(1594067920.488:14559): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=29064 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=270 compat=0 ip=0x4559b1 code=0x7ffc0000
Jul 6 15:38:40 my-machine kernel: [369188.671726] audit: type=1326 audit(1594067920.488:14560): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=29064 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=202 compat=0 ip=0x455e53 code=0x7ffc0000
You can begin to understand the syscalls required by the http-echo
process by
looking at the syscall=
entry on each line. While these are unlikely to
encompass all syscalls it uses, it can serve as a basis for a seccomp profile
for this container.
Clean up that Pod and Service before moving to the next section:
kubectl delete service audit-pod --wait
kubectl delete pod audit-pod --wait --now
Create Pod with a seccomp profile that causes violation
For demonstration, apply a profile to the Pod that does not allow for any syscalls.
The manifest for this demonstration is:
apiVersion: v1
kind: Pod
metadata:
name: violation-pod
labels:
app: violation-pod
spec:
securityContext:
seccompProfile:
type: Localhost
localhostProfile: profiles/violation.json
containers:
- name: test-container
image: hashicorp/http-echo:0.2.3
args:
- "-text=just made some syscalls!"
securityContext:
allowPrivilegeEscalation: false
Attempt to create the Pod in the cluster:
kubectl apply -f https://k8s.io/examples/pods/security/seccomp/ga/violation-pod.yaml
The Pod creates, but there is an issue. If you check the status of the Pod, you should see that it failed to start.
kubectl get pod/violation-pod
NAME READY STATUS RESTARTS AGE
violation-pod 0/1 CrashLoopBackOff 1 6s
As seen in the previous example, the http-echo
process requires quite a few
syscalls. Here seccomp has been instructed to error on any syscall by setting
"defaultAction": "SCMP_ACT_ERRNO"
. This is extremely secure, but removes the
ability to do anything meaningful. What you really want is to give workloads
only the privileges they need.
Clean up that Pod before moving to the next section:
kubectl delete pod violation-pod --wait --now
Create Pod with a seccomp profile that only allows necessary syscalls
If you take a look at the fine-grained.json
profile, you will notice some of the syscalls
seen in syslog of the first example where the profile set "defaultAction": "SCMP_ACT_LOG"
. Now the profile is setting "defaultAction": "SCMP_ACT_ERRNO"
,
but explicitly allowing a set of syscalls in the "action": "SCMP_ACT_ALLOW"
block. Ideally, the container will run successfully and you will see no messages
sent to syslog
.
The manifest for this example is:
apiVersion: v1
kind: Pod
metadata:
name: fine-pod
labels:
app: fine-pod
spec:
securityContext:
seccompProfile:
type: Localhost
localhostProfile: profiles/fine-grained.json
containers:
- name: test-container
image: hashicorp/http-echo:0.2.3
args:
- "-text=just made some syscalls!"
securityContext:
allowPrivilegeEscalation: false
Create the Pod in your cluster:
kubectl apply -f https://k8s.io/examples/pods/security/seccomp/ga/fine-pod.yaml
kubectl get pod fine-pod
The Pod should be showing as having started successfully:
NAME READY STATUS RESTARTS AGE
fine-pod 1/1 Running 0 30s
Open up a new terminal window and use tail
to monitor for log entries that
mention calls from http-echo
:
# The log path on your computer might be different from "/var/log/syslog"
tail -f /var/log/syslog | grep 'http-echo'
Next, expose the Pod with a NodePort Service:
kubectl expose pod fine-pod --type NodePort --port 5678
Check what port the Service has been assigned on the node:
kubectl get service fine-pod
The output is similar to:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
fine-pod NodePort 10.111.36.142 <none> 5678:32373/TCP 72s
Use curl
to access that endpoint from inside the kind control plane container:
# Change 6a96207fed4b to the control plane container ID you saw from "docker ps"
docker exec -it 6a96207fed4b curl localhost:32373
just made some syscalls!
You should see no output in the syslog
. This is because the profile allowed all
necessary syscalls and specified that an error should occur if one outside of
the list is invoked. This is an ideal situation from a security perspective, but
required some effort in analyzing the program. It would be nice if there was a
simple way to get closer to this security without requiring as much effort.
Clean up that Pod and Service before moving to the next section:
kubectl delete service fine-pod --wait
kubectl delete pod fine-pod --wait --now
Create Pod that uses the container runtime default seccomp profile
Most container runtimes provide a sane set of default syscalls that are allowed
or not. You can adopt these defaults for your workload by setting the seccomp
type in the security context of a pod or container to RuntimeDefault
.
SeccompDefault
feature gate enabled, then Pods use the RuntimeDefault
seccomp profile whenever
no other seccomp profile is specified. Otherwise, the default is Unconfined
.Here's a manifest for a Pod that requests the RuntimeDefault
seccomp profile
for all its containers:
apiVersion: v1
kind: Pod
metadata:
name: default-pod
labels:
app: default-pod
spec:
securityContext:
seccompProfile:
type: RuntimeDefault
containers:
- name: test-container
image: hashicorp/http-echo:0.2.3
args:
- "-text=just made some more syscalls!"
securityContext:
allowPrivilegeEscalation: false
Create that Pod:
kubectl apply -f https://k8s.io/examples/pods/security/seccomp/ga/default-pod.yaml
kubectl get pod default-pod
The Pod should be showing as having started successfully:
NAME READY STATUS RESTARTS AGE
default-pod 1/1 Running 0 20s
Finally, now that you saw that work OK, clean up:
kubectl delete pod default-pod --wait --now
What's next
You can learn more about Linux seccomp: