Kubernetes restarts Pods without giving up GPUs

Google describes in-place Pod recovery that avoids full recreation, useful for AI workloads on GPUs or TPUs.

Google Open Source published a June 18 post about a Kubernetes change aimed squarely at large compute workloads, including AI and machine-learning training or batch jobs running on GPUs and TPUs. The verified point is specific: Kubernetes v1.35 introduces the RestartAllContainers action, tied to the RestartAllContainersOnContainerExits feature gate, so all containers in a Pod can be restarted without destroying and recreating the Pod object. The feature graduates to beta and is enabled by default in v1.36, according to the post.

The change is low level, but the problem is practical. In Kubernetes, a Pod is the unit that groups one or more containers sharing network, storage, and lifecycle context. Until now, when a multi-container setup needed a clean restart, teams often had to delete and recreate the whole Pod. For a simple service, that may be acceptable. For a distributed job using scarce accelerators, it can trigger rescheduling, a new IP address, DNS delay, control-plane pressure, and sometimes the loss of a hard-won slot on a GPU or TPU node.

The value of RestartAllContainers is that the Pod sandbox stays in place. The Kubelet stops the containers, reruns init containers, and restarts the set while preserving the runtime identity. The network address, namespace, mounted volumes, and assigned devices, including GPUs or TPUs, remain attached to the same Pod. For AI workloads, that detail can prevent a thundering herd when thousands of Pods fail and all ask the scheduler for compute at once. It can also stop another pending job from taking the node resources while the original workload is being recreated.

The Google post lists several cases: a main container corrupting a local environment, a watcher sidecar detecting a fatal error, a database proxy leaving the main application with stale connections, or a large job needing a full environment reset without losing hardware locality. Kubernetes also adds an AllContainersRestarting Pod condition, visible to observability tools, so operators and autoscalers do not mistake the event for a fresh deployment. The takeaway is restrained but useful: as AI consumes tighter clusters of expensive accelerators, open source infrastructure is improving the small recovery mechanics that reduce wasted compute and make failures less costly.