Taming Background Jobs: Sandboxing Celery and Sidekiq Tasks Before They Misbehave

Alprina Security Team

Cover Image for Taming Background Jobs: Sandboxing Celery and Sidekiq Tasks Before They Misbehave

Alprina Security Team

August 20, 2024

Hook: The Background Job That Exfiltrated Your Service Account

You run Celery workers on Kubernetes handling file conversions. A contributor merges a task that shells out to ImageMagick. During review everything looked fine, but the command expands user-controlled input directly into the shell. An attacker uploads $(cat /var/run/secrets/kubernetes.io/serviceaccount/token) as part of a filename. The worker executes the command, ImageMagick happily invokes /bin/sh, and your cluster token lands in a public bucket. Frontend traffic stays normal while the background pipeline drains secrets.

Background workers rarely get the same security scrutiny as web handlers, yet they often run with broader IAM roles, filesystem access, and network permissions. This article breaks down how to sandbox Celery (Python) and Sidekiq (Ruby) tasks, limit blast radius, and add regression tests that catch dangerous patterns before they ship.

The Problem Deep Dive

Developers optimize for throughput and reliability:

Shared workers. Multiple queues share the same container, so a low-privileged task can access high-privileged code paths.
Shell-outs and FFI. Tasks call system commands or native libraries. Input validation is inconsistent.
Broad IAM permissions. Workers need to touch S3, queues, and internal APIs. Tokens often grant admin access.
Long-lived credentials. Workers load secrets at boot and never rotate them.
Lack of observability. Telemetry aggregates by queue, not task ID, making it hard to trace malicious behavior.

Example Celery task anti-pattern:

@app.task
def convert_image(input_url, cmd):
    path = download(input_url)
    subprocess.run(f"convert {path} {cmd}", shell=True, check=True)
    return upload(path)

No sanitization, broad shell command, and the worker runs as root in a Docker image.

Technical Solutions

Quick Patch: Drop Privileges and Block Shell

Run workers as non-root users. In Dockerfile: USER worker.
Replace shell=True with argument lists:

subprocess.run(["convert", path, cmd], check=True)

Add allow lists for command-line flags.

Durable Fix: Per-Queue Sandboxing

Run sensitive queues in dedicated pods with scoped IAM roles and file systems.

Celery (Kubernetes):

Define separate deployments per queue (high_priv, low_priv).
Use Kubernetes ServiceAccount per deployment with minimal RBAC.
Mount only required volumes.
Apply seccomp and AppArmor profiles to block dangerous syscalls.

Sidekiq:

Spawn separate processes or use sidekiq_options queue: "high_priv" within dedicated pods.
Use bundle exec sidekiq -q low_priv -c 10 with --tag labels for monitoring.

OS-Level Guards

Enable seccomp: restrict syscalls (block ptrace, clone with flags).
Use ulimit to cap file descriptors and core dumps.
Mount /tmp as noexec and nosuid.

Network Segmentation

Route workers through egress proxies. Block direct internet access for high-privilege queues. Allow only necessary destinations.

Task Sandbox Wrappers

Wrap tasks with decorators that enforce policies:

def require_feature_flag(flag):
    def wrapper(func):
        @functools.wraps(func)
        def inner(*args, **kwargs):
            if not feature_flags.is_enabled(flag):
                raise RuntimeError("disabled task")
            return func(*args, **kwargs)
        return inner
    return wrapper

Add @require_feature_flag("image_conversion") to risky tasks so you can disable them quickly.

Credential Handling

Use short-lived AWS/GCP tokens via IRSA or Workload Identity.
Fetch secrets per task invocation; cache for short TTL.
Avoid writing credentials to disk.

Observability

Tag logs with task IDs and queue names.
Emit metrics for subprocess usage, retries, and long-running tasks.
Set alerts on unusual outbound traffic from worker nodes.

Alprina Policies

Scan repositories for subprocess.run(..., shell=True) or system() calls. Flag Dockerfiles that run workers as root. Ensure each deployment uses distinct service accounts.

Testing & Verification

Write unit tests that reject unsafe commands:

def test_disallows_shell_metachars():
    with pytest.raises(ValueError):
        sanitize_cmd("$(rm -rf /)")

Integration tests: run tasks inside a kind cluster with tightened security context. Attempt to read /var/run/secrets/... and assert permission errors.

Security harness: use Open Policy Agent or Kyverno to enforce pod security at admission time.

Chaos drills: intentionally break one queue to ensure others continue. Rotate IAM roles and verify tasks fail gracefully when creds expire.

Common Questions & Edge Cases

Is shell=False enough? No. Command binaries can still perform malicious actions if inputs craft file paths. Validate arguments rigorously.

How do we deal with legacy queues? Fence old queues behind feature flags. Migrate tasks incrementally to sandboxed deployments.

What about background jobs in serverless (EventBridge, Cloud Tasks)? Apply similar principles: least privilege, per-function IAM, and input validation.

Are containers sufficient? Containers help, but without seccomp/AppArmor they share the host kernel. Consider gVisor or Firecracker for untrusted workloads.

Monitoring overhead? Start with basic metrics (tasks per queue, error rate) and add more granular telemetry as you identify risk hotspots.

Conclusion

Async workers deserve the same security attention as your HTTP handlers. Separate trust zones per queue, constrain shell access, rotate credentials, and instrument the pipeline. When the next risky task lands in review, you will have both tooling and confidence to ship it safely.

Alprina Blog