Self-hostingLlamaInfrastructure

Self-Hosting a 70B Model on a Single H100: A 2026 Playbook

Yes, you can serve Llama 4 70B on one H100 at production speed. Quantization, serving stack, throughput tuning, and the operational realities.

Published Apr 30, 2026By LLMDex Editorial

Self-hosting a 70-billion-parameter LLM was a research project in 2023. By 2026 it's a routine deployment for teams with the right combination of constraints, IP-sensitive code work, EU data residency, predictable inference costs at high volume, or just a preference for owning the stack. A single H100 80GB GPU is enough to serve a 70B-class model at production speed if you choose the right quantization and serving stack. This piece is the working playbook.

Why self-host at all

Three legitimate reasons:

Data residency. Your code, your customer data, your internal documents stay on infrastructure you control. For regulated industries (finance, healthcare, defense), this is often the only acceptable path. For European companies, it solves the GDPR / EU AI Act problem cleanly.

Cost predictability at high volume. Above ~10M tokens/day, self-hosting on a single H100 is dramatically cheaper than commercial APIs. The hardware amortizes against your throughput; you stop paying per token.

Latency control. Self-hosted inference on dedicated hardware has predictable P99 latency. Commercial APIs have throttling, queueing, and tail-latency issues that you can't control.

If none of those apply, don't self-host. Pay Together AI or Fireworks for the same model on shared infrastructure.

What "single H100" actually constrains

H100 80GB has 80GB of HBM3 memory at ~3TB/s bandwidth and ~2 PFLOPS at FP8. The constraints that matter for 70B serving:

Memory. A 70B model in FP16 is 140GB, doesn't fit. In FP8 it's 70GB, fits but leaves no room for KV cache. In INT4 quantization it's 35-40GB, fits with comfortable headroom for the KV cache.

Memory bandwidth. 70B at FP8 is bandwidth-bound, not compute-bound. Throughput scales with HBM bandwidth, not raw FLOPS. This is why H100 → H200 (which has 4.8TB/s vs 3TB/s bandwidth) is a much bigger upgrade for self-hosting than the FLOPS difference suggests.

Compute. Plenty for 70B serving. Compute isn't the binding constraint.

Concurrency. A serving stack with continuous batching can handle 30-50 concurrent requests on a single H100 with a 70B model. Above that, latency degrades. Plan capacity accordingly.

Quantization choice

The single most important decision. Three options:

FP8. Native H100 support. Roughly 1% quality drop vs FP16. Memory footprint of ~70GB for 70B, fits but tight. Use this if you have specific quality requirements that INT4 can't meet.

INT4 (W4A16 or similar). Memory footprint of ~35-40GB. Quality drop of 2-5% on standard benchmarks. The right default for most workloads.

AWQ (Activation-aware Weight Quantization). A specific INT4 variant that preserves quality better than naive INT4. Memory footprint similar to INT4. Slightly slower than INT4 to load. Use this if you tested INT4 and saw too much quality drop.

For Llama 4 70B specifically, AWQ INT4 is the production sweet spot. Quality is within ~2% of FP16 on most evals; throughput is roughly 2x INT8 on the same hardware.

Serving stack

Three credible options:

vLLM

The most-used open-source serving stack. Mature, well-documented, supports continuous batching, paged attention, prefix caching, and most quantization schemes. Active development.

vLLM throughput on Llama 4 70B (AWQ INT4, single H100): roughly 4,000-6,000 tokens/sec sustained, ~30-50 concurrent users at p95 latency under 100ms first-token.

Use vLLM unless you have a specific reason not to.

SGLang

Younger, faster on some workloads. Stronger on structured-output generation (constrained decoding, JSON mode). Smaller community than vLLM, slightly less mature.

Use SGLang if your workload is heavy on structured outputs and you're OK with the smaller community.

TensorRT-LLM

Nvidia's official inference framework. Highest peak throughput on Nvidia hardware. Closed-source, less flexible, more complex to set up.

Use TensorRT-LLM if you have Nvidia engineering support and need the absolute peak throughput.

Setup walkthrough

Concrete commands for vLLM + Llama 4 70B AWQ on a single H100:

# Install vLLM (assumes Python 3.10+, CUDA 12.x)
pip install vllm

# Download the model (assumes you have HF access to Llama 4)
huggingface-cli download meta-llama/Llama-4-70B-Instruct-AWQ \
  --local-dir ./llama-4-70b-awq

# Start the server
vllm serve ./llama-4-70b-awq \
  --quantization awq \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.92 \
  --port 8000

The server exposes an OpenAI-compatible API on port 8000. Test:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "./llama-4-70b-awq",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

You're now serving Llama 4 70B from a single H100. Total setup time on a clean machine: about 30 minutes, dominated by the model download.

Throughput tuning

Three levers that move throughput most:

Batch size. vLLM auto-tunes this with continuous batching, but you can constrain --max-num-seqs and --max-num-batched-tokens if you have specific throughput vs latency tradeoffs.

Max model length. Reducing --max-model-len from 128K to 32K significantly improves memory efficiency. Most workloads don't need 128K, set the value to your actual P99 conversation length.

Speculative decoding. vLLM supports running a small draft model alongside the main model to predict tokens that get verified. With Llama 4 8B as the draft for Llama 4 70B as the main, throughput improves by ~30-50% on common workloads.

Operational realities

Three things that matter at scale:

Monitoring. Track P50/P95/P99 latency, throughput, KV cache utilization, GPU memory pressure, request queue depth. vLLM exposes these via Prometheus. Set alerts on KV cache utilization (>85% sustained means you're about to start dropping requests).

Failover. A single H100 is a single point of failure. Production deployments typically run two H100s with a load balancer; failover is fast because models are cached in CPU memory and reload to GPU in seconds.

Updates. New model checkpoints (Llama 4.1, etc.) typically require a cold restart of the serving process. Plan a maintenance window or run blue/green.

Cost. A single H100 in a colocated datacenter costs roughly $30K-40K capex. Cloud rental (AWS p5.48xlarge, GCP a3-highgpu-8g) is ~$30-40/hour or ~$22K/month per GPU. At sustained load, owning hardware pays back in 18-24 months.

When to scale beyond a single H100

Three signals to add capacity:

Sustained KV cache utilization >85%. You're queueing requests; latency is suffering.
P95 first-token latency >500ms. Users notice; conversational workloads suffer most.
Concurrent users >50 sustained. You're at vLLM's practical concurrency ceiling on a single H100.

The natural next step is two H100s with a load balancer, then move to H200s when the price-performance crosses over (roughly mid-2026 for most workloads). For very high scale, tensor-parallelism across 4-8 GPUs unlocks the larger models (Llama 4 405B, DeepSeek-V3) but the operational complexity steps up significantly.

What you don't get

Self-hosting Llama 4 70B is great for a lot of workloads. It's not great for:

Frontier reasoning. GPT-5.5, Claude Opus 4.7, and o3 will outperform Llama 4 70B on the hardest reasoning problems.
Vision-heavy work. Llama 4 70B has limited vision; for serious vision work, use Gemini 3 Pro or Qwen2-VL.
Long-context (>128K). Llama 4's context tops out below Gemini's. For 1M-token workloads, use commercial APIs.
Real-time voice. The latency story is different for voice. Use the voice agent architecture instead.

Concrete recommendation

If you have a workload that fits the self-hosting case (data residency, cost predictability at scale, latency control), here's the lean path:

Hardware: single H100 80GB. Two if you need failover.
Model: Llama 4 70B AWQ INT4. Reach for Llama 4 405B only if quality demands it (and accept the multi-GPU complexity).
Serving: vLLM with continuous batching, max-model-len=32K, GPU memory utilization=0.92.
Operations: Prometheus monitoring, sensible alerts, two-GPU failover, planned model updates.
Iteration: measure P99 latency and throughput on real traffic; tune from there.

This setup will serve you up to roughly 50 concurrent users at production quality. Above that, scale horizontally before reaching for bigger hardware.

Keep reading

Friday digest

One short email every Friday, new model launches, leaderboard moves, and pricing drops. Curated by hand. Free, no spam.