Question 1

What is quantization and how does it affect quality?

Accepted Answer

Quantization reduces model weights from 32-bit or 16-bit floats to 4-bit or 8-bit integers, dramatically reducing VRAM requirements. Q4 quantization (4-bit) reduces VRAM by ~75% vs FP32 with minimal quality loss for most tasks. For production use, Q4_K_M is the recommended balance of size and quality.

Question 2

Can I run a 70B model on a single GPU?

Accepted Answer

A 70B model in Q4 quantization requires ~42GB VRAM (including KV cache). This fits on a single A100 80GB or requires two 24GB GPUs (like RTX 4090s) with tensor parallelism. A single RTX 3090/4090 can run 7B–13B models comfortably.

Question 3

What is the difference between cloud GPU rental and local hardware?

Accepted Answer

Cloud GPUs (AWS, Lambda Labs, RunPod) charge hourly and require no upfront investment. Local GPUs require $700–4,000+ upfront but have near-zero variable cost (just electricity ~$20–60/month). Local hardware breaks even in 3–18 months depending on utilization.

Question 4

What software do I need to self-host an LLM?

Accepted Answer

Popular options include: Ollama (easiest, local), llama.cpp (C++ runtime, very efficient), vLLM (production serving, GPU), TGI by Hugging Face (production), and LM Studio (GUI for local use). All support GGUF/quantized model formats.

Question 5

When does self-hosting beat the API on cost?

Accepted Answer

For cloud GPU rental: self-hosting is often cheaper immediately if usage > ~300K tokens/day (varies by model and GPU). For local hardware: the break-even is typically 3–12 months depending on hardware cost and daily token volume.

GPU	VRAM	Provider	On-Demand/mo↑	Spot/mo	Purchase Price
NVIDIA T4 16GB	16	CHEAPESTGCP / AWS g4dn	$84.00	$28.80	—
NVIDIA A10G 24GB	24	AWS g5 instances	$180.00	$72.00	—
NVIDIA L4 24GB	24	GCP	$192.00	$84.00	—
NVIDIA L40S 48GB	48	Lambda Labs	$480.00	$192.00	—
NVIDIA A100 40GB	40	Lambda Labs	$504.00	$216.00	—
NVIDIA A100 80GB	80	AWS / GCP / Azure	$768.00	$288.00	—
NVIDIA H100 80GB	80	CoreWeave / Lambda	$1,080.00	$480.00	—
NVIDIA RTX 4090 24GB (local)	24	Local / On-prem	—	—	$1,699.00
NVIDIA RTX 3090 24GB (local)	24	Local / On-prem	—	—	$700.00
NVIDIA A6000 48GB (local)	48	Local / On-prem	—	—	$4,000.00

Self-Hosted LLM Calculator

Configuration

Compatible GPUs for Llama 3.1 8B at Q4 (8+ GB VRAM)

Self-Hosted LLM — FAQ

Our Offices