Question 1

What is RAG?

Accepted Answer

Retrieval-Augmented Generation (RAG) retrieves only the relevant chunks of your documents using a vector database, then passes those chunks to the LLM. This keeps context windows small and costs low, but requires infrastructure to build and maintain.

Question 2

When is long context cheaper than RAG?

Accepted Answer

Long context is cheaper when: document count is very small (<10 docs), query volume is low, and the model's context window is large enough (like Gemini Pro at 2M tokens). For simple one-off lookups, skipping RAG infrastructure makes sense.

Question 3

What are the hidden costs of RAG?

Accepted Answer

RAG requires: a vector database (Pinecone $0–700/mo, or pgvector self-hosted), an embedding model (usually cheap but adds up at scale), chunking and indexing infrastructure, and maintenance. For small document sets, these hidden costs can exceed the token savings.

Question 4

Can I use a hybrid approach?

Accepted Answer

Yes. Hybrid RAG uses a cheaper model (like GPT-4o mini) for retrieval ranking and a better model for final generation. This often gives the best quality-to-cost ratio, typically 3–5× cheaper than full long-context with a premium model.

Question 5

How many tokens is a typical page of text?

Accepted Answer

A standard A4 or letter page of English text contains roughly 300–500 words, which translates to 400–650 tokens. PDFs with tables, images, or complex layouts may have more overhead tokens.

RAG vs Long Context Calculator

Your document and query settings

Monthly cost comparison

RAG vs Long Context — FAQ

Our Offices