What is Token Limit?

Token Limit is the maximum number of tokens a model can process within a single request, including input and output. Token Limit constrains how much context can be provided and how long a response can be.

Quick definition

Token Limit is the maximum amount of text an AI model can handle at once.

How Token Limit works

  • Token Limit counts both prompt tokens and generated tokens.
  • Token Limit affects how much retrieved content can be included in RAG.
  • Token Limit can force truncation of long inputs or long conversation histories.
  • Token Limit interacts with prompt engineering because prompts must prioritize essential context.

Why Token Limit matters

Token Limit matters because limited context can reduce answer quality and increase omissions.

Token Limit also affects monitoring and testing because different prompts may exceed limits.

Example use cases

  • Summarizing a long document by chunking content to fit the token limit.
  • Reducing prompt verbosity so the model can return a complete answer.
  • Limiting retrieved passages in RAG to avoid exceeding context limits.

Related terms