Anthropic introduced fast caching on its APIthat remembers the context between API calls and allows developers to avoid repetitive prompts.
The fast caching feature is available in public beta on Claude 3.5 Sonnet and Claude 3 Haiku, but support for the biggest Claude model, Opus, is coming soon.
Fast caching, described on this 2023 paperallows users to retain commonly used contexts of their sessions. Because the models remember these prompts, users can add additional background information without increasing overhead. This is beneficial in cases where someone desires to send a considerable amount of context in a single prompt after which refer back to it in numerous conversations with the model. It also allows developers and other users to higher optimize model responses.
Anthropic said early users have “seen significant speed and value improvements with prompt caching for quite a lot of use cases – from incorporating a full knowledge base to 100-shot examples to incorporating every conversation turn into their prompt.”
Potential use cases include reducing costs and latency for long instructions and uploaded documents for conversational agents, faster auto-completion of codes, providing multiple instructions for agent search tools, and embedding entire documents in a command prompt, in keeping with the corporate.
Cached price requests
One advantage of cached prompts is lower prices per token. According to Anthropic, using cached prompts is “significantly cheaper” than the bottom price of the prompt token.
For Claude 3.5 Sonnet, writing a prompt to cache costs $3.75 per 1 million tokens (MTok), but using a cached prompt costs $0.30 per MTok. The base price of a prompt for the Claude 3.5 Sonnet model is $3/MTok, so in the event you pay somewhat more up front, you may expect a 10x savings the following time you utilize the cached prompt.
Claude 3 Haiku users pay $0.30/MTok for caching and $0.03/MTok when using saved prompts.
While prompt caching just isn’t yet available for Claude 3 Opus, Anthropic has already released its pricing. Writing to the cache costs $18.75/MTok, but accessing the cached prompt costs $1.50/MTok.
However, as AI influencer Simon Willison noted on X, Anthropic's cache only has a lifespan of 5 minutes and is refreshed each time it’s used.
Of course, this just isn’t the primary time Anthropic has tried to compete with other AI platforms through pricing. Before the discharge of the Claude 3 model family, Anthropic lowered the costs of its tokens.
The company is currently in a race to the underside against competitors like Google and OpenAI to supply low-cost options to third-party developers constructing on its platform.
Frequently requested feature
Other platforms offer a version of prompt caching. Lamina, an LLM inference system, uses KV caching to scale back the fee of GPUs. A cursory have a look at the OpenAI developer forums or GitHub raises questions on prompt caching.
Caching prompts just isn’t the identical as having a big language model memory. For example, OpenAI's GPT-4o provides a memory where the model remembers preferences or details. However, it doesn’t store the actual prompts and responses like prompt caching does.