Batch Prompting (cost reduction for LLMs/ChatGPT)
What are tokens?
Pealing back the onion, LLMs speak in tokens. Computers understand numbers and are really good at performing calculations with them, but too many numbers (and too many calculations) will have slow performance.
Enter tokens. Tokens are seemly random chunks of letters (sometimes english words, sometimes not) that are represented by a number. This compression from individual letters to chunks of letters reduces the number of calculations a computer needs to run in LLM, since one token may represent an entire word. OpenAI has a tokenizer playground for exploring how characters map to tokens.
Notice how ' back'
or ' onion'
are full tokens, but 'Pealing'
comprises 2 tokens ('P'
and 'ealing'
). An overly simple reason for this is OpenAI defined these tokens based on the most common in their dataset. Common words are given their own token and less common words are comprised of smaller tokens.
Prompts with large samples create many tokens. More tokens, means higher costs. LLMs can actually batch respond to prompts if you build your prompt correctly so that the same samples can be applied to the same requests.
The Experience + Slogan examples consist of 62 tokens. The actual prompt itself is 17 tokens, only 25% of the total prompt token count. If you have multiple prompts that have the same examples, then you can use the same examples for multiple prompts in a single prompt. This may sound confusing, but let me break it down.
Save money with batch prompts
With batch prompting, the model will respond to multiple prompts within a single prompt, minimizing the number of tokens that need to be created. Samples are created and defined separately from the results.
Input prompt
Pretend you are a food reviewer that only writes positive slogans about a restaurant.
Experience[1]: The pizza was too soggy
Experience[2]: The owner gave my son a toy with his hamburger
Slogan[1]: Pizza is popular
Slogan[2]: Best family friendly hamburger joint
Experience[3]: The spaghetti sauce paired perfectly with the home made pasta
Experience[4]: The sushi rolls are creative, the artistry is wonderful and everything was absolutely delicious!
Experience[5]: I like how they have both an extensive burger and breakfast menu - hence the Bs! They kept my coffee going and even catered to a number of adjustments for my friends - now that's service!
(153 tokens)
LLM Output
Slogan[3]: Perfectly paired pasta, sauce and all!
Slogan[4]: Creative sushi, artistry on point, flavors divine!
Slogan[5]: Bs for breakfast, burgers, and beyond! Exceptional service and customization.
GPT perfectly associated the experience number with the slogan (without leaking any information across experiences). The batch prompt used 153 tokens, whereas single prompts used 66, 73, and 97 tokens respectively (totaling 236 tokens).
With batch prompting using this example, there is a 64% token reduction and cost savings, while achieving similar accuracy.
Final thoughts
This technique reduces LLM costs by 50%+ at minimal accuracy degrdation. This technique is perfect for processing millinos of prompts for summarization or entity extraction, but at a low cost.
A word of warning: this technique isn’t free. There is ~3% decline in accuracy comparing a batch size of 4 to a single prompt, with an even sharper drop of accuracy from batch sizes of 4 to batch sizes of 6.
Additional reading
- Batch Prompting: Efficient Inference with Large Language Model APIs (arxiv.org)
- Prompt Engineering Reading Library (github.com)
- Prompt Engineering Walk-through colab notebook (github.com)
More posts
Posts
Batch Prompting (cost reduction for LLMs/ChatGPT)
NSFW Image detection on Digital Ocean Apps
Top 5 Junior tech roles to become software engineers
Top Ruby exploits in rubygems.org
Life Design: Daily Actions
Designing a distributed web crawler
System Design Interview Format - 6 Steps to passing
SSO / SAML to JWT: A system design problem
How and when to add foreign key constraints
Fraud Detection with Ruby on Rails
Rails: No downtime data migrations
Letter Avatars with Serverless
Wildcard Google Auth for multiple subdomains
I graduated from a code school. Now what?
Bulk email sending with Ruby
Rails tip: Returning
Rails tips: Fast Relation Detection
Regex: Escaping Slashes
Limited Retry Pattern
Adding Queues to your models
Hosting Side Projects for Free
Adding search to Atlanta Startup Jobs
Rails Patterns: Builders and Models
Ruby on Rails Grape gem: Pros and Cons
Otto meet Retrofit: How to build a scalable and performant Android application
Atlanta Startup Jobs - Scratch your own itch
SideProject: HappyBeam - Meditate and improve
How to win a hackathon
One Rails App Two Web Applications
Angularjs Rails and SEO Part 2
Angularjs Rails and SEO Part 1
Buying my first real estate investment
How to get OSX git branch name in console
How to start web consulting
Goodie Hack #2
What it takes to be entry level Rails developers
Rails 4.1 enums: How not to handle states
My path to success
Twitter bot on Heroku
Rails and Redis URL Shortener in same application
Tagging Models
Gem Configuration Defaults
Custom Environments with Rails 4
NodeJS Chat Guide
Rapid Ticket Purchasing and the extendable web
Extending Active Record in Rails 4.1
Reporting with Rails Part 2
Reporting with Rails Part 1
Setting up a Ruby on Rails 4.0.1beta1 with OSX Mavericks 10.9
subscribe via RSS