OpenAI or DIY? Uncovering the real cost of a self-hosted LLM

See how companies are responsibly integrating AI into production environments. This invite-only event in SF explores the intersection of technology and business. Click here to learn how to participate.

By integrating LLM, we proudly labeled our services as “AI-powered.” The website homepage proudly shows off the revolutionary impact of its AI-powered services with interactive demos and case studies. This will give your company its first footprint in the global generative AI environment.

Our small but loyal user base loves an enhanced customer experience, and we see growth potential at hand. But just three weeks into this month, he was shocked by the following email from OpenAI.

*Figure 1: Email sent by OpenAI when API usage reaches the hard limit specified in the OpenAI dashboard, in this case $4,999.*

Just a week ago, you were talking to customers and assessing product market fit (PMF), and now thousands of users are flocking to your website (everything goes viral on social media these days). (possibly), caused the AI-powered service to crash. .

*Figure 2: Daily API usage displayed on the OpenAI dashboard.*

As a result, once-reliable services are not only dissatisfying existing users, but also impacting new users.

VB event

AI Impact Tour – San Francisco

The next stop on VB's AI Impact tour in San Francisco will explain the complexities of responsibly integrating AI into your business. Don't miss the chance to gain insights from industry experts, network with like-minded innovators, explore his GenAI future with customer experience, and optimize your business processes.

request an invitation

The easy and obvious solution is to increase the usage limit and reinstate the service immediately.

But this temporary solution comes with some concerns. You can't help but feel like you have limited control over your own AI and associated costs, and are locked into a dependency on a single provider.

“Should I DIY?” you ask yourself

Fortunately, you know that open source large-scale language models (LLMs) are a reality. Thousands of such models are instantly available on platforms like Hugging Face, opening up self-hosting possibilities.

However, the most powerful LLMs we have found so far have billions of parameters, are hundreds of gigabytes in size, and require considerable effort to scale. Real-time systems that require low latency cannot simply plug and play into an application as they can with traditional models.

While we have full confidence in our team's ability to build the necessary infrastructure, the real concern is the cost implications of such a migration, including:

Fine tuning cost
hosting costs
Providing cost

So the big question is whether to increase usage limits or go the self-hosted, or “ownership” route.

A little math with Llama 2

First of all, don't rush. This is a big decision.

If you talk to a machine learning (ML) engineer, they'll probably tell you that Lama 2, an open source LLM, appears to be a good model to proceed with since it performs on par with GPT-3 for most tasks. Probably. That's the current model.

You also learn that it is available in three model sizes: 7, 13, and 70 billion parameters, and you decide to choose the largest model size to remain competitive with the OpenAI model you are currently using.

LLaMA 2 was trained using bfloat16, so all parameters consume 2 bytes per parameter. This means the model size will be 140 GB.

If this seems like a big model to tweak, don't worry. With LoRA, you don't have to fine-tune the entire model before deployment.

In fact, only about 0.1% (70M) of the total parameters need to be tweaked, and the bfloat16 representation consumes 0.14 GB.

Impressive, right?

To account for memory overhead during fine-tuning (backpropagation, saving activations, saving datasets, etc.), it is a good idea to maintain up to 5 times the memory space consumed by the trainable parameters. .

Let's take a closer look at this:

LLaMA 2 70B model weights are fixed during LoRA, so there is no memory overhead → memory requirement = 140 GB.
However, to adjust the LoRA layer, you need to maintain 0.14 GB*(5x) = 0.7 GB.
The total during fine-tuning can be up to 141 GB.

Let's assume you don't have a training infrastructure yet and want to use AWS. Based on AWS EC2 on-demand pricing, compute costs about $2.8 per hour and fine-tuning costs about $67 per day, but since fine-tuning doesn't take many days, It's not the cost.

AI is the opposite of a restaurant. The main cost is in delivery, not preparation.

Deployment requires maintaining two weights in memory.

Model weights. Consumes 140 GB of memory.
LoRA fine-tuning weights consume 0.14 GB of memory.
This totals 140.14 GB.

Of course, you can omit the gradient calculation, but we still recommend keeping up to 1.5x memory (approximately up to 210 GB) to handle any unexpected overhead.

Based on AWS EC2 on-demand pricing (again), GPU compute costs about $3.70 per hour, and about $90 per day to keep the model loaded into production memory and respond to incoming requests. .

This works out to about $2,700 per month.

Another thing to consider is that unexpected failures always occur. Without a fallback mechanism, users will no longer receive model predictions. To prevent this, you should maintain another redundant model in case the first model fails a request.

That means the cost is $180 per day, or $5,400 per month. We are now getting close to the cost of OpenAI.

At what point does OpenAI and open source break even?

If you continue to use OpenAI, here's how many words you can process each day to match the fine-tuning and delivery costs incurred with LLaMA 2 above.

Based on OpenAI pricing, a GPT 3.5 Turbo tweak costs $0.0080 / 1,000 tokens.

Assuming that most words have two tokens, you would need to feed approximately 4.15 million words to the OpenAI model to match the fine-tuning cost of the open source LLaMA 2 70B model ($67 per day).
The average word count for an A4 page is usually 300. This means you can feed your model a massive 14,000 pages worth of data, which is comparable to open source fine-tuning costs.

Fine-tuning costs are always lower with OpenAI because there is likely not to be as much fine-tuning data.

Another obvious point is that this fine-tuning cost is not related to training time, but to the amount of data on which the model is fine-tuned. This does not apply to fine-tuning open source models, as the cost depends on both the amount of data and the amount of time AWS compute is used.

As for the cost of the service, based on OpenAI's pricing page, the fine-tuned GPT 3.5 Turbo costs $0.003/1,000 tokens for inputs and $0.006/1,000 tokens for outputs.

Let's consider an average of $0.004/1K token. To get there at a cost of $180 per day, you would need to submit approximately 22.2 million words each day through the API.
This equates to over 74,000 pages of data, with each page equaling 300 words.

However, one good thing is that OpenAI offers pay-as-you-go pricing, so you don't have to make sure your models are running all day long.

If the model is never used, you don't pay.

Summary: When does ownership actually make sense?

Moving to self-hosted AI may seem like an attractive endeavor at first. But beware of the hidden costs and headaches that come with it.

Aside from the occasional sleepless night wondering why your AI-powered service is down, using a third-party provider will take care of almost everything you need to do when managing LLM on a production system. Difficulties will be resolved.

This is especially true if the service is not primarily “AI” but relies on AI.

For a large company, $65,000 per year in cost of ownership may be very low, but for most companies, this is a number that cannot be ignored.

Also, don't forget about additional costs such as staffing and maintenance. This can easily increase the total cost to more than $200,000 to $250,000 per year.

Certainly, owning the model from the beginning has its benefits, such as maintaining control over your data and usage.

However, making self-hosting viable requires resources to manage human resources and logistics, as well as a load of user requests of much more than approximately 22.2 million words per day.

For most use cases, it is unlikely that it is economically beneficial to own the model instead of using the API.

Avi Chawla is a data scientist and creator. AI port.

data decision maker

Welcome to the VentureBeat community!

DataDecisionMakers are experts, including data technologists, who can share their data-related insights and innovations.

If you want to read about cutting-edge ideas, updates, best practices, and the future of data and data technology, join DataDecisionMakers.

You may also consider contributing your own article.

What's Hot

Maximize your search engine rankings with data-driven tools and local SEO

Revolutionize SEO with AI Onsite Optimizer

What is SEO for websites, YouTube and other digital properties?

OpenAI or DIY? Uncovering the real cost of a self-hosted LLM

When a man pulls out the paint to do some DIY, his dog seizes the opportunity

I'm a DIY expert, so avoid buying fancy patio cleaning products – 20p products will do – and definitely avoid pressure washing.

Bright Eyes announces new album Five Dice, All Threes • News • DIY Magazine

Maximize your search engine rankings with data-driven tools and local SEO

Revolutionize SEO with AI Onsite Optimizer

What is SEO for websites, YouTube and other digital properties?

AI-powered SEO software market [2024-2031] Size, Trends, Sales, Revenue Forecasts HubSpot. Marketo. Oracle – Economica

AMD Ryzen AI CPU beats Intel Core Ultra in AI LLM and GenAI benchmarks, delivers lower power consumption and lower cost with XDNA

Microsoft investigates harmful AI-powered chatbot 'Copilot'

AnkerWork S600 review: An AI-powered speakerphone that actually works

Our Picks

Maximize your search engine rankings with data-driven tools and local SEO

Revolutionize SEO with AI Onsite Optimizer

What is SEO for websites, YouTube and other digital properties?

Most Popular

OnlyFans creator dishes dirt on dating

Anya Taylor-Joy has big plans to rival Gwyneth Paltrow's £197m business Goop as she prepares to launch a lifestyle business

OnlyFans star suffers from online stalking by family member: 'It hurts my stomach'

Subscribe to Updates

What's Hot

OpenAI or DIY? Uncovering the real cost of a self-hosted LLM

VB event

“Should I DIY?” you ask yourself

A little math with Llama 2

AI is the opposite of a restaurant. The main cost is in delivery, not preparation.

At what point does OpenAI and open source break even?

Summary: When does ownership actually make sense?

Related Posts