See how companies are responsibly integrating AI into production environments. This invite-only event in SF explores the intersection of technology and business. Click here to learn how to participate.
By integrating LLM, we proudly labeled our services as “AI-powered.” The website homepage proudly shows off the revolutionary impact of its AI-powered services with interactive demos and case studies. This will give your company its first footprint in the global generative AI environment.
Our small but loyal user base loves an enhanced customer experience, and we see growth potential at hand. But just three weeks into this month, he was shocked by the following email from OpenAI.
Just a week ago, you were talking to customers and assessing product market fit (PMF), and now thousands of users are flocking to your website (everything goes viral on social media these days). (possibly), caused the AI-powered service to crash. .
As a result, once-reliable services are not only dissatisfying existing users, but also impacting new users.
VB event
AI Impact Tour – San Francisco
request an invitation
The easy and obvious solution is to increase the usage limit and reinstate the service immediately.
But this temporary solution comes with some concerns. You can't help but feel like you have limited control over your own AI and associated costs, and are locked into a dependency on a single provider.
“Should I DIY?” you ask yourself
Fortunately, you know that open source large-scale language models (LLMs) are a reality. Thousands of such models are instantly available on platforms like Hugging Face, opening up self-hosting possibilities.
However, the most powerful LLMs we have found so far have billions of parameters, are hundreds of gigabytes in size, and require considerable effort to scale. Real-time systems that require low latency cannot simply plug and play into an application as they can with traditional models.
While we have full confidence in our team's ability to build the necessary infrastructure, the real concern is the cost implications of such a migration, including:
- Fine tuning cost
- hosting costs
- Providing cost
So the big question is whether to increase usage limits or go the self-hosted, or “ownership” route.
A little math with Llama 2
First of all, don't rush. This is a big decision.
If you talk to a machine learning (ML) engineer, they'll probably tell you that Lama 2, an open source LLM, appears to be a good model to proceed with since it performs on par with GPT-3 for most tasks. Probably. That's the current model.
You also learn that it is available in three model sizes: 7, 13, and 70 billion parameters, and you decide to choose the largest model size to remain competitive with the OpenAI model you are currently using.
LLaMA 2 was trained using bfloat16, so all parameters consume 2 bytes per parameter. This means the model size will be 140 GB.
If this seems like a big model to tweak, don't worry. With LoRA, you don't have to fine-tune the entire model before deployment.
In fact, only about 0.1% (70M) of the total parameters need to be tweaked, and the bfloat16 representation consumes 0.14 GB.
Impressive, right?
To account for memory overhead during fine-tuning (backpropagation, saving activations, saving datasets, etc.), it is a good idea to maintain up to 5 times the memory space consumed by the trainable parameters. .
Let's take a closer look at this:
- LLaMA 2 70B model weights are fixed during LoRA, so there is no memory overhead → memory requirement = 140 GB.
- However, to adjust the LoRA layer, you need to maintain 0.14 GB*(5x) = 0.7 GB.
- The total during fine-tuning can be up to 141 GB.
Let's assume you don't have a training infrastructure yet and want to use AWS. Based on AWS EC2 on-demand pricing, compute costs about $2.8 per hour and fine-tuning costs about $67 per day, but since fine-tuning doesn't take many days, It's not the cost.
AI is the opposite of a restaurant. The main cost is in delivery, not preparation.
Deployment requires maintaining two weights in memory.
- Model weights. Consumes 140 GB of memory.
- LoRA fine-tuning weights consume 0.14 GB of memory.
- This totals 140.14 GB.
Of course, you can omit the gradient calculation, but we still recommend keeping up to 1.5x memory (approximately up to 210 GB) to handle any unexpected overhead.
Based on AWS EC2 on-demand pricing (again), GPU compute costs about $3.70 per hour, and about $90 per day to keep the model loaded into production memory and respond to incoming requests. .
This works out to about $2,700 per month.
Another thing to consider is that unexpected failures always occur. Without a fallback mechanism, users will no longer receive model predictions. To prevent this, you should maintain another redundant model in case the first model fails a request.
That means the cost is $180 per day, or $5,400 per month. We are now getting close to the cost of OpenAI.
At what point does OpenAI and open source break even?
If you continue to use OpenAI, here's how many words you can process each day to match the fine-tuning and delivery costs incurred with LLaMA 2 above.
Based on OpenAI pricing, a GPT 3.5 Turbo tweak costs $0.0080 / 1,000 tokens.
- Assuming that most words have two tokens, you would need to feed approximately 4.15 million words to the OpenAI model to match the fine-tuning cost of the open source LLaMA 2 70B model ($67 per day).
- The average word count for an A4 page is usually 300. This means you can feed your model a massive 14,000 pages worth of data, which is comparable to open source fine-tuning costs.
Fine-tuning costs are always lower with OpenAI because there is likely not to be as much fine-tuning data.
Another obvious point is that this fine-tuning cost is not related to training time, but to the amount of data on which the model is fine-tuned. This does not apply to fine-tuning open source models, as the cost depends on both the amount of data and the amount of time AWS compute is used.
As for the cost of the service, based on OpenAI's pricing page, the fine-tuned GPT 3.5 Turbo costs $0.003/1,000 tokens for inputs and $0.006/1,000 tokens for outputs.
- Let's consider an average of $0.004/1K token. To get there at a cost of $180 per day, you would need to submit approximately 22.2 million words each day through the API.
- This equates to over 74,000 pages of data, with each page equaling 300 words.
However, one good thing is that OpenAI offers pay-as-you-go pricing, so you don't have to make sure your models are running all day long.
If the model is never used, you don't pay.
Summary: When does ownership actually make sense?
Moving to self-hosted AI may seem like an attractive endeavor at first. But beware of the hidden costs and headaches that come with it.
Aside from the occasional sleepless night wondering why your AI-powered service is down, using a third-party provider will take care of almost everything you need to do when managing LLM on a production system. Difficulties will be resolved.
This is especially true if the service is not primarily “AI” but relies on AI.
For a large company, $65,000 per year in cost of ownership may be very low, but for most companies, this is a number that cannot be ignored.
Also, don't forget about additional costs such as staffing and maintenance. This can easily increase the total cost to more than $200,000 to $250,000 per year.
Certainly, owning the model from the beginning has its benefits, such as maintaining control over your data and usage.
However, making self-hosting viable requires resources to manage human resources and logistics, as well as a load of user requests of much more than approximately 22.2 million words per day.
For most use cases, it is unlikely that it is economically beneficial to own the model instead of using the API.
Avi Chawla is a data scientist and creator. AI port.
data decision maker
Welcome to the VentureBeat community!
DataDecisionMakers are experts, including data technologists, who can share their data-related insights and innovations.
If you want to read about cutting-edge ideas, updates, best practices, and the future of data and data technology, join DataDecisionMakers.
You may also consider contributing your own article.
Read more about DataDecisionMakers