How to choose an LLM? A guide for senior professionals and leaders

Currently, there are 100s of LLMs available in the market (either open-sourced or proprietary). Sometime back, it seemed a new LLM was coming up every day.

By now we all know the importance of LLMs, what they can do and why we need to use them for our businesses.

However, the problem comes when choosing the right LLM for our business use case. 

There are many good proprietary models, such as GPT4, GPT3.5, and other open-source models but how can you simultaneously evaluate all these models and select the best?

Therefore to help senior IT professionals and business owners, we have created this guide on choosing an LLM.

How to choose an LLM for your business?

Cost vs Performance vs Security vs Accuracy

If we use an LLM for personal purposes then cost is not a problem. However, cost is an important factor when we use LLM for 1000s of users.

At the same time, we also need to consider performance. Similarly, we must recognize accuracy.

And Security should the top priority. We simply can not ignore that.

Conversion cost of LLMs

LLM-based applications are not like your typical CRM or ERP system. If cost increases or performance drops then we should able to switch from one LLM to another.

Even if cost/performance is not an issue, we are getting better LLMs regularly with better performance/ accuracy/ cost.

In this case, what is the switching cost?

Though we are using a ready-made LLM, we spend weeks and sometimes months figuring out how to prompt an LLM and write code to connect to it.

If we have to switch from one LLM to another then we have to do the same exercise again and for doing this, the cost could be huge. 

At the same time, we can not be stuck with the older versions, as advancements in LLMs are happening at a very rapid pace.

So, this brings us to the question – how can we test LLMs or rather how quickly can we test LLMs?

How can we test multiple LLMs?

As per AI expert, Chris Krauss, one way to test multiple LLMs is to consider this activity as a sports tournament.  

A pair of 2 teams compete against each other, then other pair compete and then the winners play against each other and so on.

For example – as shown in the image above, we can first compare GPT 4 with GPT3.5 based on cost, performance and accuracy.

Once we have a winner, we can evaluate it with another model, say WatsonX. For evaluation, we do not have to do it manually, we could use another LLM for that.

Check the below image where we asked an LLM to rate all the 3 answers and it gave a comparison of answers from all 3 LLMs. 

The prompt was – What is the most important objective of using AI?

Results by test version

We can do the test multiple times with different types of prompts and then would actually draw a proper comparison among all 3 LLMs

If we try to do this comparison manually it would take a lot of people and time to do this. 

Here is the graph for Results by Question

This way you could test multiple LLMs in a few hours.


LLMs are changing every day. They are getting better, accurate and cheap with every passing day.

To keep up with this LLM innovation, we need to design a method to test them regularly.

