Self-hosting my personal LLM (but not quite)

🎧 Listen

ChatGPT is now part of our daily lives. A quick question, an extra input, some quick feedback, I always reach for it. AGI or not AGI, I certainly can't deny the impact it has had on our lives. It's pretty incredible!

But that small voice just won't go away. "Where is all this data going? Should I really be telling this bot this much? What if someone else sees this?" In a way, it feels like everything I tell this bot is going into a void I have no control of. I can't be the only one feeling it.

And I don't like it. It's a bit like putting all my eggs in a single basket. As engineers, we know how bad having a single point of failure is. Resilience! That's what we were taught.

Given that I already started building Ambrosio, so why not use it to self-host my own LLM?

Ambrosio chat and generate photo

Self-hosting with the help of Llama.cpp

I'm not really interested in spending 500 USD/month to run a model that requires 40 GB of VRAM on a large GPU. So I had to lower my standards a little bit. The premise was now: What models can I run on a 15 USD/month server?

I started looking at the only LLM leaderboard I trust. There aren't many options with 7 billion parameters. The best one seems to be Starling-LM-7B-alpha, out of Berkeley. Quantized, it should just about run on a cheap Hetzner box. Getting it up and running on a small box with llama.cpp is as easy as:

# download model
$ wget

# setup llama cpp API with ./server
$ git clone
$ cd llama.cpp
$ make 
$ ./server -m starling-lm-7b-alpha.Q5_K_M.gguf -c 8192 --temp 0.0 --repeat_penalty 1.1 -n -1 -p "GPT4 User: {prompt}<|end_of_turn|>GPT4 Assistant:"

And just like that, you're running your own OpenAI compatible (ugh, I hate that term) API serving that very 7B model.

Now, don't get me wrong, I think this model is great! But I'm looking for something that will replace my daily use of ChatGPT. After integrating with Ambrosio the limitations were obvious. The first issue: It's slow, painfully slow (which was expected on my small machine). And even though it's pretty incredible for a small model, it's not going to replace even the most basic use cases.

And if it sounds unfair in any way, it's because it is! How could I expect the same level of quality from something that can run on a cheap VM, to something like ChatGPT? And you're right. We can't. So what else can we do?

Mixtral 8x7B and the rice test

Let's say I was going to rent a bigger machine for a better model. What would I run on it? Certainly not a 7B model! Looking at the rankings again, the best open source thing currently is Mixtral 8x7b, a mixture-of-experts model from Mistral. Just glancing at the requirements, and it was pretty clear this needed to be run on someone else's box. I decided to give a go. Their pricing is pretty much unbeatable and their inference speed is blazingly fast. And of course they have an Open AI compatible API.

Ambrosio vs. ChatGPT vs. GPT-4

Cool, it's fast and cheap. Is it any good?

Everyone has their own way of testing the quality of a model. Mine is what I call the "rice test". My system prompt, stolen from Jeremy Howard contains the following clause:

if the request begins with the string "vv" [...] make your response as concise as possible

The question above is a simple and straightforward scenario. I'm at the supermarket, I need to buy some rice to cook with my falafel. I open my phone and: "vv jasmin or basmati rice with falafel?". As you can see above, while GPT-4 gives me exactly what I need, we can't say the same for Mixtral or GPT-3.5. I just want a quick and dirty answer, I don't want something complicated, I certainly do not want any Python code. So yeah, it doesn't quite pass the jasmin rice test. But we're getting closer. And I will keep testing it in the future.

And sure, Mixtral may not have been trained with a system prompt, but it should, in principle, support it.

Thoughts and taking Ambrosio further

Since I ended up using together's API, I decided to also add image generation to Ambrosio (something like I have with DALL·E 2). Together supports most Stable Diffusion models. My feelings about them are largely the same as my feelings about Mixtral and GPT-4. The open source models are good! But they are also a bit rough, and don't come quite close to what Open AI is giving us. At least yet.

So, will Ambrosio completely replace my use of GPT-4? Will I cancel my ChatGPT plus subscription? Not really. At least, not yet.

Every time I use these open source models, I feel like I'm missing out on a better answer. In a way, I've become spoiled by GPT-4. It just grasps what I want in a better way, and most importantly, it passes the rice test with flying colors!

But the truth also is, that for most tasks, Mixtral is good enough. And it's a big leap forward from what the open source community previously had. I can only imagine what comes next.

Exciting times.

January 21, 2024
Subscribe Reply