As the title says, i started by selfhosting OpenWebUI including Ollama on my RIG. I have been pretty happy but the more i dig into this stuff, more i understand that i am doing it wrong and i definitely need to switch to llama.cpp / ik_llama.cpp.

But i have a few questions…

  1. I want a web based LLM chat GUI, because that’s my 80% usage for AI. If i go with llama.cpp, do i need to ditch OpenWebUI as well? Is there a better UI? Do i need an UI?

  2. i am currently hosting it all with a docker compose file. Is this still doable if i switch? I can go bare-metal (Gentoo server, good skills on my side) but it’s the maintenance part, a “podman compose pull” is just easier… or i am lazy.

  3. the server is headless and always accessed remotely via web or ssh, just to be clear.

My hardware is a NVIDIA RTX A4000 16GB VRAM on a I7-8700@3200Ghz with 64GB system RAM (shared with far too many services).

  • ffhein@lemmy.world
    link
    fedilink
    English
    arrow-up
    1
    ·
    4 days ago

    The main reason anyone would “need to” switch to llama.cpp is if they want to do partial offloading, i.e. split the model between GPU and CPU. This works quite well for MoE models, but you didn’t say anything about this, so I’m just wondering what your goals are.

    Absolutely nothing wrong with switching to llama.cpp, I also use it, but that’s because I occasionally want to run models larger than my VRAM. It has official docker images and a server with both API access and a decent web-UI.

    If you’re only going to run models which fully fit in VRAM, then tabbyAPI is also a good option. However, it uses Exl3 format instead of gguf, so llama.cpp probably makes more sense if you already have a lot of models in gguf format. tabby also comes with docker files and API support, so either should be quite easy to integrate with your setup.