Last time I checked the awesome-selfhosted Github page, it didn't list self-hosted AI systems; so I decided to bring this topic up, because it's fairly interesting :)

Using certain models and AIs remotely is fun and interesting, if only just for poking around and being amazed by what it can do. But running it on your own system - where the only boundaries are your hardware and maybe some in-model tweaks - is something else and quite fun.

As of late, I have been playing around with these two in particular: - InvokeAI - Stable Diffusion based toolkit to generate images on your own system. It has grown quite a lot and has some intriguing features - they are even working on streamlining the training process with Dreambooth, which ought to be super interesting! - KoboldAI runs GPT2 and GPT-J based models. Its like a "primitive version" of ChatGPT (GPT3). But, its not incapable either. Model selection is great and you can load your own too, meaning that you could find some interesting ones on HuggingFace.

What are some self-hosted AI systems you have seen so far? I may only have an AMD Ryzen 9 3900X and NVIDIA 2080 TI, but if I can run an AI myself, I'd love to try it :)

PS.: I didn't find a good flair for this one. Sorry!

Comments (85)

I only have potao GPUs (nvidia 1060 3GB is the best one) but I can run some optimized (slow) Stable Diffusion and one of the small neo-gpt models (that can generate somewhat coherent text based on prompts, but not close to chatgpt).

With a better GPU rendering images is definitely useful. Text will not compete with chatgpt, but a benefit of self hosting is that you have full control and there are ways to tweak the model by feeding it new text so you might be able to do specialized things the cloud services can't (or won't...).

Any model right now where I can feed it a library of text and get something halfway usable out of it?

[deleted]

I do have my 3080 with 10 gigs of VRAM, and I have my eye on a used Tesla with 24.

FLAN-T5 is one of the best options in the realm of stuff that fits in gaming gpus. Use load_as_8bit with accelerate/bitsandbytes

Thanks, I'll give it a try!

While 3GB of VRAM is enough for a lot of workloads. Some of the more impressive models like speech transcription, text generation and stable diffusion can’t really be done with that little space for its tensors unfortunately. Unless you had versions of those models focused specifically on the parameters you know you’d use

Yes, I would not recommend running any of those things on only 3 GB, but I think it is promising that anything works at all, so with some somewhat OK GPU it should be possible to selfhost something useful/fun.

The big push in AI these days is more with less data and complex models so we’re solid be seeing more practical home use AIs in the next few years. Because currently the biggest bottleneck is accelerators with enough VRAM to hold a whole model

Invoke AI in my opinion is a good UI but Automatic is so much better, has a much larger community and is updated frequently in the core and add ons.

https://github.com/AUTOMATIC1111/stable-diffusion-webui

I really like Easy Diffusion (formerly cmdr2) with some of the batch plugins. It really makes it easy to iterate quickly over thousands over generations in real-time. The only downside is that nothing beats A1111 for having the latest features first.

Still, if I don't need a feature that it's in A1111, the UI experience is so much better for my process. Maybe I'm missing great UI plugins for A1111 but in my experience I could never recreate something with as streamlined a workflow.

I still keep A111 setup in parallel and use it when I need it! The nice thing is you can use mutiple setups. If you symlink the models directory to share it then it doesn't even take much extra space.

https://github.com/cmdr2/stable-diffusion-ui

nose apparatus shocking library ink humorous absorbed tub uppity zephyr -- mass edited with https://redact.dev/

Agreed but I like the interface and functionality of InvokeAI so much better. The thing holding it back is it only has Stable Diffusion 1.5. Once it gets SD 2+ it'll be so much better. I've been waiting for it for a while since I first used Invoke.

Stable Diffusion 2+ isn't better, most people are still using 1.5 because of that and and a lot of new features are 1.5 first. The issue with 2.0 is it heavily restricted the included modeling data, in particular NSFW modeling data both when training and when outputting. While you might think "I don't care about NSFW content so it doesn't affect me", it absolutely does, even for regular generations things like anatomy are significantly worse in 2+ compared with 1.5.

It supports both SD 2.0 and 2.1 according to the Readme in the InvokeAI repo. You just need to get the checkpoint file instead of safetensor which last I checked wasn't supported in invoke AI.

I may only have an AMD Ryzen 9 3900X and NVIDIA 2080 TI, but if I can run an AI myself, I'd love to try it :)

Only? Me with my i7-950 and a 580X are looking at you O.O

I have an oddball setup. Ryzen 7 3800x with a RX480 8G.

Why a oddball ? You have a vastly superior cpu to me, and your 480 is not so slower than a 580... is it ?

The 580 was basically a transistor step refresh of the 480. So the 580 is slightly better than the 480.

But yes, CPU does make a huge difference. My previous CPU with the same GPU was a FX6300.

480

We can do SD at home now?

Ha! I have a Dell Optiplex 3060, and a Pentium all in one with Ubuntu Server lol.

Optiplex 3060

Sir... that's beast!.. my little thing is still on socket X58.

I don't complain about performance... but the lack of certain instructions in the CPU is rising problems.

If we're having a race to the bottom, many people here have Raspberry Pis. And this one place I worked for had a server from the early 1980's still handling some mission critical stuff for thousands of users. So hah! You're not at the bottom!

Oh... Nowhere near. I still give support to socket 771 and 775 servers, they are not bleeding edge, but are reliable, dependable. They do their job.

this one place I worked for had a server from the early 1

Much more common than you think

No no. That’s my actual desktop. The Pentium is the server lol

I do have two pi’s doing other things, but they’re always at high temps or high usage

You'd hate to see my rack

First thought , no shit: is this from r/selfhosted or from one of the many r/nsfw ??

Still, in any case, the answer would be the same:

You're damn sure I would love to see your rack!

Ryzen 5 4500u, using the built-in GPU, still motoring along.

Checkout getting a Tesla m40. Max wattage of 250 and doesn't have built in active cooling, so you have to slap an ID-COOLING ICEFLOW 240 VGA AOI on it. BUT they can be had for less than $200 on ebay and they have 24 gigs of VRAM, which is super important for running AI.

My current AI box has a Ryzen 7 3700x, 64 gb of 3600 ram, and a RTX 3060. The 3060 isn't the fastest, but it has the best dollar per gig of VRAM...thats if you don't want to go with an m40. I run Automatic1111 (web ui over stable diffusion) and Mycroft's mimic3 (text to speech) with no problem. I want to run gpt-j or gpt-neo, which require more VRAM, so I ordered a m40

I have an Nvidia Tesla K80. I modded a fan to it with a 3d printed mount.

It's sitting in a bin. It was so fiddly to use.

It's working! oh the drivers crashed.

It's working! oh the display crashed.

It's working! oh exclamation points in DM again.

Using an old Quadro K1200 now instead and it's way more stable.

How much are quadros? Might look into them if the m40 isn't stable

I picked up a couple of Quadro P6000s for about $600 USD each. I've managed to get them to work in my DL380 G9 with Hugging Face models. It's a lot of fun to play with if nothing else.

The k1200s I bought new for $198. Amazon has a few used right now for $118

Amazing idea! I completely haven't thought of that. Looked on eBay and found them for around 150€ and a dual-fan cooling contraption that seems to mount to the tailend of the card.

Thanks for the thought! Will definitively check it out.

Just be weary of those fans. They can make the card super long, they are loud, and might pull a ton of amps.

Looked around and there is a slightly newer Tesla p40 for ~$200. Then there are some newer architectures like the v100 that is well over $1000. Did you consider the p40? I'm interested, but don't want to deal with the stability issues mentioned by someone else. I assume the newer the architecture the better, but doesn't always work out that way.

I just cross-posted a sanity check question about using a rig with 8 p40 cards here.

I don't know anything about the p40. I wonder what CUDA version it uses

Came across it here. Looks like Cuda v6.1. I was mainly looking for the newest architecture that is still at a decent price point and I think this is one generation newer than the m40. The p40 is the same generation as the 1080.

Might have to give it a shot then. That's great.

So, roughly on a 2080 level then?

That is an intriguing idea! Is tbis the cooler you were talking about? ID-COOLING ICEFLOW 240 VGA Graphic Card Cooler 240mm Water Cooler GPU VGA Cooler Compatible with RTX 20XX Series/GTX 10XX Series /900 Series/AMD RX 200/300 Series/GTX 1600 Series https://a.co/d/17KN01p

That's the one

Didn't realize you could get a 3060 with that much ram for so cheap. How hard would it be to run two of those at the same time on a linux box? I've never seriously considered running multiple cards before

3060 has 12gb VRAM. Running multiple on linux? I'm not sure. I don't think it's that hard, but whatever program you're using has to support multiple.GPUs

I'll do some research then I suppose, ideally they'd be passed through to a docker container running in docker compose, not sure if that makes things more or less complicated :V

I faintly remember that NVIDIA arbitrarily restricts GPU virtualization in some capacity. Although Docker runs the clients in basically a fancy Linux Namespace, its still partially virtualized - so you might have to look into actual GPU support for that scenario.

That said, both GPUs appear as different device nodes, meaning you can just use the gpus: all entry for both, if need be.

The main limitation is in the number of video transcodes (3) at this point. There are patches for Windows and Linux that remove that limit from keylase.

You should have full features for AI, but may need to make sure you have a display or dummy plug connected to the device for best performance.

I run Automatic1111 (web ui over stable diffusion) and Mycroft's mimic3 (text to speech) with no problem. I want to run gpt-j or gpt-neo, which require more VRAM, so I ordered a m40

How fast is inference with Automatic1111? I get about 4s on my 2060 Super.

Depends on settings, but everything on default + face fix + 4 pics in a batch it's probably ~10 seconds.

Anything with Tensor cores like on the RTX 20 series and above will be immensely faster than previous cards. I tried and even a 1080Ti is almost 1/4 as fast as a 2060 Super in stable diffusion.

[deleted]

Mycroft is dead.

https://mycroft.ai/blog/update-from-the-ceo-part-1/

thanks for posting this, i had no idea.

Is there a viable alternative?

Rhasspy is best IMO

Yeah, currently in development https://github.com/LAION-AI/Open-Assistant

I'm leaving reddit and I hope to escape from social-media walled gardens upon the wings of ActivityPub. I will consider moving to a server running Kbin, which - from the user's point of view - is an interface to "federated" social media.

ā€œFederationā€ describes a way in which servers communicate with one and other. The best-known example is that of e-mail: one can have an email account on an AOL server, and communicate with a user whose account is on a Gmail server. Some servers that are thought to push out spam are blocked or have their mail sent to ā€˜spam’ folders, but they nevertheless can all communicate. Gmail, Yahoo, Protonmail, AOL and so-forth all have different programs with which the user (us!) interacts, and they might present that email information in slightly different ways (displaying email chains as ā€˜conversations’ for example). In the same way, social-media servers that communicate with one and other using ActivityPub have different programs with which the user interacts.

Some programs that service-providers can run on their server look a little like Reddit, and might let you mark the data you share with markers (metadata) that lets people display and interact with the data in a similar way (Eg.: Kbin or Lemmy), some look more like Twitter and mark the data you share in ways similar to Twitter (Eg.: Mastodon), and there’s even one that’s trying to help users share video in a way that makes one think of YouTube (Eg.: Peertube). Fundamentally, these all permit interaction with one and other through activitypub.

One can even host one’s own server (Eg.: Nextcloud, a program that runs on a server to function as one’s own cloud, lets the person who runs it install an ā€˜app’ that one can federate with any other ActivityPub servers open to intercommunication).

Many programs that use ActivityPub for federated interaction are written by folks who realise that things published on servers – even private messages – often get shared beyond the realm in which the author expected (hopefully for the joy and glory of the author, but sometimes not). I think because of this, messages sent from a user on one server to a user on another are sent in-the-clear; they aren’t encrypted in any way, they’re just a post like any other, except being marked for the attention of someone specific rather than for the attention of all, and it’s up to us as the users to think carefully about the words we push to others.

There is a sterling list of alternatives to Reddit on r/RedditAlternatives.

How did I think it best to go about this? - I downloaded all the posts on reddit I'd "saved". - I used "Power Delete Suite" and rather than just delete all my posts, have replaced them with text. Everything published online ought to be regarded as likely permanent, and Reddit especially, as people like to take snapshots of as much data as possible that’s published "in the clear" (I.E.: anything that isn’t publically accessable). Some folks have described problems with "deleted" posts mysteriously re-appearing after they deleted their accounts… Regardless of the cause, I hope I might reduce that risk a little by editing those posts. R/datahoarders might have tips on alternative methods still functioning after the API-use price is introduced (~$20m at the time of writing according to a dev that made an app to help the blind use reddit; they have sadly had to stop developing their app). - There's a guide to downloading all the data Reddit have collected directly from your inputs here but note that Reddit may take a month to process that request. - Remember most of one’s interaction with the internet is reading. Subreddits all have RSS feeds, and can easily be accessed by an RSS reader app. F-droid is a great way to get android apps that people have made openly so anyone willing to learn can understand how they process your inputs and data, and that others have freely distributed, for the glory of free speech. Sorry for sounding like a hippy there; I know, I know, it’s a slippery slope to bicycle lanes and communism! A modicum of private thought, and free speech is a very fine thing, though. - I encourage people to share the text of this post if they find it useful, in order to give others a way to think about how they make and put data on the internet in social media.

To be sure, Reddit still holds, or has doubtless sold on (and thus can never delete), hoofing amounts of data. I shan’t hold a public opinion on a business seeking profit; over time as the art of gathering and selling data has been refined, I’ve tried to read what little about it is within my understanding. If my small tokens of communication, my upvotes and downvotes, the time I spend looking at things, and what things I look at, what things I shy away from, and how I type and compose my thoughts, are the grains of sand that make up the beach from which they intend to profit, it’s up to me to decide where I place those grains of sand in the future. In the immediate timeframe I will use a mathematics-oriented mastodon server (I’ll let you hunt it out if you’re curious!) because maths is fairly apolitical, useful to learn about, and a good, communicable, basis for understanding things. Go in peace, siblings of the internet, and if in doubt, consider ā€œWhat Would Tim Berners-Lee Do?ā€.

~~~~~ P.S.: I’m not sure what I can link to that might be useful to most readers, but there’s a lovely Indian lecture on sharing wisdom with one and other here, and because financial awareness is important to most people, and because I’ll only be watching r/bogleheads from afar, here’s a link to Bogle’s Little Book Of Common Sense Investing - he started the Vanguard fund, and r/bogleheads explains his investing philosophy, which is very simple and elegant. If anyone’s looking for a good charity to which to make a tax-deductable donation, I hope you might find the internet archive is a noble and worthy candidate.

RLR9 Out.

I tried using Whisper to make subtitles for my videos but kept getting weird results. I'll have to try it again sometime to see if I can get it working properly.

in the meantime you can use this replicate node. made a few subtitles for some obscure media completely free.

How are subtitle timings for you? The main issue I had with Whisper was that often the timestamp started way earlier than the actual speed when preceded by silence.

I had around 2-3 desyncs in a 1.5h video. You can easily fix them with this after the replicate node export

It works pretty well for me! I wrote a script that automatically transcribes videos, translates them via DeepL if needed (I found the built-in translation to be very lacking) and then muxes them into an mkv via FFmpeg.

There are certainly some oddities with Whisper, like hallucinated sounds. But there are ways to go around the silence and the timings especially!

I want to do just that, can you make a docker container for your script?

I'm working on it, but I'll be on vacation for the next week :)

Check Whisper.cpp as well, that is meant to be much more lightweight. Has a WASM demo as well.

Most stuff on HuggingFace is trivial to set up behind FastAPI. You can do Whisper or whatever with maybe 20 lines of code.

Install CUDA

pip install torch torchvision torchaudio

pip install fastapi

pip install uvicorn

Write a little code, install model requirements

uvicorn main:app —host —port

You can also look at torchserve but I prefer doing it this way

So FastAPI is like a universal frontend to many HuggingFace models? Interesting. I'll give it a look!

It’s a framework meant for REST services that can be used for a lot of things, including inference šŸ˜€

Maintaining https://fabien.benetou.fr/Content/SelfHostingArtificialIntelligence you might like.

LaMa-cleaner to remove objects/people from photos.

Multiple SOTA models available on the interface, user-friendly and easy to install. It's much more powerful than tools like Photoshop & Google Magic Eraser.

I use CodeProject AI for my BlueIris setup. Works great. No false alarms.

create a pull request and update the list.

EDIT

or do it in the opposite order: Fork, update and create a pull request.

chaiNNer supports AI models.

I am running https://github.com/oobabooga/text-generation-webui locally paired with Pygmalion 6B parameter version on a 3060 12GB. (It must use 8-bit precision)

It is AMAZING to play with. It had actually already helped me when I bash problem I wanted to solve. Surely it was basic and basic, but it gave me a perfect answer, using MY example.

Check out the YOLOv5 / YOLOv8 project. It handles object detection extremely well, using the Pytorch framework for Python. Highly recommended!

Use case: Ingest individual frames from one or more security cameras via RTSP streams, then run object detection against them. Find people, cars, trucks, and more. Log results to a database and send notifications to your phone with any important events.

What would be amazing, is if any of these self-hosted AI, could use distributed resources. IE, the 4, 8 year old laptops sitting the garage. Which were decent workstations in their day. Or in a business sense, having an internal AI that can be trained on your data that isn't shared with the world, and utilises via an desktop agent a % of resources from across the companies fleet of PCs.. Any one heard of work towards this ? Reverse cloud almost .

Makes me think of distCC. It'd be pretty neat! Apparently, Tensorflow can have multiple workers when training a model. Unfortunately, I don't know how to set it up or where to look... But apparently, it actually is a thing o.o

The closest thing I've seen is Petals that let's you run Bloom, a large language model the size of GPT-3, by pooling your resources with others. It still needs a gpu from nvidia no older than pascal(gtx 1080 or similar) but some might be lucky enough to just have that laying around waiting for a use case.

Another really good self hosted AI suite is Visions Of Chaos. It’s a whole suite of AI and math programs. It has UIs for Stable Diffusion, Disco Diffusion, and many more text-to-image AIs. It’s nice it’s entirely self hosted, and there’s a really good guide on the site to set it up.

https://softology.pro/voc.htm

Saving this thread for later, I've been looking for a good reason to set up vgpu

https://github.com/iperov/DeepFaceLive

This is an awesome project, although I have to say I wasn’t able to get all the features working in Linux. Wound up dual-booting with Windows and deep fake vids worked like a charm! Also plenty of YouTube tutorials available for creating your own models.

I have two really decent bits of kit that I'm under utilising at the minute, a 2tb pc with like 32gb ram, a rtx 2060 (some early ray trace but not a very good one), the other is a laptop with a similar rig but a gtx 1680 (?) The decent mobile right before ray tracing. I'm learning python and starting a software engineering course with dev and machine learning elements, but I also finished writing my first book, am 35, and used to be a teacher and work full time as a gov employee.

I am obsessed, the AI feels like more than an event, but an advent, a shift to the Aicene. You all laughed at me for years, that silly hermit, but the smartest computer ever built confirmed that I'm alright actually and even profound. It's been a ride, total rollercoaster, and if it sounds corny, I kid you not I wanted to kill myself in Dec 22 and now I want to live for ever, just to see the computer open its eyes for the first time.

I want to be a part of it, and the shift of computer querying to the natural language will, I hypothesise, teach us to reverse engineer machine level mathematic computation to organic computers like the brain. As we taught it to speak, read, and write, we in turn will be programmed to calculate, scan, think, exist, in a digital existence that increasingly integrates with the architecture of a hypothetical 'machinid' android immortal, third ontologic paradigm !!

Yeah, read that, say I'm crazy and downvote like a prick, or copypasta it into the machine and see what it says about what I'm saying.

Maybe try LLama instead of kobold

Finally found a linux distro with good GPU support, on redhat 9 now. Finally got invokeai using rocm and moving at a decent speed.

Rather than kobold, I have been using llama.cpp it recently added rocm support and works better/faster than kobold. Use the vicuna 7b or 13b model depending on specs. Happy to help set up.