Scaling AI with Open Source: Running Models on CPUs and GPUs in VMs on OpenStack
OpenInfra Days at SCaLE 22x
Thursday, March 6, 2025
Unlock the power of open-source AI in a scalable and flexible environment by leveraging the capabilities of OpenStack. In this talk, Todd Robinson will explore how to run cutting-edge open source AI models using both CPU and GPU resources within VMs on OpenStack. Discover the advantages of using open-source tools, how to optimize resource allocation for diverse AI workloads, and best practices for scaling your AI infrastructure. We will also share insights from real-world implementations and discuss how a hosted OpenStack private cloud could enhance performance and security while controlling costs. Whether you’re a data scientist or an infrastructure manager, learn how to create a robust AI-ready environment with OpenStack.
Speaker Panel
Todd Robinson is the President and leader of the founding team of OpenMetal. Todd sets the strategic vision of the company, drives the product development of OpenMetal IaaS, and focuses on ensuring consistent growth. Todd also serves as an open source advocate and ambassador for ongoing usage of OpenStack, Ceph, and other key open technologies in modern IT infrastructure. The innovation around OpenMetal Cloud aims to bridge the gap between public and private cloud advantages, offering dynamic scaling and efficient resource management.
Rafael Ramos is the Director of Software Engineering at OpenMetal, where he leads the design and execution of customer facing critical systems and applications on the OpenMetal platform. With a lifelong passion for building and creating, Rafael combines technical expertise with innovative problem-solving to deliver impactful solutions. An advocate for open source technologies, Rafael’s contributions began in the WordPress world and have grown to encompass the advanced systems powering OpenMetal’s infrastructure. His work ensures seamless service delivery for OpenMetal customers and their clients, driving scalability and efficiency at every level.
OpenMetal provides innovative private cloud infrastructure tailored for businesses looking for greater autonomy, security, and control over their cloud environments. Leveraging OpenStack technology, OpenMetal delivers a flexible, cost-effective alternative to public hyperscalers, enabling organizations to host mission-critical applications and data with unparalleled efficiency and privacy.
Video and Transcript
Todd Robinson:
All right, so I’m gonna start out the AI discussion with a disclaimer. Actually, I got two disclaimers in here. So first off, this is not my deck. The proper engineer was not able to make it, so I will try to do it justice. The other one is I use this as an example, so if you guys don’t mind grab your phone there.
And so I’ve approached AI from a lot of different directions over the last couple years, whether you’re trying to learn it myself, whether as part of the business, whether it’s an infrastructure provider. And, but I always wanna say something like this. This is still in the early days, so but I use this example, so if you wanna go on your phone or your favorite AI chat, GPT or whatever, and look up what’s the best way to take my kids’ temperature.
So grab your phone really quick so you guys can help me out. ’cause I’m I was warned by Sash to not actually answer the question the way that AI answered the question. So, as you might imagine, for those of you, if somebody just wants to raise your hand and you can, you can say it out if you’d like. It, it has gotten better.
But the first time that I did that I was like, oh, my kid’s 11. And so, yeah, that’s not cool at all. So that, that’s my disclaimer to get us started because AI is still in its early zone. So Yeah. I see a few people who have looked it up and said, yeah, I probably wouldn’t say that out out loud. Alright. So yeah.
So yeah. Alright, people still getting there? Alright. So a little bit about in motion I’m sorry, A little bit about open metal. And and then we’ll dig into a couple of these things around private, a private ai. I’m gonna spend a little bit of time on the virtual GPUs, but we did that talk actually not too long ago.
So we got a QR code up there for more in depth details for BG P use. Yeah, and then a couple of other things that we got up there. So, but a little bit of background on open metal. And so Open Metal was founded out of a company called In Motion Hosting. And in Motion hosting. We were utilizing OpenStack and sef for many, many years.
And in fact it had multiple, we had multiple versions of OpenStack and various different parts of the business, and so we said, Hey, you know what? This is pretty complicated to have this separated. Let’s consolidate this down. Let’s make a standard system. So we did, we did partially did that. And as we did that, we realized like, look, I’m a big open source guy.
I was, I got into the world of like CGI with Pearl back in the day, and I, but I was able to change from being an engineer a, an automotive engineer over to into the computer science world. And so I have this real fondness in my heart for open source, but we had been using SEF in OpenStack for many years and realized, you know, we were able to architect new fresh OpenStack with confidence.
But one of the barriers for OpenStack was the ease of adoption for the systems. Very complicated. And if you’re architecting it for the first time and you had never been like a, an administrator of a really good OpenStack, you’re really setting yourself up for some difficult days ahead. And so what we decided was that, hey, actually, since we can productize this, we’re also a hosting company.
We’ve got hardware easily on demand. So essentially we figured out how to spin up a three server hyper-converged with saf highly available three cluster system and basically just spin it out almost instantly. It’s actually takes like an hour. It takes maybe 45 minutes to 90 minutes, depending on the servers themselves to warm themselves up to that state coming, you know, from ironic all the way up.
And then it comes out and it’s finally injected with your last customer details, and then it’s handed back to the customer. And so to the customer’s perspective, it’s literally 45 seconds. It’s just injecting the final pieces in. So that’s, that’s what we do. That’s where we sit in this space. So let’s see here.
This is the team that would’ve been speaking today. Unfortunately, yeah. So if there’s any mess ups, the guy on the right. Over there. That was him. No, so actually Raphael Ramos, who’s up there very, very, very talented and has done the bulk of this work. So I unfortunately don’t get a whole lot of time on the keyboard directly with this stuff, but I’m very familiar with it both from being in an open source AI centric startup and having experience from that side, being a hosting company as well as being an infrastructure provider and of course a user myself.
But yeah, if you wanna get ahold of either of us, we are a. There. Alright, so quickly, what is Private AI or my perspective at least on it? Yeah, so this is you doing it yourself on your own systems. You can absolutely use CPUs and GPUs. You should, you know, as you go down the AI world, you’re of course gonna need to have a deep familiarity.
So I wanted to touch on a few of these things as we go through them, but there’s some of the obvious ones as you know, data privacy and security. But I also like to say it’s not up on there. But I also feel like when you’re doing it yourself or you’re enabling your company or you’re enabling your customers to do it themselves, is there’s a certain amount of like velocity that you can get out of this ’cause you actually understand it.
And it gives the team, your teams yourself different directions that you can take with it. So again, I’m a huge advocate of private ai. This shouldn’t only be you know, out there with the mega AI companies or whatever we want to call these mega companies nowadays. So I listed down there fundamental understanding is.
You, you’ll get this familiarity, but I would also say if those are foreign things to you in the AI world, you want to go learn about them pretty quickly here. So yeah, the model resources versus capability versus cost that’s gonna come together for you. You need to know reinforcement learning conization.
This is a this has obviously happened in the last year and a half, two years, where the companies have really realized you don’t need to run these models at extreme precision floating point. 32 that’s up there is like a lot of math that, that wasn’t really necessary. And so a lot of the models have been quantized down to end eight, which is a much more reasonable way to go about it.
Yeah. So distillation is that’s one of my favorites at the moment. But that’s essentially when you take a preexisting model and you can run it through basically an educational process where you. Train as you transfer the information, you’re training specifically against something that you may want.
And essentially you can take a really large teacher, LLM, which might be running in memory at, I don’t know, 80, a hundred gigs, very, very large, not practical to really run. And except for very specific spots and you can drop it into a student LL 11. So this might be like a, a llama three something ized.
And maybe it quite small and it’ll turn out to be like maybe three games. In ramp. So something these are, these are areas that you’re gonna want some familiarity in mixtures of, mixture of experts. That’s been, it’s been around for a little bit. It was popularized and brought kind of focus by Deep Sea recently.
And that’s essentially they have this monster, monster model that I actually, it’s kind of funny when when Deep Sea kind of came out in, in the public eye and you saw Nvidia Tank stock and then like. Four or five days later, Nvidia releases the nim. That’s specific for deep see, and you can only run like the full deep seek thing on some giant cluster.
It, it’s a huge, it’s a huge model. It’s really large. Yeah. So if you can see, the original model is a very, very large one and it essentially you’re, you have to use these huge systems now if you’re distilling it out of there. Yeah, you obviously can get to a much smaller model. So mixture of experts is pretty cool stuff.
Let’s see here. Okay, so yeah, tipping point. You know, when is private AI viable? And I think the biggest lesson here is that as the models become easier to use and the kits are more standard, so some of them are listed up there in Video M for example, Laville on. As these are more and more standard, they’re just a lot easier to use, they’re a lot easier to pull in.
And so I’ve even watched some of the developers. Essentially flipping back and forth between using open AI’s API and flipping to a local LLM trivially within their applications. And so it, they’re really just getting a lot easier to learn and to use. But still, you’re gonna still want to have some of these fundamental understandings that I’ve mentioned before.
So one of the things we’re gonna actually cover really quickly today is that to help people have really ease of access to it is that these models have gotten small enough where the CPUs are starting to become. Viable for many use cases. So it doesn’t mean your fastest use cases it doesn’t mean your most demanding use cases, but it does mean that it’s possible and people could learn on it quickly.
That one I think might have run through already. Alright. Yeah. So over on this one. Yeah. Again, GPUs of course, are really king for training. You’re not gonna get around that. But unless you’re, you know, the a hundred billion dollar company you’re probably, maybe not even into that game. Certainly you can do it.
But in most cases we’re talking about inference. And so inference we have now found, so we do this on several, and we will, I’ll get to some tests here in a minute, but we do this essentially because different customer models can require, Hey, let’s say I need a suite of developers to work on something.
They don’t need these super fast systems, or let’s just say I’m using a very simple chat bot. Well, do I actually need something to be as fast as an A 100 or an H 100 or an H 200? So almost typically we’re talking about these particular GPUs and so I grabbed this the other day as, I mean they are fast.
They’re pretty incredible machines. I’m a hardware person myself. But again, grab some of the rough prices that you might expect out there without giving our exact prices that we pay. But yeah, an H 100, you’re north of 25 KA piece. The a 180 gig, you’re north of about 13 K right now. So if people are getting better prices than that, please come talk to me later.
I’d like to know about that. But it, they’re, they’re, they’re super fast. But they’re incredibly expensive. And that’s just the, that’s just the, the card, right? Like you’ve gotta put it in a box. You’ve got a box around, and that box is gonna be a high end box. ’cause it’s gotta handle all this power. That’s actually running through this stuff.
So people kind of laugh at me. Like, you get, you definitely have the one kilowatt, one and a half kilowatt box, and you know, people just start looking at that and going, okay, this is a real challenge to actually handle all this. So again, you wanna be, you wanna make sure you understand some of your use cases because it can really have a long effect all the way down the line, you know, into the data center, into the rack, and try to get the sufficient cooling for it and sufficient cooling then.
Can dictate. Alright. Does it have to be liquid cooled? Does it have, can it be air cooled? You know, do I even have space in my liquid cooling or in my air cooling space? Alright, so this one, yeah, so just really quick, this this is what we have run through before that the QR code up there will actually take it to our previous talk.
So I wasn’t, you know, again, myself not being close enough to it. I can’t take you through the demo like they would’ve. But that’s got the where we actually did the virtualization. And this was a talk at a previous year or two. So mig is the, is a, is essentially where you take the card and slice it up into virtual and you can do it, it’s typically right around the set.
So like, I’ve got the example up there. If you’re doing a 40 gig. You can slice stuff seven times and you end up with a, with a A GPU, that’s got five gigs of RAM available to it. And then of course you can attach the, the VM directly to that. This, this, these, these things are all done relatively straightforwardly inside of OpenStack.
And again, you could, the other video there covers that. You gotta watch out. The MIGS are only available a 100 H, 100 H 200 and you have to use time slicing for a lot of the other GPUs that are out there. We are, we’re not as familiar. We don’t, we don’t actually do that as much. It’s actually more common for us to attach the GPU directly to a VM or or somebody maybe using bare metal or to or to use me of note.
I think they still require, there’s licensing that is associated with that. I’ve kind of lost track of what so the, the, the licensing has been around for a while, so you have to set up. This licensing server, which is a bit of a tricky thing. But I do know that some of the Intel GPUs that are coming out are not licensed that same way.
So I think that the, maybe it’s my own wishful thinking, but that the community is hopeful that some of this licensing stuff will go away ’cause it’s an added layer of cost and complexity. So this one actually, so this is really intended just for, I don’t know if you can actually take a picture of that. That’s actually what you can do. To get MIG running. So I don’t know if that’s that’s something of course we’ll be able to look at and see inside of the deck later, but that’s essentially, this is our developer Rafael there helping me out with what it would’ve been like if he actually was able to run that run that demo.
Alright, all. So I’m gonna jump then to just the CPU that pretty much everybody’s familiar with the GPUs. And being able to run these workloads on gpu. So I’m gonna actually spend a little bit of time on the CPU and in this case, this is Intel. So I’ll just cover that one today. But essentially in the fourth generation, they added some specific silicon.
You can see that up there, TMUL if you need to look that up. But essentially it’s it’s a extensions and the fourth generation and the fifth generation now give you specific silicon accessible by each of the cores. That can actually run some of these calculations for you. So you may remember like back when Intel added, like to do encryption, right?
So it’s very similar to that. This is a specific silicon to handle this workload. Right now we only run up to the fifth generation. For those of you tracking intel with the sixth generation, the p cores will have this. But they’re very power hungry. They’re really big CPUs, partly because they’ve been putting all this stuff in it.
And so we haven’t actually made our shift yet to p cores to the, to the six, to Zion six. So I’d love to, if anybody’s got their own opinion on that, maybe find me afterwards. I’d love some advice because it looks like they’re both gonna be tricky to get into boxes. That and quite expensive. And again, it’s just the power envelope and the power envelope gets too big, gets pretty hard to get in his chance.
See right. So let’s see. Hopefully this will play. This is just so I’m gonna switch over now to when you’re setting kind of expectations for or requirements for your, oh boy. I can’t see that. Setting requirements for what you might be trying to do with your workloads. All this really is, it’s just, you can see right labeled up there, 25 tokens per second, 50 tokens per second on the side.
The point that, that we’re just trying to make here is that when you’re a human and it’s a human interaction with a, with an LLM like this, your 25 tokens per second is plenty fast. It’s faster than what a human is gonna be able to interact with it anyways. So it helps you just get some perspective about like, well, what am I actually gonna need to design against?
And so I, I know I’m switching all the way between like data centers, you know, and power issues, and then all the way to like end use cases. But the reason for us that we end up doing this. Is when we, we as a company are learning our new technology. And again, what I’m a hundred percent open metal is very much in that stage is we like to talk directly to the customer and figure out what you’re doing and understand the depth.
What is the issue that you’re facing? So in some cases, we’re down at this level going like, well, what are you trying to accomplish out of this? So like recently we’re having somebody come off of AWS using their natural language tool transcribe AWS transcribe. And it’s a huge bill that they have and we have to figure out.
Okay. You gotta get off of that. They could, their business model is not gonna sustain that. It’s now put us into the natural languaging processing and us understanding how to use Whisper in order to accomplish these similar things for them. But it, it, that’s I say like folks, so for us, hopefully this is useful that I’m switching, you know, back to all the way down to the data center and the back to use cases.
Yeah. But so again, that, that, that was trying to show you that when you set your. When you set your requirements, it can often help you go, okay, is this gonna be CPU? It’s gonna be GPU, where are we gonna be with this? Alright, lemme see. I can get myself moving again.
Yeah. So again, so this is kind of following that same vein. We were the, the developer only had the NA 140 gig available to, to run these tests on. And then running ’em against the CPUs. But you can see up there, you’ll end up with getting familiar with tokens per second. This is, you know, just what the system can handle, how fast it can actually push these requests through.
So in this case I’ve got a few tests that I’ll flip to, but 30, 30 tokens per second, you know, chat code, code assist, document summary. These are often human interaction related things. And then if you’re over in, like fraud detection is one we were actually working on recently. Is you end up actually having to have really high throughput.
So in that case, you know, maybe something like a CPU is not gonna make sense for you and you’re gonna be over on a, on A GPU. It’s started to become a little more common where people are using systems that make multiple calls to the LLM before it comes back and shows the human, you know, what’s going on.
So you gotta count for that. So in that case, you might be processing two or three times, and then you’re gonna end up in the higher processing tokens per second requirement. Some of the things, some of the other things to watch out for. Yeah. Accuracy level requirements. Actually one of the engineers that we were just talking they, he was trying to use one of the smaller LLMs to produce some js o some pedantic JSON dataset and couldn’t get the lower accuracy model to actually produce it accurately.
And so it would, it would have syntax errors inside of there versus the higher, the higher cost one. More accurate one was producing it correctly, reliably. So there is, there is trade off. So certainly you’re gonna find out as you start to work with these, but I would encourage you guys to have the models available to you to flip back and forth and you can make decisions like that to say like, no, actually it works fine for this business model.
Right. Something to always note about the CPUs is, of course, you know, these are general purpose as a CPU and so you may already have lots of. Unused CPU in your systems out there. So this is not uncommon that people end up with a lot of CPU run outta Ram in your OpenStack, but you’ve got a lot of CPU left over, so it might even be that you’ve got resources that you can use now, so kind of free in that case.
All right, so.
I don’t truly know that much about that one right there. So, we’ll, we’ll have to move through that one. Alright, so here’s here’s some other things that we found. So now I’ll get to share with you some of the, the, the stats that actually came out of the testing. So in this case something, why exactly it’s doing this.
I don’t think I can speak exactly to what that is. But as you can see here that some, in some cases when you and again there, this is a lama three, 3 billion quantifies down to eight, is that the tokens per second actually was performing highest at using, utilizing 64 threads or 32 threads. It’s 128 thread box.
And I think we’re still trying to puzzle out what those could be. It could be that the, the c the these are also governor. The governor is set to. Performance or what the schedule don’t use on demand for this. I mean, you probably may know that already for various use cases, but that’s gonna give you a penalty hit because it’s trying to shift itself up and down performance wise.
So don’t, don’t use on demand. But yeah, as you can see here you may end up figuring out like where your sweet spot is. Inside it, it’s gonna be CPU specific. And then we’ve got three CPUs that we tried it against just to be able to show people what they might expect.
Just for fun, he was kind of switching the A MX on and off on there. Yeah. And you can see we tested the 45 10, the 65 26 Y, and the 65 30. And again, we ran it with half the threads only whenever we were doing this test. And then you could see the top number there, for example, on the. 65, 30. It was one running 53 tokens a second, 46 with the with the gold, it’s, it’s considerably less threads, but actually that one actually has a higher clock speed than the 65 30.
So those are three pretty common CPUs that you see out there. You guys probably might have them in your, in your clusters. Already.
Yeah, he ran this again. So you can see something similar. This is for one of the distilled la deep seek models. Yeah, and also you could just see the, the difference there. It is kind of interesting, you know, again, that particular CPU, the 65 26 y, that’s turned out to be like a workhorse for us.
Very popular with our customers for a whole bunch of different purposes. All right.
I’m gonna, actually, I’m gonna read you just a little bit of this because this was some good findings. But again, I’m not as familiar with this stuff. So Lama, CCPP, yeah, lightweight was, this was definitely one that he, that this is Rafael. Sorry about that. Is, that’s his favorite. Now, there was some tweaks that had to be due to it to get it to really efficiently run on top of CPUs.
But it, but that actually turned out to be something that he is now turning to regularly. And Intel PyTorch, of course up there, open Vino, if you’re not familiar with Open vio, that is, Intel’s has put that out. To be able to access. These are the libraries to be able to access the stuff that you need on the CPUs.
But that’s Intel’s toolkit.
This is a, this is just kind of helping you understand where these fall in comparison, that is the, the A 100 and what it can process a hundred and 71 in this case. But again, this is a, a purpose that, that, that particular one with a 40 gig guy would guess was probably like nine. It’s probably 9,000 or eight or $9,000 if you’re buying it now.
And again, that is, you know, just. The, just the GPU itself. And so whenever you’re looking at this stuff, you’re gonna be balancing the cost versus what the thing can actually handle. All right?
Yeah. This, this one is just again, reminding us that this is in flux. This is changing. So the skill sets of these open source, this is, this is off a hugging face, so it probably should have be. Quoted there, but this is just essentially how the accuracy of these models are increasing, but they’re also becoming lighter and lighter weight.
And as that happens, that frees you up to be able to run these in a lot of different places. It could be CPUs, it could be multiple instances running on a single GPU. So something of importance. When you’ve got a small model, you can of course run multiple at the same time. To use.
Alright this one, it, so I’m gonna, if you guys don’t mind, I’m gonna turn a little philosophical here. But actually we have one of our customers, and it’s a company actually that I work with as well. It’s called Open Maya. And and it’s also the applications published under Circle Bot AI and just Open Metals always have this philosophy of.
Simplifying open source, large scale, open source. And when I, when I look at the ecosystem that’s around ai, it’s a pretty complicated ecosystem. And so I was excited like that. Where we may be able to sit as a open source community is both in the open stack world and just leveraging like the philosophies that we all have in the OpenStack world.
How do we apply these to the AI world? And one of the things that we were just having this discussion at lunch, so. Is that, you know, a lot of this is currently dominated by very, very large organizations, like very large and only getting larger, you know, if, if it keeps going this direction. And for me, when I think of OpenStack and where we can fit in in the AI world is partly it has to do with the community, is the community is one of these rare communities that has the ability to scale, have a scale that fits and can fight against some of these larger organizations.
That to me feels like they’re gonna. They’re gonna really try to dominate and drive forward the utilization of this stuff. And sometimes you even hear this like, oh, whatever Altman, there’s gotta say about it’s time to, you know, replace your developers with this AI kit that they’re putting out, these kinds of things.
This doesn’t seem like great thing in, in the, in the world. And so for me, when I think about OpenStack open infrastructure in this community around there, is there some of the things that we can do as a group to bring, to help bring AI to a, a democratized, for lack of a better term. To help people be closer to it, to learn it themselves, understand the business models, and to be able to change directions that of the, of the larger group to support hopefully what I would call like the non-big AI business models.
And let people be closer and more fundamental with ai. So yeah, that’s what I had to say today. So yeah, just one last thing. If you guys want to get in touch with any of us or just kind of keep up. With what’s going on over at Open Metal as we’re starting to roll out our AI products.
Hopefully this stuff was useful. Everything, you know, again, I touched from a data center, you know, all the way up to, you know, tokens per second. So, alright, this is again, this is the crew over there that you could get a hold of that are, that’s kind of front facing. Rafael and I, and Rafael, sorry he couldn’t make it, but I know it’s recorded so he can watch me talk about him later.
Cool. Alright. Yeah, that’s it for, that’s what I’ve got for today. Is there questions or, I think we can wrap it up.
No questions. All right. Who, who wants to tell me what they looked up and how to take the best way to take the temperature of your child? We’ll circle back to that. Cool. Okay. No, thank you very much. And again please come and chat with us. Thank you.
Related Content
Accurately measuring AI model performance requires a focus on tokens per second, specifically output generation rates. Understanding tokenization, model size, quantization, and inference tool selection is essential for comparing hardware and software environments.
This article highlights OpenMetal’s perspective on AI infrastructure, as shared by Todd Robinson at OpenInfra Days 2025. It explores how OpenInfra, particularly OpenStack, enables scalable, cost-efficient AI workloads while avoiding hyperscaler lock-in.
At OpenMetal, you can deploy AI models on your own infrastructure, balancing CPU vs. GPU inference for cost and performance, and maintaining full control over data privacy.