Day 30 of 100 Days Agentic Engineer Challenge: $6k Hardware to run DeepSeek R1 locally
With enter of DeepSeek R1 OpenSource we got an amazing tool where we have possibility to run the very powerful reasoning LLM locally, we can finally have our own AI. But how to do it, don’t we need hundreds of tousend hardware to run it locally? No, we can run it on $6k worth of equipment. I’ll list all the hardware parts in a moment, but first let’s review my daily tasks routine.
Daily Tasks Routine
- Physical activity — I did 33 push-ups again and for the fourth day in a row I went outside and carried sand in a wheelbarrow to fix my road full of holes. It’s a nice way to get some exercise outside and do something useful.
- Seven hours of sleep — I slept for 7 hours, but still went to bed too late.
- AI Agent — Working on my AI chat with a simple AI agent.
- PAIC — In queue.
- Data Science — In queue.
If you want to know what all these tasks are about, read the introduction to the 100 Days Agentic Engineer Challenge.
Complete hardware and software setup for running Deepseek-R1 locally
The article is based on an X.com post by Matthew Carrigan @carrigmat, and I wanted to copy the list there so it doesn’t get lost in the social media jungle. Here is the link to the post https://x.com/carrigmat/status/1884244369907278106 and below the hardware list and descriptions from the post:
- Motherboard: Gigabyte MZ73-LM0 or MZ73-LM1. We want 2 EPYC sockets to get a massive 24 channels of DDR5 RAM to max out that memory size and bandwidth.
https://www.gigabyte.com/Enterprise/Server-Motherboard/MZ73-LM0-rev-3x - CPU: 2x any AMD EPYC 9004 or 9005 CPU. LLM generation is bottlenecked by memory bandwidth, so you don’t need a top-end one.
https://www.newegg.com/p/N82E16819113865 - RAM: This is the big one. We are going to need 768GB (to fit the model) across 24 RAM channels (to get the bandwidth to run it fast enough). That means 24 x 32GB DDR5-RDIMM modules. Example kits:
https://v-color.net/products/ddr5-ecc-rdimm-servermemory?variant=44758742794407
https://www.newegg.com/nemix-ram-384gb/p/1X5-003Z-01FM7 - Case: You can fit this in a standard tower case, but make sure it has screw mounts for a full server motherboard, which most consumer cases won’t. The Enthoo Pro 2 Server will take this motherboard:
https://www.newegg.com/black-phanteks-enthoo-pro-2-server-edition-full-tower/p/N82E16811854127 - PSU: The power use of this system is surprisingly low! (<400W) However, you will need lots of CPU power cables for 2 EPYC CPUs. The Corsair HX1000i has enough, but you might be able to find a cheaper option: https://www.corsair.com/us/en/p/psu/cp-9020259-na/hx1000i-fully-modular-ultra-low-noise-platinum-atx-1000-watt-pc-power-supply-cp-9020259-na
- Heatsink: This is a tricky bit. AMD EPYC is socket SP5, and most heatsinks for SP5 assume you have a 2U/4U server blade, which we don’t for this build. You probably have to go to Ebay/Aliexpress for this. I can vouch for this one:
https://www.ebay.com/itm/226499280220 - And if you find the fans that come with that heatsink noisy, replacing with 1 or 2 of these per heatsink instead will be efficient and whisper-quiet:
https://www.newegg.com/noctua-nf-a12x25-pwm-case-fan/p/1YF-000T-000K7 - And finally, the SSD: Any 1TB or larger SSD that can fit R1 is fine. I recommend NVMe, just because you’ll have to copy 700GB into RAM when you start the model, lol. No link here, if you got this far I assume you can find one yourself!
Put it all together and throw Linux on it.
Important tip: Go into the BIOS and set the number of NUMA groups to 0. This will ensure that every layer of the model is interleaved across all RAM chips, doubling our throughput. Don’t forget!
Software part
- Follow the instructions here to install llama.cpp https://github.com/ggerganov/llama.cpp
- Next, the model. Time to download 700 gigabytes of weights from @huggingface. Grab every file in the Q8_0 folder here:
https://huggingface.co/unsloth/DeepSeek-R1-GGUF/tree/main - For a quick demo do this:
llama-cli -m ./DeepSeek-R1.Q8_0–00001-of-00015.gguf — temp 0.6 -no-cnv -c 16384 -p “<|User|>How many Rs are there in strawberry?<|Assistant|>” - If all goes well, you should witness a short load period followed by the stream of consciousness as a state-of-the-art local LLM begins to ponder your question.
- And once it passes that test, just use llama-server to host the model and pass requests in from your other software. You now have frontier-level intelligence hosted entirely on your local machine, all open-source and free to use!
Additional info:
- And if you got this far: Yes, there’s no GPU in this build! If you want to host on GPU for faster generation speed, you can! You’ll just lose a lot of quality from quantization, or if you want Q8 you’ll need >700GB of GPU memory, which will probably cost $100k+
- Since a lot of people are asking, the generation speed on this build is 6 to 8 tokens per second, depending on the specific CPU and RAM speed you get, or slightly less if you have a long chat history. The clip above is near-realtime, sped up slightly to fit video length limits
- Another update: Someone pointed out this cooler, which I wasn’t aware of. Seems like another good option if you can find a seller!
https://www.arctic.de/en/Freezer-4U-SP5/ACFRE00158A
Disclaimer: At this point I would like to mention again that the list and content is taken from the following post on X com: https://x.com/carrigmat/status/1884244369907278106.
I just wanted to not lose it and add to my Agentic Engineer journey. I will also edit the post in the future and add my local prices for the hardware.