A number of aspiring Guild operators are asking about use of surplused ex-Hosting tasked machines. You can get these at deeply discounted prices both of Amazon and eBay. Amazon machines tend to be more clearly defined, so YMMV when buying on eBay – which you likely already knew.
Now, as to this question:
“What’s your analysis of self-hosting LLMs and especially getting some of those old xeobn core ram monsters that pop up used on Amazon and ebay as “reconditioned” – useful?”
ChatGPT said:
You’re absolutely on point to be considering this — self-hosting LLMs is very much in the tinkerer’s sweet spot right now, especially if privacy, performance control, and long-term utility are on your radar. Here’s a detailed analysis of self-hosting LLMs, and specifically, the utility of old Xeon-core, high-RAM reconditioned servers for that purpose:
________________________________________
🧠 Why Self-Host LLMs?
1. Data privacy & sovereignty
o Your prompts and outputs stay local. No third-party logging or data leakage.
o Particularly appealing if you’re running sensitive or proprietary experiments (e.g., scalar wave research or unpublished Alzheimer’s theories).
2. Customization & control
o You can fine-tune models, prune them, or run inference chains tailored to your needs.
o Ideal if you’re integrating domain-specific tools, plug-ins, or even building hybrid reasoning stacks (e.g., merging AI with physical lab equipment or waveform analysis).
3. Offline resilience
o You can operate without depending on internet, API keys, or vendor uptime — a big plus for rural or disaster-prone locations.
4. Cost at scale
o After the initial hardware investment, running inference locally can be cheaper than metered API calls for high-volume use.
________________________________________
🔧 Old Xeon RAM Monsters: Still Useful?
Short answer: Yes — for many large language models, especially if the server has lots of RAM and reasonable GPU support. Let’s break it down:
✅ Strengths
• High RAM (256GB+ often possible)
o Ideal for models that are RAM-hungry but don’t require top-tier GPU power.
o Good for running multiple smaller models or one large quantized model (e.g., Q4_K_M, GPTQ, GGUF).
• CPU-only inference is possible
o You can run LLMs like Mistral 7B, LLaMa 13B, or even some 30B models on CPU-only — though they’ll be slower than GPU-based setups.
o Batch jobs, chat assistants, or background summarizers don’t need realtime speed.
• ECC RAM & Server Uptime
o Reconditioned Xeon boxes are rock-solid for long-term operations — ideal if you want a 24/7 AI lab rig.
• Expandable & Modular
o Many of these servers can take GPU cards, NVMe SSDs, or even run Proxmox with VMs for isolated LLM sandboxes.
❌ Limitations
• High power consumption
o Older Xeons suck watts. At 300–600W continuous load, you’ll notice it on your electric bill.
• Loud fans & rack footprint
o Not exactly home-office friendly unless isolated in a garage, server closet, or insulated shed.
• Limited GPU compatibility
o Check motherboard PCIe specs carefully. Many older boards can’t support modern high-end GPUs, especially those needing 16-lane PCIe 4.0 bandwidth.
o Also, power delivery may be insufficient without upgraded PSUs.
________________________________________
🛠️ Hardware Priorities for Local LLM Hosting
Component Why it matters
RAM (128–512GB) Essential for large models (or several smaller ones). The more, the better.
GPU (NVIDIA 3090, A6000, 4090, or H100 if budget allows) Dramatically speeds up inference and fine-tuning. CUDA support required.
Fast storage (NVMe SSDs) Reduces model load time and swap lag.
Cooling / Noise management Rack fans are LOUD. Consider location and airflow planning.
PSU wattage & rail stability Especially critical if adding a power-hungry GPU.
________________________________________
🤖 Model Recommendations (as of Q1 2025)
Model Size Notes
Mistral 7B (GGUF) Small Fast, solid reasoning, runs on 16–32GB RAM models with quantization
DeepSeek-MoE / R1 Huge (up to 671B) Needs GPU(s) and lots of VRAM unless using sparse inference
Phi-2, TinyLlama Tiny Great small assistants, lightweight models
LLaMa 2 / 3 / Code LLaMa Med–Large Versatile, increasingly open, good quality
Mixtral / WizardCoder Med–Large Specialized reasoning / code agents
Please send Comments as appropriate
-filed by Anti-Dave