Floating Astronaut

Architecting a Local Multi-Agent AI Council: Overcoming Consumer Hardware Constraints through Sequential Orchestration

A complete, fact-based engineering guide to building a self-correcting three-model council on an RTX 4070 — exploiting NVMe bandwidth and sequential VRAM eviction to bypass the 12 GB barrier.

Architecting a Local Multi-Agent AI Council

Executive Summary of Localized Inference

The paradigm of artificial neural networks has recently undergone a profound structural shift. The historical reliance on monolithic, cloud-hosted language models is actively being challenged by decentralized, locally hosted inference frameworks. As enterprise data privacy concerns escalate and the demand for offline processing capabilities intensifies, deploying highly capable machine learning architectures on consumer-grade hardware has become a premier objective for systems architects. However, realizing this objective on a local workstation presents a severe, unyielding physical constraint: Video Random Access Memory (VRAM) allocation.

The NVIDIA GeForce RTX 4070, equipped with exactly 12 GB of GDDR6X VRAM, occupies the critical intersection between basic desktop computing and advanced parallel processing. 12 GB provides sufficient bandwidth to fully load and execute a single 7–8 billion parameter language model with exceptional speed — but strictly prohibits concurrent loading of three heavy models, an operation that would require 15 to 24 GB depending on context window configurations.

While a solitary 8B model is highly capable, isolated neural networks frequently suffer from localized hallucinations, factual drift over prolonged context generations, and confirmation bias during complex problem-solving. To achieve analytical fidelity that rivals massive proprietary server clusters, systems architects have adopted Multi-Agent Systems (MAS). By orchestrating a "council" of autonomous digital entities — each powered by a distinct neural architecture with a designated persona — the resulting synthetic output is subjected to rigorous peer review, internal auditing, and adversarial cross-verification.

Because the RTX 4070 possesses exactly 12 GB of VRAM, the architectural solution relies on sequential hardware forcing: load a model, execute a task, instantly unload it, load the next. This pipeline is exceptionally fast when paired with modern NVMe solid-state storage, and the combined inferential capabilities of the council drastically reduce output hallucinations.

Hardware Topography: The VRAM Economic Reality

The Mathematical Constraints of 12 Gigabytes

A standard neural network parameter, trained in 16-bit floating-point precision (FP16), consumes exactly two bytes of physical memory. Consequently, an 8-billion parameter model such as Meta's Llama 3.1 8B natively demands approximately 16 GB of VRAM merely to load the static neural weights — before any dynamic context window is established. This immediately disqualifies the RTX 4070 from executing the model in its original, uncompressed state.

Attempting to force a 16 GB file into a 12 GB hardware buffer results in partial offloading to the host CPU RAM across the PCIe bus, degrading token generation speeds from rapid GPU matrix multiplication down to unacceptably slow CPU processing.

Quantization Dynamics and Footprint Compression

The engineering mechanism used to solve this bottleneck is algorithmic quantization: a mathematical process that systematically compresses neural model weights into lower-precision numerical representations. The industry standard for optimal performance on consumer hardware is 4-bit quantization, particularly the Q4_K_M algorithm developed for the llama.cpp inference backend. By truncating weight precision from 16 bits to 4 bits, the physical size of the model is reduced by approximately 75 percent — an 8B model shrinks from 16.0 GB down to an incredibly efficient 4.7 GB.

Model Foundation Parameters Precision File Size VRAM Required
Llama 3.1 8 Billion FP16 (16-bit) 16.0 GB > 16.0 GB
Llama 3.1 8 Billion Q8_0 (8-bit) 8.5 GB ~ 8.5 GB
Llama 3.1 8 Billion Q4_K_M (4-bit) 4.7 GB ~ 4.7 GB
Qwen 2.5 7 Billion Q4_K_M (4-bit) 4.4 GB ~ 4.4 GB
Mistral v0.3 7 Billion Q4_K_M (4-bit) 4.1 GB ~ 4.1 GB

Static model weight sizes only account for a portion of total VRAM consumption. Active text generation requires additional dynamic memory allocation driven by the Key-Value (KV) Cache — a temporal buffer storing the internal multi-dimensional representations of all previous tokens within a given sequence. For a standard 8B model operating with a moderate context window of 8,192 tokens, the KV cache consumes an additional 1.0 GB to 1.5 GB. A fully operational Q4_K_M 8B model therefore realistically consumes 6.0 GB to 7.5 GB of total VRAM during active generation.

⚠️ Given the RTX 4070's strict 12 GB ceiling, loading a second 7B model into the remaining ~4.5 GB will trigger immediate VRAM exhaustion. Benchmarks show partial CPU offloading can plummet generation speeds from 40 tokens/sec down to 8 tokens/sec. Parallel concurrent execution of multiple distinct models is strictly contraindicated on 12 GB GPU architecture.

Storage Bandwidth and the Physics of Model Swapping

Because parallel execution triggers memory overflow, the proposed architecture relies entirely on high-velocity sequential model swapping. The physical viability of this pipeline depends heavily on the read speed of the host system's secondary storage drive. When a 4-bit quantized 8B model is loaded, approximately 4.7 GB of static data must be physically transferred across the motherboard's architecture.

A modern NVMe SSD on PCIe Gen 4.0 or Gen 5.0 boasts sequential read speeds of 3,500 – 7,500 MB/s. At these bandwidths, transferring 4.7 GB of neural weight data into VRAM requires approximately 0.6 to 1.5 seconds — virtually imperceptible. Conversely, a legacy mechanical HDD at 100 MB/s would result in a load latency exceeding 45 seconds per model swap, rendering the entire sequential methodology prohibitively slow. High-bandwidth NVMe storage is a mandatory prerequisite.

The Cognitive Trinity: Strategic Model Heterogeneity

If multiple agents are powered by the identical foundational model, they share identical pre-training data distributions, identical algorithmic biases, and identical latent space conceptualizations. A model tasked with critiquing its own output is highly susceptible to confirmation bias — it will likely follow the same flawed inferential pathway that generated the initial error during the review phase.

The optimal 12 GB VRAM-tier council fuses three models from entirely distinct organizations: Alibaba's Qwen 2.5, Meta's Llama 3.1, and Mistral AI's 7B variant. Because they originate from different creators, they inherently lack overlapping bias and will not automatically agree with each other's mistakes.

The Lead Researcher: Qwen 2.5 (7B)

Alibaba's Qwen 2.5 was subjected to a pre-training regimen utilizing an unprecedented 18 trillion tokens — a sheer volume of data exposure rarely observed in models of this compact weight class. Its architecture was heavily fine-tuned using targeted synthetic data encompassing complex mathematics, scientific textbooks, and multi-language coding snippets. This yields an engine that is brilliant at coding and logic, making it ideal as the Lead Researcher responsible for generating the exhaustive, fact-based initial draft.

The Ruthless Skeptic: Llama 3.1 (8B)

Meta's Llama 3.1 8B exhibits exceptional instruction-following adherence and highly nuanced analytical processing, making it optimal for forensic textual deconstruction. Because its pre-training lineage is entirely distinct from Qwen, its latent conceptual space approaches data from a divergent angle. When prompted to act as a Ruthless Skeptic, Llama 3.1 aggressively analyzes the Qwen-generated draft, purposefully hunting for logical fallacies, missing context, and factual inconsistencies. It is explicitly instructed to not write a final answer — its sole purpose is a structured adversarial critique.

The Final Arbiter: Mistral (7B)

While Mistral models occasionally score lower than Qwen or Llama on raw synthetic logic benchmarks, human-preference evaluations consistently highlight Mistral's superiority in writing style, tonal adherence, and structural cleanliness. As the Final Arbiter, Mistral reads the initial draft and the harsh critique, resolves the discrepancies between the two datasets, and writes the absolute final truth in polished, professional prose.

Agent Role Model Provider Primary Strength Function in Pipeline
Lead Researcher Qwen 2.5 (7B) Alibaba Deep logic, coding, 18T token dataset Generates the exhaustive, fact-based initial draft
Ruthless Skeptic Llama 3.1 (8B) Meta AI Instruction following, safety alignment Audits the initial draft for logical flaws
Final Arbiter Mistral (7B) Mistral AI Structured outputs, superior tonal quality Synthesizes draft and critique into a polished final verdict

Phase 1: Installing the Backend Engine and Orchestration Framework

Phase 1 — Infrastructure

Deploying this cognitive trinity requires two foundational installations running in parallel.

Ollama serves as the localized inference server. Built upon the llama.cpp infrastructure, Ollama abstracts the complexities of direct hardware memory allocation, operating as a background daemon that listens for API requests. Download and install Ollama from ollama.com. It installs natively on Windows, macOS, and Linux, establishing a local server that binds to http://localhost:11434.

Simultaneously, download and install Python, ensuring the critical "Add Python to PATH" option is checked during installation. Then install the orchestration framework:

terminal — install crewai
pip install crewai litellm

CrewAI structures the system into tangible hierarchical objects: Agents, Tasks, and Crews. LiteLLM acts as a universal translation layer, routing standard framework API calls to local or remote providers. Installing both explicitly prevents the common runtime error ImportError: Fallback to LiteLLM is not available.

Phase 2: Downloading the 12 GB-Tier Council

Phase 2 — Model Acquisition

To ensure complete offline capability and protect data privacy, the neural weights of all three models must be physically downloaded and cached to the host machine's SSD. Run these commands one by one in your terminal:

terminal — pull all three models (~13.2 GB total)
ollama run qwen2.5
ollama run llama3.1
ollama run mistral

The run command checks the local registry; if the model is absent, it initiates a download from the public registry, caches the .gguf file to the hard drive, and immediately initializes an interactive prompt. Once the >>> prompt appears, type /bye to exit. Repeat for all three models. The models are now securely saved and standing ready for algorithmic recall.

Phase 3: The Architecture of GPU Forcing

Phase 3 — VRAM Management

Before writing the executable code, it is critically important to understand the physical operation the architecture is forcing the GPU to execute:

  1. Qwen loads → writes a comprehensive draft → unloads.
  2. Llama loads → reads Qwen's draft → writes a harsh critique → unloads.
  3. Mistral loads → reads both draft and critique → writes the final truth → terminates.

The VRAM Persistence Bottleneck

While CrewAI's Process.sequential logic correctly orchestrates the chronological handoff of text between agents, the framework does not possess kernel-level control over Ollama's GPU memory allocation. This creates a severe bottleneck that will cause the RTX 4070 to fail if left unaddressed.

Ollama is fundamentally designed as a persistent server. By default, the keep_alive parameter is set to 5 minutes (5m). After an agent finishes generating a response, its multi-gigabyte model remains parked inside VRAM for 300 seconds. In a sequential pipeline, this triggers a catastrophic cascade:

  1. CrewAI triggers the Researcher. Ollama loads Qwen 2.5 → ~5.5 GB VRAM consumed.
  2. Qwen 2.5 finishes the draft. CrewAI moves to the next task.
  3. CrewAI triggers the Skeptic and requests Llama 3.1.
  4. Because the 5-minute timeout has not elapsed, Qwen 2.5 is still occupying 5.5 GB.
  5. Ollama attempts to load Llama 3.1, requiring an additional 5.5 GB. Total VRAM allocation instantly breaches 11+ GB alongside mandatory Windows display reservations.
  6. VRAM exhausted. The system either crashes with an OOM error or offloads Llama layers to CPU RAM, obliterating inference speed.

Forcing Immediate Eviction for Sequential Logic

The most elegant and reliable solution is to manipulate the host OS environment variables. The critical variable is OLLAMA_KEEP_ALIVE. Setting it to 0 instructs the server to implement a zero-second memory retention policy for all API requests.

🪟 Windows: Open Environment Variables (search in Start Menu → "Edit the system environment variables" → Environment Variables). Under System variables, click New. Variable name: OLLAMA_KEEP_ALIVE, Value: 0. Click OK, then completely restart the Ollama process (kill it from the system tray and relaunch).

Once this environmental variable is active, the physical lifecycle of the VRAM synchronizes perfectly with CrewAI's logic flow. Qwen 2.5 generates text and instantly purges from VRAM. The GPU VRAM plummets back to near 0 GB utilization. Llama 3.1 loads cleanly into an empty buffer, generates its critique, and instantly purges. This continuous flush-and-fill methodology maximizes the 12 GB bandwidth of the RTX 4070 and completely circumvents the hardware limitation.

Phase 4: The Python Script and Orchestration Logic

Phase 4 — Execution Code

Create a new folder on your computer. Inside it, create a file named council.py. Open it in any text editor (Notepad or VS Code) and paste the following code exactly:

python — council.py
from crewai import Agent, Task, Crew, Process, LLM

# 1. Connect to your local Ollama models
researcher_llm = LLM(model="ollama/qwen2.5", base_url="http://localhost:11434")
skeptic_llm    = LLM(model="ollama/llama3.1", base_url="http://localhost:11434")
judge_llm      = LLM(model="ollama/mistral",  base_url="http://localhost:11434")

# 2. Define the Council Members (Agents)
researcher = Agent(
    role='Lead Researcher',
    goal='Provide a comprehensive, highly accurate initial draft answering the user prompt.',
    backstory='You are a brilliant data scientist. You only care about facts.',
    llm=researcher_llm,
    verbose=True
)

skeptic = Agent(
    role='Ruthless Skeptic',
    goal="Find every logical flaw, factual error, or weak point in the Researcher's draft.",
    backstory='You are a cynical auditor. You assume the Researcher is wrong until proven '
              'otherwise. You do not write answers, you only critique.',
    llm=skeptic_llm,
    verbose=True
)

judge = Agent(
    role='Final Arbiter',
    goal='Synthesize the original draft and the critique into a perfect, flawless final answer.',
    backstory='You are an impartial judge. You weigh the Researcher data against the Skeptic '
              'critique and output the absolute truth.',
    llm=judge_llm,
    verbose=True
)

# 3. Define the Prompt (change this to whatever you want!)
user_topic = "What are the realistic physics limitations of building a space elevator on Earth?"

# 4. Define the Tasks
draft_task = Task(
    description=f'Write a detailed answer to this topic: {user_topic}',
    expected_output='A full, multi-paragraph explanation of the topic.',
    agent=researcher
)

critique_task = Task(
    description="Read the Researcher's output. Write a harsh critique pointing out "
                "missing context or logical failures.",
    expected_output='A bulleted list of flaws in the previous draft.',
    agent=skeptic
)

final_task = Task(
    description='Read the original draft and the critique. Write the final, corrected response.',
    expected_output='A polished, highly accurate, and final comprehensive answer.',
    agent=judge
)

# 5. Form the Crew and Execute
ai_council = Crew(
    agents=[researcher, skeptic, judge],
    tasks=[draft_task, critique_task, final_task],
    process=Process.sequential  # Forces one-at-a-time execution, saving your 12 GB VRAM
)

print("The Council is now in session...")
result = ai_council.kickoff()

print("\n\n=== FINAL VERDICT ===")
print(result)

Codebase Architecture and LiteLLM Routing

The LLM class uses the model name prefix to determine API routing via LiteLLM. When calling a local Ollama instance, the string identifier must be strictly prefixed with ollama/ — e.g., model="ollama/qwen2.5".

Local network communication requires specifying the local host endpoint. Critically, earlier deprecated versions of CrewAI used a parameter called api_base. This has been strictly deprecated in modern releases. The LLM class now expects the base_url keyword argument. Using the old variable will instantly trigger:

error — using deprecated api_base
TypeError: LLM.__init__() got an unexpected keyword argument 'API_BASE'

The code provided uses the correct base_url="http://localhost:11434" parameter, guaranteeing flawless communication with the background daemon.

Psychological Prompting and Task Chaining

An Agent in CrewAI requires three critical psychological prompts: Role (functional title), Goal (precise objective), and Backstory (deep contextual grounding that forces the LLM to adopt a specific persona). Instructing the Skeptic that "You are a cynical auditor" significantly alters the weights triggered during inference, forcing the model to hunt for errors rather than simply summarize data.

A Task represents the discrete unit of computational work assigned to an agent. The Crew instantiation binds agents and tasks together. The critical parameter for hardware resource management is process=Process.sequential, which enforces a rigid linear progression. The textual output of each task is automatically and invisibly injected into the context window payload of the subsequent task — guaranteeing that the Skeptic always reads the Researcher's full draft, and the Arbiter always reads both.

To run the council, open your terminal, navigate to the folder containing the script, and execute:

terminal — run the council
python council.py

System Execution Dynamics and Telemetry

Upon executing the application, the hardware and digital orchestration systems synchronize in a highly visible manner. VRAM utilization rests at a static baseline of roughly 0.5 GB to 1.5 GB (reserved by the OS display rendering). Upon initialization of draft_task, the NVMe SSD undergoes a massive sequential read, transferring 4.4 GB of neural data into the GPU in roughly one second. VRAM utilization spikes to approximately 6.0 GB as Qwen 2.5 allocates its KV cache and begins matrix multiplication.

The critical architectural intervention occurs the exact moment Qwen 2.5 generates its final token. Dictated by the environment variable, the model is violently and instantaneously purged from hardware memory. The VRAM telemetry graph plummets directly back to the 1.5 GB baseline. CrewAI seamlessly registers task completion and initiates critique_task, appending the entirety of Qwen's output into Llama's input context. The SSD fires again, rapidly loading the 4.7 GB Llama 3.1 8B model into the now completely vacant VRAM buffer.

Llama evaluates the ingested textual data, outputs its adversarial bulletin, and is similarly purged upon completion. Finally, Mistral is loaded, synthesizing the historical context loop into the final definitive verdict, before leaving the system clean and dormant.

"The capacity to orchestrate heterogeneous, multi-model agentic councils entirely locally represents a profound democratization of computational intelligence. By intelligently exploiting basic system bottlenecks — turning VRAM limitations into a chronological scheduling algorithm that leans heavily on PCIe SSD bandwidth — developers can match the logical depth and error-checking rigor previously reserved for massive proprietary server clusters."

By decoupling reliance on single homogeneous models and implementing competitive multi-agent peer review, the local computing ecosystem effectively transforms the RTX 4070 from a simple graphics renderer into a highly complex, infallible digital laboratory.