Executive Summary of Localized Inference
The paradigm of artificial neural networks has recently undergone a profound structural shift. The historical reliance on monolithic, cloud-hosted language models is actively being challenged by decentralized, locally hosted inference frameworks. As enterprise data privacy concerns escalate and the demand for offline processing capabilities intensifies, deploying highly capable machine learning architectures on consumer-grade hardware has become a premier objective for systems architects. However, realizing this objective on a local workstation presents a severe, unyielding physical constraint: Video Random Access Memory (VRAM) allocation.
The NVIDIA GeForce RTX 4070, equipped with exactly 12 GB of GDDR6X VRAM, occupies the critical intersection between basic desktop computing and advanced parallel processing. 12 GB provides sufficient bandwidth to fully load and execute a single 7–8 billion parameter language model with exceptional speed — but strictly prohibits concurrent loading of three heavy models, an operation that would require 15 to 24 GB depending on context window configurations.
While a solitary 8B model is highly capable, isolated neural networks frequently suffer from localized hallucinations, factual drift over prolonged context generations, and confirmation bias during complex problem-solving. To achieve analytical fidelity that rivals massive proprietary server clusters, systems architects have adopted Multi-Agent Systems (MAS). By orchestrating a "council" of autonomous digital entities — each powered by a distinct neural architecture with a designated persona — the resulting synthetic output is subjected to rigorous peer review, internal auditing, and adversarial cross-verification.
Because the RTX 4070 possesses exactly 12 GB of VRAM, the architectural solution relies on sequential hardware forcing: load a model, execute a task, instantly unload it, load the next. This pipeline is exceptionally fast when paired with modern NVMe solid-state storage, and the combined inferential capabilities of the council drastically reduce output hallucinations.
Hardware Topography: The VRAM Economic Reality
The Mathematical Constraints of 12 Gigabytes
A standard neural network parameter, trained in 16-bit floating-point precision (FP16), consumes exactly two bytes of physical memory. Consequently, an 8-billion parameter model such as Meta's Llama 3.1 8B natively demands approximately 16 GB of VRAM merely to load the static neural weights — before any dynamic context window is established. This immediately disqualifies the RTX 4070 from executing the model in its original, uncompressed state.
Attempting to force a 16 GB file into a 12 GB hardware buffer results in partial offloading to the host CPU RAM across the PCIe bus, degrading token generation speeds from rapid GPU matrix multiplication down to unacceptably slow CPU processing.
Quantization Dynamics and Footprint Compression
The engineering mechanism used to solve this bottleneck is algorithmic
quantization: a mathematical process that systematically compresses neural model
weights into lower-precision numerical representations. The industry standard for optimal
performance on consumer hardware is 4-bit quantization, particularly the
Q4_K_M algorithm developed for the llama.cpp inference backend. By truncating
weight precision from 16 bits to 4 bits, the physical size of the model is reduced by
approximately 75 percent — an 8B model shrinks from 16.0 GB down to an
incredibly efficient 4.7 GB.
| Model Foundation | Parameters | Precision | File Size | VRAM Required |
|---|---|---|---|---|
| Llama 3.1 | 8 Billion | FP16 (16-bit) | 16.0 GB | > 16.0 GB |
| Llama 3.1 | 8 Billion | Q8_0 (8-bit) | 8.5 GB | ~ 8.5 GB |
| Llama 3.1 | 8 Billion | Q4_K_M (4-bit) | 4.7 GB | ~ 4.7 GB |
| Qwen 2.5 | 7 Billion | Q4_K_M (4-bit) | 4.4 GB | ~ 4.4 GB |
| Mistral v0.3 | 7 Billion | Q4_K_M (4-bit) | 4.1 GB | ~ 4.1 GB |
Static model weight sizes only account for a portion of total VRAM consumption. Active text generation requires additional dynamic memory allocation driven by the Key-Value (KV) Cache — a temporal buffer storing the internal multi-dimensional representations of all previous tokens within a given sequence. For a standard 8B model operating with a moderate context window of 8,192 tokens, the KV cache consumes an additional 1.0 GB to 1.5 GB. A fully operational Q4_K_M 8B model therefore realistically consumes 6.0 GB to 7.5 GB of total VRAM during active generation.
Storage Bandwidth and the Physics of Model Swapping
Because parallel execution triggers memory overflow, the proposed architecture relies entirely on high-velocity sequential model swapping. The physical viability of this pipeline depends heavily on the read speed of the host system's secondary storage drive. When a 4-bit quantized 8B model is loaded, approximately 4.7 GB of static data must be physically transferred across the motherboard's architecture.
A modern NVMe SSD on PCIe Gen 4.0 or Gen 5.0 boasts sequential read speeds of 3,500 – 7,500 MB/s. At these bandwidths, transferring 4.7 GB of neural weight data into VRAM requires approximately 0.6 to 1.5 seconds — virtually imperceptible. Conversely, a legacy mechanical HDD at 100 MB/s would result in a load latency exceeding 45 seconds per model swap, rendering the entire sequential methodology prohibitively slow. High-bandwidth NVMe storage is a mandatory prerequisite.
The Cognitive Trinity: Strategic Model Heterogeneity
If multiple agents are powered by the identical foundational model, they share identical pre-training data distributions, identical algorithmic biases, and identical latent space conceptualizations. A model tasked with critiquing its own output is highly susceptible to confirmation bias — it will likely follow the same flawed inferential pathway that generated the initial error during the review phase.
The optimal 12 GB VRAM-tier council fuses three models from entirely distinct organizations: Alibaba's Qwen 2.5, Meta's Llama 3.1, and Mistral AI's 7B variant. Because they originate from different creators, they inherently lack overlapping bias and will not automatically agree with each other's mistakes.
The Lead Researcher: Qwen 2.5 (7B)
Alibaba's Qwen 2.5 was subjected to a pre-training regimen utilizing an unprecedented 18 trillion tokens — a sheer volume of data exposure rarely observed in models of this compact weight class. Its architecture was heavily fine-tuned using targeted synthetic data encompassing complex mathematics, scientific textbooks, and multi-language coding snippets. This yields an engine that is brilliant at coding and logic, making it ideal as the Lead Researcher responsible for generating the exhaustive, fact-based initial draft.
The Ruthless Skeptic: Llama 3.1 (8B)
Meta's Llama 3.1 8B exhibits exceptional instruction-following adherence and highly nuanced analytical processing, making it optimal for forensic textual deconstruction. Because its pre-training lineage is entirely distinct from Qwen, its latent conceptual space approaches data from a divergent angle. When prompted to act as a Ruthless Skeptic, Llama 3.1 aggressively analyzes the Qwen-generated draft, purposefully hunting for logical fallacies, missing context, and factual inconsistencies. It is explicitly instructed to not write a final answer — its sole purpose is a structured adversarial critique.
The Final Arbiter: Mistral (7B)
While Mistral models occasionally score lower than Qwen or Llama on raw synthetic logic benchmarks, human-preference evaluations consistently highlight Mistral's superiority in writing style, tonal adherence, and structural cleanliness. As the Final Arbiter, Mistral reads the initial draft and the harsh critique, resolves the discrepancies between the two datasets, and writes the absolute final truth in polished, professional prose.
| Agent Role | Model | Provider | Primary Strength | Function in Pipeline |
|---|---|---|---|---|
| Lead Researcher | Qwen 2.5 (7B) | Alibaba | Deep logic, coding, 18T token dataset | Generates the exhaustive, fact-based initial draft |
| Ruthless Skeptic | Llama 3.1 (8B) | Meta AI | Instruction following, safety alignment | Audits the initial draft for logical flaws |
| Final Arbiter | Mistral (7B) | Mistral AI | Structured outputs, superior tonal quality | Synthesizes draft and critique into a polished final verdict |
Phase 1: Installing the Backend Engine and Orchestration Framework
Phase 1 — InfrastructureDeploying this cognitive trinity requires two foundational installations running in parallel.
Ollama serves as the localized inference server. Built upon the llama.cpp
infrastructure, Ollama abstracts the complexities of direct hardware memory allocation,
operating as a background daemon that listens for API requests. Download and install Ollama
from ollama.com. It installs natively on Windows, macOS, and Linux, establishing
a local server that binds to http://localhost:11434.
Simultaneously, download and install Python, ensuring the critical "Add Python to PATH" option is checked during installation. Then install the orchestration framework:
pip install crewai litellm
CrewAI structures the system into tangible hierarchical objects: Agents,
Tasks, and Crews. LiteLLM acts as a universal translation layer, routing
standard framework API calls to local or remote providers. Installing both explicitly prevents
the common runtime error ImportError: Fallback to LiteLLM is not available.
Phase 2: Downloading the 12 GB-Tier Council
Phase 2 — Model AcquisitionTo ensure complete offline capability and protect data privacy, the neural weights of all three models must be physically downloaded and cached to the host machine's SSD. Run these commands one by one in your terminal:
ollama run qwen2.5
ollama run llama3.1
ollama run mistral
The run command checks the local registry; if the model is absent, it initiates
a download from the public registry, caches the .gguf file to the hard drive,
and immediately initializes an interactive prompt. Once the >>> prompt
appears, type /bye to exit. Repeat for all three models. The models are now
securely saved and standing ready for algorithmic recall.
Phase 3: The Architecture of GPU Forcing
Phase 3 — VRAM ManagementBefore writing the executable code, it is critically important to understand the physical operation the architecture is forcing the GPU to execute:
- Qwen loads → writes a comprehensive draft → unloads.
- Llama loads → reads Qwen's draft → writes a harsh critique → unloads.
- Mistral loads → reads both draft and critique → writes the final truth → terminates.
The VRAM Persistence Bottleneck
While CrewAI's Process.sequential logic correctly orchestrates the chronological
handoff of text between agents, the framework does not possess kernel-level control over
Ollama's GPU memory allocation. This creates a severe bottleneck that will cause the RTX 4070
to fail if left unaddressed.
Ollama is fundamentally designed as a persistent server. By default, the
keep_alive parameter is set to 5 minutes (5m). After an agent
finishes generating a response, its multi-gigabyte model remains parked inside VRAM for
300 seconds. In a sequential pipeline, this triggers a catastrophic cascade:
- CrewAI triggers the Researcher. Ollama loads Qwen 2.5 → ~5.5 GB VRAM consumed.
- Qwen 2.5 finishes the draft. CrewAI moves to the next task.
- CrewAI triggers the Skeptic and requests Llama 3.1.
- Because the 5-minute timeout has not elapsed, Qwen 2.5 is still occupying 5.5 GB.
- Ollama attempts to load Llama 3.1, requiring an additional 5.5 GB. Total VRAM allocation instantly breaches 11+ GB alongside mandatory Windows display reservations.
- VRAM exhausted. The system either crashes with an OOM error or offloads Llama layers to CPU RAM, obliterating inference speed.
Forcing Immediate Eviction for Sequential Logic
The most elegant and reliable solution is to manipulate the host OS environment variables.
The critical variable is OLLAMA_KEEP_ALIVE. Setting it to 0
instructs the server to implement a zero-second memory retention policy for
all API requests.
OLLAMA_KEEP_ALIVE,
Value: 0. Click OK, then completely restart the Ollama process
(kill it from the system tray and relaunch).
Once this environmental variable is active, the physical lifecycle of the VRAM synchronizes perfectly with CrewAI's logic flow. Qwen 2.5 generates text and instantly purges from VRAM. The GPU VRAM plummets back to near 0 GB utilization. Llama 3.1 loads cleanly into an empty buffer, generates its critique, and instantly purges. This continuous flush-and-fill methodology maximizes the 12 GB bandwidth of the RTX 4070 and completely circumvents the hardware limitation.
Phase 4: The Python Script and Orchestration Logic
Phase 4 — Execution Code
Create a new folder on your computer. Inside it, create a file named council.py.
Open it in any text editor (Notepad or VS Code) and paste the following code exactly:
from crewai import Agent, Task, Crew, Process, LLM
# 1. Connect to your local Ollama models
researcher_llm = LLM(model="ollama/qwen2.5", base_url="http://localhost:11434")
skeptic_llm = LLM(model="ollama/llama3.1", base_url="http://localhost:11434")
judge_llm = LLM(model="ollama/mistral", base_url="http://localhost:11434")
# 2. Define the Council Members (Agents)
researcher = Agent(
role='Lead Researcher',
goal='Provide a comprehensive, highly accurate initial draft answering the user prompt.',
backstory='You are a brilliant data scientist. You only care about facts.',
llm=researcher_llm,
verbose=True
)
skeptic = Agent(
role='Ruthless Skeptic',
goal="Find every logical flaw, factual error, or weak point in the Researcher's draft.",
backstory='You are a cynical auditor. You assume the Researcher is wrong until proven '
'otherwise. You do not write answers, you only critique.',
llm=skeptic_llm,
verbose=True
)
judge = Agent(
role='Final Arbiter',
goal='Synthesize the original draft and the critique into a perfect, flawless final answer.',
backstory='You are an impartial judge. You weigh the Researcher data against the Skeptic '
'critique and output the absolute truth.',
llm=judge_llm,
verbose=True
)
# 3. Define the Prompt (change this to whatever you want!)
user_topic = "What are the realistic physics limitations of building a space elevator on Earth?"
# 4. Define the Tasks
draft_task = Task(
description=f'Write a detailed answer to this topic: {user_topic}',
expected_output='A full, multi-paragraph explanation of the topic.',
agent=researcher
)
critique_task = Task(
description="Read the Researcher's output. Write a harsh critique pointing out "
"missing context or logical failures.",
expected_output='A bulleted list of flaws in the previous draft.',
agent=skeptic
)
final_task = Task(
description='Read the original draft and the critique. Write the final, corrected response.',
expected_output='A polished, highly accurate, and final comprehensive answer.',
agent=judge
)
# 5. Form the Crew and Execute
ai_council = Crew(
agents=[researcher, skeptic, judge],
tasks=[draft_task, critique_task, final_task],
process=Process.sequential # Forces one-at-a-time execution, saving your 12 GB VRAM
)
print("The Council is now in session...")
result = ai_council.kickoff()
print("\n\n=== FINAL VERDICT ===")
print(result)
Codebase Architecture and LiteLLM Routing
The LLM class uses the model name prefix to determine API routing via LiteLLM.
When calling a local Ollama instance, the string identifier must be strictly prefixed with
ollama/ — e.g., model="ollama/qwen2.5".
Local network communication requires specifying the local host endpoint. Critically, earlier
deprecated versions of CrewAI used a parameter called api_base. This has been
strictly deprecated in modern releases. The LLM class now
expects the base_url keyword argument. Using the old variable will instantly
trigger:
TypeError: LLM.__init__() got an unexpected keyword argument 'API_BASE'
The code provided uses the correct base_url="http://localhost:11434" parameter,
guaranteeing flawless communication with the background daemon.
Psychological Prompting and Task Chaining
An Agent in CrewAI requires three critical psychological prompts: Role
(functional title), Goal (precise objective), and Backstory
(deep contextual grounding that forces the LLM to adopt a specific persona). Instructing the
Skeptic that "You are a cynical auditor" significantly alters the weights triggered
during inference, forcing the model to hunt for errors rather than simply summarize data.
A Task represents the discrete unit of computational work assigned to an agent.
The Crew instantiation binds agents and tasks together. The critical parameter
for hardware resource management is process=Process.sequential, which enforces
a rigid linear progression. The textual output of each task is automatically and
invisibly injected into the context window payload of the subsequent task —
guaranteeing that the Skeptic always reads the Researcher's full draft, and the Arbiter always
reads both.
To run the council, open your terminal, navigate to the folder containing the script, and execute:
python council.py
System Execution Dynamics and Telemetry
Upon executing the application, the hardware and digital orchestration systems synchronize in
a highly visible manner. VRAM utilization rests at a static baseline of roughly 0.5 GB
to 1.5 GB (reserved by the OS display rendering). Upon initialization of
draft_task, the NVMe SSD undergoes a massive sequential read, transferring
4.4 GB of neural data into the GPU in roughly one second. VRAM utilization spikes to
approximately 6.0 GB as Qwen 2.5 allocates its KV cache and begins
matrix multiplication.
The critical architectural intervention occurs the exact moment Qwen 2.5 generates its final
token. Dictated by the environment variable, the model is violently and instantaneously
purged from hardware memory. The VRAM telemetry graph plummets directly back to the
1.5 GB baseline. CrewAI seamlessly registers task completion and initiates
critique_task, appending the entirety of Qwen's output into Llama's input
context. The SSD fires again, rapidly loading the 4.7 GB Llama 3.1 8B model into the now
completely vacant VRAM buffer.
Llama evaluates the ingested textual data, outputs its adversarial bulletin, and is similarly purged upon completion. Finally, Mistral is loaded, synthesizing the historical context loop into the final definitive verdict, before leaving the system clean and dormant.
"The capacity to orchestrate heterogeneous, multi-model agentic councils entirely locally represents a profound democratization of computational intelligence. By intelligently exploiting basic system bottlenecks — turning VRAM limitations into a chronological scheduling algorithm that leans heavily on PCIe SSD bandwidth — developers can match the logical depth and error-checking rigor previously reserved for massive proprietary server clusters."
By decoupling reliance on single homogeneous models and implementing competitive multi-agent peer review, the local computing ecosystem effectively transforms the RTX 4070 from a simple graphics renderer into a highly complex, infallible digital laboratory.