DAISI Browser Host - Run AI Inference in Your Browser via WebGPU

What is Browser Host?

Browser Host turns any browser tab into a fully functional DAISI host. It loads GGUF models, runs inference entirely on your GPU via WebGPU compute shaders, and connects directly to the ORC (Orchestrator) via gRPC-web — all without any server involvement.

Your data stays on your device. The model runs locally in your browser. Only inference commands and responses travel over the network to the ORC.

Requirements

Requirement	Details
Browser	Chrome 113+ or Edge 113+ (WebGPU required)
GPU	Any GPU with WebGPU support (NVIDIA, AMD, Intel, Apple Silicon)
VRAM	Minimum 1 GB free. Larger models need more — see VRAM table below.
Account	A DAISI account at manager.daisinet.com

Getting Started

Navigate to Browser Host in the Manager sidebar.
The page will detect your GPU and display adapter information (device, architecture, features).
Select a model from the dropdown. The first download may take a few minutes depending on your connection — once cached, future loads are instant.
Once loaded, the host automatically creates a host identity and connects to the ORC.
Your browser tab is now a live DAISI host, processing inference requests from the network.

VRAM Requirements

VRAM usage depends on the model size and quantization. The browser host estimates VRAM before downloading and warns if a model may not fit.

Model	Quant	Approx. VRAM
Qwen 0.6B	Q4_0	~500 MB
Qwen 0.6B	Q8_0	~800 MB
TinyLlama 1.1B	Q8_0	~1.5 GB
Llama 3.2 1B	Q4_0	~1 GB
Llama 3.2 1B	Q8_0	~1.8 GB

WebGPU has a per-buffer limit of 2 GB, but total VRAM usage can exceed this since the engine uses multiple buffers.

Going Online

When the host connects to the ORC, it becomes available for inference requests from the network. Here's what happens:

Connection The browser connects directly to the ORC via gRPC-web. No server proxy — your browser talks to the ORC the same way native hosts do.
Heartbeats The host sends heartbeats every 60 seconds with model information. The ORC uses these to know your host is alive and what models you have loaded.
Auto-Reconnect If the connection drops, the host automatically reconnects with exponential backoff (up to 10 retries). No manual intervention needed.

Private Chat

Use the Private Chat button to test the model locally. Your conversation runs entirely on your device — nothing is sent to a server. This is always free and always private.

Use the ORC Chat button to test the full pipeline: your prompt goes through the ORC, gets routed to your browser host, inference runs on your GPU, and tokens stream back through the ORC.

Supported Models

Browser Host supports GGUF models with the following configurations:

Feature	Support
Architecture	Llama, Qwen, Mistral (and compatible)
GPU Quantization	Q4_0, Q8_0 (native GPU shaders)
CPU Dequant	F16, Q4_0, Q4_1, Q5_0, Q5_1, Q8_0, Q4_K, Q6_K
Chat Templates	Automatic — reads Jinja2 template from GGUF metadata
Attention	Multi-head + Grouped Query Attention (GQA)

Troubleshooting

"WebGPU not available" Make sure you're using Chrome 113+ or Edge 113+. On some systems you may need to enable WebGPU in chrome://flags.
"Model too large" Try a smaller model or a more compressed quantization (Q4_0 uses roughly half the VRAM of Q8_0). Close other GPU-intensive tabs.
"Disconnected from ORC" The host will auto-reconnect. If it doesn't, try reloading the page. Ensure your ORC address is correct in the Manager settings.
Slow inference Use Q8_0 models for best quality/speed tradeoff. Larger models are slower. Close other tabs that may be using the GPU.