webml-kit: running ML models in the browser via WebGPU/WASM.

Posted by init0@reddit | LocalLLaMA | View on Reddit | 3 comments

webml-kit: running ML models in the browser via WebGPU/WASM.

webml-kit

If you've ever built a browser-ML demo, you know the drill: copy 150 lines of Web Worker boilerplate from the last project, wire up postMessage, add progress reporting, handle the GPU vanishing mid-inference, and pray the model is cached so your user doesn't wait 3 minutes. Every. Single. Time.

This library does that part for you. It wraps u/huggingface/transformers with a sane API and handles the ugly bits: device detection, model caching, token streaming, KV-cache management, and GPU recovery.

import { ModelClient } from 'webml-kit';

const client = new ModelClient();
// or with an explicit worker path:
// const client = new ModelClient(new URL('webml-kit/worker', import.meta.url));

// What can this machine do?
const device = await client.detect();
console.log(device.backend);         // 'webgpu' or 'wasm' or 'cpu'
console.log(device.gpu?.vendor);      // 'apple'
console.log(device.recommendedDtype); // 'q4'

// Load a model
await client.load({
  task: 'text-generation',
  modelId: 'onnx-community/Bonsai-1.7B-ONNX',
  dtype: 'q4',
  onProgress: ({ percent }) => console.log(`Loading: ${percent}%`),
});

// Stream tokens as they're generated
for await (const { token, tps } of client.stream('Tell me a joke')) {
  process.stdout.write(token);
}