webml-kit: running ML models in the browser via WebGPU/WASM.

Posted by init0@reddit | LocalLLaMA | View on Reddit | 3 comments

webml-kit

If you've ever built a browser-ML demo, you know the drill: copy 150 lines of Web Worker boilerplate from the last project, wire up postMessage, add progress reporting, handle the GPU vanishing mid-inference, and pray the model is cached so your user doesn't wait 3 minutes. Every. Single. Time.

This library does that part for you. It wraps u/huggingface/transformers with a sane API and handles the ugly bits: device detection, model caching, token streaming, KV-cache management, and GPU recovery.

import { ModelClient } from 'webml-kit';

const client = new ModelClient();
// or with an explicit worker path:
// const client = new ModelClient(new URL('webml-kit/worker', import.meta.url));

// What can this machine do?
const device = await client.detect();
console.log(device.backend);         // 'webgpu' or 'wasm' or 'cpu'
console.log(device.gpu?.vendor);      // 'apple'
console.log(device.recommendedDtype); // 'q4'

// Load a model
await client.load({
  task: 'text-generation',
  modelId: 'onnx-community/Bonsai-1.7B-ONNX',
  dtype: 'q4',
  onProgress: ({ percent }) => console.log(`Loading: ${percent}%`),
});

// Stream tokens as they're generated
for await (const { token, tps } of client.stream('Tell me a joke')) {
  process.stdout.write(token);
}

[-]

scottgal2@reddit

Very neat, I've been playing with promptapi for some client side synthesis tasks but this looks far more usable (esp as promptapi is chrome only AND still not enabled by default).

init0@reddit (OP)

Thank you! I have been experimenting a lot with prompt API, checkout the section in https://h3manth.com/ai/

NICE I vibe coded a little chrome extension https://github.com/scottgal/mostlylucid.smartcopyalt and a future product will use it for in-page supprt.
For an image alt text generator locally it rules (and yes still locallama ;))