Speech-to-Text (WebSocket)¶
Jodit AI Adapter exposes a WebSocket endpoint for streaming speech-to-text. The browser streams microphone audio over the socket and the server proxies it to the provider's realtime transcription API, so the provider key never reaches the browser and usage can be metered exactly like text AI requests.
Implementation status
This page documents the authenticated transport that is available today.
The realtime audio→transcript protocol (audio frames in, delta/final
transcript events out) and per-credit metering are being added in the
following releases. The connection contract below (endpoint, auth, lifecycle)
is stable.
Endpoint¶
The path follows the configured routePrefix (default /ai), so with the
default configuration the socket URL is:
Optional query parameters select the provider and model (like the other adapter handlers; all default sensibly):
| Param | Default | Meaning |
|---|---|---|
provider |
openai |
Configured provider to use |
model |
provider default (gpt-4o-mini-transcribe) |
Transcription model |
language |
en |
BCP-47 language hint (e.g. en-US) |
Authentication¶
Authentication happens during the upgrade handshake and uses the same rules as the HTTP routes:
- API key — taken from, in order,
Authorization: Bearer <key>,x-api-key: <key>, or the?key=/?apikey=query parameter. - Format — validated against
apiKeyPattern(default 36-char UUID shape). - Referer — when
requireReferer(orallowedReferers) is configured, theReferer/Originof the handshake must match. checkAuthentication— if provided, it is awaited and must return a user id; otherwise the handshake is refused.
An unauthenticated handshake is rejected with an HTTP status before the
WebSocket opens (the client receives an unexpected-response/error, not a
silently dropped socket).
Referer for the socket
The external checkReferer callback is HTTP-only (it receives an Express
Request) and is not invoked for the WebSocket. Hosts that need referer
allow-listing on the socket should enforce it in their own upgrade handler
(for example, jodit-startup-service does this). requireReferer +
allowedReferers pattern matching are applied.
Wire protocol¶
Two frame kinds travel over the socket:
| Direction | Frame type | Payload |
|---|---|---|
| client → server | binary | mono PCM16 little-endian @ 24 kHz audio chunks |
| server → client | text (JSON) | control & transcript events |
Server → client events:
{ "type": "ready" } // sent on connect — start streaming audio
{ "type": "delta", "text": "..." } // interim transcript (partial)
{ "type": "final", "text": "..." } // committed transcript for a phrase
{ "type": "error", "message": "..." }
Status
The endpoint is fully wired: it authenticates the upgrade, resolves the
provider adapter, runs a realtime transcription session, and meters audio
usage through onUsage. The OpenAI provider emits ready, then delta
(interim) and final (committed) transcripts, and error on failure.
Sending microphone audio from the browser¶
Capture the microphone, downsample to PCM16 / 24 kHz, and send each chunk as a binary frame. This mirrors the format the server forwards to the realtime provider.
const socket = new WebSocket(`wss://your-host/ai/transcribe?key=${apiKey}`);
socket.binaryType = 'arraybuffer';
const TARGET_SAMPLE_RATE = 24000;
function floatTo16BitPCM(input: Float32Array): Int16Array {
const out = new Int16Array(input.length);
for (let i = 0; i < input.length; i++) {
const s = Math.max(-1, Math.min(1, input[i]));
out[i] = s < 0 ? s * 0x8000 : s * 0x7fff;
}
return out;
}
function downsample(input: Float32Array, inRate: number, outRate: number): Float32Array {
if (outRate >= inRate) return input;
const ratio = inRate / outRate;
const length = Math.round(input.length / ratio);
const result = new Float32Array(length);
for (let i = 0; i < length; i++) {
result[i] = input[Math.min(Math.round(i * ratio), input.length - 1)];
}
return result;
}
socket.addEventListener('open', async () => {
const stream = await navigator.mediaDevices.getUserMedia({
audio: { channelCount: 1, echoCancellation: true, noiseSuppression: true }
});
const ctx = new AudioContext();
if (ctx.state === 'suspended') await ctx.resume();
const source = ctx.createMediaStreamSource(stream);
const processor = ctx.createScriptProcessor(4096, 1, 1);
const zeroGain = ctx.createGain();
zeroGain.gain.value = 0; // keep the graph running without playback
processor.onaudioprocess = (e) => {
if (socket.readyState !== WebSocket.OPEN) return;
const input = e.inputBuffer.getChannelData(0);
const pcm = floatTo16BitPCM(
downsample(input, e.inputBuffer.sampleRate, TARGET_SAMPLE_RATE)
);
socket.send(pcm.buffer); // binary audio frame
};
source.connect(processor);
processor.connect(zeroGain);
zeroGain.connect(ctx.destination);
});
socket.addEventListener('message', (event) => {
// server → client events are JSON text frames
if (typeof event.data !== 'string') return;
const msg = JSON.parse(event.data);
switch (msg.type) {
case 'ready': /* streaming has started */ break;
case 'delta': /* show interim transcript: msg.text */ break;
case 'final': /* commit transcript: msg.text */ break;
case 'error': console.error(msg.message); break;
}
});
HTTPS required
getUserMedia only works on secure origins (HTTPS / localhost), and the
socket must be wss:// in production.
Don't hand-roll this in the editor. The Jodit PRO AI Assistant ships a microphone button that wires the capture above into the assistant prompt for you (it reuses the editor's
speech-recognizeplumbing). This raw example is for non-Jodit integrations or for understanding the protocol.
Provider support¶
Transcription is a provider capability. Each provider adapter may implement
openTranscriptionSession(session); the base adapter's default throws
"Speech-to-text transcription is not supported by this provider", so a provider
that hasn't implemented it fails fast with a clear message.
A session carries the browser-facing socket, the resolved options
({ model, language }, chosen by the caller like every other handler), an
AbortSignal, and a reportUsage callback the transport supplies so audio usage
flows through the same credits → onUsage path as text requests. See the
ITranscriptionSession / ITranscriptionContext / ITranscriptionUsage types.
The OpenAI adapter implements this against the OpenAI Realtime transcription
API (the realtime endpoint can be overridden with
config.options.realtimeTranscriptionUrl). It sends ready once the upstream
session is live, forwards interim delta and committed final transcripts, and
reports audio/text token usage when the session ends.
Standalone vs. integration¶
In standalone mode (start()), the transcription WebSocket is attached to
the server automatically.
In integration mode (mounting the adapter onto an existing Express app), the
host owns the http.Server, so attach the WebSocket yourself:
import { attachTranscriptionWs } from 'jodit-ai-adapter';
const handle = attachTranscriptionWs(httpServer, config);
// later, to detach the upgrade listener and close the WebSocket server:
handle.close();
attachTranscriptionWs registers an upgrade listener that only handles the
${routePrefix}/transcribe path and ignores other paths, so it can coexist with
additional WebSocket handlers on the same server.
See also the Credits System for how transcription usage will be
priced and reported through onUsage.