VaultClip: Building Private Video Q&A Fully in the Browser

I built VaultClip because I wanted a simple thing that still feels surprisingly hard to get today: upload a video, ask questions about it, and get useful answers ,without sending the video to someone else's server.

The GitHub repo is here: https://github.com/abhinav-TB/VaultClip

Live demo: https://vaultclip.gamezdev.workers.dev

VaultClip is an open-source, browser-only video Q&A app. You select a video or audio file, the app extracts the audio locally, transcribes it locally, builds searchable context locally, and lets you chat with the media using local AI. No uploads. No accounts. No backend. No analytics pipeline quietly collecting your files.

That constraint, everything happens in the browser shaped almost every technical decision in the project.

Why build this?

A lot of useful information is trapped inside videos: lectures, meeting recordings, interviews, demos, tutorials, walkthroughs, research calls, debugging sessions, and personal notes. The usual way to find something is still manual scrubbing: drag the timeline, listen for a few seconds, miss it, go back, try again.

Cloud AI tools can solve that problem, but they often require uploading the file. That is fine for public content, but it is a bad fit for private recordings. A meeting recording, an interview, a classroom lecture, or an unreleased demo may contain sensitive information. Even when a provider has good policies, the user still has to trust an external system with the media.

VaultClip takes the opposite approach:

the file stays on your machine
inference runs in your browser
temporary data stays in browser memory
after the initial model download, the app can work offline
the code is open source and auditable

The goal was not just to make video Q&A work. The goal was to make it work while preserving the user's control over the media.

The product idea

The user flow is intentionally small:

Pick a video or audio file.
Preview it in the browser.
Let VaultClip extract and transcribe the audio.
Ask natural-language questions.
Get answers grounded in timestamped transcript context.

Examples of questions:

"What were the main decisions in this meeting?"
"When does the speaker explain the architecture?"
"Summarize the action items."
"Find the part where they talk about latency."
"What are the objections raised in this interview?"

The important part is that the app does not need to upload the file to answer these questions.

Architecture at a high level

VaultClip is a React + TypeScript single-page app built with Vite. The heavier work is moved out of the UI thread into Web Workers, because media processing and model inference can easily freeze the browser if handled directly in React components.

The pipeline looks like this:

User selects media
        ↓
Validate file size, duration, and format
        ↓
Preview the file locally
        ↓
Extract audio with ffmpeg.wasm in a Web Worker
        ↓
Transcribe audio chunks with Gemma via Transformers.js + WebGPU
        ↓
Create timestamped transcript segments
        ↓
Build local retrieval context
        ↓
Answer questions in a local chat interface

The current repository is here:

https://github.com/abhinav-TB/VaultClip

The app is deployed as a static app on Cloudflare Workers, but the deployed app is still not a backend for user files. The Worker serves the frontend assets and handles app routing. The actual media processing happens in the user's browser.

Why browser-only AI is interesting

For a long time, "AI app" almost automatically meant "send data to an API." That is still the right architecture for many products, especially when models are large or collaboration matters. But browsers have become much more capable.

With WebGPU, Web Workers, WebAssembly, and libraries like Transformers.js, it is now realistic to run useful local AI workflows inside a normal web app. The browser can:

decode and preview media
run ffmpeg.wasm for audio extraction
use WebGPU for accelerated inference
cache model files locally
manage temporary data in memory
provide a familiar UI without installation

This makes a new class of apps possible: local-first AI tools that feel like web apps but behave more like private desktop utilities.

VaultClip is an experiment in that direction.

Keeping media private

Privacy in VaultClip is not just a sentence in the README. It is a design constraint.

The selected media file is stored through the browser's File API. It is not serialized into Redux, not uploaded to a server, and not persisted to a remote database. The extracted audio is also kept locally, and object URLs are cleaned up when they are no longer needed.

The app has no backend API for media. The only unavoidable network step is the initial download of model assets. Once the model is cached, the workflow can run locally.

This matters because video files are often high-context. They can include faces, voices, screenshares, credentials accidentally visible in a demo, internal project names, customer details, or personal conversations. A privacy-first architecture should not ask for that data unless it truly needs it.

VaultClip's answer is: for this use case, the app does not need it.

The local processing pipeline

The first technical challenge is turning an arbitrary media file into something the model can use.

Videos are large, varied, and browser support depends on codecs. VaultClip starts with guardrails: file size limits, duration limits, supported formats, and browser preview checks. These constraints are not glamorous, but they make the app much more reliable. Browser memory is finite, and local inference gets expensive quickly.

Once the user confirms the file, VaultClip extracts audio using ffmpeg.wasm inside a Web Worker. That keeps the UI responsive while the app does CPU-heavy work. The extracted audio is converted into a transcription-friendly format, such as WAV at 16 kHz.

The transcription step processes audio in chunks. Chunking matters because long media files can exceed practical generation limits, and smaller chunks make progress easier to track. VaultClip keeps timestamp ranges for transcript segments so answers can point back to where the information came from.

After transcription, the app prepares local context for question answering. Instead of sending the whole transcript to a server, VaultClip builds a local retrieval layer and feeds relevant context into the chat flow.

Running Gemma in the browser

VaultClip uses Gemma through Transformers.js with WebGPU acceleration. The model lifecycle has a few stages:

load or fetch model assets
initialize the runtime
warm up inference
process transcription and chat tasks
surface progress and failure states in the UI

This is very different from calling a hosted API. With an API, the hard work is hidden behind an HTTP request. In the browser, the app has to care about model download size, browser support, GPU availability, memory pressure, initialization time, and failure recovery.

That tradeoff is worth it for VaultClip because the privacy model becomes much stronger. The user's media does not leave the device just because they want an AI summary.

State management and workers

One lesson from the project: do not put large browser objects directly into global app state.

VaultClip uses Redux Toolkit for serializable UI and workflow state, but the actual File objects and raw audio bytes live in separate registries. Redux stores metadata like file name, size, duration, object URL, processing status, transcript segments, and model state.

That separation keeps the app easier to debug and avoids pushing non-serializable heavy objects through the normal React state flow.

The Web Worker boundary is also important. The UI should not know too much about how ffmpeg or the model runtime works. Instead, a worker client provides a promise-based interface for long-running tasks. This keeps the app structure cleaner:

React handles the user experience
Redux tracks state transitions
workers handle expensive compute
small registry modules own large in-memory objects
shared types define the contracts between pieces

That architecture made the app easier to grow without turning every component into an orchestration layer.

The hard parts

The hardest parts were not the UI screens. The hard parts were browser limitations and edge cases.

Browser memory is not straightforward. Web APIs do not give a reliable cross-browser way to know total system memory or GPU memory. That means the app needs conservative limits and clear failure states instead of pretending it can process every file.

WebGPU support is also still uneven. Chrome and Edge are the best targets right now. Firefox and Safari are improving, but for this kind of app, WebGPU support is a real requirement.

Another subtle issue is transcription coverage. Longer chunks can produce incomplete coverage or weaker timestamp behavior, so VaultClip defaults to smaller chunk sizes. The app is designed around practical reliability instead of theoretical maximum throughput.

There is also the first-run experience: downloading a local model can be large. That is not as seamless as an API call, but the upside is that the model can be cached and reused locally.

What VaultClip is good for

VaultClip is useful when privacy matters more than cloud-scale convenience.

Some examples:

students reviewing long lectures
engineers searching through demo recordings
researchers analyzing interviews
creators reviewing raw footage
professionals summarizing meeting recordings
anyone who wants Q&A over media without uploading it

It is not trying to replace every cloud transcription or video intelligence product. It is designed for the local-first case: give me useful AI over my file, but keep the file with me.

Current limitations

The current version is still an MVP. The main limitations are:

WebGPU is required for the best experience
first-time model download is large
browser memory limits constrain file size and duration
transcript timestamps are segment-level, not word-level
the app focuses on one active media session at a time
very long recordings still need better chunking and persistence strategies

These constraints are acceptable for the first version because they keep the privacy model simple and the architecture understandable.

Why open source matters here

For privacy-focused tools, open source is not just a nice extra. It is part of the trust model.

If an app claims "your files never leave your device," users should be able to inspect how that is implemented. VaultClip's code is available here:

https://github.com/abhinav-TB/VaultClip

The repo includes the architecture, worker pipeline, model runtime integration, media guardrails, deployment setup, and local development instructions.

If you want to run it locally:

git clone https://github.com/abhinav-TB/VaultClip.git
cd VaultClip
npm install
npm run dev -- --host 127.0.0.1

Then open http://127.0.0.1:5173 in a WebGPU-capable browser like Chrome or Edge.

What I learned

The biggest lesson from building VaultClip is that browser AI is no longer just a toy category. It still has rough edges, but it is useful enough for real workflows when the scope is chosen carefully.

The second lesson is that privacy-first design needs to happen at the architecture level. You cannot bolt it on at the end. If the core workflow assumes uploads, then privacy becomes policy language. If the core workflow keeps data local, privacy becomes a property of the system.

VaultClip is my attempt to build the second kind of app.

Closing thoughts

We are used to thinking of the browser as a thin client for server-side intelligence. But for many personal AI tasks, especially ones involving private files, the browser can be the compute environment.

VaultClip shows that a modern web app can process media, run local models, build retrieval context, and provide useful Q&A — all without asking the user to hand over their video.

If you are interested in local-first AI, browser ML, WebGPU, or privacy-preserving media tools, check out the project here:

https://github.com/abhinav-TB/VaultClip

And try the live demo here:

https://vaultclip.gamezdev.workers.dev

If you find the idea useful or want to support more local-first browser AI experiments, please consider giving the GitHub repo a star. It helps the project reach more people who care about private AI tools.

VaultClip: Building Private Video Q&A Fully in the Browser

Why build this?

The product idea

Architecture at a high level

Why browser-only AI is interesting

Keeping media private

The local processing pipeline

Running Gemma in the browser

State management and workers

The hard parts

What VaultClip is good for

Current limitations

Why open source matters here

What I learned

Closing thoughts

Comments

More from this blog

The Preparation Trap

What an AI Assistant Should Actually Remember

Transformers and Attention Mechanisms: From Basics to GPTMini

How To Build A BlockChain Using Python

Command Palette

Why build this?

The product idea

Architecture at a high level

Why browser-only AI is interesting

Keeping media private

The local processing pipeline

Running Gemma in the browser

State management and workers

The hard parts

What VaultClip is good for

Current limitations

Why open source matters here

What I learned

Closing thoughts

Comments

More from this blog