Most video AI gives you bullet points. I built one that gives you a map.

[ENG-001]May 12, 20269 min read

Flat summaries strip video of the one thing that makes it video: time. Vistral keeps it. Here is how I turned meetings, interviews, and podcasts into temporal knowledge graphs where every insight links back to the exact second it happened.

Open any video summarizer and the output looks the same. A title, six bullet points, maybe an action item or two. It reads like the minutes of a meeting nobody attended. It is fast, it is tidy, and it quietly throws away the most valuable thing the video ever had.

Video is not a document. It is a sequence. Someone makes a claim at 4:12, a slide contradicts them at 7:30, and a different speaker quietly walks it back at 11:05. A bullet point cannot hold that. It flattens an hour of evolving thought into a list of facts with no memory of how they got there.

I spent the Mistral AI Hackathon building a different answer. It is called Vistral, and instead of a summary it produces a temporal knowledge graph: a structured, queryable representation of a video where every insight is wired back to the exact moment it came from.

The problem with bullet points

A summary asks you to trust it. You read a claim like the team agreed to ship in Q3 and you have no way to check it without scrubbing through the recording yourself. The summary is a dead end. It is the answer with the evidence deleted.

Worse, it has no sense of direction. Topics in a real conversation evolve. People change their minds. A good summary of a bad meeting still looks like a good summary. The structure that would tell you the conversation went in circles is exactly the structure that gets discarded.

A summary is the answer with the evidence deleted. A knowledge graph is the answer with the evidence attached.

I wanted the opposite of a dead end. I wanted every node to be a door: click an insight, jump to the second it was said, see who said it, see what else it connects to.

What a temporal knowledge graph actually is

Strip away the buzzwords and it is simple. Nodes are the things in the video: speakers, topics, claims, action items, KPIs. Edges are the relationships between them: this speaker made this claim, this claim supports this topic, this visual contradicts this statement. The temporal part is the detail that matters most: every node carries the timestamp where it appeared.

That single property changes everything downstream. Because the graph knows when, it can render an interactive timeline, detect that two claims about the same topic happened twenty minutes apart, and let a force-directed visualization cluster a conversation by what it was actually about rather than the order it was spoken.

01Automatic speaker diarization with talk-time statistics
02Topic segmentation with real temporal boundaries
03Action items extracted with assignees and priority
04KPI extraction with the surrounding context
05Contradiction detection between what is said and what is shown

SpeakerTopicClaimKPIAction

A slice of a real graph. Nodes carry timestamps, edges carry relationships. The red edge is a contradiction Vistral flagged on its own.

The architecture: two passes, one graph

The naive approach is to dump the full transcript into a large model and ask for everything at once. It works for a demo. It falls apart on a one-hour video, because the model is simultaneously trying to perceive what happened and reason about what it means, and it does neither well.

Vistral splits those two jobs apart. The whole pipeline runs in eight stages, but the intelligence lives in two distinct LLM passes with a graph built in between.

INPUTVideoMeeting, interview or podcast. Audio and frames are split apart.

PASS APerceptionVoxtral + Pixtral extract entities. No interpretation, just a faithful inventory.

GRAPHTemporal Knowledge GraphDeterministic code stitches nodes and edges, anchored to timestamps.

PASS BReasoningMistral Small reads only the graph and writes insights with evidence chains.

OUTPUTInsightsEvery claim links back to a node and a timestamp.

The Vistral pipeline: perception and reasoning never run in the same breath. A deterministic graph sits between them as the contract.

Pass A is perception. It reads the transcript and the visual analysis and does nothing but extract entities: who spoke, what topics surfaced, what claims were made, what numbers were mentioned. No interpretation, no insight, no opinion. Just a faithful inventory of what is in the video.

Between the passes, those entities are assembled into the temporal knowledge graph. This is not an LLM step. It is deterministic code stitching nodes and edges together and attaching timestamps. The graph is the contract between perception and reasoning.

graph.json

{
  "nodes": [
    { "id": "spk_1", "type": "speaker", "label": "Maya",  "t": "00:00" },
    { "id": "clm_4", "type": "claim",   "label": "Revenue up in every region", "t": "04:12" },
    { "id": "kpi_2", "type": "kpi",     "label": "Q2 churn 3.1%", "t": "07:30" }
  ],
  "edges": [
    { "from": "spk_1", "to": "clm_4", "rel": "asserts" },
    { "from": "clm_4", "to": "kpi_2", "rel": "contradicts" }
  ]
}

Pass B is reasoning. It never sees the raw transcript. It sees only the serialized graph, and from that compact structure it generates the insights, each one carrying an evidence chain that points back to specific nodes and timestamps.

Separating perception from reasoning is the single decision that made Vistral reliable. A model asked to do one job at a time hallucinates far less than a model asked to do everything in one breath.

The 91 percent trick

Here is the part I did not expect to matter as much as it did. A raw ten-minute transcript with visual annotations runs around forty thousand tokens. The temporal knowledge graph that represents the same video serializes to about three and a half thousand. That is a 91 percent reduction.

Raw transcript + vision~40,000 tokens

Serialized temporal graph~3,500 tokens

91%fewer tokens into the reasoning pass

The graph is not just smaller. It is pre-filtered signal, so the reasoning pass never reads filler, crosstalk or repetition.

The reasoning pass becomes faster, cheaper, and noticeably sharper, because the graph has already removed the noise. Filler words, repeated phrasing, and crosstalk never reach Pass B. The model reasons over signal, not over a transcript. A ten-minute video goes from upload to finished knowledge graph in roughly two minutes.

The same instinct shows up earlier in the pipeline. Before any vision model runs, frames are deduplicated with perceptual hashing, so a static slide that sits on screen for ninety seconds is analyzed once instead of ninety times. That alone cut vision costs by about 70 percent.

Catching the lie: cross-modal contradiction detection

The feature I am proudest of is the one that only a temporal graph makes possible. Because Vistral processes audio and video as separate signals and anchors both to the same timeline, it can compare them.

A speaker says revenue is up across every region. The slide on screen at that exact timestamp shows two regions in decline. A flat summary records the sentence and moves on. Vistral places both observations on the same point in time, notices they disagree, and surfaces a contradiction node. The audio claim and the visual evidence point at each other.

A summary believes whoever is talking. A temporal graph checks the slide.

Audio and video are anchored to one shared timeline. When the words and the slide disagree at the same second, Vistral raises a contradiction node.

The stack

The whole project leans on the Mistral model family, one model per job. Voxtral Mini handles speech recognition and diarization. Pixtral 12B does the vision work, OCR and scene understanding. Mistral Small carries both the perception and reasoning passes. The pipeline itself is Python and FastAPI, with FFmpeg for media and Server-Sent Events streaming live progress to the client.

HEARINGVoxtral MiniSpeech recognition and speaker diarization. Turns raw audio into an attributed transcript.

SEEINGPixtral 12BOCR and scene understanding on deduplicated frames. Reads slides, charts and on-screen text.

THINKINGMistral SmallRuns both the perception and reasoning passes. Extracts entities, then writes evidence-backed insights.

One model per job. Each stage does a single thing well instead of one model carrying the whole pipeline.

The frontend is Next.js 16 and React 19. The knowledge graph is rendered with a force-directed layout so that, when you open a video, you are not reading minutes. You are looking at the shape of a conversation.

What I would do differently

01Make the graph editable. The pipeline gets entities right most of the time, and a human should be able to correct the rest without rerunning everything.
02Persist graphs across videos. Two meetings about the same project should share nodes, so the graph grows into an organizational memory rather than a per-video artifact.
03Stream Pass A. Perception could begin the moment the first transcript segment lands instead of waiting for the full ASR run.

Why this matters

We are about to drown in video. Every meeting is recorded, every lecture is captured, every interview is archived. The bottleneck was never storage. It is retrieval with trust. A summary you cannot verify is a rumor with good formatting.

A temporal knowledge graph is a small bet on a different future, one where AI does not hand you a conclusion and ask for your faith, but hands you a map and invites you to walk it. Every claim is a node. Every node is a door. And every door opens onto the exact second the truth was on screen.

Vistral is open source under the MIT license. The repository is at github.com/DonTizi/vistral, and the pre-computed demos run without an API key.

← Back to blog