No marketing language. No hand-waving. Just the real architecture, algorithms, statistical methods, and production behaviors.
Drawn directly from the production implementation in betterwrk-discover.
We reject fixed-size chunking entirely.
for each document:
sentences = split_into_sentences(document)
embeddings = sentence_transformer.encode(sentences) # all-MiniLM-L6-v2 or equivalent
for i in 1..len(embeddings)-1:
sim = cosine_similarity(embeddings[i-1], embeddings[i])
if sim < dynamic_threshold(i, previous_chunk_coherence):
create_boundary()
carry_forward_context_summary(previous_chunk)
For every generated sentence we:
Thresholds are per-customer and per-use-case. Below threshold → either quarantined or surfaced with explicit "low grounding" warning to the analyst.
Raw event logs and local process topology never leave the organization. Only low-dimensional sufficient statistics ever leave, protected by shallow crypto.
Fixed-size bundles with dummy slots, randomized permutation under governance seed, padding, range encodings, and signatures. Ciphertext size and sparsity do not leak local vocabulary or density.
Untrusted central aggregator performs only homomorphic additions (and optional scalar multiplications by governed weights). Multiplicative depth zero or one. No deep encrypted sequence comparison circuits.
Quorum (t-of-n) performs verified partial decryptions. Global P(i,j) = C*(i,j) / D*(i) computed from aggregate counts after reconstruction — mathematically exact volume-weighted probabilities, not biased average of local probabilities.
Structural telemetry + media + user annotations fused into one canonical, tamper-evident log with intelligent failover.
OCR + computer vision + vision-language models invoked only on failure intervals. Source-flagged surrogate events (ocr_surrogate, vision_language_surrogate, etc.) with bounding boxes, confidence, media timestamp, and provenance tags. Not hallucinated into the log as native telemetry.
Every event hashed over canonical stringification containing: normalized activity data + media timestamp + exact input data length + data type identifier + previous event hash. Tampering breaks the hash chain at the point of modification.
Structural interaction evidence and media evidence captured concurrently under one session, with a live evolving process representation and targeted human clarification only when information gain justifies the interruption.
Remote plane maintains live process map and coverage metrics across dimensions (objective, actor, handoffs, exception rationale, completion condition). Prompt engine scores candidates on information gain × confidence × urgency minus dynamic human interruption cost (keystroke velocity, typing state, modal state, etc.).
Prompt notifications are invalidation triggers only. Service worker fetches canonical session state from server before rendering anything in the active document. Media capture in offscreen context is never paused.
Telemetry sources have different visibility depending on capture mode (tab, window, full desktop, opaque VDI, redacted zones). The system dynamically inverts reliability instead of applying static source priorities.
When capture-mode listener detects opaque remoting (VDI, remote app, etc.) for a region, structured/DOM source multiplier for that bounded region is driven to zero. Visual/OCR/vision-language claims receive primary weight. Absence of structured telemetry is not treated as negative evidence.
In redacted zones, visual multiplier is reduced proportionally to masked area or blur confidence. Structured semantic claims (where permitted) retain weight.
W_eff(s,e) = W_base(s,type) × M_capture(s,mode,region) × M_specificity(s,e) × exp(−|Δt|/τ)
Discordance penalty applied when materially inconsistent claims remain: final confidence of selected claim is reduced in proportion to aggregate weight of losing claims.