IMPLEMENTATION DETAILS

This is how the engine actually works.

No marketing language. No hand-waving. Just the real architecture, algorithms, statistical methods, and production behaviors.

Drawn directly from the production implementation in betterwrk-discover.

01 — SEMANTIC CHUNKING

Sentence-level embedding + dynamic cosine-similarity boundary detection

We reject fixed-size chunking entirely.

Algorithm:
for each document:
    sentences = split_into_sentences(document)
    embeddings = sentence_transformer.encode(sentences)   # all-MiniLM-L6-v2 or equivalent

    for i in 1..len(embeddings)-1:
        sim = cosine_similarity(embeddings[i-1], embeddings[i])
        if sim < dynamic_threshold(i, previous_chunk_coherence):
            create_boundary()
            carry_forward_context_summary(previous_chunk)
Key production details
  • • Dynamic threshold adapts per customer based on historical coherence
  • • Minimum chunk size guard + maximum context window
  • • Context carry-forward: last 2–3 sentences summarized and prepended to next chunk
  • • +107% coherence lift measured on real client corpora (0.42 → 0.87 average)
This single change dramatically improves embedding quality for every downstream service.
02 — EMBEDDING QUALITY SERVICE

Nightly statistical process control on the vector space

Every night we compute:
Silhouette score (per cluster + global)
Kolmogorov-Smirnov test against previous day’s distribution
Coherence (intra-chunk sentence similarity)
Discriminability (inter-chunk separation)
Output: daily grade (Excellent/Good/Fair/Poor) + drift alert if KS p-value < 0.01 or silhouette drop > 0.08.
This service protects against model drift, domain shift, and garbage data ingestion before any insight is generated.
03 — INSIGHT CONFIDENCE SERVICE

Bootstrap + Grounding + Evidence Strength on every output

Bootstrap 95% Confidence Intervals
For any metric (%, duration, cost), we draw 5,000+ bootstrap resamples from the actual supporting artifacts. The reported interval is the 2.5th–97.5th percentile of the resampled distribution.
Example output in production: “15% (95% CI: 12–18%, p=0.03) based on 47 artifacts”
Grounding Score (0.0 – 1.0)
Cosine similarity between the generated sentence embedding and the mean embedding of its cited source chunks. Thresholds trigger quarantine.
Evidence Strength + p-values
STRONG / MODERATE / WEAK / INSUFFICIENT
Calculated from effective sample size, variance, grounding, and statistical significance of the underlying signal.
04 — REAL-TIME ANOMALY DETECTOR

3-sigma Statistical Process Control at ingestion

As every document and data stream arrives we maintain rolling windows and fire alerts on:
  • Volume: > 3σ deviation from 7-day rolling mean
  • Quality: sudden drop in average grounding or silhouette of new chunks
  • Pattern: emergence of new high-frequency process variants not seen in training window
  • Sentiment/linguistic drift in customer documentation
Alerts are pushed to Slack within 60–90 seconds of ingestion. This is the first line of defense against poisoned or degraded data reaching the insight layer.
ADDITIONAL IMPLEMENTATION NOTES
Data model isolation
All process and assessment data is strictly account-scoped at the database and vector layer. No cross-customer leakage is possible.
Nightly batch jobs
Embedding quality, drift detection, and re-grading of low-grounding insights run in isolated background workers with dead-letter queues and retry semantics.
Latency characteristics
Semantic chunking + initial embedding < 800ms per document. Full confidence scoring + grounding + evidence strength labeling typically completes in < 650ms for standard insights.
Observability & lineage
Every insight carries immutable metadata: embedding model version, grounding score at generation time, bootstrap sample count, p-value, and direct pointers to source artifacts.
HOW GROUNDING IS ACTUALLY COMPUTED

Not a vibe. A measurable score.

For every generated sentence we:

  1. Retrieve the top-k source chunks cited by the generation step.
  2. Mean-pool their embeddings.
  3. Compute cosine similarity between the generated sentence embedding and that mean source vector.
  4. Apply a length-and-specificity penalty.
  5. Clamp to [0, 1].

Thresholds are per-customer and per-use-case. Below threshold → either quarantined or surfaced with explicit "low grounding" warning to the analyst.

HARD PRODUCTION INVARIANTS
• No insight is ever shown without an associated grounding score.
• Bootstrap intervals are always 95% and use ≥ 2,000 resamples for any public-facing number.
• Embedding model version is pinned per customer and cannot change without explicit migration + re-grading.
• Anomaly detector runs on every ingestion before any chunk enters the active corpus.
• All low-grounding insights are logged with full lineage for later analyst review.
PATENTED CORE INNOVATIONS

The inventions that make the engine private, resilient, and verifiable at scale.

05 — PRIVACY-PRESERVING FEDERATED PROCESS MINING

Local sufficient statistics + shallow cryptographic aggregation + Merkle-verified lineage

Raw event logs and local process topology never leave the organization. Only low-dimensional sufficient statistics ever leave, protected by shallow crypto.

Local Abstraction Generation
  • Transition-count matrix C(k), row-denominator D(k), causal relation-score R(k)
  • Optional duration T(k), squared-duration Q(k), emission E(k) accumulators
  • Activity labels mapped to governed index U(v) with hierarchical bucketing for rare/sensitive activities
Leakage-Mitigated Slot Packing

Fixed-size bundles with dummy slots, randomized permutation under governance seed, padding, range encodings, and signatures. Ciphertext size and sparsity do not leak local vocabulary or density.

Shallow Protected Aggregation

Untrusted central aggregator performs only homomorphic additions (and optional scalar multiplications by governed weights). Multiplicative depth zero or one. No deep encrypted sequence comparison circuits.

Threshold Reconstruction & Exact Global Model

Quorum (t-of-n) performs verified partial decryptions. Global P(i,j) = C*(i,j) / D*(i) computed from aggregate counts after reconstruction — mathematically exact volume-weighted probabilities, not biased average of local probabilities.

Hybrid MRV + AI Lineage Gate
Merkle-root anchoring of canonicalized batches with adaptive packetization (audit-risk score). Selective Merkle proofs. AI lineage gate rejects features/training data unless batch has verified threshold reconstruction + Merkle lineage.
Key technical effects: reduced cryptographic circuit depth, lower communication volume than event-level MPC, exact aggregate stochastic modeling, threshold-governed decryption, auditable provenance with raw logs remaining off-chain and inside organizational boundaries.
06 — CANONICAL MULTIMODAL EVENT-LOG INTELLIGENCE LAYER

Telemetry failover + bounded visual recovery + per-event cryptographic provenance

Structural telemetry + media + user annotations fused into one canonical, tamper-evident log with intelligent failover.

Dual-Mechanism Telemetry Suspension Detector
  • Active heartbeat: signed ping/pong with sequence numbers over persistent channel
  • Passive temporal-overlap: media keyframe activity vs. semantic telemetry density in sliding window
  • Failure asserted only for bounded media segments (not entire stream)
Bounded Visual Semantic Recovery

OCR + computer vision + vision-language models invoked only on failure intervals. Source-flagged surrogate events (ocr_surrogate, vision_language_surrogate, etc.) with bounding boxes, confidence, media timestamp, and provenance tags. Not hallucinated into the log as native telemetry.

Five-Field Per-Event Cryptographic Provenance

Every event hashed over canonical stringification containing: normalized activity data + media timestamp + exact input data length + data type identifier + previous event hash. Tampering breaks the hash chain at the point of modification.

Key Property
Raw media, screenshots, audio, and rejected claims live in separate evidence bundles (outside the primary hash chain). Cryptographic deletion of raw evidence does not invalidate the verified event log.
07 — LIVE DUAL-STREAM COPILOT ARCHITECTURE

Concurrent structural + media capture + live process map + adaptive in-context clarification

Structural interaction evidence and media evidence captured concurrently under one session, with a live evolving process representation and targeted human clarification only when information gain justifies the interruption.

Dual-Stream Capture with Resilience
  • Structural: interaction-tracker content script in active document (DOM, accessibility, sanitized network, page fingerprints)
  • Media: offscreen document / hidden capture context (continuous even during popup close or navigation)
  • Tab/domain switch: flush structural state, persist media buffers, migrate floating indicator, keep shared session ID
Live Process Map + Adaptive Prompt Policy

Remote plane maintains live process map and coverage metrics across dimensions (objective, actor, handoffs, exception rationale, completion condition). Prompt engine scores candidates on information gain × confidence × urgency minus dynamic human interruption cost (keystroke velocity, typing state, modal state, etc.).

WebSocket Invalidation + Canonical State Retrieval

Prompt notifications are invalidation triggers only. Service worker fetches canonical session state from server before rendering anything in the active document. Media capture in offscreen context is never paused.

OCR Semantic Recovery
When structural stream goes blind, system samples media at key timestamps and synthesizes source-flagged surrogate events (with provenance tags) rather than guessing or dropping coverage.
08 — DYNAMIC CROSS-MODAL TELEMETRY ARBITRATION

Capture-mode-aware reliability inversion + pre-hashing reconciliation gate

Telemetry sources have different visibility depending on capture mode (tab, window, full desktop, opaque VDI, redacted zones). The system dynamically inverts reliability instead of applying static source priorities.

Reliability Inversion on Structural Blindness

When capture-mode listener detects opaque remoting (VDI, remote app, etc.) for a region, structured/DOM source multiplier for that bounded region is driven to zero. Visual/OCR/vision-language claims receive primary weight. Absence of structured telemetry is not treated as negative evidence.

Redaction-Aware Visual Multipliers

In redacted zones, visual multiplier is reduced proportionally to masked area or blur confidence. Structured semantic claims (where permitted) retain weight.

Effective Weight Formula (per source and event candidate)
W_eff(s,e) = W_base(s,type) × M_capture(s,mode,region) × M_specificity(s,e) × exp(−|Δt|/τ)

Discordance penalty applied when materially inconsistent claims remain: final confidence of selected claim is reduced in proportion to aggregate weight of losing claims.

Pre-Hashing Reconciliation Gate
Only the verified surrogate semantic claim + arbitration metadata enters the primary cryptographic event hash chain. Raw screenshots, audio, keystrokes, rejected claims, and unverified observations are routed to a separate, policy-governed evidence bundle (outside the hash chain). Cryptographic deletion of raw evidence does not break the verified log.