A Tour Of Learned And Reference-Free Bayesian Filters

A long-form guide to the Kalman, ELBO, distillation, IWAE, FIVO, ADF, and Power-EP filtering experiments in ml-examples.

Series: VBF Experiments, April 2026

This is a long technical note about a small research program in Bayesian filtering. The starting point is familiar if you know Kalman filters: there is a hidden state, there are noisy measurements, and the filter has to update its belief online.

The work here asks a narrower question:

Can we learn or design strict online filters that keep an explicit posterior belief, remain auditable, and survive nonlinear observations?

The short answer is that the linear-Gaussian case is now a reliable debugging suite, but the nonlinear case exposed several different failure modes. Plain local ELBO training tends to become under-dispersed. Reference distillation shows that good strict filters exist, but it is not a fair unsupervised result. Mixture IWAE objectives help state density a lot, but can hurt one-step observation prediction. Deterministic quadrature ADF and Power-EP baselines then show that much of the gap is algorithmic: if we use the known likelihood directly, a reference-free filter can get close to the grid reference in several regimes.

The intended reader knows the Kalman filter, but may not know variational filtering, assumed-density filtering, IWAE, FIVO, or Power-EP. I will define the pieces as we go.

The source code lives in dwrtz/ml-examples. The main implementation files are:

Related shorter posts:

The filtering problem

A state-space model has hidden states \(z_t\), observations \(y_t\), optional observed covariates \(x_t\), a transition model, and an observation model:

\[ p(z_{0:T}, y_{1:T} \mid x_{1:T}) = p(z_0)\prod_{t=1}^T p(z_t \mid z_{t-1})p(y_t \mid z_t, x_t). \]

The online filtering posterior is:

\[ p(z_t \mid y_{1:t}, x_{1:t}). \]

For a linear-Gaussian model, the Kalman filter gives this posterior exactly. For nonlinear observations, exact filtering is usually unavailable, so we choose an approximate belief family and an update rule.

The strict filter contract in these experiments is:

\[ q^F_t = \operatorname{update}(q^F_{t-1}, x_t, y_t). \]

The update is online. It does not get future observations. The filter must carry an explicit posterior belief \(q^F_t(z_t)\), not just a hidden RNN state.

The variational Bayesian filtering form used in the repo carries one more piece: a backward conditional over the previous state. The local edge posterior is factorized as:

\[ q^E_t(z_t, z_{t-1}) = q^F_t(z_t)q^B_t(z_{t-1}\mid z_t). \]

This edge view matters because the transition model couples \(z_{t-1}\) and \(z_t\). If the filter emits both a current marginal and a backward conditional, we can score the local edge against the generative model.

The two benchmark worlds

The first world is scalar linear-Gaussian:

\[ z_t = z_{t-1} + w_t,\quad w_t \sim \mathcal{N}(0,Q), \]\[ y_t = x_t z_t + v_t,\quad v_t \sim \mathcal{N}(0,R). \]

Here we know the exact answer. This is where we check the mechanics: Kalman oracles, edge posteriors, ELBO terms, coverage, predictive likelihoods, and calibration.

The second world keeps the random-walk state but changes the observation:

\[ y_t = x_t \sin(z_t) + v_t,\quad v_t \sim \mathcal{N}(0,R). \]

This model is scalar, but it is not easy. The sine observation is periodic, so different latent states can explain the same observation. When \(x_t\) is small, the measurement carries little information about \(z_t\). The stress patterns used in the reports are:

PatternWhat it tests
sinusoidalordinary nonlinear observation strength
weak sinusoidalobservations are systematically weak
intermittent sinusoidalinformative and uninformative spans alternate
zerono state information in the observations
random normalirregular observation scale and aliasing

For nonlinear diagnostics, the repo uses a deterministic one-dimensional grid filter as a reference. That reference is expensive enough to cache, but it is not used in fully unsupervised training rows.

The evaluation metrics

Most tables use four quantities.

State NLL is the negative log likelihood of the true latent state under the filtering belief:

\[ \operatorname{NLL}_z = -\frac{1}{BT}\sum_{b,t}\log q^F_{b,t}(z_{b,t}^{\mathrm{true}}). \]

Coverage 90 is the fraction of true states inside the nominal 90 percent interval of the reported belief. For a calibrated Gaussian filter, this should be near \(0.90\).

Variance ratio compares the learned filtering variance to the reference filtering variance:

\[ \operatorname{var\ ratio} = \frac{\mathbb{E}_{b,t}[\operatorname{Var}_{q}(z_{b,t})]} {\mathbb{E}_{b,t}[\operatorname{Var}_{\mathrm{ref}}(z_{b,t})]}. \]

A variance ratio far below one is under-dispersion. It means the filter is too confident. In these experiments, under-dispersion is the recurring failure mode.

Predictive-y NLL scores the next measurement before assimilation:

\[ -\log p_q(y_t \mid y_{This is important because a filter can put density on the true latent state while still being a poor observation predictor. That tradeoff appears in the mixture and quadrature branches.

Why labels matter

The reports use several training-signal labels. They are not bookkeeping trivia; they decide which claims are fair.

LabelWhat training can useWhat it means
fully unsupervised\(x,y\), known transition, known observation model, priorfair reference-free learning result
reference-free deterministicsame model information, but no learned amortized objectivefair algorithmic filtering baseline
reference-distilled diagnosticgrid moments or reference beliefspositive control, not fair unsupervised
oracle-calibrated diagnosticoracle/reference variance targetsidentifies bottlenecks, not a deployable objective
supervisedtrue latent states or oracle edge targetsuseful engineering control

The strongest-looking row is not always the strongest claim. A reference-distilled row can show that a posterior family is capable, while a fully unsupervised row shows that the objective found it without hidden help.

Technique 1: exact Kalman filtering

The Kalman filter is the anchor. In the scalar linear-Gaussian model, if the prior belief is \(z_{t-1}\sim \mathcal{N}(m_{t-1},P_{t-1})\), then the predictive state is:

\[ m^-_t = m_{t-1},\quad P^-_t = P_{t-1} + Q. \]

For observation \(y_t = x_t z_t + v_t\), the innovation and Kalman gain are:

\[ e_t = y_t - x_t m^-_t, \]\[ S_t = x_t^2P^-_t + R, \]\[ K_t = \frac{P^-_t x_t}{S_t}. \]

The filtering update is:

\[ m_t = m^-_t + K_t e_t,\quad P_t = P^-_t - K_t x_t P^-_t. \]

The repo keeps this exact path because it gives a hard standard for state NLL, coverage, variance, and predictive likelihood. In the scalar benchmark, exact Kalman coverage stays close to \(0.90\), and the frozen-marginal control matches it almost exactly.

Linear-Gaussian diagnostics over training

That figure is from the linear-Gaussian ELBO run. The important role of the linear suite is not that it is hard. The role is that mistakes have nowhere to hide.

The compact linear report found:

RegimeExact/reference behaviorLearned behavior
nominal sinusoidalKalman state NLL 0.402, coverage 0.900self-fed supervised state NLL 0.415
weak sinusoidalKalman state NLL 1.175, coverage 0.899MC ELBO state NLL 1.291, coverage 0.813
zero unobservableKalman state NLL 2.740, coverage 0.904MC ELBO state NLL 7.010, coverage 0.392
random normalKalman state NLL 0.219, coverage 0.898MC ELBO state NLL 0.307, coverage 0.847

That zero-observation row is the warning sign. If observations contain no information, a calibrated filter should mostly propagate uncertainty. The local MC ELBO row became much too narrow.

Technique 2: frozen-marginal edge learning

The VBF model represents a local edge:

\[ q^E_t(z_t,z_{t-1}) = q^F_t(z_t)q^B_t(z_{t-1}\mid z_t). \]

In the frozen-marginal control, \(q^F_t\) is not learned. It is set to the exact Kalman marginal. The learned part is only the backward conditional \(q^B_t(z_{t-1}\mid z_t)\).

Why is that useful? It isolates whether the edge/backward machinery is correct. If the filtering marginal is exact and the backward head is trained, then any bad filtering metric would indicate plumbing trouble rather than objective trouble.

The frozen marginal row matched exact Kalman in the scalar reports. That gave confidence that the edge representation itself was sane.

Technique 3: residualized learned update cells

The first learned filters were not arbitrary neural networks. The structured MLP cell starts from an analytic Kalman-like update and learns corrections.

In the linear case, a simplified version of the update is:

pred_var = prev_var + q
innovation = y_t - x_t * prev_mean
innovation_var = x_t**2 * pred_var + r
base_gain = pred_var * x_t / innovation_var
gain_scale = 2.0 * sigmoid(raw[..., 0])
filter_mean = prev_mean + gain_scale * base_gain * innovation
filter_var = base_filter_var * exp(clipped_raw_scale) + min_var

That pattern is implemented in structured_mlp_step. The nonlinear structured cell uses an EKF-like local linearization:

\[ h(z,x) = x\sin(z),\quad \frac{\partial h}{\partial z} = x\cos(z). \]

The base update is then corrected by learned residual terms.

This design is conservative. It says: keep useful filtering structure when it is available, then learn the parts the analytic approximation gets wrong.

Technique 4: supervised and self-fed edge distillation

Supervised distillation uses an oracle edge posterior or reference filtering beliefs as targets. The most direct loss is a Gaussian KL:

\[ \operatorname{KL}\left(q_{\mathrm{oracle}}(z_t,z_{t-1}) \Vert q_{\theta}(z_t,z_{t-1})\right). \]

Teacher forcing feeds the model the reference previous belief. Self-fed rollout feeds the model its own previous belief.

The distinction matters. Teacher forcing can show one-step capacity. Self-fed training tests whether errors compound when the filter is actually used.

In the nonlinear reports, teacher-forced structured moment distillation showed that the structured head could learn the one-step reference map:

CaseModestate NLLcoverage 90variance ratio
weakteacher-forced structured2.7900.8881.112
intermittentteacher-forced structured2.7610.9011.034

But self-fed structured rollout was unstable until horizon rollout distillation was introduced.

Technique 5: local edge ELBO

The unsupervised local edge ELBO scores the edge factor against the generative model:

\[ \begin{aligned} \mathcal{L}_t &= \mathbb{E}_{q^F_t(z_t)q^B_t(z_{t-1}\mid z_t)} \big[ \log p(y_t\mid z_t,x_t) {}+ \log p(z_t\mid z_{t-1}) \\ &\quad {}+ \log q^F_{t-1}(z_{t-1}) {}- \log q^F_t(z_t) {}- \log q^B_t(z_{t-1}\mid z_t) \big]. \end{aligned} \]

The JAX code samples \(z_t\), samples \(z_{t-1}\) from the backward conditional, and computes the log weight:

return (
    normal_log_prob(y, observation_mean, r)
    + normal_log_prob(z_t, z_tm1, q)
    + normal_log_prob(z_tm1, prev_mean, prev_var)
    - normal_log_prob(z_t, filter_mean, filter_var)
    - normal_log_prob(z_tm1, backward_mean, backward_var)
)

In the scalar linear-Gaussian case, this is a reasonable unsupervised baseline, but it is under-dispersed in weak and zero-observation regimes. In the nonlinear case, it is much worse: local explanations can be plausible at a single step but globally inconsistent over a trajectory.

That gives the core diagnosis:

\[ \text{local plausible explanation} \rightarrow \text{too-narrow posterior} \rightarrow \text{bad next prior} \rightarrow \text{compounding error}. \]

Technique 6: oracle variance calibration

Oracle variance calibration adds penalties that compare learned filtering variance to reference variance. For example:

\[ \left(\log \frac{\mathbb{E}[\operatorname{Var}_q(z_t)]} {\mathbb{E}[\operatorname{Var}_{\mathrm{ref}}(z_t)]} \right)^2. \]

There are variants:

  • global variance ratio;
  • time-local variance ratio;
  • low-observation weighted variance ratio;
  • regime-local variance ratio for randomized \(Q/R\).

These rows are not fully unsupervised because they use reference variance targets. They are still useful because they identify what went wrong.

In the zero-observation nonlinear case, reference calibration recovers the grid reference uncertainty:

CaseModelstate NLLcoverage 90variance ratio
zerobaseline15.4710.2640.050
zerotime w12.7320.9101.004

That says the evaluation and reference machinery are coherent. It also says the unassisted objective was failing to preserve uncertainty.

Technique 7: grid reference filtering

For the nonlinear scalar benchmark, the reference filter discretizes a grid over \(z\). At each step it propagates mass through the Gaussian transition, multiplies by the nonlinear likelihood, and renormalizes:

\[ \tilde{p}_t(z_t) = \int p(z_t\mid z_{t-1})p_{t-1}(z_{t-1})\,dz_{t-1}, \]\[ p_t(z_t) \propto p(y_t\mid z_t,x_t)\tilde{p}_t(z_t). \]

In code, that is a log-space matrix update over the grid:

pred_log_mass = logsumexp(
    prev_log_mass[:, :, None] + transition_log_mass[None, :, :],
    axis=1,
)
filter_log_mass = pred_log_mass + normal_log_prob(y_t[:, None], obs_mean, r)
filter_log_mass = filter_log_mass - logsumexp(filter_log_mass, axis=1, keepdims=True)

The grid filter is used for diagnostics and reference-distilled controls. It is not used to train fully unsupervised rows.

Technique 8: reference moment distillation

Reference moment distillation trains a neural filter to match grid-reference moments:

\[ \frac{(m_\theta - m_{\mathrm{ref}})^2} {P_{\mathrm{ref}}} + \left(\log P_\theta - \log P_{\mathrm{ref}}\right)^2. \]

This is not a fair unsupervised method, but it answers a critical question: does the strict family have enough capacity?

The answer was yes. A direct moment-distilled head reached:

Casestate NLLcoverage 90variance ratio
weak2.8060.8320.665
intermittent2.8060.8320.656

That was far better than vanilla nonlinear ELBO. It showed that a strict Gaussian filter can perform well if the training signal is strong enough.

Technique 9: horizon rollout distillation

Structured teacher-forced distillation had one-step capacity but poor self-fed rollout. Horizon rollout distillation addresses that by starting from reference beliefs, rolling the learned update for \(H\) steps, and penalizing reference moment error throughout the rollout.

For horizon \(H=4\), the structured head became much more stable:

Casestate NLLcoverage 90variance ratio
weak2.8080.8810.993
intermittent3.3400.8611.124

The interpretation is subtle. The structured head can be calibrated, but it needs a training signal that sees short self-fed futures. A one-step objective is not enough.

Technique 10: windowed joint ELBO

The first fully unsupervised nonlinear repair tried to make the ELBO trajectory-consistent. Instead of scoring each edge independently, it samples a short latent path from the terminal filtering marginal and backward conditionals.

For a window ending at \(s+H\):

\[ q(z_{s-1:s+H}) = q^F_{s+H}(z_{s+H}) \prod_{t=s}^{s+H}q^B_t(z_{t-1}\mid z_t). \]

The objective scores the sampled path under the carried prior and the generative model:

\[ \begin{aligned} \mathcal{L}_{s,H} &= \mathbb{E}_{q} \big[ \log q^F_{s-1}(z_{s-1}) {}+ \sum_{t=s}^{s+H}\log p(z_t\mid z_{t-1}) \\ &\quad {}+ \sum_{t=s}^{s+H}\log p(y_t\mid z_t,x_t) {}- \log q^F_{s+H}(z_{s+H}) {}- \sum_{t=s}^{s+H}\log q^B_t(z_{t-1}\mid z_t) \big]. \end{aligned} \]

The implementation includes the sanity check that horizon one matches the local edge ELBO. That is covered by tests/test_train_nonlinear_objectives.py.

Nonlinear objective repair sweep

The promoted Gaussian unsupervised repair combined:

structured_joint_elbo_h4_w005_predictive_y_masked_y_spans_h4

It improved degraded-observation robustness:

Conditionvanilla structured NLLpromoted NLLvanilla var ratiopromoted var ratio
sinusoidal52.98954.9300.0410.083
weak20.86514.6720.0580.090
intermittent37.85322.9920.0380.060
zero13.4748.4140.0560.107
random normal113.95860.1090.0140.040

This is a real improvement, but not a solved filter. The variance ratios remain very low.

Technique 11: causal predictive-y scoring

A filter should be useful before it assimilates the current measurement. The pre-assimilation predictive term asks whether the previous filtering belief, propagated through the transition, assigns probability to \(y_t\):

\[ p_q(y_t\mid y_{For the sine observation this is computed with Gauss-Hermite quadrature:

nodes, weights = hermgauss(num_points)
z = prev_mean[..., None] + sqrt(2.0 * pred_state_var[..., None]) * nodes
obs_mean = x[..., None] * sin(z)
log_likelihood = normal_log_prob(y[..., None], obs_mean, r)
log_prob_y = logsumexp(log_weights + log_likelihood, axis=-1) - 0.5 * log(pi)

This term is causal only if the update path has not already used \(y_t\) to construct the prediction. The reports treat that as a guardrail.

Technique 12: masked-y span training

Masked-y training withholds measurements from the update path for random points or spans. When \(y_t\) is masked, the filter should perform the exact transition prediction:

\[ m_t = m_{t-1},\quad P_t = P_{t-1} + Q. \]

That behavior is tested in the nonlinear reference tests. The purpose is to make the carried belief robust during uninformative spans. In the promoted Gaussian objective, masked spans are paired with the predictive-y score and windowed ELBO.

Technique 13: Gaussian mixture filters

The sine likelihood is periodic, so a single Gaussian belief can be a poor description of the posterior. The mixture branch uses:

\[ q^F_t(z_t) = \sum_{k=1}^K \pi_{t,k}\mathcal{N}(z_t;\mu_{t,k},\sigma^2_{t,k}). \]

The direct mixture head emits weights, component means, component variances, and componentwise backward conditionals. In simplified code:

raw = hidden @ w2 + b2
raw = raw.reshape(..., num_components, 6)
filter_weights = softmax(raw[..., 0], axis=-1)
component_mean = prev_mean[..., None] + raw[..., 1]
component_var = softplus(raw[..., 2]) + min_var

Mixtures by themselves did not solve the problem. The useful result came from pairing a mixture family with a multi-sample objective.

Technique 14: IWAE window objectives

The IWAE objective changes how samples are combined. If \(w_k\) are importance weights for sampled paths, the ordinary ELBO averages log weights:

\[ \mathcal{L}_{\mathrm{ELBO}} = \frac{1}{K}\sum_{k=1}^K \log w_k. \]

IWAE uses:

\[ \mathcal{L}_{\mathrm{IWAE}} = \log\left(\frac{1}{K}\sum_{k=1}^K w_k\right). \]

In code:

if objective_family == "iwae":
    return logsumexp(log_weights, axis=0) - log(num_samples)

The best K2 mixture IWAE row in the April 29 report was:

direct_mixture_k2_joint_iwae_h4_k32

Mixture IWAE sweep

It improved state density dramatically:

Patternpromoted Gaussian NLLK2 IWAE h4 k32 NLLpromoted var ratioK2 IWAE var ratio
sinusoidal54.9306.4020.0830.125
weak14.6724.2200.0900.232
intermittent22.9924.3350.0600.254
zero8.4145.1050.1070.111
random normal60.1098.4520.0400.087

But predictive-y NLL regressed on nonzero-observation stressors. That means the row was not promotable as a fully solved filter:

Patternpromoted pred-y NLLK2 IWAE pred-y NLL
sinusoidal0.5710.681
weak0.3220.334
intermittent0.3740.389
random normal0.6140.726

This was the first major coupled-bottleneck result: posterior expressivity helps when paired with the right objective, but a state-density win can be an observation-prediction loss.

Technique 15: Renyi and alpha objectives

The Renyi objective generalizes the sample aggregation:

\[ \mathcal{L}_{\alpha} = \frac{1}{1-\alpha} \log\left(\frac{1}{K}\sum_{k=1}^K w_k^{1-\alpha}\right). \]

As \(\alpha\to 1\), it approaches the ELBO. Smaller values change the pressure on high-weight samples.

In the reported Gaussian branch, changing only the Gaussian objective to IWAE or Renyi did not solve calibration. The mixture plus IWAE pairing mattered more.

Technique 16: FIVO-style objectives

FIVO-style objectives estimate a sequence marginal likelihood with particles and resampling. The broad shape is:

\[ \sum_{t=1}^T \log\left(\frac{1}{K}\sum_{k=1}^K w_{t,k}\right), \]

where particles are propagated and optionally resampled over time.

The repo includes several proposal families:

  • marginal filter proposal;
  • transition-filter bridge proposal;
  • learned transition-filter bridge proposal.

The bridge proposal combines a transition from the previous particle with the filter belief at the current time. In Gaussian form:

\[ \sigma^2_{\mathrm{bridge}} = \left(\frac{1}{Q}+\frac{1}{\sigma_t^2}\right)^{-1}, \]\[ \mu_{\mathrm{bridge}} = \sigma^2_{\mathrm{bridge}} \left(\frac{z_{t-1}}{Q}+\frac{\mu_t}{\sigma_t^2}\right). \]

The latest short FIVO bridge resampling suite shows the method is sensitive to resampling and proposal details. Stop-gradient resampling improved coverage in some 250-step K4 runs, but the auxiliary learned bridge was poor in the same short-budget setting. This branch is still exploratory.

Technique 17: bootstrap particle filter reference

A bootstrap particle filter is also used as a diagnostic reference. It samples particles from the transition:

\[ z_t^{(k)} \sim p(z_t\mid z_{t-1}^{(k)}), \]

weights them by the observation likelihood:

\[ w_t^{(k)} \propto p(y_t\mid z_t^{(k)},x_t), \]

then resamples.

The particle filter is not the headline reference because the grid filter is deterministic and scalar. But particle filtering is useful for sanity checking mixture and quadrature rows, especially when comparing state density and predictive-y behavior.

Technique 18: local tilted projection

The quadrature and ADF branches start from the local tilted distribution:

\[ \tilde{p}(z_t) \propto p(y_t\mid z_t,x_t) \int p(z_t\mid z_{t-1})q^F_{t-1}(z_{t-1})\,dz_{t-1}. \]

Then they project \(\tilde{p}\) back into the chosen belief family.

For a Gaussian ADF update, the projection is moment matching:

\[ q^F_t(z_t) = \mathcal{N} \left( \mathbb{E}_{\tilde{p}}[z_t], \operatorname{Var}_{\tilde{p}}[z_t] \right). \]

This is reference-free because it uses the known transition, known observation model, and current observations, not the grid posterior cache.

Technique 19: quadrature ADF

For the scalar sine model, the tilted distribution can be integrated with deterministic quadrature. This is a different class from learned amortized filters. It is closer to classic assumed-density filtering.

The April 30 quadrature suite found:

Patterngrid ref NLLGaussian ADFK4 ADF spreadK4 Power-EP alpha 0.5
sinusoidal2.6916.0546.8402.867
weak2.6922.7492.6482.638
intermittent2.6952.7692.6742.528
zero2.6932.6932.7362.736
random normal2.6988.8348.9765.002

That was an important reversal. The best neural fully unsupervised filters were still far from the grid reference, but deterministic reference-free updates could get much closer.

Technique 20: Power-EP

Expectation propagation uses local factors and projections. Power-EP introduces a power parameter \(\alpha\), which changes how strongly the likelihood tilts the current belief before projection.

In these reports, Power-EP style updates are used as reference-free filtering rules. They are not trained on reference moments.

Quadrature state NLL and predictive-y comparison

The K4 Power-EP row with \(2\pi\)-spread components was strong on weak and intermittent patterns. It still struggled on random-normal observations until alias-indexed components were introduced.

Technique 21: alias-indexed mixture components

The sine likelihood has a built-in alias structure:

\[ \sin(z) = \sin(z + 2\pi k). \]

Instead of asking a neural network to discover this from scratch, the alias-indexed quadrature branch places mixture components on \(2\pi\)-spaced aliases and weights them with the prior and likelihood.

The prior-weighted K5 alias Power-EP row reached near-reference state NLL across all five stressors:

PatternK5 alias state NLLref state NLLcoverage 90variance ratiopred-y NLLref pred-y NLL
sinusoidal2.6022.6910.9241.5330.7100.457
weak2.7502.6920.9681.5620.3340.301
intermittent2.7392.6950.9621.5490.3930.351
zero2.7582.6930.9691.5850.2600.260
random normal2.7172.6980.8882.0110.8270.530

The state-density result is excellent, but the row is over-dispersed and predictive-y NLL is weak in clean and random-normal settings.

Shrink variants reduce variance:

Patternshrink state NLLcoverage 90variance ratio
weak2.6870.9130.998
intermittent2.6830.9090.994
zero2.6950.9161.016

But shrink hurts other patterns. This is now a Pareto problem, not a single-row victory.

What has been learned

The research arc can be summarized as a sequence of increasingly specific diagnoses.

First, the linear-Gaussian case established the plumbing. Exact Kalman, frozen-marginal controls, supervised edge learning, and ELBO metrics all work. The scalar MC ELBO is already under-dispersed in weak and zero-observation regimes.

Second, the nonlinear learned Gaussian filter exposed a much larger under-dispersion problem. More steps, simple resampling, and variance penalties did not fix the fully unsupervised objective.

Third, reference-distilled diagnostics showed that good strict filters exist. The issue is not simply that online Gaussian filtering is impossible. The issue is finding the right reference-free objective or update rule.

Fourth, a combined unsupervised Gaussian objective - windowed ELBO, predictive-y, and masked-y spans - improved robustness but remained badly under-dispersed.

Fifth, direct K2 mixture IWAE improved state density substantially. This showed that posterior family and objective are coupled. It also introduced a new failure: state-density improvement can regress predictive-y NLL.

Sixth, deterministic quadrature ADF and Power-EP baselines showed that the known nonlinear likelihood contains enough structure to get close to the grid reference without reference targets. The current strongest state-density rows are algorithmic reference-free filters, not learned amortized ELBO filters.

A rough scorecard

This table compresses the main techniques into one view. Values are indicative from the reports, not a single unified sweep.

TechniqueTraining signalStrong pointMain weakness
exact Kalmanexact linear-Gaussian inferencegold standard in scalar linear casenot available for nonlinear sine
frozen marginalsupervised/controlverifies edge/backward machinerynot a learned filter
structured local ELBOfully unsupervisedclean VBF objectivesevere nonlinear under-dispersion
oracle variance calibrationoracle diagnosticproves variance is bottleneckuses reference variance
direct moment distillationreference-distilledstrong nonlinear state NLLnot unsupervised
structured rollout h4reference-distilledcalibrated structured comparisonstill reference-assisted
joint ELBO + predictive-y + maskingfully unsupervisedimproves degraded robustnessvariance ratios remain low
K2 mixture IWAEfully unsupervisedlarge state-density gainspredictive-y regression
FIVO bridgefully unsupervisedpromising particle objectivesensitive and still exploratory
quadrature ADFdeterministic reference-freestrong classic filtering baselineuneven on aliases/random-normal
Power-EP K4/K5deterministic reference-freenear-reference state NLL in many regimesstate-predictive tradeoff remains

The most important caveat

The latest generated report artifacts are not all equally reliable as final summaries. One generated final report had empty robustness and diagnostic tables while still asserting a fixed best row. The source metrics and narrower summary files are more reliable than that incomplete aggregate. Until the aggregate report is regenerated with the full intended input set, I would treat the following as the clean state:

  • linear-Gaussian benchmark: report-ready;
  • nonlinear learned Gaussian objective repair: partial success, not solved;
  • nonlinear K2 mixture IWAE: best learned fully unsupervised state-density row in the April 29 branch, blocked by predictive-y regression;
  • quadrature ADF / Power-EP: strongest current reference-free algorithmic baselines, with a remaining state-density versus predictive-y tradeoff.

How to reproduce pieces

The repo uses uv and JAX. The common commands are:

uv sync --dev
make test

Run the scalar linear-Gaussian aggregate:

make aggregate-linear-gaussian-reports

Run nonlinear learned sweeps from the sweep script:

uv run python scripts/sweep_nonlinear_learned.py \
  --models structured_joint_elbo_h4_w005_predictive_y_masked_y_spans_h4 \
  --steps 1000

Run the direct nonlinear training script from a config:

uv run python scripts/train_nonlinear.py \
  --config experiments/nonlinear/09_structured_elbo_sine_mlp.yaml

The exact command surface changes as the research harness evolves, so the most stable entry points are the scripts and committed configs in:

Where I would go next

I would split the next work into two tracks.

The first track is report hygiene. The latest findings should be consolidated into one committed status note that separates:

  • learned neural filters;
  • deterministic reference-free filters;
  • reference-distilled diagnostics;
  • oracle-calibrated diagnostics.

The second track is methodological. The quadrature results are now too strong to ignore. The next learned filters should probably learn from the structure that quadrature exposed, not just run another local ELBO sweep. Good next questions:

  • Can an amortized filter learn the alias-aware Power-EP update?
  • Can the state-density and predictive-y objectives be separated into a principled two-head selection rule?
  • Can mixture components be initialized or constrained around known periodic aliases without using reference moments?
  • Can the predictive-y regression be fixed without giving back the state-density gains?

The main lesson is not that one technique won. It is that filtering quality is multi-dimensional. A filter must carry calibrated state uncertainty, predict future measurements, survive weak observations, and remain online. The Kalman filter gets all of that for free in the linear-Gaussian case. In the nonlinear case, every approximation decides which part of that contract it is willing to pay for.

Source artifacts

The most useful source artifacts for this synthesis are linked below. Stable committed source files are linked to GitHub; generated summary snapshots used for the April 30 interpretation are bundled with this post under artifacts/.