Find flaky tests #
A test is flaky if it sometimes passes and sometimes fails on the same DUT under the same conditions. Real flakes hide one of three things: a marginal limit, a measurement that depends on uncontrolled environment, or a race in setup. This recipe walks the operator UI and the parquet store to identify which.
Prerequisites #
- A few weeks of accumulated runs in the project's data dir (the retest signal needs repeated DUT serials across sessions to mean anything).
litmus serverunning on the bench.
1. Find the suspects in the Metrics → Retest tab #
Open /metrics, click the
Retest tab. The chart shows the percentage of unique DUTs that
needed more than one attempt to clear the same step, bucketed by
period. The table below shows Period / Serials / Retested / Rate /
Avg retries.
High retest rates flag flaky tests OR marginal hardware. To narrow to "is it the test", filter the same Metrics view by product or station with the filter bar above the tabs and see whether the spike follows the test, the station, or the product.
2. Pin the test that's flaking #
The Retest tab is aggregate; for the specific test, open
/results. The list
doesn't text-filter by DUT serial, so sort by Started descending
and scan the DUT column for one of the affected serials. A flaky
test shows up as a serial that has both passed and failed rows
in its history without an obvious code change between them.
Click into a failing run. The
Results detail step
tree shows one row per (step_path, vector_index) regardless of
retry count; to see the individual attempts, jump to the parquet
query in the next step. Confirm the failing step's measurements
table: a borderline value just outside the limit is a marginal
limit; a wild value is environment or hardware.
3. Make the retry behaviour explicit #
If the test is genuinely intermittent and you can't fix the root
cause yet, set an explicit retry policy with the
@pytest.mark.litmus_retry
marker:
@pytest.mark.litmus_retry(max_retries=2, delay=0.5, on=["AssertionError"])
def test_output_voltage(context, verify):
...This translates to pytest-rerunfailures under the hood. Every
retry produces parquet rows with the same vector_index and an
incremented vector_retry — the operator UI's step tree counts
those and shows them as retries.
4. Confirm with a parquet query #
To see every attempt for one (run, step, serial) combination across the project, query the parquet store directly:
duckdb -c "
SELECT run_id, dut_serial, step_path, vector_index, vector_retry,
measurement_outcome, measurement_value
FROM read_parquet('<data_dir>/runs/**/*.parquet')
WHERE step_path = 'test_output_voltage'
AND dut_serial = 'DPB001-0001'
AND record_type = 'measurement'
ORDER BY run_started_at DESC, vector_retry ASC
"A row where vector_retry increments past 0 is a retried attempt.
A row where the final retry's measurement_outcome is passed
but earlier retries were failed is a real intermittent — the
unit is right, the test just had to try again. A row where every
retry of the same step on the same serial fails the same way is
not a flake at all; it's a deterministic failure.
Resolve <data_dir> from
ProjectConfig or check the
Three Stores page for the default
locations.
5. Cross-check the environment with channels #
If the measurement is wild but the DUT is fine, the cause is
usually environmental. Open
/channels, find the
session ID from the failing run's detail page, and look at any
power-rail, temperature, or supply-current channel logged during
that session. A 50 mV brown-out on the supply rail during the
failing window is a smoking gun.
Related #
- Metrics — Retest tab — the chart used in step 1
- Results — detail view — the step tree used in step 2
litmus_retrymarker — the retry policy in step 3- Parquet schema → Retries —
vector_retrycolumn semantics - Three stores — ParquetBackend + ChannelStore
- Compare two runs — what to do once you've narrowed it to two specific runs