Re: Preparing Argo and Lyra for the CI (Was: Preparing Ursa and Lyra for the CI)

Hi Christian,

On 2023-11-01 05:34, Christian Kastner wrote:

On 2023-11-01 03:46, Cordell Bloor wrote:

One failing test suite is that of rocfft, which times out after five
hours [2]. These old servers have terrible single-thread performance,
so it takes a long time to run the rocfft test suite.

FYI, the limit of 5h was arbitrarily chosen my me, so we can always
increase it.

The reason why I chose 5h, instead of 10h or 24h or whatever, is that
the limit is global, and we do have test suites where at least one test
hangs in an infinite loop, namely rocthrust [4]. (Incidentally, it
passes on gfx803, but in <2min, which must also be a bug).

Making timeouts more package-specific was a recent discussion on the
debian-ci list [5]. I've created an issue to add this to our debci [6],
this shouldn't be much work, in fact.

In the short term, we could just increase the global timeout until rocfft can finish on Argo and Lyra. In the medium term, it would make sense to have package-specific time limits. We may also want to look into whether we can speed up FFTW, as the CPU reference library is probably the vast majority of the runtime. We might be able to cache some wisdom files or configure FFTW to use more than one thread.

The rocsparse, rocblas, and rocsolver packages are also failing.
Those tests crash with the error "Illegal instruction" [3]. I've not
yet determined the cause of this problem, but it does not occur when
the QEMU CPU model is configured as pass-through. It's not clear to
me why this problem is not seen on the gfx1030 CI machine.

Found it: the autopkgtest QEMU machinery treats AMD and Intel CPUs
differently [7]. ci-worker-ckk{01,02} both have AMD CPUs and these are
pass-through as-is. Intel CPUs have a more complex configuration, to
enable nested KVM.

Wow. Good job finding that one!

Though, I still wonder what the illegal instruction is. It would be pretty handy if we could get a stack trace when the tests crash. That would narrow the problem quite a bit, even without interactive debugging.

I think the nested KVM feature is used by the official ci.debian.net (at
least for some workers: Cloud -> Debian VM -> autopkgtest VM), so it
might be a model-specific thing.

In any case, making the CPU configurable seems like something that might
be worthwhile to add to src:autopkgtest. I could add that to our fork
for now, would that help?

That would be useful. The illegal instruction may be an indicator of a legitimate issue (or it might be kvm-specific), but we do not want it to obscure all other test results.

I'll keep this in mind for the meta-worker I'm hacking on. The normal
debci worker is an unbounded listener (on one queue). The meta-worker is
a bounded listener (on multiple queues) and can perform actions when
boundaries are met.

In practice, that means that you could e.g. have the host wake up
periodically by BIOS RTC alarm, and have the meta-worker initiate
shutdown once all its relevant queues are empty.

That sounds nice.

In the next few months, I would like to start adding laptop GPUs like the AMD Radeon 680M (gfx1035) and the AMD Radeon 780M (gfx1103) to the test matrix. The test systems for those will probably be mini-pcs. They might be suitable for doing double-duty as both CI workers and controllers for the local cluster of machines.

Sincerely,
Cory Bloor