Hi Christian,
On 2023-11-01 05:34, Christian Kastner wrote:
On 2023-11-01
03:46, Cordell Bloor wrote:
One failing test
suite is that of rocfft, which times out after five
hours [2]. These old servers have terrible single-thread
performance,
so it takes a long time to run the rocfft test suite.
FYI, the limit of 5h was arbitrarily chosen my me, so we can
always
increase it.
The reason why I chose 5h, instead of 10h or 24h or whatever, is
that
the limit is global, and we do have test suites where at least
one test
hangs in an infinite loop, namely rocthrust [4]. (Incidentally,
it
passes on gfx803, but in <2min, which must also be a bug).
Making timeouts more package-specific was a recent discussion on
the
debian-ci list [5]. I've created an issue to add this to our
debci [6],
this shouldn't be much work, in fact.
In the short term, we could just increase the global timeout until
rocfft can finish on Argo and Lyra. In the medium term, it would
make sense to have package-specific time limits. We may also want
to look into whether we can speed up FFTW, as the CPU reference
library is probably the vast majority of the runtime. We might be
able to cache some wisdom files or configure FFTW to use more than
one thread.
The rocsparse,
rocblas, and rocsolver packages are also failing.
Those tests crash with the error "Illegal instruction" [3].
I've not
yet determined the cause of this problem, but it does not
occur when
the QEMU CPU model is configured as pass-through. It's not
clear to
me why this problem is not seen on the gfx1030 CI machine.
Found it: the autopkgtest QEMU machinery treats AMD and Intel
CPUs
differently [7]. ci-worker-ckk{01,02} both have AMD CPUs and
these are
pass-through as-is. Intel CPUs have a more complex
configuration, to
enable nested KVM.
Wow. Good job finding that one!
Though, I still wonder what the illegal instruction is. It would
be pretty handy if we could get a stack trace when the tests
crash. That would narrow the problem quite a bit, even without
interactive debugging.
I think the nested
KVM feature is used by the official ci.debian.net (at
least for some workers: Cloud -> Debian VM -> autopkgtest
VM), so it
might be a model-specific thing.
In any case, making the CPU configurable seems like something
that might
be worthwhile to add to src:autopkgtest. I could add that to our
fork
for now, would that help?
That would be useful. The illegal instruction may be an indicator
of a legitimate issue (or it might be kvm-specific), but we do not
want it to obscure all other test results.
I'll keep this in
mind for the meta-worker I'm hacking on. The normal
debci worker is an unbounded listener (on one queue). The
meta-worker is
a bounded listener (on multiple queues) and can perform actions
when
boundaries are met.
In practice, that means that you could e.g. have the host wake
up
periodically by BIOS RTC alarm, and have the meta-worker
initiate
shutdown once all its relevant queues are empty.
That sounds nice.
In the next few months, I would like to start adding laptop GPUs
like the AMD Radeon 680M (gfx1035) and the AMD Radeon 780M
(gfx1103) to the test matrix. The test systems for those will
probably be mini-pcs. They might be suitable for doing double-duty
as both CI workers and controllers for the local cluster of
machines.
Sincerely,
Cory Bloor