[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#1006346: cloud.debian.org: bullseye AMIs don't boot on Amazon EC2 Xen instances with Enhanced Networking



On Sat, Mar 19, 2022 at 10:41:39AM +0100, Salvatore Bonaccorso wrote:
> > >From the upstream discussion on the linux-pci mailing list [*]:
> > 
> > > Yes. My understanding is that the issue is because AWS is using older
> > > versions of Xen. They are in the process of updating their fleet to a
> > > newer version of Xen so the change introduced with Stefan's commit
> > > isn't an issue any longer.
> > > 
> > > I think the changes are scheduled to be completed in the next 10-12
> > > weeks. For now we are carrying a revert in the Fedora Kernel.
> > > 
> > > You can follow this Fedora CoreOS issue if you'd like to know more
> > > about when the change lands in their backend. We work closely with one
> > > of their partner engineers and he keeps us updated.
> > > https://github.com/coreos/fedora-coreos-tracker/issues/1066
> > 
> > Ideally we can revert the upstream commit from the stable kernels, since
> > otherwise Debian users on AWS Xen instance types may be stuck using
> > older, unsafe kernels.  Especially if we have time to include the change
> > in the upcoming bullseye and buster point releases.  If the kernel
> > updates for those stable updates have already been built, though, it
> > might be too late to matter.  By the time we publish our next kernel
> > builds, the AWS Xen update may be complete.
> 
> Wehere one can track the update status for their Xen version directly
> or is following the above the only reference?

It's just for reference; the deployment timeline isn't published.  As
far as I know, it's also subject to change in the event that unexpected
issues arise or it's preempted by some high severity issue.

> How frequent is this particular combination of hardware/software? We
> have the change already applied for a while in bullseye, buster would
> be impacted new since the last update done for security fixes

The impacted instance types aren't the most common, as they're not the
latest generation.  So I expect that the majority of the impact is felt
by people or organizations that haven't yet been able to make time to
switch to newer instance types.  The implication here, of course, is
that many of these deployments may be production environment where
stability is prioritized over migration to the new thing.

We get a little bit of data about what instance types are used with
Debian on AWS, but it's incomplete as it only reflects usage by AWS
customers who use access Debian via the AWS Marketplace.  Consider it
something like popcon data; it's essentially opt-in.  If the data we get
from the Marketplace covering the past 3 days worth of activity is
representative of the Debian usage in general, then it looks like
roughly 1% of Debian users on AWS are trying to use the impacted
instance types.

> Are there workarounds for the affected users of this combination? I
> see some options listed in https://wiki.debian.org/Cloud/AmazonEC2Image/Bullseye 

People can use newer generation instance types, which are not impacted.
Depending on the use case, that could be a trivial change, but it could
also be disruptive.  Newer instance types aren't based on Xen at all and
expose a different hardware device model to the instance.  Debian
supports the newer instance types, but the end user workload may still
need additional nontrivial qualification.

> If we revert the commit it reverts a fix for a bug with Marvell NVME
> devices.
> 
> But we cannot just revert the commit for the cloud images.

Understood.

> If we know something about the release schedule from Amazon to update
> their Xen instances (which is the way to move forward, since upstream
> won't revert the commit) then we should leave the status as it is for
> bullseye (and now for buster). For bullseye there is there is
> CVE-2022-0847 fixes they would need to pick up.

Yes, the problem will go away when the Xen fleet is updated.  It sounds
like we're looking at roughly a 3 month timeline, after which point the
patch won't be a problem.  However, until then, people who need to use
Xen instances will be stuck either running an unsafe kernel or building
their own.

noah


Reply to: