[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#1052697: marked as done (tech-ctte: proposed change to apt-listchanges algorithm needs expert consideration)



Your message dated Fri, 06 Oct 2023 20:47:23 +0100
with message-id <87fs2nbkj8.fsf@melete.silentflame.com>
and subject line Re: Bug#1052697: tech-ctte: proposed change to apt-listchanges algorithm needs expert consideration
has caused the Debian Bug report #1052697,
regarding tech-ctte: proposed change to apt-listchanges algorithm needs expert consideration
to be marked as done.

This means that you claim that the problem has been dealt with.
If this is not the case it is now your responsibility to reopen the
Bug report if necessary, and/or fix the problem forthwith.

(NB: If you are a system administrator and have no idea what this
message is talking about, this may indicate a serious mail system
misconfiguration somewhere. Please contact owner@bugs.debian.org
immediately.)


-- 
1052697: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1052697
Debian Bug Tracking System
Contact owner@bugs.debian.org with problems
--- Begin Message ---
Package: tech-ctte
Severity: normal

Hello all,

I am hoping that this is an acceptable request to make of the
technical committee, for reasons which I explain below, but if not,
then I apologize for bothering you and you should kindly tell me to
buzz off. Thanks.

I'm in the process of adopting apt-listchanges, and as part of doing
that I am reviewing all the outstanding bugs. Bug #748631 is
particularly concerning to me, because it involves apt-listchanges
failing to display changelog entries that it should have.

That bug prompted me to do a deep dive into the algorithm that
apt-listchanges uses to determine which changelog and NEWS entries to
display to the user. After doing that, I don't tihnk the current
algorithm is accurate enough, and I want to revamp it.

I have put a lot of thought into what to replace it with and come up
with a plan that I think is solid, but I've been in this business for
long enough to know that I certainly may be missing something
important.

I believe -- and maybe I'm wrong about this! -- that it's important
enough to accurately tell users what's changing during upgrades that
significantly changing the apt-listchanges algorithm is not something
I should do unilaterally, hence this request for y'all to review my
plan and let me know if I'm missing anything or if you think I'm
entirely off-base. If there's a group I should take this to instead of
the Technical Committee, please let me know; I'm still learning my way
around here.

My proposal is appended below. I wrote this from the point of view of
a fait accompli, i.e., as if the change has already been implemented,
because if I do end up implementing this, then I intend to commit the
proposal into the source tree as a design document to help whoever
comes after me understand why the program works the way it does.

Please let me know your thoughts.

Thank you,

Jonathan Kamens

---

# Determining which entries the user has already seen

## Historical perspective

Earlier versions of this program used the following approach to
determine which changlog or NEWS entries (hereafter "entries") are new
and should be displayed to the user:

- Group packages by source package.
- Keep track of the highest version number of any of the packages in
  the group and use that as the threshold for identifying new entries.
- Display any entries with version numbers not less than the
  previously determined version number.

This approach was based on two assumptions, neither of which is always
true:

1. Assume that the version numbering for all packages that come from
   the same source package are in the same series.
2. Assume that the version numbering of entries always matches the
   aforementioned version numbering.

For an example of where these assumptions break down, look at the
dmsetup package:

- The source package for dmsetup is lvm2.
- The version number for the dmsetup package is lower, but with a
  higher epoch than, the version number for the lvm2 package.
- The entries in changelog.Debian.gz use lvm2 version numbers, while
  the ones in changelog.Debian.devmapper.gz use dmsetup version
  numbers.

This approach was also limited in that it only looked at
NEWS.Debian[.gz], changelog.Debian[.gz], changlog.Debian.arch[.gz],
and changelog[.gz]. For an example of where this fails, again look at
dmsetup, which has changelog.Debian.devmapper.gz.

Another technique used in earlier versions of this program was to
attempt heuristically to ignore version number suffixes which should
not be considered when evaluating whether a particular entry was new.
The employed heuristics were brittle, potentially leading to missed
entries or entries displayed multiple times.

## Current approach

The current approach abandons the dependency on version numbers and
relies instead on entry checksums.

The program maintains a persistent database of previously seen
changelog entries containing the following data:

- The checksum of the most recently seen entry in each changelog file,
  including the special file "/network/[package]" for changelog
  entries fetched over the network for the package named [package].
- The checksums of all entries seen in the past three years, _with the
  header line of each entry removed_, for each source package.
  
We index content by source package because the same changelog entries
frequently appear in multiple binary packages built from the same
source package, and we only want the user to see those once.

We remove the header line of each entry in the second set of checksums
because sometimes a package version uploaded to stable and a different
version uploaded to unstable use different header lines for the same
changelog entry.

Given this stored data, the filtering algorithm is simple: Ignore any
entry whose content checksum is in the database, and stop reading a
file when the checksum for the most recently seen entry is reached.

The database used by the current approach is significantly larger than
the database required for the historical approach -- a few megabytes
vs. a few kilobytes -- but it is still relatively mall and we consider
this an acceptable amount of space to use for a significantly
better-performing algorithm.

Because this approach uses entry checksums, it is able to include
entries from files like changelog.Debian.devmapper that the historical
approach ignored.

### Edge case: no database, or no data for a file in the database

When the persistent database is not being used in a particular
invocation of the program, or when there is no data for a particular
file in the database, then the above approach requires modification.

In this case, we read and calculate checksums for the same path on
disk to seed the database before we parse the file in the package.

### Edge case: no database, changelog data from network

When the persistent database is not being used in a particular
invocation of the program, and the changelog data for a package is
being fetched over the network because it is not present in the
package, there is no reliable way to determine which changelog entries
have been displayed already, so the program displays all of them.

This is sufficiently rare, both because the program is usually used
with a persistent database and because there are relatively few
packages without embedded changelogs, that it is considered an
acceptable performance degradation to exchange for better overall
performance.

It is also preferable to the historical approach because it errs by
displaying extra information to the user rather than by failing to
display data that it should have.

--- End Message ---
--- Begin Message ---
Hello,

On Tue 26 Sep 2023 at 06:11am -04, Jonathan Kamens wrote:

> I am hoping that this is an acceptable request to make of the
> technical committee, for reasons which I explain below, but if not,
> then I apologize for bothering you and you should kindly tell me to
> buzz off. Thanks.

I'm going to close the bug because we wouldn't want to apply ourselves
to it as a whole committee before the topic is first discussed in a
venue like debian-devel@lists.debian.org.

It's fine to continue discussion here on the bug despite it being
closed, however, but you might move it to d-devel, with reference to
this bug.

Thank you for these efforts.

-- 
Sean Whitton

Attachment: signature.asc
Description: PGP signature


--- End Message ---

Reply to: