Re: deduplicating file systems: VDO with Debian?
On Mon, 2022-11-07 at 16:29 +0100, hede wrote:
> Am 07.11.2022 02:57, schrieb hw:
> > Hi,
> >
> > Is there no VDO in Debian, and what would be good to use for
> > deduplication with
> > Debian? Why isn't VDO in the stardard kernel? Or is it?
>
> I have used vdo in Debian some time ago and didn't remember big
> problems. AFAIR I did compile it myself - no prebuild packages.
Cool, I could give that a try, ty.
> I switched to btrfs for other reasons. Not even for performance. The VDO
> Layer eats performance, yes, but compared to naked ext4 even btrfs is
> slow.
Really? I never noticed that btrfs would be slow. But then, it's been a long
time that I used ext4 ...
> > There is no point in
> > deduplicating
> > backups after they're done because I don't need to save disk space for
> > them when
> > I can fit them in the first place.
>
> That's only one point.
What are the others?
> And it's not really some valid one, I think, as
> you do typically not run into space problems with one single action
> (YMMV). Running multiple sessions and out-of-band deduplication between
> them works for me.
That still requires you to have enough disk space for at least two full backups.
I can see it working for three backups because you can deduplicate the first
two, but not for two. And why would I deduplicate when I have sufficient disk
space.
> In-band deduplication (that's the one you want) has some drawbacks, too:
> High Ressource usage. You need plenty of RAM (up to several Gigabytes
> per Terabyte Storage) and write success is delayed (-> slow direct i/o).
Well, if it takes 5 days or so to make a backup, that won't be very useful. It
takes more than long enough already because my discs can only sustain so much.
> For Out-of-Band deduplication there are multiple different
> implementations. File based dedup on directory basis can be very fast
> and resource economical, for example via rdfind or jdupes. Block based
> like via bees for btrfs (that's the one I use) is more close to in-band
> deduplication (including high RAM usage). Bees can be switched off and
> on at any time (for example if it's a small home-system which runs more
> demanding tasks from time to time) and switching it on again resumes at
> the last state (it starts at the last transaction id which was processed
> -> btrfs knows its transactions).
Hm. I wouldn't mind running it from time to time, though I don't know that I
would have a lot of duplicate data other than backups. How much space might I
expect to gain from using bees, and how much memory does it require to run?
Reply to: