[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

SMART Uncorrectable_Error_Cnt rising - should I be worried?



This is not directly Debian-related, except insofar as the system
involved is running Debian, but we've already had a somewhat similar
thread recently and this forum is as likely as any I'm aware of to have
people who might have the experience to address the question(s). I would
be open to recommendations for alternate / better forums for this
inquiry, if people have such.


For background: I have an eight-drive RAID-6 array of 2TB SSDs, built
back in early-to-mid 2021.

Until recently, as far as I'm aware there have not been any problems
related to it.


Within the past few weeks, I got root-mail notifications from smartd
that the ATA error count on two of the drives had increased - one from 0
to a fairly low value (I think between 10 and 20), the other from 0 to
1. I figured this was nothing to worry about - because of the relatively
low values, because the other drives had not shown any such thing, and
because of the expected stability and lifetime of good-quality SSDs.


On Sunday (two days ago), I got root-mail notifications from smartd
about *all* of the drives in the array. This time, the total error
counts had gone up to values in the multiple hundreds per drive. Since
then (yesterday), I've also gotten further notification mails about at
least one of the drives increasing further. So far today I have not
gotten any such notifications.

One thing I don't know, which may or may not be important, is whether
these alert mails are being triggered when the error-count increase
happens, or when a scheduled check of some type is run. If it's the
latter, then it might be that there's a monthly check and that's the
reason why all eight drives got mails sent at once, but if it's the
former, then the so-close-in-time alerts from all eight drives would
seem more likely to reflect a real problem.


I've looked at the SMART attributes for the drives, and am having a hard
time determining whether or not there's anything worth being actually
concerned about here. Some of the information I'm seeing seems to
suggest yes, but other information seems to suggest no.

Relevant-seeming excerpts from the output of 'smartctl -a' on one of the
drives is attached (rather than inline, to avoid line-wrapping). I can
provide full output of that command for that drive, or even for all of
the drives, if desired.


Things that seem to suggest that there may be reason to be concerned
include, but may not be limited to:

The Uncorrectable_Error_Cnt, which is the value referenced by the alert
mails, has risen well above its apparent previous value of 0, and signs
are that it may be going to keep rising.

The Runtime_Bad_Block count is nonzero.

The ECC_Error_Rate is nonzero (and, at least in the case of this
specific drive, also equal to the Uncorrectable_Error_Cnt).

Most of the attributes are listed as of type "Old_age". That strikes me
as unexpected; two and a half years of mostly-read-based operation does
not seem like enough to qualify a SSD as "old", although my expectations
here may well be off. (I would be inclined to expect five-to-ten years
of operation out of a non-defective drive, assuming reasonable physical
treatment otherwise, if not considerably more.)

As mentioned above, the increase in Uncorrectable_Error_Cnt has happened
at nearly the same time (relative to drive installation date) for all
the drives, and for some of the drives it seems to be continuing to
increase.


I don't know how to interpret the "Pre-fail" notation for the other
attributes. That terminology could be intended to mean "This drive has
entered the final stage before failure, and its failure is expected to
be imminent" - or it could equally well be the status that the
attributes *start* in, with the intended meaning "This drive has not yet
reached a stage where there is any reason to think it might fail".


Things that seem to suggest that there may *not* be a reason to be
concerned include, but may not be limited to:

The "VALUE" column for each of the attributes remains high; most are in
the range from 098 to 100, and excluding the Airflow_Temperature_Cel
figure, the lowest is 095, for Power_On_Hours. From what I've managed to
find in reading online, this column is typically a percentage value,
with lower percentages indicating that the drive is closer to failure.

The Total_LBAs_Written value, when combined with the Sector Size,
results (if my math is correct) in a total-data-written figure of
between 3TB and 4TB. That should be *well* under the advertised write
endurance of this drive, given that the drive is 2TB and (both IIRC and
from what I've found in reading up on such things again after these
errors started to occur) those advertised values for similar-capacity
drives seem to start in the hundreds of TB and go up.


So... as the Subject asks, should I be worried? How do I interpret these
results, and at what point do they start to reflect something to take
action over? If there is not reason to be worried, what *do* these
alerts indicate, and at what point *should* I start to be worried about
them?

I already *am* worried, to the point of having heartburn and difficulty
sleeping over the possibility of data loss (there's enough on here that
external backup would be somewhat difficult to arrange), but I'm not
sure whether or not that is warranted.

My default plan is to identify an appropriate model and buy a pair of
replacement drives, but not install them yet; buy another two drives
every six months, until I have a full replacement set; and start failing
drives out of the RAID array and installing replacements as soon as one
either fails, or looks like it's imminently about to fail. But if the
mass notification mails indicate that all eight are nearing failure,
that might not be enough - and if they don't indicate any likelihood of
failure this year, then buying replacement drives yet might be
premature.

What drives I choose to buy as replacement would also be influenced by
how likely it is that this indicates impending failure. If it doesn't,
then drives similar to what I already have would probably still be
appropriate; if it does, then I'm going to want to go up-market and buy
long-endurance drives intended for high uptime - i.e., data-center
storage drives, which are likely to be more expensive.

-- 
   The Wanderer

The reasonable man adapts himself to the world; the unreasonable one
persists in trying to adapt the world to himself. Therefore all
progress depends on the unreasonable man.         -- George Bernard Shaw
Model Family:     Samsung based SSDs
Device Model:     Samsung SSD 870 EVO 2TB
Serial Number:    S620NJ0R410888A
LU WWN Device Id: 5 002538 f31440901
Firmware Version: SVT01B6Q
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available, deterministic, zeroed
Device is:        In smartctl database 7.3/5319
ATA Version is:   ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Jan  9 07:32:13 2024 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   098   098   010    Pre-fail  Always       -       31
  9 Power_On_Hours          0x0032   095   095   000    Old_age   Always       -       22286
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       29
177 Wear_Leveling_Count     0x0013   099   099   000    Pre-fail  Always       -       11
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   098   098   010    Pre-fail  Always       -       31
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   098   098   010    Pre-fail  Always       -       31
187 Uncorrectable_Error_Cnt 0x0032   099   099   000    Old_age   Always       -       598
190 Airflow_Temperature_Cel 0x0032   069   050   000    Old_age   Always       -       31
195 ECC_Error_Rate          0x001a   199   199   000    Old_age   Always       -       598
199 CRC_Error_Count         0x003e   100   100   000    Old_age   Always       -       0
235 POR_Recovery_Count      0x0012   099   099   000    Old_age   Always       -       21
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       6950839497

Attachment: signature.asc
Description: OpenPGP digital signature


Reply to: