[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

[Nbd] Bug with large reads and protocol issue



I have found an interesting problem with large reads.

I have been trying to ascertain what the correct protocol is
for read errors.

What ndd-server currently does is process the read in chunks
of BUF_SIZE size. If any chunk errors, it sends an error
response. This is problematic because the server cannot
correctly process an error response if it is sent half-way
through a stream of data blocks. It causes the connection
to hang. As the error code may be interpreted as data,
which might be acted upon, it is theoretically possible that
this might cause corruption (though this is unlikely
with the current client as the error response is so
much smaller than a block).

Reading the protocol, there is only one possible interpretation
of what is meant to happen (as far as I can tell). Either
the response is meant to error, in which case no data is
sent at all, or the response does not error, in which case
all the data is meant to be sent. There is (rightly) no
"send half the data and an error" variant.

But this is really problematic for the reason set out below.

Let's suppose that a given server can handle large reads
efficiently. What I want to do is to start sending the data to the
tcp channel before I've read all the data. This is in fact
what nbd-server attempts to do right now in the read is
bigger than BUF_SIZE.

The problem occurs if a read other than the first errors (or
more accurately if any read errors after we have sent any data).
How do we represent that error to the server? We've already
returned that the operation has succeeded.

To do proper error handling (which nbd-server doesn't, as far
as I can tell), we'd need to save the whole read in memory,
which is (a) memory inefficient, and (b) throughput inefficient
as we'd have to buffer the entire read.

One answer to this is "don't use large reads, then". However,
in certain situations (e.g. servers than can parallelize
requests), it's far more efficient to do larger reads.
Even now, we wait until a large amount of data has been read
before sending any.

Given that errors are really unlikely in the great scheme of
things, a relatively low overhead solution to this would be
to send the read followed by the error code (again) (we could
signal this by returning "EDONTKNOW" or something in the
original error field). If this was non-zero, the client would
discard all the data and use this as the error code.
This would waste 4 octets on every read reply where EDONTKNOW
was used, which would solely be large read requests. Obviously
as EDONTKNOW would be sent at the end of a large read, if there
is an early error, we'd have to send a large amount of junk
over TCP in the event of an error, but this is hardly
a problem.

Whilst in theory we'd need to signal EDONTKNOW support, actually
large reads ( > BUFSIZ ) are pretty dodgy in that any error
will cause a disconnect. Paul suggests we never get them anyway
due to kernel request size limitations, though Wouter seems
skeptical. So I am tempted just to put EDONTKNOW support
into nbd-server and the kernel without any signalling. There
cannot be many people using large reads reliably as prior to the
last release there were full of all sorts of, um, interesting
features.

--
Alex Bligh



Reply to: