Ladvise Lock Ahead design

Lock ahead is a new Lustre feature aimed at solving a long standing problem
with shared file write performance in Lustre.  It requires client and server
support.  It will be used primarily via the MPI-I/O library, not directly from
user applications.

The first part of this document (sections 1 and 2) is an overview of the
problem and high level description of the solution.  Section 3 explains how the
library will make use of this feature, and sections 4 and 5 describe the design
of the Lustre changes.

1. Overview: Purpose & Interface
Lock ahead is intended to allow optimization of certain I/O patterns which
would otherwise suffer LDLM* lock contention.  It allows applications to
manually request locks on specific extents of a file, avoiding the usual
server side optimizations. This applications which know their I/O pattern to
use that information to avoid false conflicts due to server side optimizations.

*Lustre distributed lock manager.  This is the locking layer shared between
clients and servers, to manage access between clients.

Normally, clients get locks automatically as the first step of an I/O.
The client asks for a lock which covers exactly the area of interest (ie, a
read or write lock of n bytes at offset x), but the server attempts to optimize
this by expanding the lock to cover as much of the file as possible.  This is
useful for a single client, but can be trouble for multiple clients.

In cases where multiple clients wish to write to the same file, this
optimization can result in locks that conflict when the actual I/O operations
do not.  This requires clients to wait for one another to complete I/O, even
when there is no conflict between actual I/O requests.  This can significantly
reduce performance (Anywhere from 40-90%, depending on system specs) for some
workloads.

The lockahead feature makes it possible to avoid this problem by acquiring the
necessary locks in advance, by explicit requests with server side extent
changes disabled.  We add a new lfs advice type, LU_LADVISE_LOCKAHEAD,
which allows lock requests from userspace on the client, specifying the extent
and the I/O mode (read/write) for the lock.  These lock requests explicitly
disable server side changes to the lock extent, so the lock returned to the
client covers only the extent requested.

When using this feature, clients which intend to write to a file can request
locks to cover their I/O pattern, wait a moment for the locks to be granted,
then write or read the file.

In this way, a set of clients which knows their I/O pattern in advance can
force the LDLM layer to grant locks appropriate for that I/O pattern.  This
allows applications which are poorly handled by the default lock optimization
behavior to significantly improve their performance.

2. I/O Pattern & Locking problems
2. A. Strided writing and MPI-I/O
There is a thorough explanation and overview of strided writing and the
benefits of this functionality in the slides from the lock ahead presentation
at LUG 2015.  It is highly recommended to read that first, as the graphics are
much clearer than the prose here.

See slides 1-13:
http://wiki.lustre.org/images/f/f9/Shared-File-Performance-in-Lustre_Farrell.pdf

MPI-I/O uses strided writing when doing I/O from a large job to a single file.
I/O is aggregated from all the nodes running a particular application to a
small number of I/O aggregator nodes which then write out the data, in a
strided manner.

In strided writing, different clients take turns writing different blocks of a
file (A block is some arbitrary number of bytes).  Client 1 is responsible for
writes to block 0, block 2, block 4, etc., client 2 is responsible for block 1,
block 3, etc.

Without the ability to manually request locks, strided writing is set up in
concert with Lustre file striping so each client writes to one OST.  (IE, for a
file striped to three OSTs, we would write from three clients.)

The particular case of interest is when we want to use more than one client
per OST.  This is important, because an OST typically has much more bandwidth
than one client.  Strided writes are non-overlapping, so they should be able to
proceed in parallel with more than one client per OST.  In practice, on Lustre,
they do not, due to lock expansion.

2. B. Locking problems
We will now describe locking when there is more than one client per OST.  This
behavior is the same on a per OST basis in a file striped across multiple OSTs.
When the first client asks to write block 0, it asks for the required lock from
the server.  When it receives this request, the server sees that there are no
other locks on the file.  Since it assumes the client will want to write to the
file again, the server expands the lock as far as possible.  In this case, it
expands the lock to the maximum file size (effectively, to infinity), then
grants it to client 1.

When client 2 wants to write block 1, it conflicts with the expanded lock
granted to client 1.  The server then must revoke (In Lustre terms,
'call back') the lock granted to client 1 so it can grant a lock to client 2.
After the lock granted to client is revoked, there are no locks on the file.
The server sees this when processing the lock request from client 2, and
expands that lock to cover the whole file.

Client 1 then wishes to write block 3 of the file...  And the cycle continues.
The two clients exchange the extended lock throughout the write, allowing only
one client to write at a time, plus latency to exchange the lock.  The effect is
dramatic: Two clients are actually slower than one.  (Similar behavior is seen
with more than two clients.)

The solution is to use this new advice type to acquire locks before they are
needed.  In effect, before it starts writing to the file, client 1 requests
locks on block 0, block 2, etc. It locks 'ahead' a certain (tunable) number of
locks. Client 2 does the same.  Then they both begin to write, and are able to
do so in parallel.  A description of the actual library implementation follows.

3. Library implementation
Actually implementing this in the library carries a number of wrinkles.
The basic pattern is this:
Before writing, an I/O aggregator requests a certain number of locks on blocks
that it is responsible for.  It may or may not ever write to these blocks, but
it takes locks knowing it might.  It then begins to write, tracking how many of
the locks it has used.  When the number of locks 'ahead' of the I/O is low
enough, it requests more locks in advance of the I/O.

For technical reasons which are explained in the implementation section, these
lock requests are either asynchronous and non-blocking or synchronous and
blocking.  In Lustre terms, non-blocking means if there is already a lock on
the relevant extent of the file, the manual lock request is not granted.  This
means that if there is already a lock on the file (quite common; imagine
writing to a file which was previously read by another process), these lock
requests will be denied.  However, once the first 'real' write arrives that
was hoping to use a lockahead lock, that write will cause the blocking lock to
be cancelled, so this interference is not fatal.

It is of course possible for another process to get in the way by immediately
asking for a lock on the file.  This is something users should try to avoid.
When writing out a file, repeatedly trying to read it will impact performance
even without this feature.

These interfering locks can also happen if a manually requested lock is, for
some reason, not available in time for the write which intended to use it.
The lock which results from this write request is expanded using the
normal rules.  So it's possible for that lock (depending on the position of
other locks at the time) to be extended to cover the rest of the file.  That
will block future lockahead locks.

The expanded lock will be revoked when a write happens (from another client)
in the range covered by that lock, but the lock for that write will be expanded
as well - And then we return to handing the lock back and forth between
clients.  These expanded locks will still block future lockahead locks,
rendering them useless.

The way to avoid this is to turn off lock expansion for I/Os which are
supposed to be using these manually requested locks.  That way, if the
manually requested lock is not available, the lock request for the I/O will not
be expanded.  Instead, that request (which is blocking, unlike a lockahead
request) will cancel any interfering locks, but the resulting lock will not be
expanded.  This leaves the later parts of the file open, allowing future
manual lock requests to succeed.  This means that if an interfering lock blocks
some manual requests, those are lost, but the next set of manual requests can
proceed as normal.

In effect, the 'locking ahead of I/O' is interrupted, but then is able to
re-assert itself. The feature used here is referred to as 'no expansion'
locking (as only the extent required by the actual I/O operation is locked)
and is turned on with another new ladvise advice, LU_LADVISE_NOEXPAND.  This
feature is added as part of the lockahead patch.  The strided writing library
will use this advice on the file descriptor it uses for writing.

4. Client side design
4. A. Ladvise lockahead
Requestlock uses the existing asynchronous lock request functionality
implemented for asynchronous glimpse locks (AGLs), a long standing Lustre
feature.  AGLs are locks which are requested by statahead, which are used to
get file size information before it's requested.  The key thing about an
asynchronous lock request is that it does not have a specific I/O operation
waiting for the lock.

This means two key things:

1. There is no OSC lock (lock layer above LDLM for data locking) associated
with the LDLM lock
2. There is no thread waiting for the LDLM lock, so lock grant processing
must be handled by the ptlrpc daemon thread which received the reply

Since both of these issues are addressed by the asynchronous lock request code
which lockahead shares with AGL, we will not explore them in depth here.

Finally, lockahead requests set the CEF_LOCK_NO_EXPAND flag, which tells the
OSC (the per OST layer of the client) to set LDLM_FL_NO_EXPANSION on any lock
requests.  LDLM_FL_NO_EXPANSION is a new LDLM lock flag which tells the server
not to expand the lock extent.

This leaves the user facing interface.  Requestlock is implemented as a new
ladvise advice, and it uses the ladvise feature of multiple advices in one API
call to put many lock requests in to an array of advices.

The arguments required for this advice are a mode (read or write), range (start
and end), and flags.

The client will then make lock requests on these extents, one at a time.
Because the lock requests are asynchronous (replies are handled by ptlrpcd),
many requests can be made quickly by overlapping them, rather than waiting for
each one to complete.  (This requires that they be non-blocking, as the
ptlrpcd threads must not wait in the ldlm layer.)

4. B. LU_LADVISE_LOCKNOEXPAND
The lock no expand ladvise advice sets a boolean in a Lustre data structure
associated with a file descriptor.  When an I/O is done to this file
descriptor, the flag is picked up and passed through to the ldlm layer, where
it sets LDLM_FL_NO_EXPANSION on lock requests made for that I/O.

5. Server side changes
Implementing lockahead requires server support for LDLM_FL_NO_EXPANSION, but
it also required an additional pair of server side changes to fix issues which
came up because of lockahead.  These changes are not part of the core design
instead, they are separate fixes which are required for it to work.

5. A. Support LDLM_FL_NO_EXPANSION

Disabling server side lock expansion is done with a new LDLM flag.  This is
done with a simple check for that flag on the server before attempting to
expand the lock.  If the flag is found, lock expansion is skipped.

5. B. Implement LDLM_FL_SPECULATIVE

As described above, lock ahead locks are non-blocking. The BLOCK_NOWAIT LDLM
flag is used now to implement some nonblocking behavior, but it only considers
group locks blocking.  But, for asynchronous lock requests to work correctly,
they cannot wait for any other locks.  For this purpose, we add
LDLM_FL_SPECULATIVE.  This new flag is used for asynchronous lock requests,
and implements the broader non-blocking behavior they require.

5. C. File size & ofd_intent_policy changes

Knowing the current file size during writes is tricky on a distributed file
system, because multiple clients can be writing to a file at any time.  When
writes are in progress, the server must identify which client is currently
responsible for growing the file size, and ask that client what the file size
is.

To do this, the server uses glimpse locking (in ofd_intent_policy) to get the
current file size from the clients.  This code uses the assumption that the
holder of the highest write lock (PW lock) knows the current file size.  A
client learns the (then current) file size when a lock is granted.  Because
only the holder of the highest lock can grow a file, either the size hasn't
changed, or that client knows the new size; so the server only has to contact
the client which holds this lock, and it knows the current file size.

Note that the above is actually racy. When the server asks, the client can
still be writing, or another client could acquire a higher lock during this
time.  The goal is a good approximation while the file is being written, and a
correct answer once all the clients are done writing.  This is achieved because
once writes to a file are complete, the holder of that highest lock is
guaranteed to know the current file size.  This is where manually requested
locks cause trouble.

By creating write locks in advance of an actual I/O, lockahead breaks the
assumption that the holder of the highest lock knows the file size.

This assumption is normally true because locks which are created as part of
IO - rather than in advance of it - are guaranteed to be 'active', IE,
involved in IO, and the holder of the highest 'active' lock always knows the
current file size, because the size is either not changing or the holder of
that lock is responsible for updating it.

Consider:  Two clients, A and B, strided writing.  Each client requests, for
example, 2 manually requested locks.  (Real numbers are much higher.)  Client A
holds locks on segments 0 and 2, client B holds locks on segments 1 and 3.

The request comes to write 3 segments of data.  Client A writes to segment 0,
client B writes to segment 1, and client A also writes to segment 2.  No data
is written to segment 3.  At this point, the server checks the file size, by
glimpsing the highest lock . The lock on segment 3.  Client B does not know
about the writing done by client A to segment 2, so it gives an incorrect file
size.

This would be OK if client B had pending writes to segment 3, but it does not.
In this situation, the server will never get the correct file size while this
lock exists.

The solution is relatively straightforward: The server needs to glimpse every
client holding a write lock (starting from the top) until we find one holding
an 'active' lock (because the size is known to be at least the size returned
from an 'active' lock), and take the largest size returned. This avoids asking
only a client which may not know the correct file size.

Unfortunately, there is no way to know if a manually requested lock is active
from the server side.  So when we see such a lock, we must send a glimpse to
the holder (unless we have already sent a glimpse to that client*).  However,
because locks without LDLM_FL_NO_EXPANSION set are guaranteed to be 'active',
once we reach the first such lock, we can stop glimpsing.

*This is because when we glimpse a specific lock, the client holding it returns
its best idea of the size information, so we only need to send one glimpse to
each client.

This is less efficient than the standard "glimpse only the top lock"
methodology, but since we only need to glimpse one lock per client (and the
number of clients writing to the part of a file on a given OST is fairly
limited), the cost is restrained.

Additionally, lock cancellation methods such as early lock cancel aggressively
clean up older locks, particularly when the LRU limit is exceeded, so the
total lock count should also remain manageable.

In the end, the final verdict here is performance. Requestlock testing for the
strided I/O case has shown good performance results.
