The NGX_DECLINED is replaced with NGX_DONE to match closer to return code
of ngx_quic_handle_packet() and ngx_quic_close_connection() rc argument.
The ngx_quic_close_connection() rc code is used only when quic connection
exists, thus anything goes if qc == NULL.
The ngx_quic_handle_datagram() does not return NG_OK in cases when quic
connection is not yet created.
Previously, closure detection for server-initiated uni streams was not properly
implemented. Instead, HTTP/3 code relied on QUIC code posting the read event
and setting rev->error when it needed to close the stream. Then, regular
uni stream read handler called c->recv() and received error, which closed the
stream. This was an ad-hoc solution. If, for whatever reason, the read
handler was called earlier, c->recv() would return 0, which would also close
the stream.
Now server-initiated uni streams have a separate read event handler for
tracking stream closure. The handler calls c->recv(), which normally returns
0, but may return error in case of closure.
Sending the instruction is delayed until the end of the current event cycle.
Delaying the instruction is allowed by quic-qpack-21, section 2.2.2.3.
The goal is to reduce the amount of data sent back to client by accumulating
several inserts in one instruction and sometimes not sending the instruction at
all, if Section Acknowledgement was sent just before it.
Operations like ngx_quic_open_stream(), ngx_http_quic_get_connection(),
ngx_http_v3_finalize_connection(), ngx_http_v3_shutdown_connection() used to
receive a QUIC stream connection. Now they can receive the main QUIC
connection as well. This is useful when calling them from a stream context.
This is to ease transition with oldish BoringSSL versions,
the default for SSL_set_quic_use_legacy_codepoint() has been
flipped in BoringSSL a1d3bfb64fd7ef2cb178b5b515522ffd75d7b8c5.
Chrome only uses TLS session tickets once with TLS 1.3, likely following
RFC 8446 Appendix C.4 recommendation. With OpenSSL, this works fine with
built-in session tickets, since these are explicitly renewed in case of
TLS 1.3 on each session reuse, but results in only two connections being
reused after an initial handshake when using ssl_session_ticket_key.
Fix is to always renew TLS session tickets in case of TLS 1.3 when using
ssl_session_ticket_key, similarly to how it is done by OpenSSL internally.
Previously, when input ended on a QUIC buffer boundary, input chain was not
advanced to the next buffer. As a result, ngx_quic_write_chain() returned
a chain with an empty buffer instead of NULL. This broke HTTP write filter,
preventing it from closing the HTTP request and eventually timing out.
Now input chain is always advanced to a buffer that has data, before checking
QUIC buffer boundary condition.
RFC 9000, 9.3. Responding to Connection Migration:
An endpoint only changes the address to which it sends packets in
response to the highest-numbered non-probing packet.
The patch extends this requirement to probing packets. Although it may
seem excessive, it helps with mitigation of reply attacks (when an off-path
attacker has copied packet with PATH_CHALLENGE and uses different
addresses to exhaust available connection ids).
The quic connection now holds active, backup and probe paths instead
of sockets. The number of migration paths is now limited and cannot
be inflated by a bad client or an attacker.
The client id is now associated with path rather than socket. This allows
to simplify processing of output and connection ids handling.
New migration abandons any previously started migrations. This allows to
free consumed client ids and request new for use in future migrations and
make progress in case when connection id limit is hit during migration.
A path now can be revalidated without losing its state.
The patch also fixes various issues with NAT rebinding case handling:
- paths are now validated (previously, there was no validation
and paths were left in limited state)
- attempt to reuse id on different path is now again verified
(this was broken in 40445fc7c403)
- former path is now validated in case of apparent migration
The function splits a buffer at given offset. The function is now
called from ngx_quic_read_chain() and ngx_quic_write_chain(), which
simplifies both functions.
After 9ae239d2547d, ngx_quic_handle_write_event() no longer runs into
ngx_send_lowat() for QUIC connections, so the check became excessive.
It is assumed that external modules operating with SO_SNDLOWAT
(I'm not aware of any) should do this check on their own.
Previously, ngx_quic_write_chain() treated each input buffer as a memory
buffer, which is not always the case. Special buffers were not skipped, which
is especially important when hitting the input byte limit.
The issue manifested itself with ngx_quic_write_chain() returning a non-empty
chain consisting of a special last_buf buffer when called from QUIC stream
send_chain(). In order for this to happen, input byte limit should be equal to
the chain length, and the input chain should end with an empty last_buf buffer.
An easy way to achieve this is the following:
location /empty {
return 200;
}
When this non-empty chain was returned from send_chain(), it signalled to the
caller that input was blocked, while in fact it wasn't. This prevented HTTP
request from finalization, which prevented QUIC from sending STREAM FIN to
the client. The QUIC stream was then reset after a timeout.
Now special buffers are skipped and send_chain() returns NULL in the case
above, which signals to the caller a successful operation.
Also, original byte limit is now passed to ngx_quic_write_chain() from
send_chain() instead of actual chain length to make sure it's never zero.
Previously, when a STREAM FIN frame with no data bytes was received after all
prior stream data were already read by the application layer, the frame was
ignored and eof was not reported to the application.
When a worker process is shutting down, keepalive is not used: this is checked
before the ngx_http_set_keepalive() call in ngx_http_finalize_connection().
Yet the "Connection: keep-alive" header was still sent, even if we know that
the worker process is shutting down, potentially resulting in additional
requests being sent to the connection which is going to be closed anyway.
While clients are expected to be able to handle asynchronous close events
(see ticket #1022), it is certainly possible to send the "Connection: close"
header instead, informing the client that the connection is going to be closed
and potentially saving some unneeded work.
With this change, we additionally check for worker process shutdown just
before sending response headers, and disable keepalive accordingly.
Linux with EPOLLEXCLUSIVE usually notifies only the process which was first
to add the listening socket to the epoll instance. As a result most of the
connections are handled by the first worker process (ticket #2285). To fix
this, we re-add the socket periodically, so other workers will get a chance
to accept connections.
The SF_NOCACHE flag, introduced in FreeBSD 11 along with the new non-blocking
sendfile() implementation by glebius@, makes it possible to use sendfile()
along with the "directio" directive.
Starting with FreeBSD 11, there is no need to use AIO operations to preload
data into cache for sendfile(SF_NODISKIO) to work. Instead, sendfile()
handles non-blocking loading data from disk by itself. It still can, however,
return EBUSY if a page is already being loaded (for example, by a different
process). If this happens, we now post an event for the next event loop
iteration, so sendfile() is retried "after a short period", as manpage
recommends.
The limit of the number of EBUSY tolerated without any progress is preserved,
but now it does not result in an alert, since on an idle system event loop
iteration might be very short and EBUSY can happen many times in a row.
Instead, SF_NODISKIO is simply disabled for one call once the limit is
reached.
With this change, sendfile(SF_NODISKIO) is now used automatically as long as
sendfile() is enabled, and no longer requires "aio on;".
It was mostly copy of the ngx_quic_listen(). Now ngx_quic_listen() no
longer generates server id and increments seqnum. Instead, the server
id is generated when the socket is created.
The ngx_quic_alloc_socket() function is renamed to ngx_quic_create_socket().
Notably, NAXSI is known to misuse ngx_regex_compile() with rc.options set
to PCRE_CASELESS | PCRE_MULTILINE. With PCRE2 support, and notably binary
compatibility changes, it is no longer possible to set PCRE[2]_MULTILINE
option without using proper interface. To facilitate correct usage,
this change adds the NGX_REGEX_MULTILINE option.
With this change, dynamic modules using nginx regex interface can be used
regardless of the variant of the PCRE library nginx was compiled with.
If a module is compiled with different PCRE library variant, in case of
ngx_regex_exec() errors it will report wrong function name in error
messages. This is believed to be tolerable, given that fixing this will
require interface changes.
The PCRE2 library is now used by default if found, instead of the
original PCRE library. If needed for some reason, this can be disabled
with the --without-pcre2 configure option.
To make it possible to specify paths to the library and include files
via --with-cc-opt / --with-ld-opt, the library is first tested without
any additional paths and options. If this fails, the pcre2-config script
is used.
Similarly to the original PCRE library, it is now possible to build PCRE2
from sources with nginx configure, by using the --with-pcre= option.
It automatically detects if PCRE or PCRE2 sources are provided.
Note that compiling PCRE2 10.33 and later requires inttypes.h. When
compiling on Windows with MSVC, inttypes.h is only available starting
with MSVC 2013. In older versions some replacement needs to be provided
("echo '#include <stdint.h>' > pcre2-10.xx/src/inttypes.h" is good enough
for MSVC 2010).
The interface on nginx side remains unchanged.
If a configuration parsing fails for some reason, ngx_regex_module_init()
is not called, and ngx_pcre_studies remained set despite the fact that
the pool it was allocated from is already freed. This might result in
a segmentation fault during runtime regular expression compilation, such
as in SSI, for example, in the single process mode, or if a worker process
died and was respawned from a master process in such an inconsistent state.
Fix is to clear ngx_pcre_studies from the pool cleanup handler (which is
anyway used to free JIT-compiled patterns).
Previously, buffer lists was used to track used buffers. Now reference
counter is used instead. The new implementation is simpler and faster with
many buffer clones.
They are replaced with ngx_quic_write_chain() and ngx_quic_read_chain().
These functions represent the API to data buffering.
The first function adds data of given size at given offset to the buffer.
Now it returns the unwritten part of the chain similar to c->send_chain().
The second function returns data of given size from the beginning of the buffer.
Its second argument and return value are swapped compared to
ngx_quic_split_bufs() to better match ngx_quic_write_chain().
Added, returned and stored data are regular ngx_chain_t/ngx_buf_t chains.
Missing data is marked with b->sync flag.
The functions are now used in both send and recv data chains in QUIC streams.
Previously, when a few bytes were send to a QUIC stream by the application, a
4K buffer was allocated for these bytes. Then a STREAM frame was created and
that entire buffer was used as data for that frame. The frame with the buffer
were in use up until the frame was acked by client. Meanwhile, when more
bytes were send to the stream, more buffers were allocated and assigned as
data to newer STREAM frames. In this scenario most buffer memory is unused.
Now the unused part of the stream output buffer is available for further
stream output while earlier parts of the buffer are waiting to be acked.
This is achieved by splitting the output buffer.
The output is always sent to the active path, which is stored in the
quic connection. There is no need to pass it in arguments.
When output has to be send to to a specific path (in rare cases, such as
path probing), a separate method exists (ngx_quic_frame_sendto()).
The path validation status and anti-amplification limit status is actually
two different variables. It is possible that validating path should not
be limited (for example, when re-validating former path).
Previously, path was considered valid during arbitrary selected 10m timeout
since validation. This is quite not what RFC 9000 says; the relevant
part is:
An endpoint MAY skip validation of a peer address if that
address has been seen recently.
The patch considers a path to be 'recently seen' if packets were received
during idle timeout. If a packet is received from the path that was seen
not so recently, such path is considered new, and anti-amplification
restrictions apply.
After creation, a client stream is added to qc->streams.uninitialized queue.
After initialization it's removed from the queue. If a stream is never
initialized, it is freed in ngx_quic_close_streams(). Stream initializer
is now set as read event handler in stream connection.
Previously qc->streams.uninitialized was used only for delayed stream
initialization.
The change makes it possible not to handle separately the case of a new stream
in stream-related frame handlers. It makes these handlers simpler since new
streams and existing streams are now handled by the same code.
With sendfile() in threads ("aio threads; sendfile on;"), client connection
can block on writing, waiting for sendfile() to complete. In HTTP/2 this
might result in the request hang, since an attempt to continue processing
in thread event handler will call request's write event handler, which
is usually stopped by ngx_http_v2_send_chain(): it does nothing if there
are no additional data and stream->queued is set. Further, HTTP/2 resets
stream's c->write->ready to 0 if writing blocks, so just fixing
ngx_http_v2_send_chain() is not enough.
Can be reproduced with test suite on Linux with:
TEST_NGINX_GLOBALS_HTTP="aio threads; sendfile on;" prove h2*.t
The following tests currently fail: h2_keepalive.t, h2_priority.t,
h2_proxy_max_temp_file_size.t, h2.t, h2_trailers.t.
Similarly, sendfile() with AIO preloading on FreeBSD can block as well,
with similar results. This is, however, harder to reproduce, especially
on modern FreeBSD systems, since sendfile() usually does not return EBUSY.
Fix is to modify ngx_http_v2_send_chain() so it actually tries to send
data to the main connection when called, and to make sure that
c->write->ready is set by the relevant event handlers.
With sendfile in threads, "task already active" alerts might appear in logs
if a write event happens on the main HTTP/2 connection, triggering a sendfile
in threads while another thread operation is already running. Observed
with "aio threads; aio_write on; sendfile on;" and with thread event handlers
modified to post a write event to the main HTTP/2 connection (though can
happen without any modifications).
Similarly, sendfile() with AIO preloading on FreeBSD can trigger duplicate
aio operation, resulting in "second aio post" alerts. This is, however,
harder to reproduce, especially on modern FreeBSD systems, since sendfile()
usually does not return EBUSY.
Fix is to avoid starting a sendfile operation if other thread operation
is active by checking r->aio in the thread handler (and, similarly, in
aio preload handler). The added check also makes duplicate calls protection
redundant, so it is removed.
The SSL_OP_ENABLE_MIDDLEBOX_COMPAT option is provided by QuicTLS and enabled
by default in the newly created SSL contexts. SSL_set_quic_method() is used
to clear it, which is required for SSL handshake to work on QUIC connections.
Switching context in the ngx_http_ssl_servername() SNI callback overrides SSL
options from the new SSL context. This results in the option set again.
Fix is to explicitly clear it when switching to another SSL context.
Initially reported here (in Russian):
http://mailman.nginx.org/pipermail/nginx-ru/2021-November/063989.html
ngx_http_v3_tables.h and ngx_http_v3_tables.c are renamed to
ngx_http_v3_table.h and ngx_http_v3_table.c to better match HTTP/2 code.
ngx_http_v3_streams.h and ngx_http_v3_streams.c are renamed to
ngx_http_v3_uni.h and ngx_http_v3_uni.c to better match their content.
Directives that set transport parameters are removed from the configuration.
Corresponding values are derived from the quic configuration or initialized
to default. Whenever possible, quic configuration parameters are taken from
higher-level protocol settings, i.e. HTTP/3.
A new variable $http3 is added. The variable equals to "h3" for HTTP/3
connections, "hq" for hq connections and is an empty string otherwise.
The variable $quic is eliminated.
The new variable is similar to $http2 variable.
RFC 9000 19.16
The sequence number specified in a RETIRE_CONNECTION_ID frame MUST NOT
refer to the Destination Connection ID field of the packet in which the
frame is contained.
Before the patch, the RETIRE_CONNECTION_ID frame was sent before switching
to the new client id. If retired client id was currently in use, this lead
to violation of the spec.
The c->udp->dgram may be NULL only if the quic connection was just
created: the ngx_event_udp_recvmsg() passes information about datagrams
to existing connections by providing information in c->udp.
If case of a new connection, c->udp is allocated by the QUIC code during
creation of quic connection (it uses c->sockaddr to initialize qsock->path).
Thus the check for qsock->path is excessive and can be read wrong, assuming
that other options possible, leading to warnings from clang static analyzer.
Removed sending CLOSE_CONNECTION directly to avoid duplicate frames,
since it is sent later again in SSL_do_handshake() error handling.
As such, removed redundant settings of error fields set elsewhere.
While here, improved debug message.
All open sockets are stored in a queue. There is no need to close some
of them separately. If it happens that active and backup point to same
socket, double close may happen (leading to possible segfault).
The RFC 9000 allows a packet from known CID arrive from unknown path:
These requirements regarding connection ID reuse apply only to the
sending of packets, as unintentional changes in path without a change
in connection ID are possible. For example, after a period of
network inactivity, NAT rebinding might cause packets to be sent on a
new path when the client resumes sending.
Before the patch, such packets were rejected with an error in the
ngx_quic_check_migration() function. Removing the check makes the
separate function excessive - remaining checks are early migration
check and "disable_active_migration" check. The latter is a transport
parameter sent to client and it should not be used by server.
The server should send "disable_active_migration" "if the endpoint does
not support active connection migration" (18.2). The support status depends
on nginx configuration: to have migration working with multiple workers,
you need bpf helper, available on recent Linux systems. The patch does
not set "disable_active_migration" automatically and leaves it for the
administrator. By default, active migration is enabled.
RFC 900 says that it is ok to migrate if the peer violates
"disable_active_migration" flag requirements:
If the peer violates this requirement,
the endpoint MUST either drop the incoming packets on that path without
generating a Stateless Reset
OR
proceed with path validation and allow the peer to migrate. Generating a
Stateless Reset or closing the connection would allow third parties in the
network to cause connections to close by spoofing or otherwise manipulating
observed traffic.
So, nginx adheres to the second option and proceeds to path validation.
Note:
The ngtcp2 may be used for testing both active migration and NAT rebinding:
ngtcp2/client --change-local-addr=200ms --delay-stream=500ms <ip> <port> <url>
ngtcp2/client --change-local-addr=200ms --delay-stream=500ms --nat-rebinding \
<ip> <port> <url>
Single UDP datagram may contain multiple QUIC datagrams. In order to
facilitate handling of such cases, 'first' flag in the ngx_quic_header_t
structure is introduced.
Previously, the retired socket was not closed if it didn't match
active or backup.
New sockets could not be created (due to count limit), since retired socket
was not closed before calling ngx_quic_create_sockets().
When replacing retired socket, new socket is only requested after closing
old one, to avoid hitting the limit on the number of active connection ids.
Together with added restrictions, this fixes an issue when a current socket
could be closed during migration, recreated and erroneously reused leading
to null pointer dereference.
Previously the frame was not handled and connection was closed with an error.
Now, after receiving this frame, global flow control is updated and new
flow control credit is sent to client.
Previously, after receiving STREAM_DATA_BLOCKED, current flow control limit
was sent to client. Now, if the limit can be updated to the full window size,
it is updated and the new value is sent to client, otherwise nothing is sent.
The change lets client update flow control credit on demand. Also, it saves
traffic by not sending MAX_STREAM_DATA with the same value twice.
The reasons why a stream may not be created by server currently include hitting
worker_connections limit and memory allocation error. Previously in these
cases the entire QUIC connection was closed and all its streams were shut down.
Now the new stream is rejected and existing streams continue working.
To reject an HTTP/3 request stream, RESET_STREAM and STOP_SENDING with
H3_REQUEST_REJECTED error code are sent to client. HTTP/3 uni streams and
Stream streams are not rejected.
The variable contains a negotiated curve used for the handshake key
exchange process. Known curves are listed by their names, unknown
ones are shown in hex.
Note that for resumed sessions in TLSv1.2 and older protocols,
$ssl_curve contains the curve used during the initial handshake,
while in TLSv1.3 it contains the curve used during the session
resumption (see the SSL_get_negotiated_group manual page for
details).
The variable is only meaningful when using OpenSSL 3.0 and above.
With older versions the variable is empty.
Without this change, aio used with HTTP/2 can result in connection hang,
as observed with "aio threads; aio_write on;" and proxying (ticket #2248).
The problem is that HTTP/2 updates buffers outside of the output filters
(notably, marks them as sent), and then posts a write event to call
output filters. If a filter does not call the next one for some reason
(for example, because of an AIO operation in progress), this might
result in a state when the owner of a buffer already called
ngx_chain_update_chains() and can reuse the buffer, while the same buffer
is still sitting in the busy chain of some other filter.
In the particular case a buffer was sitting in output chain's ctx->busy,
and was reused by event pipe. Output chain's ctx->busy was permanently
blocked by it, and this resulted in connection hang.
Fix is to change ngx_chain_update_chains() to skip buffers from other
modules unconditionally, without trying to wait for these buffers to
become empty.
The "sendfile_max_chunk" directive is important to prevent worker
monopolization by fast connections. The 2m value implies maximum 200ms
delay with 100 Mbps links, 20ms delay with 1 Gbps links, and 2ms on
10 Gbps links. It also seems to be a good value for disks.
Previously, connections to upstream servers used sendfile() if it was
enabled, but never honored sendfile_max_chunk. This might result
in worker monopolization for a long time if large request bodies
are allowed.
On Linux starting with 2.6.16, sendfile() silently limits all operations
to MAX_RW_COUNT, defined as (INT_MAX & PAGE_MASK). This incorrectly
triggered the interrupt check, and resulted in 0-sized writev() on the
next loop iteration.
Fix is to make sure the limit is always checked, so we will return from
the loop if the limit is already reached even if number of bytes sent is
not exactly equal to the number of bytes we've tried to send.
Previously, it was checked that sendfile_max_chunk was enabled and
almost whole sendfile_max_chunk was sent (see e67ef50c3176), to avoid
delaying connections where sendfile_max_chunk wasn't reached (for example,
when sending responses smaller than sendfile_max_chunk). Now we instead
check if there are unsent data, and the connection is still ready for writing.
Additionally we also check c->write->delayed to ignore connections already
delayed by limit_rate.
This approach is believed to be more robust, and correctly handles
not only sendfile_max_chunk, but also internal limits of c->send_chain(),
such as sendfile() maximum supported length (ticket #1870).
Previously, 1 millisecond delay was used instead. In certain edge cases
this might result in noticeable performance degradation though, notably on
Linux with typical CONFIG_HZ=250 (so 1ms delay becomes 4ms),
sendfile_max_chunk 2m, and link speed above 2.5 Gbps.
Using posted next events removes the artificial delay and makes processing
fast in all cases.
The directive enables including all frames from start time to the most recent
key frame in the result. Those frames are removed from presentation timeline
using mp4 edit lists.
Edit lists are currently supported by popular players and browsers such as
Chrome, Safari, QuickTime and ffmpeg. Among those not supporting them properly
is Firefox[1].
Based on a patch by Tracey Jaquith, Internet Archive.
[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1735300
The function updates the duration field of mdhd atom. Previously it was
updated in ngx_http_mp4_read_mdhd_atom(). The change makes it possible to
alter track duration as a result of processing track frames.
As per quic-qpack-21:
When a stream is reset or reading is abandoned, the decoder emits a
Stream Cancellation instruction.
Previously the instruction was not sent. Now it's sent when closing QUIC
stream connection if dynamic table capacity is non-zero and eof was not
received from client. The latter condition means that a trailers section
may still be on its way from client and the stream needs to be cancelled.
A QUIC stream connection is treated as reusable until first bytes of request
arrive, which is also when the request object is now allocated. A connection
closed as a result of draining, is reset with the error code
H3_REQUEST_REJECTED. Such behavior is allowed by quic-http-34:
Once a request stream has been opened, the request MAY be cancelled
by either endpoint. Clients cancel requests if the response is no
longer of interest; servers cancel requests if they are unable to or
choose not to respond.
When the server cancels a request without performing any application
processing, the request is considered "rejected." The server SHOULD
abort its response stream with the error code H3_REQUEST_REJECTED.
The client can treat requests rejected by the server as though they had
never been sent at all, thereby allowing them to be retried later.
When an HTTP/3 function returns an error in context of a QUIC stream, it's
this function's responsibility now to finalize the entire QUIC connection
with the right code, if required. Previously, QUIC connection finalization
could be done both outside and inside such functions. The new rule follows
a similar rule for logging, leads to cleaner code, and allows to provide more
details about the error.
While here, a few error cases are no longer treated as fatal and QUIC connection
is no longer finalized in these cases. A few other cases now lead to
stream reset instead of connection finalization.
The PATH_RESPONSE frame must be expanded to 1200, except the case
when anti-amplification limit is in effect, i.e. on unvalidated paths.
Previously, the anti-amplification limit was always applied.
If client ID was never used, its refcount is zero. To keep things simple,
the ngx_quic_unref_client_id() function is now aware of such IDs.
If client ID was used, the ngx_quic_replace_retired_client_id() function
is supposed to find all users and unref the ID, thus ngx_quic_unref_client_id()
should not be called after it.
Previously, it was not enforced in the stream module.
Now, since b9e02e9b2f1d it is possible to specify protocols.
Since ALPN is always required, the 'require_alpn' setting is now obsolete.
The "min" and "max" arguments refer to UDP datagram size. Generating payload
requires to account properly for header size, which is variable and depends on
payload size and packet number.
After fe919fd63b0b, processing QUIC streams was postponed until after handshake
completion, which means that 0-RTT is effectively off. With ssl_ocsp enabled,
it could be further delayed. This differs from how OCSP validation works with
SSL_read_early_data(). With this change, processing QUIC streams is unlocked
when obtaining 0-RTT secret.
The sent queue is sorted by packet number. It is possible to avoid
traversing full queue while handling ack ranges. It makes sense to
start traversing from the queue head (i.e. check oldest packets first).
With this patch, all traffic over a QUIC connection is compared to traffic
over QUIC streams. As long as total traffic is many times larger than stream
traffic, we consider this to be a flood.
With this patch, all traffic over HTTP/3 bidi and uni streams is counted in
the h3c->total_bytes field, and payload traffic is counted in the
h3c->payload_bytes field. As long as total traffic is many times larger than
payload traffic, we consider this to be a flood.
Request header traffic is counted as if all fields are literal. Response
header traffic is counted as is.
Checking the reset after encryption avoids false positives. More importantly,
it avoids the check entirely in the usual case where decryption succeeds.
RFC 9000, 10.3.1 Detecting a Stateless Reset
Endpoints MAY skip this check if any packet from a datagram is
successfully processed.
As per RFC 9000:
An endpoint that receives a STOP_SENDING frame MUST send a RESET_STREAM
frame if the stream is in the "Ready" or "Send" state.
An endpoint SHOULD copy the error code from the STOP_SENDING frame to
the RESET_STREAM frame it sends, but it can use any application error code.
The flag indicates that the entire response was sent to the socket up to the
last_buf flag. The flag is only usable for protocol implementations that call
ngx_http_write_filter() from header filter, such as HTTP/1.x and HTTP/3.
Similar to the previous change, a segmentation fault occurres when evaluating
SSL certificates on a QUIC connection due to an uninitialized stream session.
The fix is to adjust initializing the QUIC part of a connection until after
it has session and variables initialized.
Similarly, this appends logging error context for QUIC connections:
- client 127.0.0.1:54749 connected to 127.0.0.1:8880 while handling frames
- quic client timed out (60: Operation timed out) while handling quic input
A QUIC connection doesn't have c->log->data and friends initialized to sensible
values. Yet, a request can be created in the certificate callback with such an
assumption, which leads to a segmentation fault due to null pointer dereference
in ngx_http_free_request(). The fix is to adjust initializing the QUIC part of
a connection such that it has all of that in place.
Further, this appends logging error context for unsuccessful QUIC handshakes:
- cannot load certificate .. while handling frames
- SSL_do_handshake() failed .. while sending frames
Previously the counter was not incremented for HTTP/3 streams, but still
decremented in ngx_http_close_connection(). There are two solutions here, one
is to increment the counter for HTTP/3 streams, and the other one is not to
decrement the counter for HTTP/3 streams. The latter solution looks
inconsistent with ngx_stat_reading/ngx_stat_writing, which are incremented on a
per-request basis. The change adds ngx_stat_active increment for HTTP/3
request and push streams.
Notably, it is to avoid setting the TCP_NODELAY flag for QUIC streams
in ngx_http_upstream_send_response(). It is an invalid operation on
inherently SOCK_DGRAM sockets, which leads to QUIC connection close.
The change reduces diff to the default branch in stream content phase.
This function was only referenced from ngx_http_v3_create_push_request() to
initialize push connection log. Now the log handler is copied from the parent
request connection.
The change reduces diff to the default branch.
The functions ngx_quic_handle_read_event() and ngx_quic_handle_write_event()
are added. Previously this code was a part of ngx_handle_read_event() and
ngx_handle_write_event().
The change simplifies ngx_handle_read_event() and ngx_handle_write_event()
by moving QUIC-related code to a QUIC source file.
Previously it had -1 as fd. This fixes proxying, which relies on downstream
connection having a real fd. Also, this reduces diff to the default branch for
ngx_close_connection().
The request body filter chain is no longer called after processing
a DATA frame. Instead, we now post a read event to do this. This
ensures that multiple small DATA frames read during the same event loop
iteration are coalesced together, resulting in much faster processing.
Since rb->buf can now contain unprocessed data, window update is no
longer sent in ngx_http_v2_state_read_data() in case of flow control
being used due to filter buffering. Instead, window will be updated
by ngx_http_v2_read_client_request_body_handler() in the posted read
event.
Following rb->filter_need_buffering changes, request body reading is
only finished after the filter chain is called and rb->last_saved is set.
As such, with r->request_body_no_buffering, timer on fc->read is no
longer removed when the last part of the body is received, potentially
resulting in incorrect behaviour.
The fix is to call ngx_http_v2_process_request_body() from the
ngx_http_v2_read_unbuffered_request_body() function instead of
directly calling ngx_http_v2_filter_request_body(), so the timer
is properly removed.
In the body read handler, the window was incorrectly calculated
based on the full buffer size instead of the amount of free space
in the buffer. If the request body is buffered by a filter, and
the buffer is not empty after the read event is generated by the
filter to resume request body processing, this could result in
"http2 negative window update" alerts.
Further, in the body ready handler and in ngx_http_v2_state_read_data()
the buffer wasn't cleared when the data were already written to disk,
so the client might stuck without window updates.
If a MAX_DATA frame was received before any stream was created, then the worker
process would crash in nginx_quic_handle_max_data_frame() while traversing the
stream tree. The issue is solved by adding a check that makes sure the tree is
not empty.
If a filter wants to buffer the request body during reading (for
example, to check an external scanner), it can now do so. To make
it possible, the code now checks rb->last_saved (introduced in the
previous change) along with rb->rest == 0.
Since in HTTP/2 this requires flow control to avoid overflowing the
request body buffer, so filters which need buffering have to set
the rb->filter_need_buffering flag on the first filter call. (Note
that each filter is expected to call the next filter, so all filters
will be able set the flag if needed.)
It indicates that the last buffer was received by the save filter,
and can be used to check this at higher levels. To be used in the
following changes.
If due to an error ngx_http_request_body_save_filter() is called
more than once with rb->rest == 0, this used to result in a segmentation
fault. Added an alert to catch such errors, just in case.
Previously, fully preread unbuffered requests larger than client body
buffer size were saved to disk, despite the fact that "unbuffered" is
expected to imply no disk buffering.
The save body filter saves the request body to disk once the buffer is full.
Yet in HTTP/2 this might happen even if there is no need to save anything
to disk, notably when content length is known and the END_STREAM flag is
sent in a separate empty DATA frame. Workaround is to provide additional
byte in the buffer, so saving the request body won't be triggered.
This fixes unexpected request body disk buffering in HTTP/2 observed after
the previous change when content length is known and the END_STREAM flag
is sent in a separate empty DATA frame.
In particular, now the code always uses a buffer limited by
client_body_buffer_size. At the cost of an additional copy it
ensures that small DATA frames are not directly mapped to small
write() syscalls, but rather buffered in memory before writing.
Further, requests without Content-Length are no longer forced
to use temporary files.
With SSL it is possible that an established connection is ready for
reading after the handshake. Further, events might be already disabled
in case of level-triggered event methods. If this happens and
ngx_http_upstream_send_request() blocks waiting for some data from
the upstream, such as flow control in case of gRPC, the connection
will time out due to no read events on the upstream connection.
Fix is to explicitly check the c->read->ready flag if sending request
blocks and post a read event if it is set.
Note that while it is possible to modify ngx_ssl_handshake() to keep
read events active, this won't completely resolve the issue, since
there can be data already received during the SSL handshake
(see 573bd30e46b4).
Hash initialization ignores elements with key.data set to NULL.
Nevertheless, the initial hash bucket size check didn't skip them,
resulting in unnecessary restrictions on, for example, variables with
long names and with the NGX_HTTP_VARIABLE_NOHASH flag.
Fix is to update the initial hash bucket size check to skip elements
with key.data set to NULL, similarly to how it is done in other parts
of the code.
Requires OpenSSL 3.0 compiled with "enable-ktls" option. Further, KTLS
needs to be enabled in kernel, and in OpenSSL, either via OpenSSL
configuration file or with "ssl_conf_command Options KTLS;" in nginx
configuration.
On FreeBSD, kernel TLS is available starting with FreeBSD 13.0, and
can be enabled with "sysctl kern.ipc.tls.enable=1" and "kldload ktls_ocf"
to load a software backend, see man ktls(4) for details.
On Linux, kernel TLS is available starting with kernel 4.13 (at least 5.2
is recommended), and needs kernel compiled with CONFIG_TLS=y (with
CONFIG_TLS=m, which is used at least on Ubuntu 21.04 by default,
the tls module needs to be loaded with "modprobe tls").
While clock_gettime(CLOCK_MONOTONIC_COARSE) is faster than
clock_gettime(CLOCK_MONOTONIC), the latter is fast enough on Linux for
practical usage, and the difference is negligible compared to other costs
at each event loop iteration. On the other hand, CLOCK_MONOTONIC_COARSE
causes various issues with typical CONFIG_HZ=250, notably very inaccurate
limit_rate handling in some edge cases (ticket #1678) and negative difference
between $request_time and $upstream_response_time (ticket #1965).
This is a recommended behavior by RFC 7301 and is useful
for mitigation of protocol confusion attacks [1].
To avoid possible negative effects, list of supported protocols
was extended to include all possible HTTP protocol ALPN IDs
registered by IANA [2], i.e. "http/1.0" and "http/0.9".
[1] https://alpaca-attack.com/
[2] https://www.iana.org/assignments/tls-extensiontype-values/
In b87b7092cedb (nginx 1.21.1), logging level of "upstream sent invalid
header" errors was accidentally changed to "info". This change restores
the "error" level, which is a proper logging level for upstream-side
errors.
The u->keepalive flag is initialized early if the response has no body
(or an empty body), and needs to be reset if there are any extra data,
similarly to how it is done in ngx_http_proxy_copy_filter(). Missed
in 83c4622053b0.
The "proxy_half_close" directive enables handling of TCP half close. If
enabled, connection to proxied server is kept open until both read ends get
EOF. Write end shutdown is properly transmitted via proxy.
Do this only when the entire request body is empty and
r->request_body_in_file_only is set.
The issue manifested itself with missing warning "a client request body is
buffered to a temporary file" when the entire rb->buf is full and all buffers
are delayed by a filter.
This adds new Auth-SSL-Protocol and Auth-SSL-Cipher headers to
the mail proxy auth protocol when SSL is enabled.
This can be useful for detecting users using older clients that
negotiate old ciphers when you want to upgrade to newer
TLS versions of remove suppport for old and insecure ciphers.
You can use your auth backend to notify these users before the
upgrade that they either need to upgrade their client software
or contact your support team to work out an upgrade path.
To load old/weak server or client certificates it might be needed to adjust
the security level, as introduced in OpenSSL 1.1.0. This change ensures that
ciphers are set before loading the certificates, so security level changes
via the cipher string apply to certificate loading.
Export ciphers are forbidden to negotiate in TLS 1.1 and later protocol modes.
They are disabled since OpenSSL 1.0.2g by default unless explicitly configured
with "enable-weak-ssl-ciphers", and completely removed in OpenSSL 1.1.0.
A new behaviour was introduced in OpenSSL 1.1.1e, when a peer does not send
close_notify before closing the connection. Previously, it was to return
SSL_ERROR_SYSCALL with errno 0, known since at least OpenSSL 0.9.7, and is
handled gracefully in nginx. Now it returns SSL_ERROR_SSL with a distinct
reason SSL_R_UNEXPECTED_EOF_WHILE_READING ("unexpected eof while reading").
This leads to critical errors seen in nginx within various routines such as
SSL_do_handshake(), SSL_read(), SSL_shutdown(). The behaviour was restored
in OpenSSL 1.1.1f, but presents in OpenSSL 3.0 by default.
Use of the SSL_OP_IGNORE_UNEXPECTED_EOF option added in OpenSSL 3.0 allows
to set a compatible behaviour to return SSL_ERROR_ZERO_RETURN:
https://git.openssl.org/?p=openssl.git;a=commitdiff;h=09b90e0
See for additional details: https://github.com/openssl/openssl/issues/11381
The OPENSSL_SUPPRESS_DEPRECATED macro is used to suppress deprecation warnings.
This covers Session Tickets keys, SSL Engine, DH low level API for DHE ciphers.
Unlike OPENSSL_API_COMPAT, it works well with OpenSSL built with no-deprecated.
In particular, it doesn't unhide various macros in OpenSSL includes, which are
meant to be hidden under OPENSSL_NO_DEPRECATED.
The only consumer is a callback function for SSL_CTX_set_tmp_rsa_callback()
deprecated in OpenSSL 1.1.0. Now the function is conditionally compiled too.
The latest HTTP/1.1 draft describes Transfer-Encoding in HTTP/1.0 as having
potentially faulty message framing as that could have been forwarded without
handling of the chunked encoding, and forbids processing subsequest requests
over that connection: https://github.com/httpwg/http-core/issues/879.
While handling of such requests is permitted, the most secure approach seems
to reject them.
The c->read->ready and c->write->ready flags might be reset during
the handshake, and not set again if the handshake was finished on
the other event. At the same time, some data might be read from
the socket during the handshake, so missing c->read->ready flag might
result in a connection hang, for example, when waiting for an SMTP
greeting (which was already received during the handshake).
Found by Sergey Kandaurov.
Previously, when cleaning up a QUIC stream in shutdown mode,
ngx_quic_shutdown_quic() was called, which could close the QUIC connection
right away. This could be a problem if the connection was referenced up the
stack. For example, this could happen in ngx_quic_init_streams(),
ngx_quic_close_streams(), ngx_quic_create_client_stream() etc.
With a typical HTTP/3 client the issue is unlikely because of HTTP/3 uni
streams which need a posted event to close. In this case QUIC connection
cannot be closed right away.
Now QUIC connection read event is posted and it will shut down the connection
asynchronously.
After receiving GOAWAY, client is not supposed to create new streams. However,
until client reads this frame, we allow it to create new streams, which are
gracefully rejected. To prevent client from abusing this algorithm, a new
limit is introduced. Upon reaching keepalive_requests * 2, server now closes
the entire QUIC connection claiming excessive load.
The "hq" mode is HTTP/0.9-1.1 over QUIC. The following limits are introduced:
- uni streams are not allowed
- keepalive_requests is enforced
- keepalive_time is enforced
In case of error, QUIC connection is finalized with 0x101 code. This code
corresponds to HTTP/3 General Protocol Error.
Previously, in-flight byte counter and congestion window were properly
maintained, but the limit was not properly implemented.
Now a new datagram is sent only if in-flight byte counter is less than window.
The limit is datagram-based, which means that a single datagram may lead to
exceeding the limit, but the next one will not be sent.
Previously, the error was ignored leading to unnecessary retransmits.
Now, unsent frames are returned into output queue, state is reset, and
timer is started for the next send attempt.
As per quic-http-34:
Endpoints SHOULD create the HTTP control stream as well as the
unidirectional streams required by mandatory extensions (such as the
QPACK encoder and decoder streams) first, and then create additional
streams as allowed by their peer.
Previously, client could create and destroy additional uni streams unlimited
number of times before creating mandatory streams.
OpenSSL is known to provide read keys for an encryption level before the
level is active in TLS, following the old BoringSSL API. In BoringSSL,
it was then fixed to defer releasing read keys until QUIC may use them.
The directive enables usage of UDP segmentation offloading by quic.
By default, gso is disabled since it is not always operational when
detected (depends on interface configuration).
To improve output performance, UDP segmentation offloading is used
if available. If there is a significant amount of data in an output
queue and path is verified, QUIC packets are not sent one-by-one,
but instead are collected in a buffer, which is then passed to kernel
in a single sendmsg call, using UDP GSO. Such method greatly decreases
number of system calls and thus system load.
Additionally, the ngx_init_srcaddr_cmsg() function is introduced which
initializes control message with connection local address.
The NGX_HAVE_ADDRINFO_CMSG macro is defined when at least one of methods
to deal with corresponding control message is available.