syzkaller hit a WARN_ON_ONCE(!scm->pid) in scm_pidfd_recv().
In unix_stream_read_generic(), if there is no skb in the queue, we could
bail out the do-while loop without calling scm_set_cred():
1. No skb in the queue
2. sk is non-blocking
or
shutdown(sk, RCV_SHUTDOWN) is called concurrently
or
peer calls close()
If the socket is configured with SO_PASSPIDFD, scm_pidfd_recv() would
populate cmsg with garbage emitting the warning.
Let's skip SCM_PIDFD if scm->pid is NULL in scm_pidfd_recv().
Note another way would be skip calling scm_recv() in such cases, but this
caused a regression resulting in commit 9d797ee2dce1 ("Revert "af_unix:
Call scm_recv() only after scm_set_cred()."").
netlink: Add __sock_i_ino() for __netlink_diag_dump().
syzbot reported a warning in __local_bh_enable_ip(). [0]
Commit 8d61f926d420 ("netlink: fix potential deadlock in
netlink_set_err()") converted read_lock(&nl_table_lock) to
read_lock_irqsave() in __netlink_diag_dump() to prevent a deadlock.
However, __netlink_diag_dump() calls sock_i_ino() that uses
read_lock_bh() and read_unlock_bh(). If CONFIG_TRACE_IRQFLAGS=y,
read_unlock_bh() finally enables IRQ even though it should stay
disabled until the following read_unlock_irqrestore().
Using read_lock() in sock_i_ino() would trigger a lockdep splat
in another place that was fixed in commit f064af1e500a ("net: fix
a lockdep splat"), so let's add __sock_i_ino() that would be safe
to use under BH disabled.
Vladimir Oltean [Mon, 26 Jun 2023 15:44:02 +0000 (18:44 +0300)]
net: dsa: avoid suspicious RCU usage for synced VLAN-aware MAC addresses
When using the felix driver (the only one which supports UC filtering
and MC filtering) as a DSA master for a random other DSA switch, one can
see the following stack trace when the downstream switch ports join a
VLAN-aware bridge:
What it's saying is that vlan_for_each() expects rtnl_lock() context and
it's not getting it, when it's called from the DSA master's ndo_set_rx_mode().
The caller of that - dsa_slave_set_rx_mode() - is the slave DSA
interface's dsa_port_bridge_host_fdb_add() which comes from the deferred
dsa_slave_switchdev_event_work().
We went to great lengths to avoid the rtnl_lock() context in that call
path in commit 0faf890fc519 ("net: dsa: drop rtnl_lock from
dsa_slave_switchdev_event_work"), and calling rtnl_lock() is simply not
an option due to the possibility of deadlocking when calling
dsa_flush_workqueue() from the call paths that do hold rtnl_lock() -
basically all of them.
So, when the DSA master calls vlan_for_each() from its ndo_set_rx_mode(),
the state of the 8021q driver on this device is really not protected
from concurrent access by anything.
Looking at net/8021q/, I don't think that vlan_info->vid_list was
particularly designed with RCU traversal in mind, so introducing an RCU
read-side form of vlan_for_each() - vlan_for_each_rcu() - won't be so
easy, and it also wouldn't be exactly what we need anyway.
In general I believe that the solution isn't in net/8021q/ anyway;
vlan_for_each() is not cut out for this task. DSA doesn't need rtnl_lock()
to be held per se - since it's not a netdev state change that we're
blocking, but rather, just concurrent additions/removals to a VLAN list.
We don't even need sleepable context - the callback of vlan_for_each()
just schedules deferred work.
The proposed escape is to remove the dependency on vlan_for_each() and
to open-code a non-sleepable, rtnl-free alternative to that, based on
copies of the VLAN list modified from .ndo_vlan_rx_add_vid() and
.ndo_vlan_rx_kill_vid().
David Howells [Tue, 27 Jun 2023 13:49:48 +0000 (14:49 +0100)]
libceph: Partially revert changes to support MSG_SPLICE_PAGES
Fix the mishandling of MSG_DONTWAIT and also reinstates the per-page
checking of the source pages (which might have come from a DIO write by
userspace) by partially reverting the changes to support MSG_SPLICE_PAGES
and doing things a little differently. In messenger_v1:
(1) The ceph_tcp_sendpage() is resurrected and the callers reverted to use
that.
(2) The callers now pass MSG_MORE unconditionally. Previously, they were
passing in MSG_MORE|MSG_SENDPAGE_NOTLAST and then degrading that to
just MSG_MORE on the last call to ->sendpage().
(3) Make ceph_tcp_sendpage() a wrapper around sendmsg() rather than
sendpage(), setting MSG_SPLICE_PAGES if sendpage_ok() returns true on
the page.
In messenger_v2:
(4) Bring back do_try_sendpage() and make the callers use that.
(5) Make do_try_sendpage() use sendmsg() for both cases and set
MSG_SPLICE_PAGES if sendpage_ok() is set.
Vladimir Oltean [Tue, 27 Jun 2023 13:42:35 +0000 (16:42 +0300)]
net: phy: mscc: fix packet loss due to RGMII delays
Two deadly typos break RX and TX traffic on the VSC8502 PHY using RGMII
if phy-mode = "rgmii-id" or "rgmii-txid", and no "tx-internal-delay-ps"
override exists. The negative error code from phy_get_internal_delay()
does not get overridden with the delay deduced from the phy-mode, and
later gets committed to hardware. Also, the rx_delay gets overridden by
what should have been the tx_delay.
Fixes: dbb050d2bfc8 ("phy: mscc: Add support for RGMII delay configuration") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Harini Katakam <harini.katakam@amd.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Link: https://lore.kernel.org/r/20230627134235.3453358-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
v2: This series uses vmalloc_array and vcalloc instead of
array_size. It also leaves a multiplication of a constant by a
sizeof as is. Two patches are thus dropped from the series.
====================
Davide Tronchin [Mon, 26 Jun 2023 12:53:36 +0000 (14:53 +0200)]
net: usb: qmi_wwan: add u-blox 0x1312 composition
Add RmNet support for LARA-R6 01B.
The new LARA-R6 product variant identified by the "01B" string can be
configured (by AT interface) in three different USB modes:
* Default mode (Vendor ID: 0x1546 Product ID: 0x1311) with 4 serial
interfaces
* RmNet mode (Vendor ID: 0x1546 Product ID: 0x1312) with 4 serial
interfaces and 1 RmNet virtual network interface
* CDC-ECM mode (Vendor ID: 0x1546 Product ID: 0x1313) with 4 serial
interface and 1 CDC-ECM virtual network interface
The first 4 interfaces of all the 3 configurations (default, RmNet, ECM)
are the same.
In RmNet mode LARA-R6 01B exposes the following interfaces:
If 0: Diagnostic
If 1: AT parser
If 2: AT parser
If 3: AT parset/alternative functions
If 4: RMNET interface
Matthieu Baerts [Mon, 26 Jun 2023 09:02:39 +0000 (11:02 +0200)]
perf trace: fix MSG_SPLICE_PAGES build error
Our MPTCP CI and Stephen got this error:
In file included from builtin-trace.c:907:
trace/beauty/msg_flags.c: In function 'syscall_arg__scnprintf_msg_flags':
trace/beauty/msg_flags.c:28:21: error: 'MSG_SPLICE_PAGES' undeclared (first use in this function)
28 | if (flags & MSG_##n) { | ^~~~
trace/beauty/msg_flags.c:50:9: note: in expansion of macro 'P_MSG_FLAG'
50 | P_MSG_FLAG(SPLICE_PAGES);
| ^~~~~~~~~~
trace/beauty/msg_flags.c:28:21: note: each undeclared identifier is reported only once for each function it appears in
28 | if (flags & MSG_##n) { | ^~~~
trace/beauty/msg_flags.c:50:9: note: in expansion of macro 'P_MSG_FLAG'
50 | P_MSG_FLAG(SPLICE_PAGES);
| ^~~~~~~~~~
The fix is similar to what was done with MSG_FASTOPEN: the new macro is
defined if it is not defined in the system headers.
Paolo Abeni [Tue, 27 Jun 2023 10:47:43 +0000 (12:47 +0200)]
Merge tag 'nf-23-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
1) Reset shift on Boyer-Moore string match for each block,
from Jeremy Sowden.
2) Fix acccess to non-linear area in DCCP conntrack helper,
from Florian Westphal.
3) Fix kernel-doc warnings, by Randy Dunlap.
4) Bail out if expires= does not show in SIP helper message,
or make ct_sip_parse_numerical_param() tristate and report
error if expires= cannot be parsed.
5) Unbind non-anonymous set in case rule construction fails.
6) Fix underflow in chain reference counter in case set element
already exists or it cannot be created.
Cambda Zhu [Mon, 26 Jun 2023 09:33:47 +0000 (17:33 +0800)]
ipvlan: Fix return value of ipvlan_queue_xmit()
ipvlan_queue_xmit() should return NET_XMIT_XXX, but
ipvlan_xmit_mode_l2/l3() returns rx_handler_result_t or NET_RX_XXX
in some cases. ipvlan_rcv_frame() will only return RX_HANDLER_CONSUMED
in ipvlan_xmit_mode_l2/l3() because 'local' is true. It's equal to
NET_XMIT_SUCCESS. But dev_forward_skb() can return NET_RX_SUCCESS or
NET_RX_DROP, and returning NET_RX_DROP(NET_XMIT_DROP) will increase
both ipvlan and ipvlan->phy_dev drops counter.
The skb to forward can be treated as xmitted successfully. This patch
makes ipvlan_queue_xmit() return NET_XMIT_SUCCESS for forward skb.
Jakub Kicinski [Mon, 26 Jun 2023 19:59:18 +0000 (12:59 -0700)]
Merge tag 'nf-next-23-06-26' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-next
1) Allow slightly larger IPVS connection table size from Kconfig for
64-bit arch, from Abhijeet Rastogi.
2) Since IPVS connection table might be larger than 2^20 after previous
patch, allow to limit it depending on the available memory.
Moreover, use kvmalloc. From Julian Anastasov.
3) Do not rebuild VLAN header in nft_payload when matching source and
destination MAC address.
4) Remove nested rcu read lock side in ip_set_test(), from Florian Westphal.
5) Allow to update set size, also from Florian.
6) Improve NAT tuple selection when connection is closing,
from Florian Westphal.
7) Support for resetting set element stateful expression, from Phil Sutter.
8) Use NLA_POLICY_MAX to narrow down maximum attribute value in nf_tables,
from Florian Westphal.
* tag 'nf-next-23-06-26' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
netfilter: nf_tables: limit allowed range via nla_policy
netfilter: nf_tables: Introduce NFT_MSG_GETSETELEM_RESET
netfilter: snat: evict closing tcp entries on reply tuple collision
netfilter: nf_tables: permit update of set size
netfilter: ipset: remove rcu_read_lock_bh pair from ip_set_test
netfilter: nft_payload: rebuild vlan header when needed
ipvs: dynamically limit the connection hash table
ipvs: increase ip_vs_conn_tab_bits range for 64BIT
====================
netfilter: nf_tables: fix underflow in chain reference counter
Set element addition error path decrements reference counter on chains
twice: once on element release and again via nft_data_release().
Then, d6b478666ffa ("netfilter: nf_tables: fix underflow in object
reference counter") incorrectly fixed this by removing the stateful
object reference count decrement.
Restore the stateful object decrement as in b91d90368837 ("netfilter:
nf_tables: fix leaking object reference count") and let
nft_data_release() decrement the chain reference counter, so this is
done only once.
Fixes: d6b478666ffa ("netfilter: nf_tables: fix underflow in object reference counter") Fixes: 628bd3e49cba ("netfilter: nf_tables: drop map element references from preparation phase") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Ilia.Gavrilov [Fri, 23 Jun 2023 11:23:46 +0000 (11:23 +0000)]
netfilter: nf_conntrack_sip: fix the ct_sip_parse_numerical_param() return value.
ct_sip_parse_numerical_param() returns only 0 or 1 now.
But process_register_request() and process_register_response() imply
checking for a negative value if parsing of a numerical header parameter
failed.
The invocation in nf_nat_sip() looks correct:
if (ct_sip_parse_numerical_param(...) > 0 &&
...) { ... }
Make the return value of the function ct_sip_parse_numerical_param()
a tristate to fix all the cases
a) return 1 if value is found; *val is set
b) return 0 if value is not found; *val is unchanged
c) return -1 on error; *val is undefined
Found by InfoTeCS on behalf of Linux Verification Center
(linuxtesting.org) with SVACE.
Randy Dunlap [Fri, 23 Jun 2023 06:11:01 +0000 (23:11 -0700)]
linux/netfilter.h: fix kernel-doc warnings
kernel-doc does not support DECLARE_PER_CPU(), so don't mark it with
kernel-doc notation.
One comment block is not kernel-doc notation, so just use
"/*" to begin the comment.
Quietens these warnings:
netfilter.h:493: warning: Function parameter or member 'bool' not described in 'DECLARE_PER_CPU'
netfilter.h:493: warning: Function parameter or member 'nf_skb_duplicated' not described in 'DECLARE_PER_CPU'
netfilter.h:493: warning: expecting prototype for nf_skb_duplicated(). Prototype was for DECLARE_PER_CPU() instead
netfilter.h:496: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
* Contains bitmask of ctnetlink event subscribers, if any.
Fixes: e7c8899f3e6f ("netfilter: move tee_active to core") Fixes: fdf6491193e4 ("netfilter: ctnetlink: make event listener tracking global") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
And nothing more is 'pulled' from the packet, depending on the content.
dh->dccph_doff, and/or dh->dccph_x ...)
So dccp_ack_seq() is happily reading stuff past the _dh buffer.
BUG: KASAN: stack-out-of-bounds in nf_conntrack_dccp_packet+0x1134/0x11c0
Read of size 4 at addr ffff000128f66e0c by task syz-executor.2/29371
[..]
Fix this by increasing the stack buffer to also include room for
the extra sequence numbers and all the known dccp packet type headers,
then pull again after the initial validation of the basic header.
While at it, mark packets invalid that lack 48bit sequence bit but
where RFC says the type MUST use them.
Compile tested only.
v2: first skb_header_pointer() now needs to adjust the size to
only pull the generic header. (Eric)
Heads-up: I intend to remove dccp conntrack support later this year.
Fixes: 2bc780499aa3 ("[NETFILTER]: nf_conntrack: add DCCP protocol support") Reported-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Jeremy Sowden [Mon, 19 Jun 2023 19:06:57 +0000 (20:06 +0100)]
lib/ts_bm: reset initial match offset for every block of text
The `shift` variable which indicates the offset in the string at which
to start matching the pattern is initialized to `bm->patlen - 1`, but it
is not reset when a new block is retrieved. This means the implemen-
tation may start looking at later and later positions in each successive
block and miss occurrences of the pattern at the beginning. E.g.,
consider a HTTP packet held in a non-linear skb, where the HTTP request
line occurs in the second block:
[... 52 bytes of packet headers ...]
GET /bmtest HTTP/1.1\r\nHost: www.example.com\r\n\r\n
and the pattern is "GET /bmtest".
Once the first block comprising the packet headers has been examined,
`shift` will be pointing to somewhere near the end of the block, and so
when the second block is examined the request line at the beginning will
be missed.
Reinitialize the variable for each new block.
Fixes: 8082e4ed0a61 ("[LIB]: Boyer-Moore extension for textsearch infrastructure strike #2") Link: https://bugzilla.netfilter.org/show_bug.cgi?id=1390 Signed-off-by: Jeremy Sowden <jeremy@azazel.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Lin Ma [Sun, 25 Jun 2023 09:10:07 +0000 (17:10 +0800)]
net: nfc: Fix use-after-free caused by nfc_llcp_find_local
This commit fixes several use-after-free that caused by function
nfc_llcp_find_local(). For example, one UAF can happen when below buggy
time window occurs.
The buggy address belongs to the object at ffff888105b0e400
which belongs to the cache kmalloc-1k of size 1024
The buggy address is located 16 bytes inside of
freed 1024-byte region [ffff888105b0e400, ffff888105b0e800)
Memory state around the buggy address: ffff888105b0e300: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff888105b0e380: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff888105b0e400: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^ ffff888105b0e480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff888105b0e500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
In summary, this patch solves those use-after-free by
1. Re-implement the nfc_llcp_find_local(). The current version does not
grab the reference when getting the local from the linked list. For
example, the llcp_sock_bind() gets the reference like below:
// llcp_sock_bind()
local = nfc_llcp_find_local(dev); // A
..... \
| raceable
..... /
llcp_sock->local = nfc_llcp_local_get(local); // B
There is an apparent race window that one can drop the reference
and free the local object fetched in (A) before (B) gets the reference.
2. Some callers of the nfc_llcp_find_local() do not grab the reference
at all. For example, the nfc_genl_llc_{{get/set}_params/sdreq} functions.
We add the nfc_llcp_local_put() for them. Moreover, we add the necessary
error handling function to put the reference.
3. Add the nfc_llcp_remove_local() helper. The local object is removed
from the linked list in local_release() when all reference is gone. This
patch removes it when nfc_llcp_unregister_device() is called.
Therefore, every caller of nfc_llcp_find_local() will get a reference
even when the nfc_llcp_unregister_device() is called. This promises no
use-after-free for the local object is ever possible.
Fixes: 52feb444a903 ("NFC: Extend netlink interface for LTO, RW, and MIUX parameters support") Fixes: c7aa12252f51 ("NFC: Take a reference on the LLCP local pointer when creating a socket") Signed-off-by: Lin Ma <linma@zju.edu.cn> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 26 Jun 2023 09:36:48 +0000 (10:36 +0100)]
Merge branch 'sfc-next'
Edward Cree says:
====================
sfc: fix unaligned access in loopback selftests
Arnd reported that the sfc drivers each define a packed loopback_payload
structure with an ethernet header followed by an IP header, whereas the
kernel definition of iphdr specifies that this is 4-byte aligned,
causing a W=1 warning.
Fix this in each case by adding two bytes of leading padding to the
struct, taking care that these are not sent on the wire.
Tested on EF10; build-tested on Siena and Falcon.
Changed in v2:
* added __aligned(4) to payload struct definitions (Arnd)
* fixed dodgy whitespace (checkpatch)
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Edward Cree [Fri, 23 Jun 2023 18:38:06 +0000 (19:38 +0100)]
sfc: falcon: use padding to fix alignment in loopback test
Add two bytes of padding to the start of struct ef4_loopback_payload,
which are not sent on the wire. This ensures the 'ip' member is
4-byte aligned, preventing the following W=1 warning:
net/ethernet/sfc/falcon/selftest.c:43:15: error: field ip within 'struct ef4_loopback_payload' is less aligned than 'struct iphdr' and is usually due to 'struct ef4_loopback_payload' being packed, which can lead to unaligned accesses [-Werror,-Wunaligned-access]
struct iphdr ip;
Reported-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Edward Cree <ecree.xilinx@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Edward Cree [Fri, 23 Jun 2023 18:38:05 +0000 (19:38 +0100)]
sfc: siena: use padding to fix alignment in loopback test
Add two bytes of padding to the start of struct efx_loopback_payload,
which are not sent on the wire. This ensures the 'ip' member is
4-byte aligned, preventing the following W=1 warning:
net/ethernet/sfc/siena/selftest.c:46:15: error: field ip within 'struct efx_loopback_payload' is less aligned than 'struct iphdr' and is usually due to 'struct efx_loopback_payload' being packed, which can lead to unaligned accesses [-Werror,-Wunaligned-access]
struct iphdr ip;
Reported-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Edward Cree <ecree.xilinx@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Edward Cree [Fri, 23 Jun 2023 18:38:04 +0000 (19:38 +0100)]
sfc: use padding to fix alignment in loopback test
Add two bytes of padding to the start of struct efx_loopback_payload,
which are not sent on the wire. This ensures the 'ip' member is
4-byte aligned, preventing the following W=1 warning:
net/ethernet/sfc/selftest.c:46:15: error: field ip within 'struct efx_loopback_payload' is less aligned than 'struct iphdr' and is usually due to 'struct efx_loopback_payload' being packed, which can lead to unaligned accesses [-Werror,-Wunaligned-access]
struct iphdr ip;
Reported-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Edward Cree <ecree.xilinx@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Edward Cree [Fri, 23 Jun 2023 14:34:48 +0000 (15:34 +0100)]
sfc: fix crash when reading stats while NIC is resetting
efx_net_stats() (.ndo_get_stats64) can be called during an ethtool
selftest, during which time nic_data->mc_stats is NULL as the NIC has
been fini'd. In this case do not attempt to fetch the latest stats
from the hardware, else we will crash on a NULL dereference:
BUG: kernel NULL pointer dereference, address: 0000000000000038
RIP efx_nic_update_stats
abridged calltrace:
efx_ef10_update_stats_pf
efx_net_stats
dev_get_stats
dev_seq_printf_stats
Skipping the read is safe, we will simply give out stale stats.
To ensure that the free in efx_ef10_fini_nic() does not race against
efx_ef10_update_stats_pf(), which could cause a TOCTTOU bug, take the
efx->stats_lock in fini_nic (it is already held across update_stats).
Fixes: d3142c193dca ("sfc: refactor EF10 stats handling") Reviewed-by: Pieter Jansen van Vuuren <pieter.jansen-van-vuuren@amd.com> Signed-off-by: Edward Cree <ecree.xilinx@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Analogous to NFT_MSG_GETOBJ_RESET, but for set elements with a timeout
or attached stateful expressions like counters or quotas - reset them
all at once. Respect a per element timeout value if present to reset the
'expires' value to.
Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
netfilter: snat: evict closing tcp entries on reply tuple collision
When all tried source tuples are in use, the connection request (skb)
and the new conntrack will be dropped in nf_confirm() due to the
non-recoverable clash.
Make it so that the last 32 attempts are allowed to evict a colliding
entry if this connection is already closing and the new sequence number
has advanced past the old one.
Such "all tuples taken" secenario can happen with tcp-rpc workloads where
same dst:dport gets queried repeatedly.
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
====================
splice, net: Switch over users of sendpage() and remove it
Here's the final set of patches towards the removal of sendpage. All the
drivers that use sendpage() get switched over to using sendmsg() with
MSG_SPLICE_PAGES.
The following changes are made:
(1) Make the protocol drivers behave according to MSG_MORE, not
MSG_SENDPAGE_NOTLAST. The latter is restricted to turning on MSG_MORE
in the sendpage() wrappers.
(2) Fix ocfs2 to allocate its global protocol buffers with folio_alloc()
rather than kzalloc() so as not to invoke the !sendpage_ok warning in
skb_splice_from_iter().
(3) Make ceph/rds, skb_send_sock, dlm, nvme, smc, ocfs2, drbd and iscsi
use sendmsg(), not sendpage and make them specify MSG_MORE instead of
MSG_SENDPAGE_NOTLAST.
(4) Kill off sendpage and clean up MSG_SENDPAGE_NOTLAST.
David Howells [Fri, 23 Jun 2023 22:55:13 +0000 (23:55 +0100)]
net: Kill MSG_SENDPAGE_NOTLAST
Now that ->sendpage() has been removed, MSG_SENDPAGE_NOTLAST can be cleaned
up. Things were converted to use MSG_MORE instead, but the protocol
sendpage stubs still convert MSG_SENDPAGE_NOTLAST to MSG_MORE, which is now
unnecessary.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-afs@lists.infradead.org
cc: mptcp@lists.linux.dev
cc: rds-devel@oss.oracle.com
cc: tipc-discussion@lists.sourceforge.net
cc: virtualization@lists.linux-foundation.org Link: https://lore.kernel.org/r/20230623225513.2732256-17-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David Howells [Fri, 23 Jun 2023 22:55:12 +0000 (23:55 +0100)]
sock: Remove ->sendpage*() in favour of sendmsg(MSG_SPLICE_PAGES)
Remove ->sendpage() and ->sendpage_locked(). sendmsg() with
MSG_SPLICE_PAGES should be used instead. This allows multiple pages and
multipage folios to be passed through.
Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Marc Kleine-Budde <mkl@pengutronix.de> # for net/can
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-afs@lists.infradead.org
cc: mptcp@lists.linux.dev
cc: rds-devel@oss.oracle.com
cc: tipc-discussion@lists.sourceforge.net
cc: virtualization@lists.linux-foundation.org Link: https://lore.kernel.org/r/20230623225513.2732256-16-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David Howells [Fri, 23 Jun 2023 22:55:11 +0000 (23:55 +0100)]
ocfs2: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage()
Switch ocfs2 from using sendpage() to using sendmsg() + MSG_SPLICE_PAGES so
that sendpage can be phased out.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Mark Fasheh <mark@fasheh.com>
cc: Joel Becker <jlbec@evilplan.org>
cc: Joseph Qi <joseph.qi@linux.alibaba.com>
cc: ocfs2-devel@oss.oracle.com Link: https://lore.kernel.org/r/20230623225513.2732256-15-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David Howells [Fri, 23 Jun 2023 22:55:10 +0000 (23:55 +0100)]
ocfs2: Fix use of slab data with sendpage
ocfs2 uses kzalloc() to allocate buffers for o2net_hand, o2net_keep_req and
o2net_keep_resp and then passes these to sendpage. This isn't really
allowed as the lifetime of slab objects is not controlled by page ref -
though in this case it will probably work. sendmsg() with MSG_SPLICE_PAGES
will, however, print a warning and give an error.
Fix it to use folio_alloc() instead to allocate a buffer for the handshake
message, keepalive request and reply messages.
Fixes: 98211489d414 ("[PATCH] OCFS2: The Second Oracle Cluster Filesystem") Signed-off-by: David Howells <dhowells@redhat.com>
cc: Mark Fasheh <mark@fasheh.com>
cc: Kurt Hackel <kurt.hackel@oracle.com>
cc: Joel Becker <jlbec@evilplan.org>
cc: Joseph Qi <joseph.qi@linux.alibaba.com>
cc: ocfs2-devel@oss.oracle.com Link: https://lore.kernel.org/r/20230623225513.2732256-14-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David Howells [Fri, 23 Jun 2023 22:55:09 +0000 (23:55 +0100)]
scsi: target: iscsi: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage
Use sendmsg() with MSG_SPLICE_PAGES rather than sendpage. This allows
multiple pages and multipage folios to be passed through.
TODO: iscsit_fe_sendpage_sg() should perhaps set up a bio_vec array for the
entire set of pages it's going to transfer plus two for the header and
trailer and page fragments to hold the header and trailer - and then call
sendmsg once for the entire message.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Mike Christie <michael.christie@oracle.com>
cc: Maurizio Lombardi <mlombard@redhat.com>
cc: "James E.J. Bottomley" <jejb@linux.ibm.com>
cc: "Martin K. Petersen" <martin.petersen@oracle.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: open-iscsi@googlegroups.com Link: https://lore.kernel.org/r/20230623225513.2732256-13-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David Howells [Fri, 23 Jun 2023 22:55:06 +0000 (23:55 +0100)]
smc: Drop smc_sendpage() in favour of smc_sendmsg() + MSG_SPLICE_PAGES
Drop the smc_sendpage() code as smc_sendmsg() just passes the call down to
the underlying TCP socket and smc_tx_sendpage() is just a wrapper around
its sendmsg implementation.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Karsten Graul <kgraul@linux.ibm.com>
cc: Wenjia Zhang <wenjia@linux.ibm.com>
cc: Jan Karcher <jaka@linux.ibm.com>
cc: "D. Wythe" <alibuda@linux.alibaba.com>
cc: Tony Lu <tonylu@linux.alibaba.com>
cc: Wen Gu <guwen@linux.alibaba.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org> Link: https://lore.kernel.org/r/20230623225513.2732256-10-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David Howells [Fri, 23 Jun 2023 22:55:05 +0000 (23:55 +0100)]
nvmet-tcp: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage
When transmitting data, call down into TCP using a single sendmsg with
MSG_SPLICE_PAGES to indicate that content should be spliced rather than
copied instead of calling sendpage.
Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Sagi Grimberg <sagi@grimberg.me> Acked-by: Willem de Bruijn <willemb@google.com>
cc: Keith Busch <kbusch@kernel.org>
cc: Jens Axboe <axboe@fb.com>
cc: Christoph Hellwig <hch@lst.de>
cc: Chaitanya Kulkarni <kch@nvidia.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-nvme@lists.infradead.org Link: https://lore.kernel.org/r/20230623225513.2732256-9-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David Howells [Fri, 23 Jun 2023 22:55:03 +0000 (23:55 +0100)]
dlm: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage
When transmitting data, call down a layer using a single sendmsg with
MSG_SPLICE_PAGES to indicate that content should be spliced rather using
sendpage. This allows ->sendpage() to be replaced by something that can
handle multiple multipage folios in a single transaction.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Christine Caulfield <ccaulfie@redhat.com>
cc: David Teigland <teigland@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: cluster-devel@redhat.com Link: https://lore.kernel.org/r/20230623225513.2732256-7-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David Howells [Fri, 23 Jun 2023 22:55:01 +0000 (23:55 +0100)]
ceph: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage()
Use sendmsg() and MSG_SPLICE_PAGES rather than sendpage in ceph when
transmitting data. For the moment, this can only transmit one page at a
time because of the architecture of net/ceph/, but if
write_partial_message_data() can be given a bvec[] at a time by the
iteration code, this would allow pages to be sent in a batch.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: Xiubo Li <xiubli@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org> Link: https://lore.kernel.org/r/20230623225513.2732256-5-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David Howells [Fri, 23 Jun 2023 22:55:00 +0000 (23:55 +0100)]
ceph: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage
Use sendmsg() and MSG_SPLICE_PAGES rather than sendpage in ceph when
transmitting data. For the moment, this can only transmit one page at a
time because of the architecture of net/ceph/, but if
write_partial_message_data() can be given a bvec[] at a time by the
iteration code, this would allow pages to be sent in a batch.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: Xiubo Li <xiubli@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org> Link: https://lore.kernel.org/r/20230623225513.2732256-4-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David Howells [Fri, 23 Jun 2023 22:54:59 +0000 (23:54 +0100)]
net: Use sendmsg(MSG_SPLICE_PAGES) not sendpage in skb_send_sock()
Use sendmsg() with MSG_SPLICE_PAGES rather than sendpage in
skb_send_sock(). This causes pages to be spliced from the source iterator
if possible.
This allows ->sendpage() to be replaced by something that can handle
multiple multipage folios in a single transaction.
Note that this could perhaps be improved to fill out a bvec array with all
the frags and then make a single sendmsg call, possibly sticking the header
on the front also.
As MSG_SENDPAGE_NOTLAST is being phased out along with sendpage(), don't
use it further in than the sendpage methods, but rather translate it to
MSG_MORE and use that instead.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
cc: Bernard Metzler <bmt@zurich.ibm.com>
cc: Jason Gunthorpe <jgg@ziepe.ca>
cc: Leon Romanovsky <leon@kernel.org>
cc: John Fastabend <john.fastabend@gmail.com>
cc: Jakub Sitnicki <jakub@cloudflare.com>
cc: David Ahern <dsahern@kernel.org>
cc: Karsten Graul <kgraul@linux.ibm.com>
cc: Wenjia Zhang <wenjia@linux.ibm.com>
cc: Jan Karcher <jaka@linux.ibm.com>
cc: "D. Wythe" <alibuda@linux.alibaba.com>
cc: Tony Lu <tonylu@linux.alibaba.com>
cc: Wen Gu <guwen@linux.alibaba.com>
cc: Boris Pismenny <borisp@nvidia.com>
cc: Steffen Klassert <steffen.klassert@secunet.com>
cc: Herbert Xu <herbert@gondor.apana.org.au> Link: https://lore.kernel.org/r/20230623225513.2732256-2-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Sat, 24 Jun 2023 22:48:04 +0000 (15:48 -0700)]
Merge tag 'mlx5-updates-2023-06-21' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-updates-2023-06-21
mlx5 driver minor cleanup and fixes to net-next
* tag 'mlx5-updates-2023-06-21' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
net/mlx5: Remove pointless vport lookup from mlx5_esw_check_port_type()
net/mlx5: Remove redundant check from mlx5_esw_query_vport_vhca_id()
net/mlx5: Remove redundant is_mdev_switchdev_mode() check from is_ib_rep_supported()
net/mlx5: Remove redundant MLX5_ESWITCH_MANAGER() check from is_ib_rep_supported()
net/mlx5e: E-Switch, Fix shared fdb error flow
net/mlx5e: Remove redundant comment
net/mlx5e: E-Switch, Pass other_vport flag if vport is not 0
net/mlx5e: E-Switch, Use xarray for devcom paired device index
net/mlx5e: E-Switch, Add peer fdb miss rules for vport manager or ecpf
net/mlx5e: Use vhca_id for device index in vport rx rules
net/mlx5: Lag, Remove duplicate code checking lag is supported
net/mlx5: Fix error code in mlx5_is_reset_now_capable()
net/mlx5: Fix reserved at offset in hca_cap register
net/mlx5: Fix SFs kernel documentation error
net/mlx5: Fix UAF in mlx5_eswitch_cleanup()
====================
Jakub Kicinski [Sat, 24 Jun 2023 22:45:51 +0000 (15:45 -0700)]
Merge branch 'netlink-add-display-hint-to-ynl'
Donald Hunter says:
====================
netlink: add display-hint to ynl
Add a display-hint property to the netlink schema, to be used by generic
netlink clients as hints about how to display attribute values.
A display-hint on an attribute definition is intended for letting a
client such as ynl know that, for example, a u32 should be rendered as
an ipv4 address. The display-hint enumeration includes a small number of
networking domain-specific value types.
====================
Donald Hunter [Fri, 23 Jun 2023 20:19:26 +0000 (21:19 +0100)]
netlink: specs: add display-hint to schema definitions
Add a display-hint property to the netlink schema that is for providing
optional hints to generic netlink clients about how to display attribute
values. A display-hint on an attribute definition is intended for
letting a client such as ynl know that, for example, a u32 should be
rendered as an ipv4 address. The display-hint enumeration includes a
small number of networking domain-specific value types.
Jakub Kicinski [Sat, 24 Jun 2023 22:41:46 +0000 (15:41 -0700)]
Merge tag 'ieee802154-for-net-next-2023-06-23' of gitolite.kernel.org:pub/scm/linux/kernel/git/wpan/wpan-next
Miquel Raynal says:
====================
Core WPAN changes:
- Support for active scans
- Support for answering BEACON_REQ
- Specific MLME handling for limited devices
WPAN driver changes:
- ca8210:
- Flag the devices as limited
- Remove stray gpiod_unexport() call
* tag 'ieee802154-for-net-next-2023-06-23' of gitolite.kernel.org:pub/scm/linux/kernel/git/wpan/wpan-next:
ieee802154: ca8210: Remove stray gpiod_unexport() call
ieee802154: ca8210: Flag the driver as being limited
net: ieee802154: Handle limited devices with only datagram support
mac802154: Handle received BEACON_REQ
ieee802154: Add support for allowing to answer BEACON_REQ
mac802154: Handle active scanning
ieee802154: Add support for user active scan requests
====================
Maxim Kochetkov [Thu, 22 Jun 2023 19:22:45 +0000 (22:22 +0300)]
net: axienet: Move reset before 64-bit DMA detection
64-bit DMA detection will fail if axienet was started before (by boot
loader, boot ROM, etc). In this state axienet will not start properly.
XAXIDMA_TX_CDESC_OFFSET + 4 register (MM2S_CURDESC_MSB) is used to detect
64-bit DMA capability here. But datasheet says: When DMACR.RS is 1
(axienet is in enabled state), CURDESC_PTR becomes Read Only (RO) and
is used to fetch the first descriptor. So iowrite32()/ioread32() trick
to this register to detect 64-bit DMA will not work.
So move axienet reset before 64-bit DMA detection.
Geliang Tang [Fri, 23 Jun 2023 17:34:13 +0000 (10:34 -0700)]
selftests: mptcp: add pm_nl_set_endpoint helper
This patch moves endpoint settings out of do_transfer() into a new
helper pm_nl_set_endpoint(). And invoke this helper in do_transfer().
This makes the code much more clearer.
Geliang Tang [Fri, 23 Jun 2023 17:34:12 +0000 (10:34 -0700)]
selftests: mptcp: drop sflags parameter
run_tests() accepts too many optional parameters. Before this modification,
it was required to set all of then when only the last one had to be
changed. That's not clear to see all these 0 and it makes the maintenance
harder:
run_tests $ns1 $ns2 10.0.1.1 1 2 3 slow
Instead, the parameter can be set as an env var with a limited scope:
Geliang Tang [Fri, 23 Jun 2023 17:34:11 +0000 (10:34 -0700)]
selftests: mptcp: drop addr_nr_ns1/2 parameters
run_tests() accepts too many optional parameters. Before this modification,
it was required to set all of then when only the last one had to be
changed. That's not clear to see all these 0 and it makes the maintenance
harder:
run_tests $ns1 $ns2 10.0.1.1 1 2 3 slow
Instead, the parameter can be set as an env var with a limited scope:
This patch switches to key/value "addr_nr_ns1=*, addr_nr_ns2=*" instead
of positional parameters addr_nr_ns1 and addr_nr_ns2 of do_transfer()
and run_tests().
Geliang Tang [Fri, 23 Jun 2023 17:34:10 +0000 (10:34 -0700)]
selftests: mptcp: drop test_linkfail parameter
run_tests() accepts too many optional parameters. Before this modification,
it was required to set all of then when only the last one had to be
changed. That's not clear to see all these 0 and it makes the maintenance
harder:
run_tests $ns1 $ns2 10.0.1.1 1 2 3 slow
Instead, the parameter can be set as an env var with a limited scope:
Geliang Tang [Fri, 23 Jun 2023 17:34:08 +0000 (10:34 -0700)]
selftests: mptcp: check subflow and addr infos
New MPTCP info are being checked in multiple places to improve the code
coverage when using the userspace PM.
This patch makes chk_mptcp_info() more generic to be able to check
subflows, add_addr_signal and add_addr_accepted info (and even more
later). New arguments are now required to get different infos from the
two namespaces because some counters are specific to the client or the
server.
Geliang Tang [Fri, 23 Jun 2023 17:34:07 +0000 (10:34 -0700)]
selftests: mptcp: test userspace pm out of transfer
This patch moves userspace pm tests out of do_transfer(). Move add address
test into a new function userspace_pm_add_addr(), and remove address test
into userspace_pm_rm_sf_addr_ns1(). Move add subflow test into
userspace_pm_add_sf() and remove subflow into
userspace_pm_rm_sf_addr_ns2().
====================
net: stmmac: introduce devres helpers for stmmac platform drivers
The goal of this series is two-fold: to make the API for stmmac platforms more
logically correct (by providing functions that acquire resources with release
counterparts that undo only their actions and nothing more) and to provide
devres variants of commonly use registration functions that allows to
significantly simplify the platform drivers.
The current pattern for stmmac platform drivers is to call
stmmac_probe_config_dt(), possibly the platform's init() callback and then
call stmmac_drv_probe(). The resources allocated by these calls will then
be released by calling stmmac_pltfr_remove(). This goes against the commonly
accepted way of providing each function that allocated a resource with a
function that frees it.
First: provide wrappers around platform's init() and exit() callbacks that
allow users to skip checking if the callbacks exist manually.
Second: provide stmmac_pltfr_probe() which calls the platform init() callback
and then calls stmmac_drv_probe() together with a variant of
stmmac_pltfr_remove() that DOES NOT call stmmac_remove_config_dt(). For now
this variant is called stmmac_pltfr_remove_no_dt() but once all users of
the old stmmac_pltfr_remove() are converted to the devres helper, it will be
renamed back to stmmac_pltfr_remove() and the no_dt function removed.
Finally use the devres helpers in dwmac-qco-ethqos to show how much simplier
the driver's probe() becomes.
This series obviously just starts the conversion process and other platform
drivers will need to be converted once the helpers land in net/.
====================
net: stmmac: dwmac-qco-ethqos: use devm_stmmac_probe_config_dt()
Significantly simplify the driver's probe() function by using the devres
variant of stmmac_probe_config_dt(). This allows to drop the goto jumps
entirely.
The remove_new() callback now needs to be switched to
stmmac_pltfr_remove_no_dt().
net: stmmac: platform: provide stmmac_pltfr_remove_no_dt()
Add a variant of stmmac_pltfr_remove() that only frees resources
allocated by stmmac_pltfr_probe() and - unlike stmmac_pltfr_remove() -
does not call stmmac_remove_config_dt().
net: stmmac: platform: provide stmmac_pltfr_probe()
Implement stmmac_pltfr_probe() which is the logical API counterpart
for stmmac_pltfr_remove(). It calls the platform's init() callback and
then probes the stmmac device.
Pavel Begunkov [Fri, 23 Jun 2023 12:38:55 +0000 (13:38 +0100)]
net/tcp: optimise locking for blocking splice
Even when tcp_splice_read() reads all it was asked for, for blocking
sockets it'll release and immediately regrab the socket lock, loop
around and break on the while check.
Check tss.len right after we adjust it, and return if we're done.
That saves us one release_sock(); lock_sock(); pair per successful
blocking splice read.
syzkaller reported use-after-free in __gtp_encap_destroy(). [0]
It shows the same process freed sk and touched it illegally.
Commit e198987e7dd7 ("gtp: fix suspicious RCU usage") added lock_sock()
and release_sock() in __gtp_encap_destroy() to protect sk->sk_user_data,
but release_sock() is called after sock_put() releases the last refcnt.
[0]:
BUG: KASAN: slab-use-after-free in instrument_atomic_read_write include/linux/instrumented.h:96 [inline]
BUG: KASAN: slab-use-after-free in atomic_try_cmpxchg_acquire include/linux/atomic/atomic-instrumented.h:541 [inline]
BUG: KASAN: slab-use-after-free in queued_spin_lock include/asm-generic/qspinlock.h:111 [inline]
BUG: KASAN: slab-use-after-free in do_raw_spin_lock include/linux/spinlock.h:186 [inline]
BUG: KASAN: slab-use-after-free in __raw_spin_lock_bh include/linux/spinlock_api_smp.h:127 [inline]
BUG: KASAN: slab-use-after-free in _raw_spin_lock_bh+0x75/0xe0 kernel/locking/spinlock.c:178
Write of size 4 at addr ffff88800dbef398 by task syz-executor.2/2401
The buggy address belongs to the object at ffff88800dbef300
which belongs to the cache UDPv6 of size 1344
The buggy address is located 152 bytes inside of
freed 1344-byte region [ffff88800dbef300, ffff88800dbef840)
Memory state around the buggy address: ffff88800dbef280: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff88800dbef300: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff88800dbef380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^ ffff88800dbef400: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff88800dbef480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
Sabrina Dubroca [Thu, 22 Jun 2023 21:03:34 +0000 (23:03 +0200)]
selftests: rtnetlink: remove netdevsim device after ipsec offload test
On systems where netdevsim is built-in or loaded before the test
starts, kci_test_ipsec_offload doesn't remove the netdevsim device it
created during the test.
Eric Dumazet [Thu, 22 Jun 2023 18:15:03 +0000 (18:15 +0000)]
sch_netem: fix issues in netem_change() vs get_dist_table()
In blamed commit, I missed that get_dist_table() was allocating
memory using GFP_KERNEL, and acquiring qdisc lock to perform
the swap of newly allocated table with current one.
In this patch, get_dist_table() is allocating memory and
copy user data before we acquire the qdisc lock.
Then we perform swap operations while being protected by the lock.
Note that after this patch netem_change() no longer can do partial changes.
If an error is returned, qdisc conf is left unchanged.
Fixes: 2174a08db80d ("sch_netem: acquire qdisc lock in netem_change()") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Link: https://lore.kernel.org/r/20230622181503.2327695-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Sat, 24 Jun 2023 22:12:05 +0000 (15:12 -0700)]
Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue
Tony Nguyen says:
====================
Intel Wired LAN Driver Updates 2023-06-22 (iavf)
This series contains updates to iavf driver only.
Przemek defers removing, previous, primary MAC address until after
getting result of adding its replacement. He also does some cleanup by
removing unused functions and making applicable functions static.
* '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
iavf: make functions static where possible
iavf: remove some unused functions and pointless wrappers
iavf: fix err handling for MAC replace
====================
Randy Dunlap [Thu, 22 Jun 2023 15:54:09 +0000 (08:54 -0700)]
revert "s390/net: lcs: use IS_ENABLED() for kconfig detection"
The referenced patch is causing build errors when ETHERNET=y and
FDDI=m. While we work out the preferred patch(es), revert this patch
to make the pain go away.
Fixes: 128272336120 ("s390/net: lcs: use IS_ENABLED() for kconfig detection") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reported-by: kernel test robot <lkp@intel.com>
Link: lore.kernel.org/r/202306202129.pl0AqK8G-lkp@intel.com Cc: Alexandra Winter <wintera@linux.ibm.com> Cc: Wenjia Zhang <wenjia@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Sven Schnelle <svens@linux.ibm.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Link: https://lore.kernel.org/r/20230622155409.27311-1-rdunlap@infradead.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Sat, 24 Jun 2023 22:04:12 +0000 (15:04 -0700)]
Merge branch '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue
Tony Nguyen says:
====================
igc: TX timestamping fixes
This is the fixes part of the series intended to add support for using
the 4 timestamp registers present in i225/i226.
Moving the timestamp handling to be inline with the interrupt handling
has the advantage of improving the TX timestamping retrieval latency,
here are some numbers using ntpperf:
During this series, and to show that as is always the case, things are
never easy as they should be, a hardware issue was found, and it took
some time to find the workaround(s). The bug and workaround are better
explained in patch 4/4.
Note: the workaround has a simpler alternative, but it would involve
adding support for the other timestamp registers, and only using the
TXSTMP{H/L}_0 as a way to clear the interrupt. But I feel bad about
throwing this kind of resources away. Didn't test this extensively but
it should work.
Also, as Marc Kleine-Budde suggested, after some consensus is reached
on this series, most parts of it will be proposed for igb.
* '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
igc: Work around HW bug causing missing timestamps
igc: Retrieve TX timestamp during interrupt handling
igc: Check if hardware TX timestamping is enabled earlier
igc: Fix race condition in PTP tx code
====================
We've added 49 non-merge commits during the last 24 day(s) which contain
a total of 70 files changed, 1935 insertions(+), 442 deletions(-).
The main changes are:
1) Extend bpf_fib_lookup helper to allow passing the route table ID,
from Louis DeLosSantos.
2) Fix regsafe() in verifier to call check_ids() for scalar registers,
from Eduard Zingerman.
3) Extend the set of cpumask kfuncs with bpf_cpumask_first_and()
and a rework of bpf_cpumask_any*() kfuncs. Additionally,
add selftests, from David Vernet.
4) Fix socket lookup BPF helpers for tc/XDP to respect VRF bindings,
from Gilad Sever.
5) Change bpf_link_put() to use workqueue unconditionally to fix it
under PREEMPT_RT, from Sebastian Andrzej Siewior.
6) Follow-ups to address issues in the bpf_refcount shared ownership
implementation, from Dave Marchevsky.
7) A few general refactorings to BPF map and program creation permissions
checks which were part of the BPF token series, from Andrii Nakryiko.
8) Various fixes for benchmark framework and add a new benchmark
for BPF memory allocator to BPF selftests, from Hou Tao.
9) Documentation improvements around iterators and trusted pointers,
from Anton Protopopov.
10) Small cleanup in verifier to improve allocated object check,
from Daniel T. Lee.
11) Improve performance of bpf_xdp_pointer() by avoiding access
to shared_info when XDP packet does not have frags,
from Jesper Dangaard Brouer.
12) Silence a harmless syzbot-reported warning in btf_type_id_size(),
from Yonghong Song.
13) Remove duplicate bpfilter_umh_cleanup in favor of umd_cleanup_helper,
from Jarkko Sakkinen.
14) Fix BPF selftests build for resolve_btfids under custom HOSTCFLAGS,
from Viktor Malik.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (49 commits)
bpf, docs: Document existing macros instead of deprecated
bpf, docs: BPF Iterator Document
selftests/bpf: Fix compilation failure for prog vrf_socket_lookup
selftests/bpf: Add vrf_socket_lookup tests
bpf: Fix bpf socket lookup from tc/xdp to respect socket VRF bindings
bpf: Call __bpf_sk_lookup()/__bpf_skc_lookup() directly via TC hookpoint
bpf: Factor out socket lookup functions for the TC hookpoint.
selftests/bpf: Set the default value of consumer_cnt as 0
selftests/bpf: Ensure that next_cpu() returns a valid CPU number
selftests/bpf: Output the correct error code for pthread APIs
selftests/bpf: Use producer_cnt to allocate local counter array
xsk: Remove unused inline function xsk_buff_discard()
bpf: Keep BPF_PROG_LOAD permission checks clear of validations
bpf: Centralize permissions checks for all BPF map types
bpf: Inline map creation logic in map_create() function
bpf: Move unprivileged checks into map_create() and bpf_prog_load()
bpf: Remove in_atomic() from bpf_link_put().
selftests/bpf: Verify that check_ids() is used for scalars in regsafe()
bpf: Verify scalar ids mapping in regsafe() using check_ids()
selftests/bpf: Check if mark_chain_precision() follows scalar ids
...
====================
The mlxsw driver currently makes the assumption that the user applies
configuration in a bottom-up manner. Thus netdevices need to be added to
the bridge before IP addresses are configured on that bridge or SVI added
on top of it. Enslaving a netdevice to another netdevice that already has
uppers is in fact forbidden by mlxsw for this reason. Despite this safety,
it is rather easy to get into situations where the offloaded configuration
is just plain wrong.
As an example, take a front panel port, configure an IP address: it gets a
RIF. Now enslave the port to the bridge, and the RIF is gone. Remove the
port from the bridge again, but the RIF never comes back. There is a number
of similar situations, where changing the configuration there and back
utterly breaks the offload.
The situation is going to be made better by implementing a range of replays
and post-hoc offloads.
This patch set lays the ground for replay of next hops. The particular
issue that it deals with is that currently, driver-specific bookkeeping for
next hops is hooked off RIF objects, which come and go across the lifetime
of a netdevice. We would rather keep these objects at an entity that
mirrors the lifetime of the netdevice itself. That way they are at hand and
can be offloaded when a RIF is eventually created.
To that end, with this patchset, mlxsw keeps a hash table of CRIFs:
candidate RIFs, persistent handles for netdevices that mlxsw deems
potentially interesting. The lifetime of a CRIF matches that of the
underlying netdevice, and thus a RIF can always assume a CRIF exists. A
CRIF is where next hops are kept, and when RIF is created, these next hops
can be easily offloaded. (Previously only the next hops created after the
RIF was created were offloaded.)
- Patches #1 and #2 are minor adjustments.
- In patches #3 and #4, add CRIF bookkeeping.
- In patch #5, link CRIFs to RIFs such that given a netdevice-backed RIF,
the corresponding CRIF is easy to look up.
- Patch #6 is a clean-up allowed by the previous patches
- Patches #7 and #8 move next hop tracking to CRIFs
No observable effects are intended as of yet. This will be useful once
there is support for RIF creation for netdevices that become mlxsw uppers,
which will come in following patch sets.
====================
Petr Machata [Thu, 22 Jun 2023 13:33:09 +0000 (15:33 +0200)]
mlxsw: spectrum_router: Track next hops at CRIFs
Move the list of next hops from struct mlxsw_sp_rif to mlxsw_sp_crif. The
reason is that eventually, next hops for mlxsw uppers should be offloaded
and unoffloaded on demand as a netdevice becomes an upper, or stops being
one. Currently, next hops are tracked at RIFs, but RIFs do not exist when a
netdevice is not an mlxsw uppers. CRIFs are kept track of throughout the
netdevice lifetime.
Correspondingly, track at each next hop not its RIF, but its CRIF (from
which a RIF can always be deduced).
Note that now that next hops are tracked at a CRIF, it is not necessary to
move each over to a new RIF when it is necessary to edit a RIF. Therefore
drop mlxsw_sp_nexthop_rif_migrate() and have mlxsw_sp_rif_migrate_destroy()
call mlxsw_sp_nexthop_rif_update() directly.
Petr Machata [Thu, 22 Jun 2023 13:33:08 +0000 (15:33 +0200)]
mlxsw: spectrum_router: Split nexthop finalization to two stages
Nexthop finalization consists of two steps: the part where the offload is
removed, because the backing RIF is now gone; and the part where the
association to the RIF is severed.
Extract from mlxsw_sp_nexthop_type_fini() a helper that covers the
unoffloading part, mlxsw_sp_nexthop_type_rif_gone(), so that it can later
be called independently.
Note that this swaps around the ordering of mlxsw_sp_nexthop_ipip_fini()
vs. mlxsw_sp_nexthop_rif_fini(). The current ordering is more of a
historical happenstance than a conscious decision. The two cleanups do not
depend on each other, and this change should have no observable effects.
Petr Machata [Thu, 22 Jun 2023 13:33:07 +0000 (15:33 +0200)]
mlxsw: spectrum_router: Use router.lb_crif instead of .lb_rif_index
A previous patch added a pointer to loopback CRIF to the router data
structure. That makes the loopback RIF index redundant, as everything
necessary can be derived from the CRIF. Drop the field and adjust the code
accordingly.
Petr Machata [Thu, 22 Jun 2023 13:33:06 +0000 (15:33 +0200)]
mlxsw: spectrum_router: Link CRIFs to RIFs
When a RIF is about to be created, the registration of the netdevice that
it should be associated with must have been seen in the past, and a CRIF
created. Therefore make this a hard requirement by looking up the CRIF
during RIF creation, and complaining loudly when there isn't one.
This then allows to keep a link between a RIF and its corresponding
CRIF (and back, as the relationship is one-to-at-most-one), which do.
The CRIF will later be useful as the objects tracked there will be
offloaded lazily as a result of RIF creation.
CRIFs are created when an "interesting" netdevice is registered, and
destroyed after such device is unregistered. CRIFs are supposed to already
exist when a RIF creation request arises, and exist at least as long as
that RIF exists. This makes for a simple invariant: it is always safe to
dereference CRIF pointer from "its" RIF.
To guarantee this, CRIFs cannot be removed immediately when the UNREGISTER
event is delivered. The reason is that if a RIF's netdevices has an IPv6
address, removal of this address is notified in an atomic block. To remove
the RIF, the IPv6 removal handler schedules a work item. It must be safe
for this work item to access the associated CRIF as well.
Thus when a netdevice that backs the CRIF is removed, if it still has a
RIF, do not actually free the CRIF, only toggle its can_destroy flag, which
this patch adds. Later on, mlxsw_sp_rif_destroy() collects the CRIF.