| Commit message (Collapse) | Author | Age | Files | Lines |
... | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Having ->async_data doesn't mean it's initialised and previously we vere
relying on setting F_CLEANUP at the right moment. With zc sendmsg
though, we set F_CLEANUP early in prep when we alloc a notif and so we
may allocate async_data, fail in copy_msg_hdr() leaving
struct io_async_msghdr not initialised correctly but with F_CLEANUP
set, which causes a ->free_iov double free and probably other nastiness.
Always initialise ->free_iov. Also, now it might point to fast_iov when
fails, so avoid freeing it during cleanups.
Reported-by: syzbot+edfd15cd4246a3fc615a@syzkaller.appspotmail.com
Fixes: 493108d95f146 ("io_uring/net: zerocopy sendmsg")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Overflowing CQEs may result in reordering, which is buggy in case of
links, F_MORE and so on. If we guarantee that we don't reorder for
the unlikely event of a CQ ring overflow, then we can further extend
this to not have to terminate multishot requests if it happens. For
other operations, like zerocopy sends, we have no choice but to honor
CQE ordering.
Reported-by: Dylan Yudaken <dylany@fb.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ec3bc55687b0768bbe20fb62d7d06cfced7d7e70.1663892031.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
We should not assume anything about ->free_iov just from
REQ_F_ASYNC_DATA but rather rely on REQ_F_NEED_CLEANUP, as we may
allocate ->async_data but failed init would leave the field in not
consistent state. The easiest solution is to remove removing
REQ_F_NEED_CLEANUP and so ->async_data dealloc from io_sendrecv_fail()
and let io_send_zc_cleanup() do the job. The catch here is that we also
need to prevent double notif flushing, just test it for NULL and zero
where it's needed.
BUG: KASAN: use-after-free in io_sendrecv_fail+0x3b0/0x3e0 io_uring/net.c:1221
Write of size 8 at addr ffff8880771b4080 by task syz-executor.3/30199
CPU: 1 PID: 30199 Comm: syz-executor.3 Not tainted 6.0.0-rc6-next-20220923-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/26/2022
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
print_address_description mm/kasan/report.c:284 [inline]
print_report+0x15e/0x45d mm/kasan/report.c:395
kasan_report+0xbb/0x1f0 mm/kasan/report.c:495
io_sendrecv_fail+0x3b0/0x3e0 io_uring/net.c:1221
io_req_complete_failed+0x155/0x1b0 io_uring/io_uring.c:873
io_drain_req io_uring/io_uring.c:1648 [inline]
io_queue_sqe_fallback.cold+0x29f/0x788 io_uring/io_uring.c:1931
io_submit_sqe io_uring/io_uring.c:2160 [inline]
io_submit_sqes+0x1180/0x1df0 io_uring/io_uring.c:2276
__do_sys_io_uring_enter+0xac6/0x2410 io_uring/io_uring.c:3216
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
Fixes: c4c0009e0b56e ("io_uring/net: combine fail handlers")
Reported-by: syzbot+4c597a574a3f5a251bda@syzkaller.appspotmail.com
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/23ab8346e407ea50b1198a172c8a97e1cf22915b.1663945875.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
io_uring will run task_work from contexts that have been prepared for
waiting, and in doing so it'll implicitly set the task running again
to avoid issues with blocking conditions. The new deferred local
task_work doesn't do that, which can result in spews on this being
an invalid condition:
[ 112.917576] do not call blocking ops when !TASK_RUNNING; state=1 set at [<00000000ad64af64>] prepare_to_wait_exclusive+0x3f/0xd0
[ 112.983088] WARNING: CPU: 1 PID: 190 at kernel/sched/core.c:9819 __might_sleep+0x5a/0x60
[ 112.987240] Modules linked in:
[ 112.990504] CPU: 1 PID: 190 Comm: io_uring Not tainted 6.0.0-rc6+ #1617
[ 113.053136] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
[ 113.133650] RIP: 0010:__might_sleep+0x5a/0x60
[ 113.136507] Code: ee 48 89 df 5b 31 d2 5d e9 33 ff ff ff 48 8b 90 30 0b 00 00 48 c7 c7 90 de 45 82 c6 05 20 8b 79 01 01 48 89 d1 e8 3a 49 77 00 <0f> 0b eb d1 66 90 0f 1f 44 00 00 9c 58 f6 c4 02 74 35 65 8b 05 ed
[ 113.223940] RSP: 0018:ffffc90000537ca0 EFLAGS: 00010286
[ 113.232903] RAX: 0000000000000000 RBX: ffffffff8246782c RCX: ffffffff8270bcc8
IOPS=133.15K, BW=520MiB/s, IOS/call=32/31
[ 113.353457] RDX: ffffc90000537b50 RSI: 00000000ffffdfff RDI: 0000000000000001
[ 113.358970] RBP: 00000000000003bc R08: 0000000000000000 R09: c0000000ffffdfff
[ 113.361746] R10: 0000000000000001 R11: ffffc90000537b48 R12: ffff888103f97280
[ 113.424038] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000001
[ 113.428009] FS: 00007f67ae7fc700(0000) GS:ffff88842fc80000(0000) knlGS:0000000000000000
[ 113.432794] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 113.503186] CR2: 00007f67b8b9b3b0 CR3: 0000000102b9b005 CR4: 0000000000770ee0
[ 113.507291] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 113.512669] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 113.574374] PKRU: 55555554
[ 113.576800] Call Trace:
[ 113.578325] <TASK>
[ 113.579799] set_page_dirty_lock+0x1b/0x90
[ 113.582411] __bio_release_pages+0x141/0x160
[ 113.673078] ? set_next_entity+0xd7/0x190
[ 113.675632] blk_rq_unmap_user+0xaa/0x210
[ 113.678398] ? timerqueue_del+0x2a/0x40
[ 113.679578] nvme_uring_task_cb+0x94/0xb0
[ 113.683025] __io_run_local_work+0x8a/0x150
[ 113.743724] ? io_cqring_wait+0x33d/0x500
[ 113.746091] io_run_local_work.part.76+0x2e/0x60
[ 113.750091] io_cqring_wait+0x2e7/0x500
[ 113.752395] ? trace_event_raw_event_io_uring_req_failed+0x180/0x180
[ 113.823533] __x64_sys_io_uring_enter+0x131/0x3c0
[ 113.827382] ? switch_fpu_return+0x49/0xc0
[ 113.830753] do_syscall_64+0x34/0x80
[ 113.832620] entry_SYSCALL_64_after_hwframe+0x5e/0xc8
Ensure that we mark current as TASK_RUNNING for deferred task_work
as well.
Fixes: c0e0d6ba25f1 ("io_uring: add IORING_SETUP_DEFER_TASKRUN")
Reported-by: Stefan Roesch <shr@fb.com>
Reviewed-by: Dylan Yudaken <dylany@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Add a zerocopy version of sendmsg.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6aabc4bdfc0ec78df6ec9328137e394af9d4e7ef.1663668091.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Merge io_send_zc_fail() into io_sendrecv_fail(), saves a few lines of
code and some headache for following patch.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e0eba1d577413aef5602cd45f588b9230207082d.1663668091.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Simple renaming of io_sendzc*() functions in preparatio to adding
a zerocopy sendmsg variant.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/265af46829e6076dd220011b1858dc3151969226.1663668091.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
We have normal sends, but what is missing is sendto-like requests. Add
sendto() capabilities to IORING_OP_SEND by passing in addr just as we do
for IORING_OP_SEND_ZC.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/69fbd8b2cb830e57d1bf9ec351e9bf95c5b77e3f.1663668091.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Instead of passing the right address into io_setup_async_addr() only
specify local on-stack storage and let the function infer where to grab
it from. It optimises out one local variable we have to deal with.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6bfa9ab810d776853eb26ed59301e2536c3a5471.1663668091.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Partial zc send may end up in io_req_complete_failed(), which not only
would return invalid result but also mask out the notification leading
to lifetime issues.
Cc: stable@vger.kernel.org
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/5673285b5e83e6ceca323727b4ddaa584b5cc91e.1663668091.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Just as with rw, partial send/recv may end up in
io_req_complete_failed() and loose the result, make sure we return the
number of bytes processed.
Cc: stable@vger.kernel.org
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a4ff95897b5419356fca9ea55db91ac15b2975f9.1663668091.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
A partially done read/write may end up in io_req_complete_failed() and
loose the result, make sure we return the number of bytes processed.
Cc: stable@vger.kernel.org
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/05e0879c226bcd53b441bf92868eadd4bf04e2fc.1663668091.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Sometimes we have to do a little bit of a fixup on a request failuer in
io_req_complete_failed(). Add a callback in opdef for that.
Cc: stable@vger.kernel.org
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b734cff4e67cb30cca976b9face321023f37549a.1663668091.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
If we have doubly sized SQEs, then we need to shift the sq index by 1
to account for using two entries for a single request. The CQE dumping
gets this right, but the SQE one does not.
Improve the SQE dumping in general, the information dumped is pretty
sparse and doesn't even cover the whole basic part of the SQE. Include
information on the extended part of the SQE, if doubly sized SQEs are
in use. A typical dump now looks like the following:
[...]
SQEs: 32
32: opcode:URING_CMD, fd:0, flags:1, off:3225964160, addr:0x0, rw_flags:0x0, buf_index:0 user_data:2721, e0:0x0, e1:0xffffb8041000, e2:0x100000000000, e3:0x5500, e4:0x7, e5:0x0, e6:0x0, e7:0x0
33: opcode:URING_CMD, fd:0, flags:1, off:3225964160, addr:0x0, rw_flags:0x0, buf_index:0 user_data:2722, e0:0x0, e1:0xffffb8043000, e2:0x100000000000, e3:0x5508, e4:0x7, e5:0x0, e6:0x0, e7:0x0
34: opcode:URING_CMD, fd:0, flags:1, off:3225964160, addr:0x0, rw_flags:0x0, buf_index:0 user_data:2723, e0:0x0, e1:0xffffb8045000, e2:0x100000000000, e3:0x5510, e4:0x7, e5:0x0, e6:0x0, e7:0x0
[...]
Fixes: ebdeb7c01d02 ("io_uring: add support for 128-byte SQEs")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
We already have the cq_shift, just use that to tell if we have doubly
sized CQEs or not.
While in there, cleanup the CQE32 vs normal CQE size printing.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
We removed conditional io_commit_cqring_flush() guarding against
spurious eventfd and the io_disarm_next()'s return value is not used
anymore, just void it.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9a441c9a32a58bcc586076fa9a7d0dc33f1fb3cb.1662652536.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
We'll grab uring_lock and call __io_run_local_work() with several
atomics inside even if there are no task works. Skip it if ->work_llist
is empty.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/f6a885f372bad2d77d9cd87341b0a86a4000c0ff.1662652536.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Let's keep checks for whether to break the iopoll loop or not same for
normal and defer tw, this includes ->cached_cq_tail checks guarding
against polling more than asked for.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d2fa8a44f8114f55a4807528da438cde93815360.1662652536.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
We may propagate a positive return value of io_run_task_work() out of
io_iopoll_check(), which breaks our tests. io_run_task_work() doesn't
return anything useful for us, ignore the return value.
Fixes: c0e0d6ba25f1 ("io_uring: add IORING_SETUP_DEFER_TASKRUN")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/c442bb87f79cea10b3f857cbd4b9a4f0a0493fa3.1662652536.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
We try to restrict CQ waiters when IORING_SETUP_DEFER_TASKRUN is set,
but if nothing has been submitted yet it'll allow any waiter, which
violates the contract.
Fixes: c0e0d6ba25f1 ("io_uring: add IORING_SETUP_DEFER_TASKRUN")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/b4f0d3f14236d7059d08c5abe2661ef0b78b5528.1662652536.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
In case of DEFER_TASK_WORK we try to restrict waiters to only one task,
which is also the only submitter; however, we don't do it reliably,
which might be very confusing and backfire in the future. E.g. we
currently allow multiple tasks in io_iopoll_check().
Fixes: c0e0d6ba25f1 ("io_uring: add IORING_SETUP_DEFER_TASKRUN")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/94c83c0a7fe468260ee2ec31bdb0095d6e874ba2.1662652536.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Reuse struct io_sr_msg for zerocopy sends, which is handy. There is
only one zerocopy specific field, namely .notif, and we have enough
space for it.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/408c5b1b2d8869e1a12da5f5a78ed72cac112149.1662639236.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
In preparation for using struct io_sr_msg for zerocopy sends, clean up
types. First, flags can be u16 as it's provided by the userspace in u16
ioprio, as well as addr_len. This saves us 4 bytes. Also use unsigned
for size and done_io, both are as well limited to u32.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/42c2639d6385b8b2181342d2af3a42d3b1c5bcd2.1662639236.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Add a sg_from_iter() for when we initiate non-bvec zerocopy sends, which
helps us to remove some extra steps from io_sg_from_iter(). The only
thing the new function has to do before giving control away to
__zerocopy_sg_from_iter() is to check if the skb has managed frags and
downgrade them if so.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/cda3dea0d36f7931f63a70f350130f085ac3f3dd.1662639236.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
We already keep io_async_msghdr caches for normal send/recv requests,
use them also for zerocopy send.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/42fa615b6e0be25f47a685c35d7b5e4f1b03d348.1662639236.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
send/recv have async_data caches but there are only used from within
issue handlers. Extend their use also to ->prep_async, should be handy
with links and IOSQE_ASYNC.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b9a2264b807582a97ed606c5bfcdc2399384e8a5.1662639236.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
We should prioritise send/recv retry cases over failures, they're more
important. Shuffle -ERESTARTSYS after we handled retries.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d9059691b30d0963b7269fa4a0c81ee7720555e6.1662639236.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
There is one place when we forgot to change hand coded spin locking with
io_cq_lock(), change it to be more consistent. Note, the unlock part is
already __io_cq_unlock_post().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/91699b9a00a07128f7ca66136bdbbfc67a64659e.1662639236.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Request referencing has changed a while ago and there is no notion left
of submission/completion references, kill an outdated comment.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/38902e7229d68cecd62702436d627d4858b0d9d4.1662639236.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
In commit 934447a603b2 ("io_uring: do not recycle buffer in READV") a
temporary fix was put in io_kbuf_recycle to simply never recycle READV
buffers.
Instead of that, rather treat READV with REQ_F_BUFFER_SELECTED the same as
a READ with REQ_F_BUFFER_SELECTED. Since READV requires iov_len of 1 they
are essentially the same.
In order to do this inside io_prep_rw() add some validation to check that
it is in fact only length 1, and also extract the length of the buffer at
prep time.
This allows removal of the io_iov_buffer_select codepaths as they are only
used from the READV op.
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220907165152.994979-1-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
We need the poll_flags to know how to poll for the IO, and we should
have the batch structure in preparation for supporting batched
completions with iopoll.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Combine the two checks we have for task_work running and whether or not
we need to shuffle the mutex into one, so we unify how task_work is run
in the iopoll loop. This helps ensure that local task_work is run when
needed, and also optimizes that path to avoid a mutex shuffle if it's
not needed.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
We have a few spots that drop the mutex just to run local task_work,
which immediately tries to grab it again. Add a helper that just passes
in whether we're locked already.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
After the addition of iopoll support for passthrough, there's a bit of
a mixup here. Clean it up and get rid of the casting for the passthrough
command type.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Put this up in the same way as iopoll is done for regular read/write IO.
Make place for storing a cookie into struct io_uring_cmd on submission.
Perform the completion using the ->uring_cmd_iopoll handler.
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Link: https://lore.kernel.org/r/20220823161443.49436-3-joshi.k@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Add tracing for io_run_local_task_work
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220830125013.570060-8-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Some workloads rely on a registered eventfd (via
io_uring_register_eventfd(3)) in order to wake up and process the
io_uring.
In the case of a ring setup with IORING_SETUP_DEFER_TASKRUN, that eventfd
also needs to be signalled when there are tasks to run.
This changes an old behaviour which assumed 1 eventfd signal implied at
least 1 CQE, however only when this new flag is set (and so old users will
not notice). This should be expected with the IORING_SETUP_DEFER_TASKRUN
flag as it is not guaranteed that every task will result in a CQE.
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220830125013.570060-7-dylany@fb.com
[axboe: fold in call_rcu() serialization fix]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Non functional change: move this function above io_eventfd_signal so it
can be used from there
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220830125013.570060-6-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Allow deferring async tasks until the user calls io_uring_enter(2) with
the IORING_ENTER_GETEVENTS flag. Enable this mode with a flag at
io_uring_setup time. This functionality requires that the later
io_uring_enter will be called from the same submission task, and therefore
restrict this flag to work only when IORING_SETUP_SINGLE_ISSUER is also
set.
Being able to hand pick when tasks are run prevents the problem where
there is current work to be done, however task work runs anyway.
For example, a common workload would obtain a batch of CQEs, and process
each one. Interrupting this to additional taskwork would add latency but
not gain anything. If instead task work is deferred to just before more
CQEs are obtained then no additional latency is added.
The way this is implemented is by trying to keep task work local to a
io_ring_ctx, rather than to the submission task. This is required, as the
application will want to wake up only a single io_ring_ctx at a time to
process work, and so the lists of work have to be kept separate.
This has some other benefits like not having to check the task continually
in handle_tw_list (and potentially unlocking/locking those), and reducing
locks in the submit & process completions path.
There are networking cases where using this option can reduce request
latency by 50%. For example a contrived example using [1] where the client
sends 2k data and receives the same data back while doing some system
calls (to trigger task work) shows this reduction. The reason ends up
being that if sending responses is delayed by processing task work, then
the client side sits idle. Whereas reordering the sends first means that
the client runs it's workload in parallel with the local task work.
[1]:
Using https://github.com/DylanZA/netbench/tree/defer_run
Client:
./netbench --client_only 1 --control_port 10000 --host <host> --tx "epoll --threads 16 --per_thread 1 --size 2048 --resp 2048 --workload 1000"
Server:
./netbench --server_only 1 --control_port 10000 --rx "io_uring --defer_taskrun 0 --workload 100" --rx "io_uring --defer_taskrun 1 --workload 100"
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220830125013.570060-5-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
This is not needed, and it is normally better to wait for task work until
after submissions. This will allow greater batching if either work arrives
in the meanwhile, or if the submissions cause task work to be queued up.
For SQPOLL this also no longer runs task work, but this is handled inside
the SQPOLL loop anyway.
For IOPOLL io_iopoll_check will run task work anyway
And otherwise io_cqring_wait will run task work
Suggested-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220830125013.570060-4-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
This will be used later to know if the ring has outstanding work. Right
now just if there is overflow CQEs to copy to the main CQE ring, but later
will include deferred tasks
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220830125013.570060-3-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
'running' is set once and read once, so can easily just remove it
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220830125013.570060-2-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|\ \ \ \
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Pull vfs constification updates from Al Viro:
"whack-a-mole: constifying struct path *"
* tag 'pull-path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
ecryptfs: constify path
spufs: constify path
nd_jump_link(): constify path
audit_init_parent(): constify path
__io_setxattr(): constify path
do_proc_readlink(): constify path
overlayfs: constify path
fs/notify: constify path
may_linkat(): constify path
do_sys_name_to_handle(): constify path
->getprocattr(): attribute name is const char *, TYVM...
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|\ \ \ \ \
| |_|_|_|/
|/| | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Pull io_uring fixes from Jens Axboe:
"Two fixes that should go into 6.0:
- Tweak the single issuer logic to register the task at creation,
rather than at first submit. SINGLE_ISSUER was added for 6.0, and
after some discussion on this, we decided to make it a bit stricter
while it's still possible to do so (Dylan).
- Stefan from Samba had some doubts on the level triggered poll that
was added for this release. Rather than attempt to mess around with
it now, just do the quick one-liner to disable it for release and
we have time to discuss and change it for 6.1 instead (me)"
* tag 'io_uring-6.0-2022-09-29' of git://git.kernel.dk/linux:
io_uring/poll: disable level triggered poll
io_uring: register single issuer task at creation
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Stefan reports that there are issues with the level triggered
notification. Since we're late in the cycle, and it was introduced for
the 6.0 release, just disable it at prep time and we can bring this
back when Samba is happy with it.
Reported-by: Stefan Metzmacher <metze@samba.org>
Reviewed-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Instead of picking the task from the first submitter task, rather use the
creator task or in the case of disabled (IORING_SETUP_R_DISABLED) the
enabling task.
This approach allows a lot of simplification of the logic here. This
removes init logic from the submission path, which can always be a bit
confusing, but also removes the need for locking to write (or read) the
submitter_task.
Users that want to move a ring before submitting can create the ring
disabled and then enable it on the submitting task.
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Fixes: 97bbdc06a444 ("io_uring: add IORING_SETUP_SINGLE_ISSUER")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|\| | | |
| |_|/ /
|/| | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Pull io_uring fix from Jens Axboe:
"Just a single fix for an issue with un-reaped IOPOLL requests on ring
exit"
* tag 'io_uring-6.0-2022-09-23' of git://git.kernel.dk/linux:
io_uring: ensure that cached task references are always put on exit
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
io_uring caches task references to avoid doing atomics for each of them
per request. If a request is put from the same task that allocated it,
then we can maintain a per-ctx cache of them. This obviously relies
on io_uring always pruning caches in a reliable way, and there's
currently a case off io_uring fd release where we can miss that.
One example is a ring setup with IOPOLL, which relies on the task
polling for completions, which will free them. However, if such a task
submits a request and then exits or closes the ring without reaping
the completion, then ring release will reap and put. If release happens
from that very same task, the completed request task refs will get
put back into the cache pool. This is problematic, as we're now beyond
the point of pruning caches.
Manually drop these caches after doing an IOPOLL reap. This releases
references from the current task, which is enough. If another task
happens to be doing the release, then the caching will not be
triggered and there's no issue.
Cc: stable@vger.kernel.org
Fixes: e98e49b2bbf7 ("io_uring: extend task put optimisations")
Reported-by: Homin Rhee <hominlab@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|\| | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Pull io_uring fixes from Jens Axboe:
"Nothing really major here, but figured it'd be nicer to just get these
flushed out for -rc6 so that the 6.1 branch will have them as well.
That'll make our lives easier going forward in terms of development,
and avoid trivial conflicts in this area.
- Simple trace rename so that the returned opcode name is consistent
with the enum definition (Stefan)
- Send zc rsrc request vs notification lifetime fix (Pavel)"
* tag 'io_uring-6.0-2022-09-18' of git://git.kernel.dk/linux:
io_uring/opdef: rename SENDZC_NOTIF to SEND_ZC
io_uring/net: fix zc fixed buf lifetime
|