aboutsummaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* erofs: support unaligned data decompressionGao Xiang2021-12-311-8/+9
| | | | | | | | | | Previously, compressed data was assumed as block-aligned. This should be changed due to in-block tail-packing inline data. Link: https://lore.kernel.org/r/20211228054604.114518-4-hsiangkao@linux.alibaba.com Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Yue Hu <huyue2@yulong.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
* erofs: introduce z_erofs_fixup_insizeGao Xiang2021-12-293-21/+36
| | | | | | | | | | To prepare for the upcoming ztailpacking feature, introduce z_erofs_fixup_insize() and pageofs_in to wrap up the process to get the exact compressed size via zero padding. Link: https://lore.kernel.org/r/20211228054604.114518-3-hsiangkao@linux.alibaba.com Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
* erofs: tidy up z_erofs_lz4_decompressGao Xiang2021-12-291-39/+44
| | | | | | | | | | | | | To prepare for the upcoming ztailpacking feature and further cleanups, introduce a unique z_erofs_lz4_decompress_ctx to keep the context, including inpages, outpages and oend, which are frequently used by the lz4 decompressor. No logic changes. Link: https://lore.kernel.org/r/20211228054604.114518-2-hsiangkao@linux.alibaba.com Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
* erofs: clean up erofs_map_blocks tracepointsGao Xiang2021-12-092-24/+19
| | | | | | | | | | | Since the new type of chunk-based files is introduced, there is no need to leave flatmode tracepoints. Rename to erofs_map_blocks instead. Link: https://lore.kernel.org/r/20211209012918.30337-1-hsiangkao@linux.alibaba.com Reviewed-by: Yue Hu <huyue2@yulong.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
* erofs: Replace zero-length array with flexible-array memberGao Xiang2021-12-081-2/+2
| | | | | | | | | | | | | | | | There is a regular need in the kernel to provide a way to declare having a dynamically sized set of trailing elements in a structure. Kernel code should always use `flexible array members' [1] for these cases. The older style of one-element or zero-length arrays should no longer be used [2]. [1] https://en.wikipedia.org/wiki/Flexible_array_member [2] https://www.kernel.org/doc/html/v5.15/process/deprecated.html#zero-length-and-one-element-arrays Link: https://lore.kernel.org/r/20211206121702.221331-1-hsiangkao@linux.alibaba.com Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
* erofs: add sysfs node to control sync decompression strategyHuang Jianan2021-12-085-7/+55
| | | | | | | | | | | | | Although readpage is a synchronous path, there will be no additional kworker scheduling overhead in non-atomic contexts together with dm-verity. Let's add a sysfs node to disable sync decompression as an option. Link: https://lore.kernel.org/r/20211206143552.8384-1-huangjianan@oppo.com Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Huang Jianan <huangjianan@oppo.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
* erofs: add sysfs interfaceHuang Jianan2021-12-086-1/+279
| | | | | | | | | | Add sysfs interface to configure erofs related parameters later. Link: https://lore.kernel.org/r/20211201145436.4357-1-huangjianan@oppo.com Reviewed-by: Chao Yu <chao@kernel.org> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Huang Jianan <huangjianan@oppo.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
* erofs: rename lz4_0pading to zero_paddingHuang Jianan2021-12-013-5/+5
| | | | | | | | | | Renaming lz4_0padding to zero_padding globally since LZMA and later algorithms also need that. Link: https://lore.kernel.org/r/20211112160935.19394-1-jnhuang95@gmail.com Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Huang Jianan <huangjianan@oppo.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
* Linux 5.16-rc3Linus Torvalds2021-11-281-1/+1
|
* Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds2021-11-289-76/+9
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull vhost,virtio,vdpa bugfixes from Michael Tsirkin: "Misc fixes all over the place. Revert of virtio used length validation series: the approach taken does not seem to work, breaking too many guests in the process. We'll need to do length validation using some other approach" [ This merge also ends up reverting commit f7a36b03a732 ("vsock/virtio: suppress used length validation"), which came in through the networking tree in the meantime, and was part of that whole used length validation series - Linus ] * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: vdpa_sim: avoid putting an uninitialized iova_domain vhost-vdpa: clean irqs before reseting vdpa device virtio-blk: modify the value type of num in virtio_queue_rq() vhost/vsock: cleanup removing `len` variable vhost/vsock: fix incorrect used length reported to the guest Revert "virtio_ring: validate used buffer length" Revert "virtio-net: don't let virtio core to validate used length" Revert "virtio-blk: don't let virtio core to validate used length" Revert "virtio-scsi: don't let virtio core to validate used buffer length"
| * vdpa_sim: avoid putting an uninitialized iova_domainLongpeng2021-11-241-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The system will crash if we put an uninitialized iova_domain, this could happen when an error occurs before initializing the iova_domain in vdpasim_create(). BUG: kernel NULL pointer dereference, address: 0000000000000000 ... RIP: 0010:__cpuhp_state_remove_instance+0x96/0x1c0 ... Call Trace: <TASK> put_iova_domain+0x29/0x220 vdpasim_free+0xd1/0x120 [vdpa_sim] vdpa_release_dev+0x21/0x40 [vdpa] device_release+0x33/0x90 kobject_release+0x63/0x160 vdpasim_create+0x127/0x2a0 [vdpa_sim] vdpasim_net_dev_add+0x7d/0xfe [vdpa_sim_net] vdpa_nl_cmd_dev_add_set_doit+0xe1/0x1a0 [vdpa] genl_family_rcv_msg_doit+0x112/0x140 genl_rcv_msg+0xdf/0x1d0 ... So we must make sure the iova_domain is already initialized before put it. In addition, we may get the following warning in this case: WARNING: ... drivers/iommu/iova.c:344 iova_cache_put+0x58/0x70 So we must make sure the iova_cache_put() is invoked only if the iova_cache_get() is already invoked. Let's fix it together. Cc: stable@vger.kernel.org Fixes: 4080fc106750 ("vdpa_sim: use iova module to allocate IOVA addresses") Signed-off-by: Longpeng <longpeng2@huawei.com> Acked-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://lore.kernel.org/r/20211124015215.119-1-longpeng2@huawei.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
| * vhost-vdpa: clean irqs before reseting vdpa deviceWu Zongyong2021-11-241-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | Vdpa devices should be reset after unseting irqs of virtqueues, or we will get errors when killing qemu process: >> pi_update_irte: failed to update PI IRTE >> irq bypass consumer (token 0000000065102a43) unregistration fails: -22 Signed-off-by: Wu Zongyong <wuzongyong@linux.alibaba.com> Link: https://lore.kernel.org/r/a2cb60cf73be9da5c4e6399242117d8818f975ae.1636946171.git.wuzongyong@linux.alibaba.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com>
| * virtio-blk: modify the value type of num in virtio_queue_rq()Ye Guojin2021-11-241-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This was found by coccicheck: ./drivers/block/virtio_blk.c, 334, 14-17, WARNING Unsigned expression compared with zero num < 0 Reported-by: Zeal Robot <zealci@zte.com.cn> Signed-off-by: Ye Guojin <ye.guojin@zte.com.cn> Link: https://lore.kernel.org/r/20211117063955.160777-1-ye.guojin@zte.com.cn Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Fixes: 02746e26c39e ("virtio-blk: avoid preallocating big SGL for data") Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
| * vhost/vsock: cleanup removing `len` variableStefano Garzarella2021-11-241-5/+1
| | | | | | | | | | | | | | | | | | | | We can increment `total_len` directly and remove `len` since it is no longer used for vhost_add_used(). Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://lore.kernel.org/r/20211122163525.294024-3-sgarzare@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
| * vhost/vsock: fix incorrect used length reported to the guestStefano Garzarella2021-11-241-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The "used length" reported by calling vhost_add_used() must be the number of bytes written by the device (using "in" buffers). In vhost_vsock_handle_tx_kick() the device only reads the guest buffers (they are all "out" buffers), without writing anything, so we must pass 0 as "used length" to comply virtio spec. Fixes: 433fc58e6bf2 ("VSOCK: Introduce vhost_vsock.ko") Cc: stable@vger.kernel.org Reported-by: Halil Pasic <pasic@linux.ibm.com> Suggested-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://lore.kernel.org/r/20211122163525.294024-2-sgarzare@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Halil Pasic <pasic@linux.ibm.com>
| * Revert "virtio_ring: validate used buffer length"Michael S. Tsirkin2021-11-242-62/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 939779f5152d161b34f612af29e7dc1ac4472fcf. Attempts to validate length in the core did not work out: there turn out to exist multiple broken devices, and in particular legacy devices are known to be broken in this respect. We have ideas for handling this better in the next version but for now let's revert to a known good state to make sure drivers work for people. Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
| * Revert "virtio-net: don't let virtio core to validate used length"Michael S. Tsirkin2021-11-241-1/+0
| | | | | | | | | | | | | | | | | | This reverts commit 816625c13652cef5b2c49082d652875da6f2ad7a. Attempts to validate length in the core did not work out. We'll drop them, so revert the dependent changes in drivers. Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
| * Revert "virtio-blk: don't let virtio core to validate used length"Michael S. Tsirkin2021-11-241-1/+0
| | | | | | | | | | | | | | | | | | This reverts commit a40392edf1b2c7822bc0ce68413106661a9d4232. Attempts to validate length in the core did not work out. We'll drop them, so revert the dependent changes in drivers. Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
| * Revert "virtio-scsi: don't let virtio core to validate used buffer length"Michael S. Tsirkin2021-11-241-1/+0
| | | | | | | | | | | | | | | | | | This reverts commit c57911ebfbfe745cb95da2bcf547c5bae000590f. Attempts to validate length in the core did not work out. We'll drop them for now, so revert the dependent changes in drivers. Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
* | Merge tag 'x86-urgent-2021-11-28' of ↵Linus Torvalds2021-11-281-1/+1
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 build fix from Thomas Gleixner: "A single fix for a missing __init annotation of prepare_command_line()" * tag 'x86-urgent-2021-11-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/boot: Mark prepare_command_line() __init
| * | x86/boot: Mark prepare_command_line() __initBorislav Petkov2021-11-241-1/+1
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix: WARNING: modpost: vmlinux.o(.text.unlikely+0x64d0): Section mismatch in reference \ from the function prepare_command_line() to the variable .init.data:command_line The function prepare_command_line() references the variable __initdata command_line. This is often because prepare_command_line lacks a __initdata annotation or the annotation of command_line is wrong. Apparently some toolchains do different inlining decisions. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/YZySgpmBcNNM2qca@zn.tnic
* | Merge tag 'sched-urgent-2021-11-28' of ↵Linus Torvalds2021-11-282-4/+7
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Thomas Gleixner: "A single scheduler fix to ensure that there is no stale KASAN shadow state left on the idle task's stack when a CPU is brought up after it was brought down before" * tag 'sched-urgent-2021-11-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/scs: Reset task stack state in bringup_cpu()
| * | sched/scs: Reset task stack state in bringup_cpu()Mark Rutland2021-11-242-4/+7
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To hot unplug a CPU, the idle task on that CPU calls a few layers of C code before finally leaving the kernel. When KASAN is in use, poisoned shadow is left around for each of the active stack frames, and when shadow call stacks are in use. When shadow call stacks (SCS) are in use the task's saved SCS SP is left pointing at an arbitrary point within the task's shadow call stack. When a CPU is offlined than onlined back into the kernel, this stale state can adversely affect execution. Stale KASAN shadow can alias new stackframes and result in bogus KASAN warnings. A stale SCS SP is effectively a memory leak, and prevents a portion of the shadow call stack being used. Across a number of hotplug cycles the idle task's entire shadow call stack can become unusable. We previously fixed the KASAN issue in commit: e1b77c92981a5222 ("sched/kasan: remove stale KASAN poison after hotplug") ... by removing any stale KASAN stack poison immediately prior to onlining a CPU. Subsequently in commit: f1a0a376ca0c4ef1 ("sched/core: Initialize the idle task with preemption disabled") ... the refactoring left the KASAN and SCS cleanup in one-time idle thread initialization code rather than something invoked prior to each CPU being onlined, breaking both as above. We fixed SCS (but not KASAN) in commit: 63acd42c0d4942f7 ("sched/scs: Reset the shadow stack when idle_task_exit") ... but as this runs in the context of the idle task being offlined it's potentially fragile. To fix these consistently and more robustly, reset the SCS SP and KASAN shadow of a CPU's idle task immediately before we online that CPU in bringup_cpu(). This ensures the idle task always has a consistent state when it is running, and removes the need to so so when exiting an idle task. Whenever any thread is created, dup_task_struct() will give the task a stack which is free of KASAN shadow, and initialize the task's SCS SP, so there's no need to specially initialize either for idle thread within init_idle(), as this was only necessary to handle hotplug cycles. I've tested this on arm64 with: * gcc 11.1.0, defconfig +KASAN_INLINE, KASAN_STACK * clang 12.0.0, defconfig +KASAN_INLINE, KASAN_STACK, SHADOW_CALL_STACK ... offlining and onlining CPUS with: | while true; do | for C in /sys/devices/system/cpu/cpu*/online; do | echo 0 > $C; | echo 1 > $C; | done | done Fixes: f1a0a376ca0c4ef1 ("sched/core: Initialize the idle task with preemption disabled") Reported-by: Qian Cai <quic_qiancai@quicinc.com> Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Tested-by: Qian Cai <quic_qiancai@quicinc.com> Link: https://lore.kernel.org/lkml/20211115113310.35693-1-mark.rutland@arm.com/
* | Merge tag 'perf-urgent-2021-11-28' of ↵Linus Torvalds2021-11-281-0/+3
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fix from Thomas Gleixner: "A single fix for perf to prevent it from sending SIGTRAP to another task from a trace point event as it's not possible to deliver a synchronous signal to a different task from there" * tag 'perf-urgent-2021-11-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf: Ignore sigtrap for tracepoints destined for other tasks
| * | perf: Ignore sigtrap for tracepoints destined for other tasksMarco Elver2021-11-231-0/+3
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | syzbot reported that the warning in perf_sigtrap() fires, saying that the event's task does not match current: | WARNING: CPU: 0 PID: 9090 at kernel/events/core.c:6446 perf_pending_event+0x40d/0x4b0 kernel/events/core.c:6513 | Modules linked in: | CPU: 0 PID: 9090 Comm: syz-executor.1 Not tainted 5.15.0-syzkaller #0 | Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 | RIP: 0010:perf_sigtrap kernel/events/core.c:6446 [inline] | RIP: 0010:perf_pending_event_disable kernel/events/core.c:6470 [inline] | RIP: 0010:perf_pending_event+0x40d/0x4b0 kernel/events/core.c:6513 | ... | Call Trace: | <IRQ> | irq_work_single+0x106/0x220 kernel/irq_work.c:211 | irq_work_run_list+0x6a/0x90 kernel/irq_work.c:242 | irq_work_run+0x4f/0xd0 kernel/irq_work.c:251 | __sysvec_irq_work+0x95/0x3d0 arch/x86/kernel/irq_work.c:22 | sysvec_irq_work+0x8e/0xc0 arch/x86/kernel/irq_work.c:17 | </IRQ> | <TASK> | asm_sysvec_irq_work+0x12/0x20 arch/x86/include/asm/idtentry.h:664 | RIP: 0010:__raw_spin_unlock_irqrestore include/linux/spinlock_api_smp.h:152 [inline] | RIP: 0010:_raw_spin_unlock_irqrestore+0x38/0x70 kernel/locking/spinlock.c:194 | ... | coredump_task_exit kernel/exit.c:371 [inline] | do_exit+0x1865/0x25c0 kernel/exit.c:771 | do_group_exit+0xe7/0x290 kernel/exit.c:929 | get_signal+0x3b0/0x1ce0 kernel/signal.c:2820 | arch_do_signal_or_restart+0x2a9/0x1c40 arch/x86/kernel/signal.c:868 | handle_signal_work kernel/entry/common.c:148 [inline] | exit_to_user_mode_loop kernel/entry/common.c:172 [inline] | exit_to_user_mode_prepare+0x17d/0x290 kernel/entry/common.c:207 | __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline] | syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:300 | do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86 | entry_SYSCALL_64_after_hwframe+0x44/0xae On x86 this shouldn't happen, which has arch_irq_work_raise(). The test program sets up a perf event with sigtrap set to fire on the 'sched_wakeup' tracepoint, which fired in ttwu_do_wakeup(). This happened because the 'sched_wakeup' tracepoint also takes a task argument passed on to perf_tp_event(), which is used to deliver the event to that other task. Since we cannot deliver synchronous signals to other tasks, skip an event if perf_tp_event() is targeted at another task and perf_event_attr::sigtrap is set, which will avoid ever entering perf_sigtrap() for such events. Fixes: 97ba62b27867 ("perf: Add support for SIGTRAP on perf events") Reported-by: syzbot+663359e32ce6f1a305ad@syzkaller.appspotmail.com Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/YYpoCOBmC/kJWfmI@elver.google.com
* | Merge tag 'locking-urgent-2021-11-28' of ↵Linus Torvalds2021-11-281-93/+89
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fixes from Thomas Gleixner: "Two regression fixes for reader writer semaphores: - Plug a race in the lock handoff which is caused by inconsistency of the reader and writer path and can lead to corruption of the underlying counter. - down_read_trylock() is suboptimal when the lock is contended and multiple readers trylock concurrently. That's due to the initial value being read non-atomically which results in at least two compare exchange loops. Making the initial readout atomic reduces this significantly. Whith 40 readers by 11% in a benchmark which enforces contention on mmap_sem" * tag 'locking-urgent-2021-11-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/rwsem: Optimize down_read_trylock() under highly contended case locking/rwsem: Make handoff bit handling more consistent
| * | locking/rwsem: Optimize down_read_trylock() under highly contended caseMuchun Song2021-11-231-7/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We found that a process with 10 thousnads threads has been encountered a regression problem from Linux-v4.14 to Linux-v5.4. It is a kind of workload which will concurrently allocate lots of memory in different threads sometimes. In this case, we will see the down_read_trylock() with a high hotspot. Therefore, we suppose that rwsem has a regression at least since Linux-v5.4. In order to easily debug this problem, we write a simply benchmark to create the similar situation lile the following. ```c++ #include <sys/mman.h> #include <sys/time.h> #include <sys/resource.h> #include <sched.h> #include <cstdio> #include <cassert> #include <thread> #include <vector> #include <chrono> volatile int mutex; void trigger(int cpu, char* ptr, std::size_t sz) { cpu_set_t set; CPU_ZERO(&set); CPU_SET(cpu, &set); assert(pthread_setaffinity_np(pthread_self(), sizeof(set), &set) == 0); while (mutex); for (std::size_t i = 0; i < sz; i += 4096) { *ptr = '\0'; ptr += 4096; } } int main(int argc, char* argv[]) { std::size_t sz = 100; if (argc > 1) sz = atoi(argv[1]); auto nproc = std::thread::hardware_concurrency(); std::vector<std::thread> thr; sz <<= 30; auto* ptr = mmap(nullptr, sz, PROT_READ | PROT_WRITE, MAP_ANON | MAP_PRIVATE, -1, 0); assert(ptr != MAP_FAILED); char* cptr = static_cast<char*>(ptr); auto run = sz / nproc; run = (run >> 12) << 12; mutex = 1; for (auto i = 0U; i < nproc; ++i) { thr.emplace_back(std::thread([i, cptr, run]() { trigger(i, cptr, run); })); cptr += run; } rusage usage_start; getrusage(RUSAGE_SELF, &usage_start); auto start = std::chrono::system_clock::now(); mutex = 0; for (auto& t : thr) t.join(); rusage usage_end; getrusage(RUSAGE_SELF, &usage_end); auto end = std::chrono::system_clock::now(); timeval utime; timeval stime; timersub(&usage_end.ru_utime, &usage_start.ru_utime, &utime); timersub(&usage_end.ru_stime, &usage_start.ru_stime, &stime); printf("usr: %ld.%06ld\n", utime.tv_sec, utime.tv_usec); printf("sys: %ld.%06ld\n", stime.tv_sec, stime.tv_usec); printf("real: %lu\n", std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count()); return 0; } ``` The functionality of above program is simply which creates `nproc` threads and each of them are trying to touch memory (trigger page fault) on different CPU. Then we will see the similar profile by `perf top`. 25.55% [kernel] [k] down_read_trylock 14.78% [kernel] [k] handle_mm_fault 13.45% [kernel] [k] up_read 8.61% [kernel] [k] clear_page_erms 3.89% [kernel] [k] __do_page_fault The highest hot instruction, which accounts for about 92%, in down_read_trylock() is cmpxchg like the following. 91.89 │ lock cmpxchg %rdx,(%rdi) Sice the problem is found by migrating from Linux-v4.14 to Linux-v5.4, so we easily found that the commit ddb20d1d3aed ("locking/rwsem: Optimize down_read_trylock()") caused the regression. The reason is that the commit assumes the rwsem is not contended at all. But it is not always true for mmap lock which could be contended with thousands threads. So most threads almost need to run at least 2 times of "cmpxchg" to acquire the lock. The overhead of atomic operation is higher than non-atomic instructions, which caused the regression. By using the above benchmark, the real executing time on a x86-64 system before and after the patch were: Before Patch After Patch # of Threads real real reduced by ------------ ------ ------ ---------- 1 65,373 65,206 ~0.0% 4 15,467 15,378 ~0.5% 40 6,214 5,528 ~11.0% For the uncontended case, the new down_read_trylock() is the same as before. For the contended cases, the new down_read_trylock() is faster than before. The more contended, the more fast. Signed-off-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Waiman Long <longman@redhat.com> Link: https://lore.kernel.org/r/20211118094455.9068-1-songmuchun@bytedance.com
| * | locking/rwsem: Make handoff bit handling more consistentWaiman Long2021-11-231-86/+85
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are some inconsistency in the way that the handoff bit is being handled in readers and writers that lead to a race condition. Firstly, when a queue head writer set the handoff bit, it will clear it when the writer is being killed or interrupted on its way out without acquiring the lock. That is not the case for a queue head reader. The handoff bit will simply be inherited by the next waiter. Secondly, in the out_nolock path of rwsem_down_read_slowpath(), both the waiter and handoff bits are cleared if the wait queue becomes empty. For rwsem_down_write_slowpath(), however, the handoff bit is not checked and cleared if the wait queue is empty. This can potentially make the handoff bit set with empty wait queue. Worse, the situation in rwsem_down_write_slowpath() relies on wstate, a variable set outside of the critical section containing the ->count manipulation, this leads to race condition where RWSEM_FLAG_HANDOFF can be double subtracted, corrupting ->count. To make the handoff bit handling more consistent and robust, extract out handoff bit clearing code into the new rwsem_del_waiter() helper function. Also, completely eradicate wstate; always evaluate everything inside the same critical section. The common function will only use atomic_long_andnot() to clear bits when the wait queue is empty to avoid possible race condition. If the first waiter with handoff bit set is killed or interrupted to exit the slowpath without acquiring the lock, the next waiter will inherit the handoff bit. While at it, simplify the trylock for loop in rwsem_down_write_slowpath() to make it easier to read. Fixes: 4f23dbc1e657 ("locking/rwsem: Implement lock handoff to prevent lock starvation") Reported-by: Zhenhua Ma <mazhenhua@xiaomi.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20211116012912.723980-1-longman@redhat.com
* | Merge tag 'trace-v5.16-rc2-3' of ↵Linus Torvalds2021-11-281-1/+1
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull another tracing fix from Steven Rostedt: "Fix the fix of pid filtering The setting of the pid filtering flag tested the "trace only this pid" case twice, and ignored the "trace everything but this pid" case. The 5.15 kernel does things a little differently due to the new sparse pid mask introduced in 5.16, and as the bug was discovered running the 5.15 kernel, and the first fix was initially done for that kernel, that fix handled both cases (only pid and all but pid), but the forward port to 5.16 created this bug" * tag 'trace-v5.16-rc2-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Test the 'Do not trace this pid' case in create event
| * | tracing: Test the 'Do not trace this pid' case in create eventSteven Rostedt (VMware)2021-11-271-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When creating a new event (via a module, kprobe, eprobe, etc), the descriptors that are created must add flags for pid filtering if an instance has pid filtering enabled, as the flags are used at the time the event is executed to know if pid filtering should be done or not. The "Only trace this pid" case was added, but a cut and paste error made that case checked twice, instead of checking the "Trace all but this pid" case. Link: https://lore.kernel.org/all/202111280401.qC0z99JB-lkp@intel.com/ Fixes: 6cb206508b62 ("tracing: Check pid filtering when creating events") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
* | | Merge tag 'iommu-fixes-v5.16-rc2' of ↵Linus Torvalds2021-11-285-17/+10
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu Pull iommu fixes from Joerg Roedel: - Intel VT-d fixes: - Remove unused PASID_DISABLED - Fix RCU locking - Fix for the unmap_pages call-back - Rockchip RK3568 address mask fix - AMD IOMMUv2 log message clarification * tag 'iommu-fixes-v5.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: iommu/vt-d: Fix unmap_pages support iommu/vt-d: Fix an unbalanced rcu_read_lock/rcu_read_unlock() iommu/rockchip: Fix PAGE_DESC_HI_MASKs for RK3568 iommu/amd: Clarify AMD IOMMUv2 initialization messages iommu/vt-d: Remove unused PASID_DISABLED
| * | | iommu/vt-d: Fix unmap_pages supportAlex Williamson2021-11-261-4/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When supporting only the .map and .unmap callbacks of iommu_ops, the IOMMU driver can make assumptions about the size and alignment used for mappings based on the driver provided pgsize_bitmap. VT-d previously used essentially PAGE_MASK for this bitmap as any power of two mapping was acceptably filled by native page sizes. However, with the .map_pages and .unmap_pages interface we're now getting page-size and count arguments. If we simply combine these as (page-size * count) and make use of the previous map/unmap functions internally, any size and alignment assumptions are very different. As an example, a given vfio device assignment VM will often create a 4MB mapping at IOVA pfn [0x3fe00 - 0x401ff]. On a system that does not support IOMMU super pages, the unmap_pages interface will ask to unmap 1024 4KB pages at the base IOVA. dma_pte_clear_level() will recurse down to level 2 of the page table where the first half of the pfn range exactly matches the entire pte level. We clear the pte, increment the pfn by the level size, but (oops) the next pte is on a new page, so we exit the loop an pop back up a level. When we then update the pfn based on that higher level, we seem to assume that the previous pfn value was at the start of the level. In this case the level size is 256K pfns, which we add to the base pfn and get a results of 0x7fe00, which is clearly greater than 0x401ff, so we're done. Meanwhile we never cleared the ptes for the remainder of the range. When the VM remaps this range, we're overwriting valid ptes and the VT-d driver complains loudly, as reported by the user report linked below. The fix for this seems relatively simple, if each iteration of the loop in dma_pte_clear_level() is assumed to clear to the end of the level pte page, then our next pfn should be calculated from level_pfn rather than our working pfn. Fixes: 3f34f1259776 ("iommu/vt-d: Implement map/unmap_pages() iommu_ops callback") Reported-by: Ajay Garg <ajaygargnsit@gmail.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Tested-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com> Link: https://lore.kernel.org/all/20211002124012.18186-1-ajaygargnsit@gmail.com/ Link: https://lore.kernel.org/r/163659074748.1617923.12716161410774184024.stgit@omen Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Link: https://lore.kernel.org/r/20211126135556.397932-3-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
| * | | iommu/vt-d: Fix an unbalanced rcu_read_lock/rcu_read_unlock()Christophe JAILLET2021-11-261-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we return -EOPNOTSUPP, the rcu lock remains lock. This is spurious. Go through the end of the function instead. This way, the missing 'rcu_read_unlock()' is called. Fixes: 7afd7f6aa21a ("iommu/vt-d: Check FL and SL capability sanity in scalable mode") Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Link: https://lore.kernel.org/r/40cc077ca5f543614eab2a10e84d29dd190273f6.1636217517.git.christophe.jaillet@wanadoo.fr Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Link: https://lore.kernel.org/r/20211126135556.397932-2-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
| * | | iommu/rockchip: Fix PAGE_DESC_HI_MASKs for RK3568Alex Bee2021-11-261-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With the submission of iommu driver for RK3568 a subtle bug was introduced: PAGE_DESC_HI_MASK1 and PAGE_DESC_HI_MASK2 have to be the other way arround - that leads to random errors, especially when addresses beyond 32 bit are used. Fix it. Fixes: c55356c534aa ("iommu: rockchip: Add support for iommu v2") Signed-off-by: Alex Bee <knaerzche@gmail.com> Tested-by: Peter Geis <pgwipeout@gmail.com> Reviewed-by: Heiko Stuebner <heiko@sntech.de> Tested-by: Dan Johansen <strit@manjaro.org> Reviewed-by: Benjamin Gaignard <benjamin.gaignard@collabora.com> Link: https://lore.kernel.org/r/20211124021325.858139-1-knaerzche@gmail.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
| * | | iommu/amd: Clarify AMD IOMMUv2 initialization messagesJoerg Roedel2021-11-261-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The messages printed on the initialization of the AMD IOMMUv2 driver have caused some confusion in the past. Clarify the messages to lower the confusion in the future. Cc: stable@vger.kernel.org Signed-off-by: Joerg Roedel <jroedel@suse.de> Link: https://lore.kernel.org/r/20211123105507.7654-3-joro@8bytes.org
| * | | iommu/vt-d: Remove unused PASID_DISABLEDJoerg Roedel2021-11-261-6/+0
| | |/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | The macro is unused after commit 00ecd5401349a so it can be removed. Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Fixes: 00ecd5401349a ("iommu/vt-d: Clean up unused PASID updating functions") Signed-off-by: Joerg Roedel <jroedel@suse.de> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Link: https://lore.kernel.org/r/20211123105507.7654-2-joro@8bytes.org
* | | Merge tag '5.16-rc2-ksmbd-fixes' of git://git.samba.org/ksmbdLinus Torvalds2021-11-272-18/+22
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull ksmbd fixes from Steve French: "Five ksmbd server fixes, four of them for stable: - memleak fix - fix for default data stream on filesystems that don't support xattr - error logging fix - session setup fix - minor doc cleanup" * tag '5.16-rc2-ksmbd-fixes' of git://git.samba.org/ksmbd: ksmbd: fix memleak in get_file_stream_info() ksmbd: contain default data stream even if xattr is empty ksmbd: downgrade addition info error msg to debug in smb2_get_info_sec() docs: filesystem: cifs: ksmbd: Fix small layout issues ksmbd: Fix an error handling path in 'smb2_sess_setup()'
| * | | ksmbd: fix memleak in get_file_stream_info()Namjae Jeon2021-11-251-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix memleak in get_file_stream_info() Fixes: 34061d6b76a4 ("ksmbd: validate OutputBufferLength of QUERY_DIR, QUERY_INFO, IOCTL requests") Cc: stable@vger.kernel.org # v5.15 Reported-by: Coverity Scan <scan-admin@coverity.com> Acked-by: Hyunchul Lee <hyc.lee@gmail.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
| * | | ksmbd: contain default data stream even if xattr is emptyNamjae Jeon2021-11-251-9/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If xattr is not supported like exfat or fat, ksmbd server doesn't contain default data stream in FILE_STREAM_INFORMATION response. It will cause ppt or doc file update issue if local filesystem is such as ones. This patch move goto statement to contain it. Fixes: 9f6323311c70 ("ksmbd: add default data stream name in FILE_STREAM_INFORMATION") Cc: stable@vger.kernel.org # v5.15 Acked-by: Hyunchul Lee <hyc.lee@gmail.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
| * | | ksmbd: downgrade addition info error msg to debug in smb2_get_info_sec()Namjae Jeon2021-11-251-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While file transfer through windows client, This error flood message happen. This flood message will cause performance degradation and misunderstand server has problem. Fixes: e294f78d3478 ("ksmbd: allow PROTECTED_DACL_SECINFO and UNPROTECTED_DACL_SECINFO addition information in smb2 set info security") Cc: stable@vger.kernel.org # v5.15 Acked-by: Hyunchul Lee <hyc.lee@gmail.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
| * | | docs: filesystem: cifs: ksmbd: Fix small layout issuesSalvatore Bonaccorso2021-11-251-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In some senteces there were missing spaces between words. Fix wording in item to show which prints are enabled and add a space beween the cat command and its argument. Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Steve French <sfrench@samba.org> CC: Hyunchul Lee <hyc.lee@gmail.com> Cc: linux-cifs@vger.kernel.org Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Salvatore Bonaccorso <carnil@debian.org> Signed-off-by: Steve French <stfrench@microsoft.com>
| * | | ksmbd: Fix an error handling path in 'smb2_sess_setup()'Christophe JAILLET2021-11-251-2/+4
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | All the error handling paths of 'smb2_sess_setup()' end to 'out_err'. All but the new error handling path added by the commit given in the Fixes tag below. Fix this error handling path and branch to 'out_err' as well. Fixes: 0d994cd482ee ("ksmbd: add buffer validation in session setup") Cc: stable@vger.kernel.org # v5.15 Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Steve French <stfrench@microsoft.com>
* | | vmxnet3: Use generic Kconfig option for page size limitGuenter Roeck2021-11-271-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | Use the architecture independent Kconfig option PAGE_SIZE_LESS_THAN_64KB to indicate that VMXNET3 requires a page size smaller than 64kB. Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | fs: ntfs: Limit NTFS_RW to page sizes smaller than 64kGuenter Roeck2021-11-271-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | NTFS_RW code allocates page size dependent arrays on the stack. This results in build failures if the page size is 64k or larger. fs/ntfs/aops.c: In function 'ntfs_write_mst_block': fs/ntfs/aops.c:1311:1: error: the frame size of 2240 bytes is larger than 2048 bytes Since commit f22969a66041 ("powerpc/64s: Default to 64K pages for 64 bit book3s") this affects ppc:allmodconfig builds, but other architectures supporting page sizes of 64k or larger are also affected. Increasing the maximum frame size for affected architectures just to silence this error does not really help. The frame size would have to be set to a really large value for 256k pages. Also, a large frame size could potentially result in stack overruns in this code and elsewhere and is therefore not desirable. Make NTFS_RW dependent on page sizes smaller than 64k instead. Signed-off-by: Guenter Roeck <linux@roeck-us.net> Cc: Anton Altaparmakov <anton@tuxera.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | arch: Add generic Kconfig option indicating page size smaller than 64kGuenter Roeck2021-11-271-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | NTFS_RW and VMXNET3 require a page size smaller than 64kB. Add generic Kconfig option for use outside architecture code to avoid architecture specific Kconfig options in that code. Suggested-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Guenter Roeck <linux@roeck-us.net> Cc: Anton Altaparmakov <anton@tuxera.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | Merge tag 'xfs-5.16-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds2021-11-273-34/+8
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull xfs fixes from Darrick Wong: "Fixes for a resource leak and a build robot complaint about totally dead code: - Fix buffer resource leak that could lead to livelock on corrupt fs. - Remove unused function xfs_inew_wait to shut up the build robots" * tag 'xfs-5.16-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: remove xfs_inew_wait xfs: Fix the free logic of state in xfs_attr_node_hasname
| * | | xfs: remove xfs_inew_waitChristoph Hellwig2021-11-242-24/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With the remove of xfs_dqrele_all_inodes, xfs_inew_wait and all the infrastructure used to wake the XFS_INEW bit waitqueue is unused. Reported-by: kernel test robot <lkp@intel.com> Fixes: 777eb1fa857e ("xfs: remove xfs_dqrele_all_inodes") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
| * | | xfs: Fix the free logic of state in xfs_attr_node_hasnameYang Xu2021-11-241-10/+7
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When testing xfstests xfs/126 on lastest upstream kernel, it will hang on some machine. Adding a getxattr operation after xattr corrupted, I can reproduce it 100%. The deadlock as below: [983.923403] task:setfattr state:D stack: 0 pid:17639 ppid: 14687 flags:0x00000080 [ 983.923405] Call Trace: [ 983.923410] __schedule+0x2c4/0x700 [ 983.923412] schedule+0x37/0xa0 [ 983.923414] schedule_timeout+0x274/0x300 [ 983.923416] __down+0x9b/0xf0 [ 983.923451] ? xfs_buf_find.isra.29+0x3c8/0x5f0 [xfs] [ 983.923453] down+0x3b/0x50 [ 983.923471] xfs_buf_lock+0x33/0xf0 [xfs] [ 983.923490] xfs_buf_find.isra.29+0x3c8/0x5f0 [xfs] [ 983.923508] xfs_buf_get_map+0x4c/0x320 [xfs] [ 983.923525] xfs_buf_read_map+0x53/0x310 [xfs] [ 983.923541] ? xfs_da_read_buf+0xcf/0x120 [xfs] [ 983.923560] xfs_trans_read_buf_map+0x1cf/0x360 [xfs] [ 983.923575] ? xfs_da_read_buf+0xcf/0x120 [xfs] [ 983.923590] xfs_da_read_buf+0xcf/0x120 [xfs] [ 983.923606] xfs_da3_node_read+0x1f/0x40 [xfs] [ 983.923621] xfs_da3_node_lookup_int+0x69/0x4a0 [xfs] [ 983.923624] ? kmem_cache_alloc+0x12e/0x270 [ 983.923637] xfs_attr_node_hasname+0x6e/0xa0 [xfs] [ 983.923651] xfs_has_attr+0x6e/0xd0 [xfs] [ 983.923664] xfs_attr_set+0x273/0x320 [xfs] [ 983.923683] xfs_xattr_set+0x87/0xd0 [xfs] [ 983.923686] __vfs_removexattr+0x4d/0x60 [ 983.923688] __vfs_removexattr_locked+0xac/0x130 [ 983.923689] vfs_removexattr+0x4e/0xf0 [ 983.923690] removexattr+0x4d/0x80 [ 983.923693] ? __check_object_size+0xa8/0x16b [ 983.923695] ? strncpy_from_user+0x47/0x1a0 [ 983.923696] ? getname_flags+0x6a/0x1e0 [ 983.923697] ? _cond_resched+0x15/0x30 [ 983.923699] ? __sb_start_write+0x1e/0x70 [ 983.923700] ? mnt_want_write+0x28/0x50 [ 983.923701] path_removexattr+0x9b/0xb0 [ 983.923702] __x64_sys_removexattr+0x17/0x20 [ 983.923704] do_syscall_64+0x5b/0x1a0 [ 983.923705] entry_SYSCALL_64_after_hwframe+0x65/0xca [ 983.923707] RIP: 0033:0x7f080f10ee1b When getxattr calls xfs_attr_node_get function, xfs_da3_node_lookup_int fails with EFSCORRUPTED in xfs_attr_node_hasname because we have use blocktrash to random it in xfs/126. So it free state in internal and xfs_attr_node_get doesn't do xfs_buf_trans release job. Then subsequent removexattr will hang because of it. This bug was introduced by kernel commit 07120f1abdff ("xfs: Add xfs_has_attr and subroutines"). It adds xfs_attr_node_hasname helper and said caller will be responsible for freeing the state in this case. But xfs_attr_node_hasname will free state itself instead of caller if xfs_da3_node_lookup_int fails. Fix this bug by moving the step of free state into caller. Also, use "goto error/out" instead of returning error directly in xfs_attr_node_addname_find_attr and xfs_attr_node_removename_setup function because we should free state ourselves. Fixes: 07120f1abdff ("xfs: Add xfs_has_attr and subroutines") Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
* | | Merge tag 'iomap-5.16-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds2021-11-271-10/+16
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull iomap fixes from Darrick Wong: "A single iomap bug fix and a cleanup for 5.16-rc2. The bug fix changes how iomap deals with reading from an inline data region -- whereas the current code (incorrectly) lets the iomap read iter try for more bytes after reading the inline region (which zeroes the rest of the page!) and hopes the next iteration terminates, we surveyed the inlinedata implementations and realized that all inlinedata implementations also require that the inlinedata region end at EOF, so we can simply terminate the read. The second patch documents these assumptions in the code so that they're not subtle implications anymore, and cleans up some of the grosser parts of that function. Summary: - Fix an accounting problem where unaligned inline data reads can run off the end of the read iomap iterator. iomap has historically required that inline data mappings only exist at the end of a file, though this wasn't documented anywhere. - Document iomap_read_inline_data and change its return type to be appropriate for the information that it's actually returning" * tag 'iomap-5.16-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: iomap: iomap_read_inline_data cleanup iomap: Fix inline extent handling in iomap_readpage
| * | | iomap: iomap_read_inline_data cleanupAndreas Gruenbacher2021-11-241-16/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Change iomap_read_inline_data to return 0 or an error code; this simplifies the callers. Add a description. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> [djwong: document the return value of iomap_read_inline_data explicitly] Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>