aboutsummaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* powerpc/bpf/32: Fix Oops on tail call testsChristophe Leroy2022-11-241-31/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | test_bpf tail call tests end up as: test_bpf: #0 Tail call leaf jited:1 85 PASS test_bpf: #1 Tail call 2 jited:1 111 PASS test_bpf: #2 Tail call 3 jited:1 145 PASS test_bpf: #3 Tail call 4 jited:1 170 PASS test_bpf: #4 Tail call load/store leaf jited:1 190 PASS test_bpf: #5 Tail call load/store jited:1 BUG: Unable to handle kernel data access on write at 0xf1b4e000 Faulting instruction address: 0xbe86b710 Oops: Kernel access of bad area, sig: 11 [#1] BE PAGE_SIZE=4K MMU=Hash PowerMac Modules linked in: test_bpf(+) CPU: 0 PID: 97 Comm: insmod Not tainted 6.1.0-rc4+ #195 Hardware name: PowerMac3,1 750CL 0x87210 PowerMac NIP: be86b710 LR: be857e88 CTR: be86b704 REGS: f1b4df20 TRAP: 0300 Not tainted (6.1.0-rc4+) MSR: 00009032 <EE,ME,IR,DR,RI> CR: 28008242 XER: 00000000 DAR: f1b4e000 DSISR: 42000000 GPR00: 00000001 f1b4dfe0 c11d2280 00000000 00000000 00000000 00000002 00000000 GPR08: f1b4e000 be86b704 f1b4e000 00000000 00000000 100d816a f2440000 fe73baa8 GPR16: f2458000 00000000 c1941ae4 f1fe2248 00000045 c0de0000 f2458030 00000000 GPR24: 000003e8 0000000f f2458000 f1b4dc90 3e584b46 00000000 f24466a0 c1941a00 NIP [be86b710] 0xbe86b710 LR [be857e88] __run_one+0xec/0x264 [test_bpf] Call Trace: [f1b4dfe0] [00000002] 0x2 (unreliable) Instruction dump: XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX ---[ end trace 0000000000000000 ]--- This is a tentative to write above the stack. The problem is encoutered with tests added by commit 38608ee7b690 ("bpf, tests: Add load store test case for tail call") This happens because tail call is done to a BPF prog with a different stack_depth. At the time being, the stack is kept as is when the caller tail calls its callee. But at exit, the callee restores the stack based on its own properties. Therefore here, at each run, r1 is erroneously increased by 32 - 16 = 16 bytes. This was done that way in order to pass the tail call count from caller to callee through the stack. As powerpc32 doesn't have a red zone in the stack, it was necessary the maintain the stack as is for the tail call. But it was not anticipated that the BPF frame size could be different. Let's take a new approach. Use register r4 to carry the tail call count during the tail call, and save it into the stack at function entry if required. This means the input parameter must be in r3, which is more correct as it is a 32 bits parameter, then tail call better match with normal BPF function entry, the down side being that we move that input parameter back and forth between r3 and r4. That can be optimised later. Doing that also has the advantage of maximising the common parts between tail calls and a normal function exit. With the fix, tail call tests are now successfull: test_bpf: #0 Tail call leaf jited:1 53 PASS test_bpf: #1 Tail call 2 jited:1 115 PASS test_bpf: #2 Tail call 3 jited:1 154 PASS test_bpf: #3 Tail call 4 jited:1 165 PASS test_bpf: #4 Tail call load/store leaf jited:1 101 PASS test_bpf: #5 Tail call load/store jited:1 141 PASS test_bpf: #6 Tail call error path, max count reached jited:1 994 PASS test_bpf: #7 Tail call count preserved across function calls jited:1 140975 PASS test_bpf: #8 Tail call error path, NULL target jited:1 110 PASS test_bpf: #9 Tail call error path, index out of range jited:1 69 PASS test_bpf: test_tail_calls: Summary: 10 PASSED, 0 FAILED, [10/10 JIT'ed] Suggested-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Fixes: 51c66ad849a7 ("powerpc/bpf: Implement extended BPF on PPC32") Cc: stable@vger.kernel.org Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Tested-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/757acccb7fbfc78efa42dcf3c974b46678198905.1669278887.git.christophe.leroy@csgroup.eu
* powerpc: Fix writable sections being moved into the rodata regionNicholas Piggin2022-11-161-1/+1
| | | | | | | | | | | | | .data.rel.ro* catches .data.rel.root_cpuacct, and the kernel crashes on a store in css_clear_dir. At least we know read-only data protection is working... Fixes: b6adc6d6d3272 ("powerpc/build: move .data.rel.ro, .sdata2 to read-only") Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221116043954.3307852-1-npiggin@gmail.com
* powerpc/32: Select ARCH_SPLIT_ARG64Michael Ellerman2022-11-011-0/+1
| | | | | | | | | | | | | | | | | | | | On 32-bit kernels, 64-bit syscall arguments are split into two registers. For that to work with syscall wrappers, the prototype of the syscall must have the argument split so that the wrapper macro properly unpacks the arguments from pt_regs. The fanotify_mark() syscall is one such syscall, which already has a split prototype, guarded behind ARCH_SPLIT_ARG64. So select ARCH_SPLIT_ARG64 to get that prototype and fix fanotify_mark() on 32-bit kernels with syscall wrappers. Note also that fanotify_mark() is the only usage of ARCH_SPLIT_ARG64. Fixes: 7e92e01b7245 ("powerpc: Provide syscall wrapper") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221101034852.2340319-1-mpe@ellerman.id.au
* powerpc/32: fix syscall wrappers with 64-bit argumentsAndreas Schwab2022-11-013-3/+24
| | | | | | | | | | | | | | With the introduction of syscall wrappers all wrappers for syscalls with 64-bit arguments must be handled specially, not only those that have unaligned 64-bit arguments. This left out the fallocate() and sync_file_range2() syscalls. Fixes: 7e92e01b7245 ("powerpc: Provide syscall wrapper") Fixes: e23750623835 ("powerpc/32: fix syscall wrappers with 64-bit arguments of unaligned register-pairs") Signed-off-by: Andreas Schwab <schwab@linux-m68k.org> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/87mt9cxd6g.fsf_-_@igel.home
* asm-generic: compat: fix compat_arg_u64() and compat_arg_u64_dual()Andreas Schwab2022-11-011-1/+1
| | | | | | | | | | | | | | | | | | | | The macros are defined backwards. This affects the following compat syscalls: - compat_sys_truncate64() - compat_sys_ftruncate64() - compat_sys_fallocate() - compat_sys_sync_file_range() - compat_sys_fadvise64_64() - compat_sys_readahead() - compat_sys_pread64() - compat_sys_pwrite64() Fixes: 43d5de2b67d7 ("asm-generic: compat: Support BE for long long args in 32-bit ABIs") Signed-off-by: Andreas Schwab <schwab@linux-m68k.org> [mpe: Add list of affected syscalls] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/871qqoyvni.fsf_-_@igel.home
* powerpc/64e: Fix amdgpu build on Book3E w/o AltiVecMichael Ellerman2022-10-311-1/+1
| | | | | | | | | | | | | | | | | | There's a build failure for Book3E without AltiVec: Error: cc1: error: AltiVec not supported in this target make[6]: *** [/linux/scripts/Makefile.build:250: drivers/gpu/drm/amd/amdgpu/../display/dc/dml/display_mode_lib.o] Error 1 This happens because the amdgpu build is only gated by PPC_LONG_DOUBLE_128, but that symbol can be enabled even though AltiVec is disabled. The only user of PPC_LONG_DOUBLE_128 is amdgpu, so just add a dependency on AltiVec to that symbol to fix the build. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221027125626.1383092-1-mpe@ellerman.id.au
* Merge tag 'v6.1-rc2' into fixesMichael Ellerman2022-10-31418-5066/+5471
|\ | | | | | | | | | | | | Merge rc2 into our fixes branch, which was based on rc1 but wasn't merged until rc3, so that for the remainder of the release our fixes branch will be based on rc2 for the purposes of resolving conflicts with other trees (if necessary).
| * Linux 6.1-rc2Linus Torvalds2022-10-231-1/+1
| |
| * Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2022-10-2318-99/+180
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull kvm fixes from Paolo Bonzini: "RISC-V: - Fix compilation without RISCV_ISA_ZICBOM - Fix kvm_riscv_vcpu_timer_pending() for Sstc ARM: - Fix a bug preventing restoring an ITS containing mappings for very large and very sparse device topology - Work around a relocation handling error when compiling the nVHE object with profile optimisation - Fix for stage-2 invalidation holding the VM MMU lock for too long by limiting the walk to the largest block mapping size - Enable stack protection and branch profiling for VHE - Two selftest fixes x86: - add compat implementation for KVM_X86_SET_MSR_FILTER ioctl selftests: - synchronize includes between include/uapi and tools/include/uapi" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: tools: include: sync include/api/linux/kvm.h KVM: x86: Add compat handler for KVM_X86_SET_MSR_FILTER KVM: x86: Copy filter arg outside kvm_vm_ioctl_set_msr_filter() kvm: Add support for arch compat vm ioctls RISC-V: KVM: Fix kvm_riscv_vcpu_timer_pending() for Sstc RISC-V: Fix compilation without RISCV_ISA_ZICBOM KVM: arm64: vgic: Fix exit condition in scan_its_table() KVM: arm64: nvhe: Fix build with profile optimization KVM: selftests: Fix number of pages for memory slot in memslot_modification_stress_test KVM: arm64: selftests: Fix multiple versions of GIC creation KVM: arm64: Enable stack protection and branch profiling for VHE KVM: arm64: Limit stage2_apply_range() batch size to largest block KVM: arm64: Work out supported block level at compile time
| | * tools: include: sync include/api/linux/kvm.hPaolo Bonzini2022-10-221-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | Provide a definition of KVM_CAP_DIRTY_LOG_RING_ACQ_REL. Fixes: 17601bfed909 ("KVM: Add KVM_CAP_DIRTY_LOG_RING_ACQ_REL capability and config option") Cc: Marc Zyngier <maz@kernel.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| | * KVM: x86: Add compat handler for KVM_X86_SET_MSR_FILTERAlexander Graf2022-10-221-0/+56
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The KVM_X86_SET_MSR_FILTER ioctls contains a pointer in the passed in struct which means it has a different struct size depending on whether it gets called from 32bit or 64bit code. This patch introduces compat code that converts from the 32bit struct to its 64bit counterpart which then gets used going forward internally. With this applied, 32bit QEMU can successfully set MSR bitmaps when running on 64bit kernels. Reported-by: Andrew Randrianasulu <randrianasulu@gmail.com> Fixes: 1a155254ff937 ("KVM: x86: Introduce MSR filtering") Signed-off-by: Alexander Graf <graf@amazon.com> Message-Id: <20221017184541.2658-4-graf@amazon.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| | * KVM: x86: Copy filter arg outside kvm_vm_ioctl_set_msr_filter()Alexander Graf2022-10-221-14/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the next patch we want to introduce a second caller to set_msr_filter() which constructs its own filter list on the stack. Refactor the original function so it takes it as argument instead of reading it through copy_from_user(). Signed-off-by: Alexander Graf <graf@amazon.com> Message-Id: <20221017184541.2658-3-graf@amazon.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| | * kvm: Add support for arch compat vm ioctlsAlexander Graf2022-10-222-0/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We will introduce the first architecture specific compat vm ioctl in the next patch. Add all necessary boilerplate to allow architectures to override compat vm ioctls when necessary. Signed-off-by: Alexander Graf <graf@amazon.com> Message-Id: <20221017184541.2658-2-graf@amazon.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| | * Merge tag 'kvm-riscv-fixes-6.1-1' of https://github.com/kvm-riscv/linux into ↵Paolo Bonzini2022-10-226-51/+57
| | |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | HEAD KVM/riscv fixes for 6.1, take #1 - Fix compilation without RISCV_ISA_ZICBOM - Fix kvm_riscv_vcpu_timer_pending() for Sstc
| | | * RISC-V: KVM: Fix kvm_riscv_vcpu_timer_pending() for SstcAnup Patel2022-10-213-2/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The kvm_riscv_vcpu_timer_pending() checks per-VCPU next_cycles and per-VCPU software injected VS timer interrupt. This function returns incorrect value when Sstc is available because the per-VCPU next_cycles are only updated by kvm_riscv_vcpu_timer_save() called from kvm_arch_vcpu_put(). As a result, when Sstc is available the VCPU does not block properly upon WFI traps. To fix the above issue, we introduce kvm_riscv_vcpu_timer_sync() which will update per-VCPU next_cycles upon every VM exit instead of kvm_riscv_vcpu_timer_save(). Fixes: 8f5cb44b1bae ("RISC-V: KVM: Support sstc extension") Signed-off-by: Anup Patel <apatel@ventanamicro.com> Reviewed-by: Atish Patra <atishp@rivosinc.com> Signed-off-by: Anup Patel <anup@brainfault.org>
| | | * RISC-V: Fix compilation without RISCV_ISA_ZICBOMAndrew Jones2022-10-213-49/+38
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | riscv_cbom_block_size and riscv_init_cbom_blocksize() should always be available and riscv_init_cbom_blocksize() should always be invoked, even when compiling without RISCV_ISA_ZICBOM enabled. This is because disabling RISCV_ISA_ZICBOM means "don't use zicbom instructions in the kernel" not "pretend there isn't zicbom, even when there is". When zicbom is available, whether the kernel enables its use with RISCV_ISA_ZICBOM or not, KVM will offer it to guests. Ensure we can build KVM and that the block size is initialized even when compiling without RISCV_ISA_ZICBOM. Fixes: 8f7e001e0325 ("RISC-V: Clean up the Zicbom block size probing") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Andrew Jones <ajones@ventanamicro.com> Signed-off-by: Anup Patel <apatel@ventanamicro.com> Reviewed-by: Conor Dooley <conor.dooley@microchip.com> Reviewed-by: Heiko Stuebner <heiko@sntech.de> Tested-by: Heiko Stuebner <heiko@sntech.de> Signed-off-by: Anup Patel <anup@brainfault.org>
| | * | Merge tag 'kvmarm-fixes-6.1-2' of ↵Paolo Bonzini2022-10-222-1/+8
| | |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 6.1, take #2 - Fix a bug preventing restoring an ITS containing mappings for very large and very sparse device topology - Work around a relocation handling error when compiling the nVHE object with profile optimisation
| | | * | KVM: arm64: vgic: Fix exit condition in scan_its_table()Eric Ren2022-10-151-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With some PCIe topologies, restoring a guest fails while parsing the ITS device tables. Reproducer hints: 1. Create ARM virt VM with pxb-pcie bus which adds extra host bridges, with qemu command like: ``` -device pxb-pcie,bus_nr=8,id=pci.x,numa_node=0,bus=pcie.0 \ -device pcie-root-port,..,bus=pci.x \ ... -device pxb-pcie,bus_nr=37,id=pci.y,numa_node=1,bus=pcie.0 \ -device pcie-root-port,..,bus=pci.y \ ... ``` 2. Ensure the guest uses 2-level device table 3. Perform VM migration which calls save/restore device tables In that setup, we get a big "offset" between 2 device_ids, which makes unsigned "len" round up a big positive number, causing the scan loop to continue with a bad GPA. For example: 1. L1 table has 2 entries; 2. and we are now scanning at L2 table entry index 2075 (pointed to by L1 first entry) 3. if next device id is 9472, we will get a big offset: 7397; 4. with unsigned 'len', 'len -= offset * esz', len will underflow to a positive number, mistakenly into next iteration with a bad GPA; (It should break out of the current L2 table scanning, and jump into the next L1 table entry) 5. that bad GPA fails the guest read. Fix it by stopping the L2 table scan when the next device id is outside of the current table, allowing the scan to continue from the next L1 table entry. Thanks to Eric Auger for the fix suggestion. Fixes: 920a7a8fa92a ("KVM: arm64: vgic-its: Add infrastructure for tableookup") Suggested-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Eric Ren <renzhengeek@gmail.com> [maz: commit message tidy-up] Signed-off-by: Marc Zyngier <maz@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/d9c3a564af9e2c5bf63f48a7dcbf08cd593c5c0b.1665802985.git.renzhengeek@gmail.com
| | | * | KVM: arm64: nvhe: Fix build with profile optimizationDenis Nikitin2022-10-151-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Kernel build with clang and KCFLAGS=-fprofile-sample-use=<profile> fails with: error: arch/arm64/kvm/hyp/nvhe/kvm_nvhe.tmp.o: Unexpected SHT_REL section ".rel.llvm.call-graph-profile" Starting from 13.0.0 llvm can generate SHT_REL section, see https://reviews.llvm.org/rGca3bdb57fa1ac98b711a735de048c12b5fdd8086. gen-hyprel does not support SHT_REL relocation section. Filter out profile use flags to fix the build with profile optimization. Signed-off-by: Denis Nikitin <denik@chromium.org> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221014184532.3153551-1-denik@chromium.org
| | * | | Merge tag 'kvmarm-fixes-6.1-1' of ↵Paolo Bonzini2022-10-227-33/+28
| | |\| | | | | |/ | | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 6.1, take #1 - Fix for stage-2 invalidation holding the VM MMU lock for too long by limiting the walk to the largest block mapping size - Enable stack protection and branch profiling for VHE - Two selftest fixes
| | | * KVM: selftests: Fix number of pages for memory slot in ↵Gavin Shan2022-10-131-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | memslot_modification_stress_test It's required by vm_userspace_mem_region_add() that memory size should be aligned to host page size. However, one guest page is provided by memslot_modification_stress_test. It triggers failure in the scenario of 64KB-page-size-host and 4KB-page-size-guest, as the following messages indicate. # ./memslot_modification_stress_test Testing guest mode: PA-bits:40, VA-bits:48, 4K pages guest physical test memory: [0xffbfff0000, 0xffffff0000) Finished creating vCPUs Started all vCPUs ==== Test Assertion Failure ==== lib/kvm_util.c:824: vm_adjust_num_guest_pages(vm->mode, npages) == npages pid=5712 tid=5712 errno=0 - Success 1 0x0000000000404eeb: vm_userspace_mem_region_add at kvm_util.c:822 2 0x0000000000401a5b: add_remove_memslot at memslot_modification_stress_test.c:82 3 (inlined by) run_test at memslot_modification_stress_test.c:110 4 0x0000000000402417: for_each_guest_mode at guest_modes.c:100 5 0x00000000004016a7: main at memslot_modification_stress_test.c:187 6 0x0000ffffb8cd4383: ?? ??:0 7 0x0000000000401827: _start at :? Number of guest pages is not compatible with the host. Try npages=16 Fix the issue by providing 16 guest pages to the memory slot for this particular combination of 64KB-page-size-host and 4KB-page-size-guest on aarch64. Fixes: ef4c9f4f65462 ("KVM: selftests: Fix 32-bit truncation of vm_get_max_gfn()") Signed-off-by: Gavin Shan <gshan@redhat.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221013063020.201856-1-gshan@redhat.com
| | | * KVM: arm64: selftests: Fix multiple versions of GIC creationZenghui Yu2022-10-101-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 98f94ce42ac6 ("KVM: selftests: Move KVM_CREATE_DEVICE_TEST code to separate helper") wrongly converted a "real" GIC device creation to __kvm_test_create_device() and caused the test failure on my D05 (which supports v2 emulation). Fix it. Fixes: 98f94ce42ac6 ("KVM: selftests: Move KVM_CREATE_DEVICE_TEST code to separate helper") Signed-off-by: Zenghui Yu <yuzenghui@huawei.com> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221009033131.365-1-yuzenghui@huawei.com
| | | * KVM: arm64: Enable stack protection and branch profiling for VHEVincent Donnefort2022-10-092-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For historical reasons, the VHE code inherited the build configuration from nVHE. Now those two parts have their own folder and makefile, we can enable stack protection and branch profiling for VHE. Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Reviewed-by: Quentin Perret <qperret@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221004154216.2833636-1-vdonnefort@google.com
| | | * KVM: arm64: Limit stage2_apply_range() batch size to largest blockOliver Upton2022-10-092-21/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Presently stage2_apply_range() works on a batch of memory addressed by a stage 2 root table entry for the VM. Depending on the IPA limit of the VM and PAGE_SIZE of the host, this could address a massive range of memory. Some examples: 4 level, 4K paging -> 512 GB batch size 3 level, 64K paging -> 4TB batch size Unsurprisingly, working on such a large range of memory can lead to soft lockups. When running dirty_log_perf_test: ./dirty_log_perf_test -m -2 -s anonymous_thp -b 4G -v 48 watchdog: BUG: soft lockup - CPU#0 stuck for 45s! [dirty_log_perf_:16703] Modules linked in: vfat fat cdc_ether usbnet mii xhci_pci xhci_hcd sha3_generic gq(O) CPU: 0 PID: 16703 Comm: dirty_log_perf_ Tainted: G O 6.0.0-smp-DEV #1 pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : dcache_clean_inval_poc+0x24/0x38 lr : clean_dcache_guest_page+0x28/0x4c sp : ffff800021763990 pmr_save: 000000e0 x29: ffff800021763990 x28: 0000000000000005 x27: 0000000000000de0 x26: 0000000000000001 x25: 00400830b13bc77f x24: ffffad4f91ead9c0 x23: 0000000000000000 x22: ffff8000082ad9c8 x21: 0000fffafa7bc000 x20: ffffad4f9066ce50 x19: 0000000000000003 x18: ffffad4f92402000 x17: 000000000000011b x16: 000000000000011b x15: 0000000000000124 x14: ffff07ff8301d280 x13: 0000000000000000 x12: 00000000ffffffff x11: 0000000000010001 x10: fffffc0000000000 x9 : ffffad4f9069e580 x8 : 000000000000000c x7 : 0000000000000000 x6 : 000000000000003f x5 : ffff07ffa2076980 x4 : 0000000000000001 x3 : 000000000000003f x2 : 0000000000000040 x1 : ffff0830313bd000 x0 : ffff0830313bcc40 Call trace: dcache_clean_inval_poc+0x24/0x38 stage2_unmap_walker+0x138/0x1ec __kvm_pgtable_walk+0x130/0x1d4 __kvm_pgtable_walk+0x170/0x1d4 __kvm_pgtable_walk+0x170/0x1d4 __kvm_pgtable_walk+0x170/0x1d4 kvm_pgtable_stage2_unmap+0xc4/0xf8 kvm_arch_flush_shadow_memslot+0xa4/0x10c kvm_set_memslot+0xb8/0x454 __kvm_set_memory_region+0x194/0x244 kvm_vm_ioctl_set_memory_region+0x58/0x7c kvm_vm_ioctl+0x49c/0x560 __arm64_sys_ioctl+0x9c/0xd4 invoke_syscall+0x4c/0x124 el0_svc_common+0xc8/0x194 do_el0_svc+0x38/0xc0 el0_svc+0x2c/0xa4 el0t_64_sync_handler+0x84/0xf0 el0t_64_sync+0x1a0/0x1a4 Use the largest supported block mapping for the configured page size as the batch granularity. In so doing the walker is guaranteed to visit a leaf only once. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221007234151.461779-3-oliver.upton@linux.dev
| | | * KVM: arm64: Work out supported block level at compile timeOliver Upton2022-10-091-5/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Work out the minimum page table level where KVM supports block mappings at compile time. While at it, rewrite the comment around supported block mappings to directly describe what KVM supports instead of phrasing in terms of what it does not. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221007234151.461779-2-oliver.upton@linux.dev
| * | | Revert "mfd: syscon: Remove repetition of the regmap_get_val_endian()"Jason A. Donenfeld2022-10-231-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 72a95859728a7866522e6633818bebc1c2519b17. It broke reboots on big-endian MIPS and MIPS64 malta QEMU instances, which use the syscon driver. Little-endian is not effected, which means likely it's important to handle regmap_get_val_endian() in this function after all. Fixes: 72a95859728a ("mfd: syscon: Remove repetition of the regmap_get_val_endian()") Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Lee Jones <lee@kernel.org> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | kernel/utsname_sysctl.c: Fix hostname pollingLinus Torvalds2022-10-232-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit bfca3dd3d068 ("kernel/utsname_sysctl.c: print kernel arch") added a new entry to the uts_kern_table[] array, but didn't update the UTS_PROC_xyz enumerators of older entries, breaking anything that used them. Which is admittedly not many cases: it's really just the two uses of uts_proc_notify() in kernel/sys.c. But apparently journald-systemd actually uses this to detect hostname changes. Reported-by: Torsten Hilbrich <torsten.hilbrich@secunet.com> Fixes: bfca3dd3d068 ("kernel/utsname_sysctl.c: print kernel arch") Link: https://lore.kernel.org/lkml/0c2b92a6-0f25-9538-178f-eee3b06da23f@secunet.com/ Link: https://linux-regtracking.leemhuis.info/regzbot/regression/0c2b92a6-0f25-9538-178f-eee3b06da23f@secunet.com/ Cc: Petr Vorel <pvorel@suse.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | Merge tag 'perf_urgent_for_v6.1_rc2' of ↵Linus Torvalds2022-10-235-46/+163
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Borislav Petkov: - Fix raw data handling when perf events are used in bpf - Rework how SIGTRAPs get delivered to events to address a bunch of problems with it. Add a selftest for that too * tag 'perf_urgent_for_v6.1_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: bpf: Fix sample_flags for bpf_perf_event_output selftests/perf_events: Add a SIGTRAP stress test with disables perf: Fix missing SIGTRAPs
| | * | | bpf: Fix sample_flags for bpf_perf_event_outputSumanth Korikkar2022-10-171-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * Raw data is also filled by bpf_perf_event_output. * Add sample_flags to indicate raw data. * This eliminates the segfaults as shown below: Run ./samples/bpf/trace_output BUG pid 9 cookie 1001000000004 sized 4 BUG pid 9 cookie 1001000000004 sized 4 BUG pid 9 cookie 1001000000004 sized 4 Segmentation fault (core dumped) Fixes: 838d9bb62d13 ("perf: Use sample_flags for raw_data") Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Namhyung Kim <namhyung@kernel.org> Link: https://lkml.kernel.org/r/20221007081327.1047552-1-sumanthk@linux.ibm.com
| | * | | selftests/perf_events: Add a SIGTRAP stress test with disablesMarco Elver2022-10-171-3/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a SIGTRAP stress test that exercises repeatedly enabling/disabling an event while it concurrently keeps firing. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/Y0E3uG7jOywn7vy3@elver.google.com/
| | * | | perf: Fix missing SIGTRAPsPeter Zijlstra2022-10-173-43/+129
| | |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Marco reported: Due to the implementation of how SIGTRAP are delivered if perf_event_attr::sigtrap is set, we've noticed 3 issues: 1. Missing SIGTRAP due to a race with event_sched_out() (more details below). 2. Hardware PMU events being disabled due to returning 1 from perf_event_overflow(). The only way to re-enable the event is for user space to first "properly" disable the event and then re-enable it. 3. The inability to automatically disable an event after a specified number of overflows via PERF_EVENT_IOC_REFRESH. The worst of the 3 issues is problem (1), which occurs when a pending_disable is "consumed" by a racing event_sched_out(), observed as follows: CPU0 | CPU1 --------------------------------+--------------------------- __perf_event_overflow() | perf_event_disable_inatomic() | pending_disable = CPU0 | ... | _perf_event_enable() | event_function_call() | task_function_call() | /* sends IPI to CPU0 */ <IPI> | ... __perf_event_enable() +--------------------------- ctx_resched() task_ctx_sched_out() ctx_sched_out() group_sched_out() event_sched_out() pending_disable = -1 </IPI> <IRQ-work> perf_pending_event() perf_pending_event_disable() /* Fails to send SIGTRAP because no pending_disable! */ </IRQ-work> In the above case, not only is that particular SIGTRAP missed, but also all future SIGTRAPs because 'event_limit' is not reset back to 1. To fix, rework pending delivery of SIGTRAP via IRQ-work by introduction of a separate 'pending_sigtrap', no longer using 'event_limit' and 'pending_disable' for its delivery. Additionally; and different to Marco's proposed patch: - recognise that pending_disable effectively duplicates oncpu for the case where it is set. As such, change the irq_work handler to use ->oncpu to target the event and use pending_* as boolean toggles. - observe that SIGTRAP targets the ctx->task, so the context switch optimization that carries contexts between tasks is invalid. If the irq_work were delayed enough to hit after a context switch the SIGTRAP would be delivered to the wrong task. - observe that if the event gets scheduled out (rotation/migration/context-switch/...) the irq-work would be insufficient to deliver the SIGTRAP when the event gets scheduled back in (the irq-work might still be pending on the old CPU). Therefore have event_sched_out() convert the pending sigtrap into a task_work which will deliver the signal at return_to_user. Fixes: 97ba62b27867 ("perf: Add support for SIGTRAP on perf events") Reported-by: Dmitry Vyukov <dvyukov@google.com> Debugged-by: Dmitry Vyukov <dvyukov@google.com> Reported-by: Marco Elver <elver@google.com> Debugged-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Marco Elver <elver@google.com> Tested-by: Marco Elver <elver@google.com>
| * | | Merge tag 'sched_urgent_for_v6.1_rc2' of ↵Linus Torvalds2022-10-234-29/+35
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Borislav Petkov: - Adjust code to not trip up CFI - Fix sched group cookie matching * tag 'sched_urgent_for_v6.1_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: Introduce struct balance_callback to avoid CFI mismatches sched/core: Fix comparison in sched_group_cookie_match()
| | * | | sched: Introduce struct balance_callback to avoid CFI mismatchesKees Cook2022-10-174-20/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce distinct struct balance_callback instead of performing function pointer casting which will trip CFI. Avoids warnings as found by Clang's future -Wcast-function-type-strict option: In file included from kernel/sched/core.c:84: kernel/sched/sched.h:1755:15: warning: cast from 'void (*)(struct rq *)' to 'void (*)(struct callback_head *)' converts to incompatible function type [-Wcast-function-type-strict] head->func = (void (*)(struct callback_head *))func; ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ No binary differences result from this change. This patch is a cleanup based on Brad Spengler/PaX Team's modifications to sched code in their last public patch of grsecurity/PaX based on my understanding of the code. Changes or omissions from the original code are mine and don't reflect the original grsecurity/PaX code. Reported-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Link: https://github.com/ClangBuiltLinux/linux/issues/1724 Link: https://lkml.kernel.org/r/20221008000758.2957718-1-keescook@chromium.org
| | * | | sched/core: Fix comparison in sched_group_cookie_match()Lin Shengwang2022-10-171-9/+9
| | |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In commit 97886d9dcd86 ("sched: Migration changes for core scheduling"), sched_group_cookie_match() was added to help determine if a cookie matches the core state. However, while it iterates the SMT group, it fails to actually use the RQ for each of the CPUs iterated, use cpu_rq(cpu) instead of rq to fix things. Fixes: 97886d9dcd86 ("sched: Migration changes for core scheduling") Signed-off-by: Lin Shengwang <linshengwang1@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20221008022709.642-1-linshengwang1@huawei.com
| * | | Merge tag 'objtool_urgent_for_v6.1_rc2' of ↵Linus Torvalds2022-10-231-1/+1
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull objtool fix from Borislav Petkov: - Fix ORC stack unwinding when GCOV is enabled * tag 'objtool_urgent_for_v6.1_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/unwind/orc: Fix unreliable stack dump with gcov
| | * | | x86/unwind/orc: Fix unreliable stack dump with gcovChen Zhongjin2022-10-211-1/+1
| | |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a console stack dump is initiated with CONFIG_GCOV_PROFILE_ALL enabled, show_trace_log_lvl() gets out of sync with the ORC unwinder, causing the stack trace to show all text addresses as unreliable: # echo l > /proc/sysrq-trigger [ 477.521031] sysrq: Show backtrace of all active CPUs [ 477.523813] NMI backtrace for cpu 0 [ 477.524492] CPU: 0 PID: 1021 Comm: bash Not tainted 6.0.0 #65 [ 477.525295] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-1.fc36 04/01/2014 [ 477.526439] Call Trace: [ 477.526854] <TASK> [ 477.527216] ? dump_stack_lvl+0xc7/0x114 [ 477.527801] ? dump_stack+0x13/0x1f [ 477.528331] ? nmi_cpu_backtrace.cold+0xb5/0x10d [ 477.528998] ? lapic_can_unplug_cpu+0xa0/0xa0 [ 477.529641] ? nmi_trigger_cpumask_backtrace+0x16a/0x1f0 [ 477.530393] ? arch_trigger_cpumask_backtrace+0x1d/0x30 [ 477.531136] ? sysrq_handle_showallcpus+0x1b/0x30 [ 477.531818] ? __handle_sysrq.cold+0x4e/0x1ae [ 477.532451] ? write_sysrq_trigger+0x63/0x80 [ 477.533080] ? proc_reg_write+0x92/0x110 [ 477.533663] ? vfs_write+0x174/0x530 [ 477.534265] ? handle_mm_fault+0x16f/0x500 [ 477.534940] ? ksys_write+0x7b/0x170 [ 477.535543] ? __x64_sys_write+0x1d/0x30 [ 477.536191] ? do_syscall_64+0x6b/0x100 [ 477.536809] ? entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 477.537609] </TASK> This happens when the compiled code for show_stack() has a single word on the stack, and doesn't use a tail call to show_stack_log_lvl(). (CONFIG_GCOV_PROFILE_ALL=y is the only known case of this.) Then the __unwind_start() skip logic hits an off-by-one bug and fails to unwind all the way to the intended starting frame. Fix it by reverting the following commit: f1d9a2abff66 ("x86/unwind/orc: Don't skip the first frame for inactive tasks") The original justification for that commit no longer exists. That original issue was later fixed in a different way, with the following commit: f2ac57a4c49d ("x86/unwind/orc: Fix inactive tasks with stack pointer in %sp on GCC 10 compiled kernels") Fixes: f1d9a2abff66 ("x86/unwind/orc: Don't skip the first frame for inactive tasks") Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com> [jpoimboe: rewrite commit log] Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
| * | | Merge tag 'x86_urgent_for_v6.0_rc2' of ↵Linus Torvalds2022-10-2311-86/+122
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: "As usually the case, right after a major release, the tip urgent branches accumulate a couple more fixes than normal. And here is the x86, a bit bigger, urgent pile. - Use the correct CPU capability clearing function on the error path in Intel perf LBR - A CFI fix to ftrace along with a simplification - Adjust handling of zero capacity bit mask for resctrl cache allocation on AMD - A fix to the AMD microcode loader to attempt patch application on every logical thread - A couple of topology fixes to handle CPUID leaf 0x1f enumeration info properly - Drop a -mabi=ms compiler option check as both compilers support it now anyway - A couple of fixes to how the initial, statically allocated FPU buffer state is setup and its interaction with dynamic states at runtime" * tag 'x86_urgent_for_v6.0_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/fpu: Fix copy_xstate_to_uabi() to copy init states correctly perf/x86/intel/lbr: Use setup_clear_cpu_cap() instead of clear_cpu_cap() ftrace,kcfi: Separate ftrace_stub() and ftrace_stub_graph() x86/ftrace: Remove ftrace_epilogue() x86/resctrl: Fix min_cbm_bits for AMD x86/microcode/AMD: Apply the patch early on every logical thread x86/topology: Fix duplicated core ID within a package x86/topology: Fix multiple packages shown on a single-package system hwmon/coretemp: Handle large core ID value x86/Kconfig: Drop check for -mabi=ms for CONFIG_EFI_STUB x86/fpu: Exclude dynamic states from init_fpstate x86/fpu: Fix the init_fpstate size check with the actual size x86/fpu: Configure init_fpstate attributes orderly
| | * | | x86/fpu: Fix copy_xstate_to_uabi() to copy init states correctlyChang S. Bae2022-10-211-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When an extended state component is not present in fpstate, but in init state, the function copies from init_fpstate via copy_feature(). But, dynamic states are not present in init_fpstate because of all-zeros init states. Then retrieving them from init_fpstate will explode like this: BUG: kernel NULL pointer dereference, address: 0000000000000000 ... RIP: 0010:memcpy_erms+0x6/0x10 ? __copy_xstate_to_uabi_buf+0x381/0x870 fpu_copy_guest_fpstate_to_uabi+0x28/0x80 kvm_arch_vcpu_ioctl+0x14c/0x1460 [kvm] ? __this_cpu_preempt_check+0x13/0x20 ? vmx_vcpu_put+0x2e/0x260 [kvm_intel] kvm_vcpu_ioctl+0xea/0x6b0 [kvm] ? kvm_vcpu_ioctl+0xea/0x6b0 [kvm] ? __fget_light+0xd4/0x130 __x64_sys_ioctl+0xe3/0x910 ? debug_smp_processor_id+0x17/0x20 ? fpregs_assert_state_consistent+0x27/0x50 do_syscall_64+0x3f/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd Adjust the 'mask' to zero out the userspace buffer for the features that are not available both from fpstate and from init_fpstate. The dynamic features depend on the compacted XSAVE format. Ensure it is enabled before reading XCOMP_BV in init_fpstate. Fixes: 2308ee57d93d ("x86/fpu/amx: Enable the AMX feature in 64-bit mode") Reported-by: Yuan Yao <yuan.yao@intel.com> Suggested-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/lkml/BYAPR11MB3717EDEF2351C958F2C86EED95259@BYAPR11MB3717.namprd11.prod.outlook.com/ Link: https://lkml.kernel.org/r/20221021185844.13472-1-chang.seok.bae@intel.com
| | * | | perf/x86/intel/lbr: Use setup_clear_cpu_cap() instead of clear_cpu_cap()Maxim Levitsky2022-10-201-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | clear_cpu_cap(&boot_cpu_data) is very similar to setup_clear_cpu_cap() except that the latter also sets a bit in 'cpu_caps_cleared' which later clears the same cap in secondary cpus, which is likely what is meant here. Fixes: 47125db27e47 ("perf/x86/intel/lbr: Support Architectural LBR") Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Link: https://lkml.kernel.org/r/20220718141123.136106-2-mlevitsk@redhat.com
| | * | | ftrace,kcfi: Separate ftrace_stub() and ftrace_stub_graph()Peter Zijlstra2022-10-203-15/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Different function signatures means they needs to be different functions; otherwise CFI gets upset. As triggered by the ftrace boot tests: [] CFI failure at ftrace_return_to_handler+0xac/0x16c (target: ftrace_stub+0x0/0x14; expected type: 0x0a5d5347) Fixes: 3c516f89e17e ("x86: Add support for CONFIG_CFI_CLANG") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mark Rutland <mark.rutland@arm.com> Tested-by: Mark Rutland <mark.rutland@arm.com> Link: https://lkml.kernel.org/r/Y06dg4e1xF6JTdQq@hirez.programming.kicks-ass.net
| | * | | x86/ftrace: Remove ftrace_epilogue()Peter Zijlstra2022-10-201-15/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove the weird jumps to RET and simply use RET. This then promotes ftrace_stub() to a real function; which becomes important for kcfi. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220915111148.719080593@infradead.org Signed-off-by: Peter Zijlstra <peterz@infradead.org>
| | * | | x86/resctrl: Fix min_cbm_bits for AMDBabu Moger2022-10-181-6/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | AMD systems support zero CBM (capacity bit mask) for cache allocation. That is reflected in rdt_init_res_defs_amd() by: r->cache.arch_has_empty_bitmaps = true; However given the unified code in cbm_validate(), checking for: val == 0 && !arch_has_empty_bitmaps is not enough because of another check in cbm_validate(): if ((zero_bit - first_bit) < r->cache.min_cbm_bits) The default value of r->cache.min_cbm_bits = 1. Leading to: $ cd /sys/fs/resctrl $ mkdir foo $ cd foo $ echo L3:0=0 > schemata -bash: echo: write error: Invalid argument $ cat /sys/fs/resctrl/info/last_cmd_status Need at least 1 bits in the mask Initialize the min_cbm_bits to 0 for AMD. Also, remove the default setting of min_cbm_bits and initialize it separately. After the fix: $ cd /sys/fs/resctrl $ mkdir foo $ cd foo $ echo L3:0=0 > schemata $ cat /sys/fs/resctrl/info/last_cmd_status ok Fixes: 316e7f901f5a ("x86/resctrl: Add struct rdt_cache::arch_has_{sparse, empty}_bitmaps") Co-developed-by: Stephane Eranian <eranian@google.com> Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Babu Moger <babu.moger@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: James Morse <james.morse@arm.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Fenghua Yu <fenghua.yu@intel.com> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/lkml/20220517001234.3137157-1-eranian@google.com
| | * | | x86/microcode/AMD: Apply the patch early on every logical threadBorislav Petkov2022-10-181-3/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, the patch application logic checks whether the revision needs to be applied on each logical CPU (SMT thread). Therefore, on SMT designs where the microcode engine is shared between the two threads, the application happens only on one of them as that is enough to update the shared microcode engine. However, there are microcode patches which do per-thread modification, see Link tag below. Therefore, drop the revision check and try applying on each thread. This is what the BIOS does too so this method is very much tested. Btw, change only the early paths. On the late loading paths, there's no point in doing per-thread modification because if is it some case like in the bugzilla below - removing a CPUID flag - the kernel cannot go and un-use features it has detected are there early. For that, one should use early loading anyway. [ bp: Fixes does not contain the oldest commit which did check for equality but that is good enough. ] Fixes: 8801b3fcb574 ("x86/microcode/AMD: Rework container parsing") Reported-by: Ștefan Talpalaru <stefantalpalaru@yahoo.com> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Ștefan Talpalaru <stefantalpalaru@yahoo.com> Cc: <stable@vger.kernel.org> Link: https://bugzilla.kernel.org/show_bug.cgi?id=216211
| | * | | x86/topology: Fix duplicated core ID within a packageZhang Rui2022-10-171-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Today, core ID is assumed to be unique within each package. But an AlderLake-N platform adds a Module level between core and package, Linux excludes the unknown modules bits from the core ID, resulting in duplicate core ID's. To keep core ID unique within a package, Linux must include all APIC-ID bits for known or unknown levels above the core and below the package in the core ID. It is important to understand that core ID's have always come directly from the APIC-ID encoding, which comes from the BIOS. Thus there is no guarantee that they start at 0, or that they are contiguous. As such, naively using them for array indexes can be problematic. [ dhansen: un-known -> unknown ] Fixes: 7745f03eb395 ("x86/topology: Add CPUID.1F multi-die/package support") Suggested-by: Len Brown <len.brown@intel.com> Signed-off-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Len Brown <len.brown@intel.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20221014090147.1836-5-rui.zhang@intel.com
| | * | | x86/topology: Fix multiple packages shown on a single-package systemZhang Rui2022-10-171-4/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | CPUID.1F/B does not enumerate Package level explicitly, instead, all the APIC-ID bits above the enumerated levels are assumed to be package ID bits. Current code gets package ID by shifting out all the APIC-ID bits that Linux supports, rather than shifting out all the APIC-ID bits that CPUID.1F enumerates. This introduces problems when CPUID.1F enumerates a level that Linux does not support. For example, on a single package AlderLake-N, there are 2 Ecore Modules with 4 atom cores in each module. Linux does not support the Module level and interprets the Module ID bits as package ID and erroneously reports a multi module system as a multi-package system. Fix this by using APIC-ID bits above all the CPUID.1F enumerated levels as package ID. [ dhansen: spelling fix ] Fixes: 7745f03eb395 ("x86/topology: Add CPUID.1F multi-die/package support") Suggested-by: Len Brown <len.brown@intel.com> Signed-off-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Len Brown <len.brown@intel.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20221014090147.1836-4-rui.zhang@intel.com
| | * | | hwmon/coretemp: Handle large core ID valueZhang Rui2022-10-171-15/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The coretemp driver supports up to a hard-coded limit of 128 cores. Today, the driver can not support a core with an ID above that limit. Yet, the encoding of core ID's is arbitrary (BIOS APIC-ID) and so they may be sparse and they may be large. Update the driver to map arbitrary core ID numbers into appropriate array indexes so that 128 cores can be supported, no matter the encoding of core ID's. Signed-off-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Len Brown <len.brown@intel.com> Acked-by: Guenter Roeck <linux@roeck-us.net> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20221014090147.1836-3-rui.zhang@intel.com
| | * | | x86/Kconfig: Drop check for -mabi=ms for CONFIG_EFI_STUBNathan Chancellor2022-10-171-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A recent change in LLVM made CONFIG_EFI_STUB unselectable because it no longer pretends to support -mabi=ms, breaking the dependency in Kconfig. Lack of CONFIG_EFI_STUB can prevent kernels from booting via EFI in certain circumstances. This check was added by 8f24f8c2fc82 ("efi/libstub: Annotate firmware routines as __efiapi") to ensure that __attribute__((ms_abi)) was available, as -mabi=ms is not actually used in any cflags. According to the GCC documentation, this attribute has been supported since GCC 4.4.7. The kernel currently requires GCC 5.1 so this check is not necessary; even when that change landed in 5.6, the kernel required GCC 4.9 so it was unnecessary then as well. Clang supports __attribute__((ms_abi)) for all versions that are supported for building the kernel so no additional check is needed. Remove the 'depends on' line altogether to allow CONFIG_EFI_STUB to be selected when CONFIG_EFI is enabled, regardless of compiler. Fixes: 8f24f8c2fc82 ("efi/libstub: Annotate firmware routines as __efiapi") Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Acked-by: Ard Biesheuvel <ardb@kernel.org> Cc: stable@vger.kernel.org Link: https://github.com/llvm/llvm-project/commit/d1ad006a8f64bdc17f618deffa9e7c91d82c444d
| | * | | x86/fpu: Exclude dynamic states from init_fpstateChang S. Bae2022-10-171-3/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | == Background == The XSTATE init code initializes all enabled and supported components. Then, the init states are saved in the init_fpstate buffer that is statically allocated in about one page. The AMX TILE_DATA state is large (8KB) but its init state is zero. And the feature comes only with the compacted format with these established dependencies: AMX->XFD->XSAVES. So this state is excludable from init_fpstate. == Problem == But the buffer is formatted to include that large state. Then, this can be the cause of a noisy splat like the below. This came from XRSTORS for the task with init_fpstate in its XSAVE buffer. It is reproducible on AMX systems when the running kernel is built with CONFIG_DEBUG_PAGEALLOC=y and CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT=y: Bad FPU state detected at restore_fpregs_from_fpstate+0x57/0xd0, reinitializing FPU registers. ... RIP: 0010:restore_fpregs_from_fpstate+0x57/0xd0 ? restore_fpregs_from_fpstate+0x45/0xd0 switch_fpu_return+0x4e/0xe0 exit_to_user_mode_prepare+0x17b/0x1b0 syscall_exit_to_user_mode+0x29/0x40 do_syscall_64+0x67/0x80 ? do_syscall_64+0x67/0x80 ? exc_page_fault+0x86/0x180 entry_SYSCALL_64_after_hwframe+0x63/0xcd == Solution == Adjust init_fpstate to exclude dynamic states. XRSTORS from init_fpstate still initializes those states when their bits are set in the requested-feature bitmap. Fixes: 2308ee57d93d ("x86/fpu/amx: Enable the AMX feature in 64-bit mode") Reported-by: Lin X Wang <lin.x.wang@intel.com> Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Lin X Wang <lin.x.wang@intel.com> Link: https://lore.kernel.org/r/20220824191223.1248-4-chang.seok.bae@intel.com
| | * | | x86/fpu: Fix the init_fpstate size check with the actual sizeChang S. Bae2022-10-171-18/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The init_fpstate buffer is statically allocated. Thus, the sanity test was established to check whether the pre-allocated buffer is enough for the calculated size or not. The currently measured size is not strictly relevant. Fix to validate the calculated init_fpstate size with the pre-allocated area. Also, replace the sanity check function with open code for clarity. The abstraction itself and the function naming do not tend to represent simply what it does. Fixes: 2ae996e0c1a3 ("x86/fpu: Calculate the default sizes independently") Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20220824191223.1248-3-chang.seok.bae@intel.com
| | * | | x86/fpu: Configure init_fpstate attributes orderlyChang S. Bae2022-10-172-9/+5
| | |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The init_fpstate setup code is spread out and out of order. The init image is recorded before its scoped features and the buffer size are determined. Determine the scope of init_fpstate components and its size before recording the init state. Also move the relevant code together. Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: neelnatu@google.com Link: https://lore.kernel.org/r/20220824191223.1248-2-chang.seok.bae@intel.com