linux - linux

	Commit message (Collapse)	Author	Age	Files	Lines
...
*	KVM: Take mmu_lock when handling MMU notifier iff the hva hits a memslot	Sean Christopherson	2021-04-17	1	-4/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Defer acquiring mmu_lock in the MMU notifier paths until a "hit" has been detected in the memslots, i.e. don't take the lock for notifications that don't affect the guest. For small VMs, spurious locking is a minor annoyance. And for "volatile" setups where the majority of notifications _are_ relevant, this barely qualifies as an optimization. But, for large VMs (hundreds of threads) with static setups, e.g. no page migration, no swapping, etc..., the vast majority of MMU notifier callbacks will be unrelated to the guest, e.g. will often be in response to the userspace VMM adjusting its own virtual address space. In such large VMs, acquiring mmu_lock can be painful as it blocks vCPUs from handling page faults. In some scenarios it can even be "fatal" in the sense that it causes unacceptable brownouts, e.g. when rebuilding huge pages after live migration, a significant percentage of vCPUs will be attempting to handle page faults. x86's TDP MMU implementation is especially susceptible to spurious locking due it taking mmu_lock for read when handling page faults. Because rwlock is fair, a single writer will stall future readers, while the writer is itself stalled waiting for in-progress readers to complete. This is exacerbated by the MMU notifiers often firing multiple times in quick succession, e.g. moving a page will (always?) invoke three separate notifiers: .invalidate_range_start(), invalidate_range_end(), and .change_pte(). Unnecessarily taking mmu_lock each time means even a single spurious sequence can be problematic. Note, this optimizes only the unpaired callbacks. Optimizing the .invalidate_range_{start,end}() pairs is more complex and will be done in a future patch. Suggested-by: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210402005658.3024832-9-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
*	KVM: Move MMU notifier's mmu_lock acquisition into common helper	Sean Christopherson	2021-04-17	1	-41/+80
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Acquire and release mmu_lock in the __kvm_handle_hva_range() helper instead of requiring the caller to do the same. This paves the way for future patches to take mmu_lock if and only if an overlapping memslot is found, without also having to introduce the on_lock() shenanigans used to manipulate the notifier count and sequence. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210402005658.3024832-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
*	KVM: Kill off the old hva-based MMU notifier callbacks	Sean Christopherson	2021-04-17	6	-97/+0
\| \| \| \| \| \| \| \| \| \| \|	Yank out the hva-based MMU notifier APIs now that all architectures that use the notifiers have moved to the gfn-based APIs. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210402005658.3024832-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
*	KVM: PPC: Convert to the gfn-based MMU notifier callbacks	Sean Christopherson	2021-04-17	10	-173/+95
\| \| \| \| \| \| \| \| \| \| \| \| \|	Move PPC to the gfn-base MMU notifier APIs, and update all 15 bajillion PPC-internal hooks to work with gfns instead of hvas. No meaningful functional change intended, though the exact order of operations is slightly different since the memslot lookups occur before calling into arch code. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210402005658.3024832-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
*	KVM: MIPS/MMU: Convert to the gfn-based MMU notifier callbacks	Sean Christopherson	2021-04-17	2	-79/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Move MIPS to the gfn-based MMU notifier APIs, which do the hva->gfn lookup in common code, and whose code is nearly identical to MIPS' lookup. No meaningful functional change intended, though the exact order of operations is slightly different since the memslot lookups occur before calling into arch code. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210402005658.3024832-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
*	KVM: arm64: Convert to the gfn-based MMU notifier callbacks	Sean Christopherson	2021-04-17	2	-84/+32
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Move arm64 to the gfn-base MMU notifier APIs, which do the hva->gfn lookup in common code. No meaningful functional change intended, though the exact order of operations is slightly different since the memslot lookups occur before calling into arch code. Reviewed-by: Marc Zyngier <maz@kernel.org> Tested-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210402005658.3024832-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
*	KVM: Move x86's MMU notifier memslot walkers to generic code	Sean Christopherson	2021-04-17	6	-246/+314
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Move the hva->gfn lookup for MMU notifiers into common code. Every arch does a similar lookup, and some arch code is all but identical across multiple architectures. In addition to consolidating code, this will allow introducing optimizations that will benefit all architectures without incurring multiple walks of the memslots, e.g. by taking mmu_lock if and only if a relevant range exists in the memslots. The use of __always_inline to avoid indirect call retpolines, as done by x86, may also benefit other architectures. Consolidating the lookups also fixes a wart in x86, where the legacy MMU and TDP MMU each do their own memslot walks. Lastly, future enhancements to the memslot implementation, e.g. to add an interval tree to track host address, will need to touch far less arch specific code. MIPS, PPC, and arm64 will be converted one at a time in future patches. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210402005658.3024832-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
*	KVM: Assert that notifier count is elevated in .change_pte()	Sean Christopherson	2021-04-17	1	-2/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In KVM's .change_pte() notification callback, replace the notifier sequence bump with a WARN_ON assertion that the notifier count is elevated. An elevated count provides stricter protections than bumping the sequence, and the sequence is guarnateed to be bumped before the count hits zero. When .change_pte() was added by commit 828502d30073 ("ksm: add mmu_notifier set_pte_at_notify()"), bumping the sequence was necessary as .change_pte() would be invoked without any surrounding notifications. However, since commit 6bdb913f0a70 ("mm: wrap calls to set_pte_at_notify with invalidate_range_start and invalidate_range_end"), all calls to .change_pte() are guaranteed to be surrounded by start() and end(), and so are guaranteed to run with an elevated notifier count. Note, wrapping .change_pte() with .invalidate_range_{start,end}() is a bug of sorts, as invalidating the secondary MMU's (KVM's) PTE defeats the purpose of .change_pte(). Every arch's kvm_set_spte_hva() assumes .change_pte() is called when the relevant SPTE is present in KVM's MMU, as the original goal was to accelerate Kernel Samepage Merging (KSM) by updating KVM's SPTEs without requiring a VM-Exit (due to invalidating the SPTE). I.e. it means that .change_pte() is effectively dead code on _all_ architectures. x86 and MIPS are clearcut nops if the old SPTE is not-present, and that is guaranteed due to the prior invalidation. PPC simply unmaps the SPTE, which again should be a nop due to the invalidation. arm64 is a bit murky, but it's also likely a nop because kvm_pgtable_stage2_map() is called without a cache pointer, which means it will map an entry if and only if an existing PTE was found. For now, take advantage of the bug to simplify future consolidation of KVMs's MMU notifier code. Doing so will not greatly complicate fixing .change_pte(), assuming it's even worth fixing. .change_pte() has been broken for 8+ years and no one has complained. Even if there are KSM+KVM users that care deeply about its performance, the benefits of avoiding VM-Exits via .change_pte() need to be reevaluated to justify the added complexity and testing burden. Ripping out .change_pte() entirely would be a lot easier. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
*	KVM: MIPS: defer flush to generic MMU notifier code	Paolo Bonzini	2021-04-17	1	-9/+2
\| \| \| \| \| \| \|	Return 1 from kvm_unmap_hva_range and kvm_set_spte_hva if a flush is needed, so that the generic code can coalesce the flushes. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
*	KVM: MIPS: let generic code call prepare_flush_shadow	Paolo Bonzini	2021-04-17	3	-10/+10
\| \| \| \| \| \| \| \| \| \|	Since all calls to kvm_flush_remote_tlbs must be preceded by kvm_mips_callbacks->prepare_flush_shadow, repurpose kvm_arch_flush_remote_tlb to invoke it. This makes it possible to use the TLB flushing mechanism provided by the generic MMU notifier code. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
*	KVM: MIPS: rework flush_shadow_* callbacks into one that prepares the flush	Paolo Bonzini	2021-04-17	5	-43/+19
\| \| \| \| \| \| \| \| \| \| \| \|	Both trap-and-emulate and VZ have a single implementation that covers both .flush_shadow_all and .flush_shadow_memslot, and both of them end with a call to kvm_flush_remote_tlbs. Unify the callbacks into one and extract the call to kvm_flush_remote_tlbs. The next patches will pull it further out of the the architecture-specific MMU notifier functions kvm_unmap_hva_range and kvm_set_spte_hva. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
*	KVM: constify kvm_arch_flush_remote_tlbs_memslot	Paolo Bonzini	2021-04-17	4	-4/+4
\| \| \| \| \| \| \|	memslots are stored in RCU and there should be no need to change them. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
*	KVM: Explicitly use GFP_KERNEL_ACCOUNT for 'struct kvm_vcpu' allocations	Sean Christopherson	2021-04-17	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Use GFP_KERNEL_ACCOUNT when allocating vCPUs to make it more obvious that that the allocations are accounted, to make it easier to audit KVM's allocations in the future, and to be consistent with other cache usage in KVM. When using SLAB/SLUB, this is a nop as the cache itself is created with SLAB_ACCOUNT. When using SLOB, there are caveats within caveats. SLOB doesn't honor SLAB_ACCOUNT, so passing GFP_KERNEL_ACCOUNT will result in vCPU allocations now being accounted. But, even that depends on internal SLOB details as SLOB will only go to the page allocator when its cache is depleted. That just happens to be extremely likely for vCPUs because the size of kvm_vcpu is larger than the a page for almost all combinations of architecture and page size. Whether or not the SLOB behavior is by design is unknown; it's just as likely that no SLOB users care about accounding and so no one has bothered to implemented support in SLOB. Regardless, accounting vCPU allocations will not break SLOB+KVM+cgroup users, if any exist. Reviewed-by: Wanpeng Li <kernellwp@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210406190740.4055679-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
*	KVM: MMU: protect TDP MMU pages only down to required level	Paolo Bonzini	2021-04-17	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \|	When using manual protection of dirty pages, it is not necessary to protect nested page tables down to the 4K level; instead KVM can protect only hugepages in order to split them lazily, and delay write protection at 4K-granularity until KVM_CLEAR_DIRTY_LOG. This was overlooked in the TDP MMU, so do it there as well. Fixes: a6a0b05da9f37 ("kvm: x86/mmu: Support dirty logging for the TDP MMU") Cc: Ben Gardon <bgardon@google.com> Reviewed-by: Keqian Zhu <zhukeqian1@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
*	KVM: s390x: implement KVM_CAP_SET_GUEST_DEBUG2	Maxim Levitsky	2021-04-17	2	-0/+7
\| \| \| \| \| \| \| \| \|	Define KVM_GUESTDBG_VALID_MASK and use it to implement this capabiity. Compile tested only. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210401135451.1004564-6-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
*	KVM: aarch64: implement KVM_CAP_SET_GUEST_DEBUG2	Maxim Levitsky	2021-04-17	3	-5/+6
\| \| \| \| \| \| \| \| \| \|	Move KVM_GUESTDBG_VALID_MASK to kvm_host.h and use it to return the value of this capability. Compile tested only. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210401135451.1004564-5-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
*	KVM: x86: implement KVM_CAP_SET_GUEST_DEBUG2	Maxim Levitsky	2021-04-17	2	-0/+11
\| \| \| \| \| \| \| \| \|	Store the supported bits into KVM_GUESTDBG_VALID_MASK macro, similar to how arm does this. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210401135451.1004564-4-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
*	KVM: introduce KVM_CAP_SET_GUEST_DEBUG2	Paolo Bonzini	2021-04-17	2	-0/+4
\| \| \| \| \| \| \| \| \|	This capability will allow the user to know which KVM_GUESTDBG_* bits are supported. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210401135451.1004564-3-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
*	KVM: x86: pending exceptions must not be blocked by an injected event	Maxim Levitsky	2021-04-17	2	-3/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Injected interrupts/nmi should not block a pending exception, but rather be either lost if nested hypervisor doesn't intercept the pending exception (as in stock x86), or be delivered in exitintinfo/IDT_VECTORING_INFO field, as a part of a VMexit that corresponds to the pending exception. The only reason for an exception to be blocked is when nested run is pending (and that can't really happen currently but still worth checking for). Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210401143817.1030695-2-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
*	KVM: selftests: remove redundant semi-colon	Yang Yingliang	2021-04-17	1	-1/+1
\| \| \| \| \| \|	Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Message-Id: <20210401142514.1688199-1-yangyingliang@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
*	KVM: nSVM: call nested_svm_load_cr3 on nested state load	Maxim Levitsky	2021-04-17	1	-18/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	While KVM's MMU should be fully reset by loading of nested CR0/CR3/CR4 by KVM_SET_SREGS, we are not in nested mode yet when we do it and therefore only root_mmu is reset. On regular nested entries we call nested_svm_load_cr3 which both updates the guest's CR3 in the MMU when it is needed, and it also initializes the mmu again which makes it initialize the walk_mmu as well when nested paging is enabled in both host and guest. Since we don't call nested_svm_load_cr3 on nested state load, the walk_mmu can be left uninitialized, which can lead to a NULL pointer dereference while accessing it if we happen to get a nested page fault right after entering the nested guest first time after the migration and we decide to emulate it, which leads to the emulator trying to access walk_mmu->gva_to_gpa which is NULL. Therefore we should call this function on nested state load as well. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210401141814.1029036-3-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
*	KVM: x86: dump_vmcs should include the autoload/autostore MSR lists	David Edmondson	2021-04-17	1	-0/+16
\| \| \| \| \| \| \| \| \| \|	When dumping the current VMCS state, include the MSRs that are being automatically loaded/stored during VM entry/exit. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: David Edmondson <david.edmondson@oracle.com> Message-Id: <20210318120841.133123-6-david.edmondson@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
*	KVM: x86: dump_vmcs should show the effective EFER	David Edmondson	2021-04-17	2	-5/+17
\| \| \| \| \| \| \| \| \| \|	If EFER is not being loaded from the VMCS, show the effective value by reference to the MSR autoload list or calculation. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Edmondson <david.edmondson@oracle.com> Message-Id: <20210318120841.133123-5-david.edmondson@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
*	KVM: x86: dump_vmcs should consider only the load controls of EFER/PAT	David Edmondson	2021-04-17	1	-4/+2
\| \| \| \| \| \| \| \| \| \| \|	When deciding whether to dump the GUEST_IA32_EFER and GUEST_IA32_PAT fields of the VMCS, examine only the VM entry load controls, as saving on VM exit has no effect on whether VM entry succeeds or fails. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Edmondson <david.edmondson@oracle.com> Message-Id: <20210318120841.133123-4-david.edmondson@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
*	KVM: x86: dump_vmcs should not conflate EFER and PAT presence in VMCS	David Edmondson	2021-04-17	1	-9/+10
\| \| \| \| \| \| \| \|	Show EFER and PAT based on their individual entry/exit controls. Signed-off-by: David Edmondson <david.edmondson@oracle.com> Message-Id: <20210318120841.133123-3-david.edmondson@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
*	KVM: x86: dump_vmcs should not assume GUEST_IA32_EFER is valid	David Edmondson	2021-04-17	1	-6/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If the VM entry/exit controls for loading/saving MSR_EFER are either not available (an older processor or explicitly disabled) or not used (host and guest values are the same), reading GUEST_IA32_EFER from the VMCS returns an inaccurate value. Because of this, in dump_vmcs() don't use GUEST_IA32_EFER to decide whether to print the PDPTRs - always do so if the fields exist. Fixes: 4eb64dce8d0a ("KVM: x86: dump VMCS on invalid entry") Signed-off-by: David Edmondson <david.edmondson@oracle.com> Message-Id: <20210318120841.133123-2-david.edmondson@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
*	KVM: nSVM: improve SYSENTER emulation on AMD	Maxim Levitsky	2021-04-17	2	-37/+68
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently to support Intel->AMD migration, if CPU vendor is GenuineIntel, we emulate the full 64 value for MSR_IA32_SYSENTER_{EIP\|ESP} msrs, and we also emulate the sysenter/sysexit instruction in long mode. (Emulator does still refuse to emulate sysenter in 64 bit mode, on the ground that the code for that wasn't tested and likely has no users) However when virtual vmload/vmsave is enabled, the vmload instruction will update these 32 bit msrs without triggering their msr intercept, which will lead to having stale values in kvm's shadow copy of these msrs, which relies on the intercept to be up to date. Fix/optimize this by doing the following: 1. Enable the MSR intercepts for SYSENTER MSRs iff vendor=GenuineIntel (This is both a tiny optimization and also ensures that in case the guest cpu vendor is AMD, the msrs will be 32 bit wide as AMD defined). 2. Store only high 32 bit part of these msrs on interception and combine it with hardware msr value on intercepted read/writes iff vendor=GenuineIntel. 3. Disable vmload/vmsave virtualization if vendor=GenuineIntel. (It is somewhat insane to set vendor=GenuineIntel and still enable SVM for the guest but well whatever). Then zero the high 32 bit parts when kvm intercepts and emulates vmload. Thanks a lot to Paulo Bonzini for helping me with fixing this in the most correct way. This patch fixes nested migration of 32 bit nested guests, that was broken because incorrect cached values of SYSENTER msrs were stored in the migration stream if L1 changed these msrs with vmload prior to L2 entry. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210401111928.996871-3-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
*	KVM: x86: add guest_cpuid_is_intel	Maxim Levitsky	2021-04-17	1	-0/+8
\| \| \| \| \| \| \| \|	This is similar to existing 'guest_cpuid_is_amd_or_hygon' Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210401111928.996871-2-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
*	KVM: x86: Account a variety of miscellaneous allocations	Sean Christopherson	2021-04-17	3	-6/+6
\| \| \| \| \| \| \| \| \| \| \| \|	Switch to GFP_KERNEL_ACCOUNT for a handful of allocations that are clearly associated with a single task/VM. Note, there are a several SEV allocations that aren't accounted, but those can (hopefully) be fixed by using the local stack for memory. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210331023025.2485960-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
*	KVM: SVM: Do not allow SEV/SEV-ES initialization after vCPUs are created	Sean Christopherson	2021-04-17	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Reject KVM_SEV_INIT and KVM_SEV_ES_INIT if they are attempted after one or more vCPUs have been created. KVM assumes a VM is tagged SEV/SEV-ES prior to vCPU creation, e.g. init_vmcb() needs to mark the VMCB as SEV enabled, and svm_create_vcpu() needs to allocate the VMSA. At best, creating vCPUs before SEV/SEV-ES init will lead to unexpected errors and/or behavior, and at worst it will crash the host, e.g. sev_launch_update_vmsa() will dereference a null svm->vmsa pointer. Fixes: 1654efcbc431 ("KVM: SVM: Add KVM_SEV_INIT command") Fixes: ad73109ae7ec ("KVM: SVM: Provide support to launch and run an SEV-ES guest") Cc: stable@vger.kernel.org Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210331031936.2495277-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
*	KVM: SVM: Do not set sev->es_active until KVM_SEV_ES_INIT completes	Sean Christopherson	2021-04-17	1	-17/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Set sev->es_active only after the guts of KVM_SEV_ES_INIT succeeds. If the command fails, e.g. because SEV is already active or there are no available ASIDs, then es_active will be left set even though the VM is not fully SEV-ES capable. Refactor the code so that "es_active" is passed on the stack instead of being prematurely shoved into sev_info, both to avoid having to unwind sev_info and so that it's more obvious what actually consumes es_active in sev_guest_init() and its helpers. Fixes: ad73109ae7ec ("KVM: SVM: Provide support to launch and run an SEV-ES guest") Cc: stable@vger.kernel.org Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210331031936.2495277-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
*	KVM: SVM: Use online_vcpus, not created_vcpus, to iterate over vCPUs	Sean Christopherson	2021-04-17	1	-2/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Use the kvm_for_each_vcpu() helper to iterate over vCPUs when encrypting VMSAs for SEV, which effectively switches to use online_vcpus instead of created_vcpus. This fixes a possible null-pointer dereference as created_vcpus does not guarantee a vCPU exists, since it is updated at the very beginning of KVM_CREATE_VCPU. created_vcpus exists to allow the bulk of vCPU creation to run in parallel, while still correctly restricting the max number of max vCPUs. Fixes: ad73109ae7ec ("KVM: SVM: Provide support to launch and run an SEV-ES guest") Cc: stable@vger.kernel.org Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210331031936.2495277-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
*	KVM: x86/mmu: Simplify code for aging SPTEs in TDP MMU	Sean Christopherson	2021-04-17	1	-3/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Use a basic NOT+AND sequence to clear the Accessed bit in TDP MMU SPTEs, as opposed to the fancy ffs()+clear_bit() logic that was copied from the legacy MMU. The legacy MMU uses clear_bit() because it is operating on the SPTE itself, i.e. clearing needs to be atomic. The TDP MMU operates on a local variable that it later writes to the SPTE, and so doesn't need to be atomic or even resident in memory. Opportunistically drop unnecessary initialization of new_spte, it's guaranteed to be written before being accessed. Using NOT+AND instead of ffs()+clear_bit() reduces the sequence from: 0x0000000000058be6 <+134>: test %rax,%rax 0x0000000000058be9 <+137>: je 0x58bf4 <age_gfn_range+148> 0x0000000000058beb <+139>: test %rax,%rdi 0x0000000000058bee <+142>: je 0x58cdc <age_gfn_range+380> 0x0000000000058bf4 <+148>: mov %rdi,0x8(%rsp) 0x0000000000058bf9 <+153>: mov $0xffffffff,%edx 0x0000000000058bfe <+158>: bsf %eax,%edx 0x0000000000058c01 <+161>: movslq %edx,%rdx 0x0000000000058c04 <+164>: lock btr %rdx,0x8(%rsp) 0x0000000000058c0b <+171>: mov 0x8(%rsp),%r15 to: 0x0000000000058bdd <+125>: test %rax,%rax 0x0000000000058be0 <+128>: je 0x58beb <age_gfn_range+139> 0x0000000000058be2 <+130>: test %rax,%r8 0x0000000000058be5 <+133>: je 0x58cc0 <age_gfn_range+352> 0x0000000000058beb <+139>: not %rax 0x0000000000058bee <+142>: and %r8,%rax 0x0000000000058bf1 <+145>: mov %rax,%r15 thus eliminating several memory accesses, including a locked access. Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210331004942.2444916-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
*	KVM: x86/mmu: Remove spurious clearing of dirty bit from TDP MMU SPTE	Sean Christopherson	2021-04-17	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Don't clear the dirty bit when aging a TDP MMU SPTE (in response to a MMU notifier event). Prematurely clearing the dirty bit could cause spurious PML updates if aging a page happened to coincide with dirty logging. Note, tdp_mmu_set_spte_no_acc_track() flows into __handle_changed_spte(), so the host PFN will be marked dirty, i.e. there is no potential for data corruption. Fixes: a6a0b05da9f3 ("kvm: x86/mmu: Support dirty logging for the TDP MMU") Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210331004942.2444916-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
*	KVM: x86/mmu: Drop trace_kvm_age_page() tracepoint	Sean Christopherson	2021-04-17	3	-27/+0
\| \| \| \| \| \| \| \| \| \| \| \|	Remove x86's trace_kvm_age_page() tracepoint. It's mostly redundant with the common trace_kvm_age_hva() tracepoint, and if there is a need for the extra details, e.g. gfn, referenced, etc... those details should be added to the common tracepoint so that all architectures and MMUs benefit from the info. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210326021957.1424875-19-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
*	KVM: Move arm64's MMU notifier trace events to generic code	Sean Christopherson	2021-04-17	6	-88/+78
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Move arm64's MMU notifier trace events into common code in preparation for doing the hva->gfn lookup in common code. The alternative would be to trace the gfn instead of hva, but that's not obviously better and could also be done in common code. Tracing the notifiers is also quite handy for debug regardless of architecture. Remove a completely redundant tracepoint from PPC e500. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210326021957.1424875-10-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
*	KVM: Move prototypes for MMU notifier callbacks to generic code	Sean Christopherson	2021-04-17	5	-22/+9
\| \| \| \| \| \| \| \| \| \| \| \| \|	Move the prototypes for the MMU notifier callbacks out of arch code and into common code. There is no benefit to having each arch replicate the prototypes since any deviation from the invocation in common code will explode. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210326021957.1424875-9-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
*	KVM: x86/mmu: Use leaf-only loop for walking TDP SPTEs when changing SPTE	Sean Christopherson	2021-04-17	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Use the leaf-only TDP iterator when changing the SPTE in reaction to a MMU notifier. Practically speaking, this is a nop since the guts of the loop explicitly looks for 4k SPTEs, which are always leaf SPTEs. Switch the iterator to match age_gfn_range() and test_age_gfn() so that a future patch can consolidate the core iterating logic. No real functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210326021957.1424875-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
*	KVM: x86/mmu: Pass address space ID to TDP MMU root walkers	Sean Christopherson	2021-04-17	2	-66/+44
\| \| \| \| \| \| \| \| \| \| \|	Move the address space ID check that is performed when iterating over roots into the macro helpers to consolidate code. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210326021957.1424875-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
*	KVM: x86/mmu: Pass address space ID to __kvm_tdp_mmu_zap_gfn_range()	Sean Christopherson	2021-04-17	3	-18/+24
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Pass the address space ID to TDP MMU's primary "zap gfn range" helper to allow the MMU notifier paths to iterate over memslots exactly once. Currently, both the legacy MMU and TDP MMU iterate over memslots when looking for an overlapping hva range, which can be quite costly if there are a large number of memslots. Add a "flush" parameter so that iterating over multiple address spaces in the caller will continue to do the right thing when yielding while a flush is pending from a previous address space. Note, this also has a functional change in the form of coalescing TLB flushes across multiple address spaces in kvm_zap_gfn_range(), and also optimizes the TDP MMU to utilize range-based flushing when running as L1 with Hyper-V enlightenments. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210326021957.1424875-6-seanjc@google.com> [Keep separate for loops to prepare for other incoming patches. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
*	KVM: x86/mmu: Coalesce TLB flushes across address spaces for gfn range zap	Sean Christopherson	2021-04-17	1	-9/+8
\| \| \| \| \| \| \| \| \| \| \| \| \|	Gather pending TLB flushes across both address spaces when zapping a given gfn range. This requires feeding "flush" back into subsequent calls, but on the plus side sets the stage for further batching between the legacy MMU and TDP MMU. It also allows refactoring the address space iteration to cover the legacy and TDP MMUs without introducing truly ugly code. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210326021957.1424875-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
*	KVM: x86/mmu: Coalesce TLB flushes when zapping collapsible SPTEs	Sean Christopherson	2021-04-17	3	-9/+9
\| \| \| \| \| \| \| \| \| \| \| \| \|	Gather pending TLB flushes across both the legacy and TDP MMUs when zapping collapsible SPTEs to avoid multiple flushes if both the legacy MMU (for nested guests) and TDP MMU have mappings for the memslot. Note, this also optimizes the TDP MMU to flush only the relevant range when running as L1 with Hyper-V enlightenments. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210326021957.1424875-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
*	KVM: x86/mmu: Move flushing for "slot" handlers to caller for legacy MMU	Sean Christopherson	2021-04-17	1	-18/+19
\| \| \| \| \| \| \| \| \| \| \| \| \|	Place the onus on the caller of slot_handle_*() to flush the TLB, rather than handling the flush in the helper, and rename parameters accordingly. This will allow future patches to coalesce flushes between address spaces and between the legacy and TDP MMUs. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210326021957.1424875-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
*	KVM: x86/mmu: Coalesce TDP MMU TLB flushes when zapping collapsible SPTEs	Sean Christopherson	2021-04-17	1	-9/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When zapping collapsible SPTEs across multiple roots, gather pending flushes and perform a single remote TLB flush at the end, as opposed to flushing after processing every root. Note, flush may be cleared by the result of zap_collapsible_spte_range(). This is intended and correct, e.g. yielding may have serviced a prior pending flush. Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210326021957.1424875-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
*	KVM: x86/vPMU: Forbid reading from MSR_F15H_PERF MSRs when guest doesn't ↵	Vitaly Kuznetsov	2021-04-17	1	-0/+6
\| \| \| \| \| \| \| \| \| \| \| \|	have X86_FEATURE_PERFCTR_CORE MSR_F15H_PERF_CTL0-5, MSR_F15H_PERF_CTR0-5 MSRs have a CPUID bit assigned to them (X86_FEATURE_PERFCTR_CORE) and when it wasn't exposed to the guest the correct behavior is to inject #GP an not just return zero. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210329124804.170173-1-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
*	KVM: nSVM: If VMRUN is single-stepped, queue the #DB intercept in ↵	Krish Sadhukhan	2021-04-17	1	-0/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	nested_svm_vmexit() According to APM, the #DB intercept for a single-stepped VMRUN must happen after the completion of that instruction, when the guest does #VMEXIT to the host. However, in the current implementation of KVM, the #DB intercept for a single-stepped VMRUN happens after the completion of the instruction that follows the VMRUN instruction. When the #DB intercept handler is invoked, it shows the RIP of the instruction that follows VMRUN, instead of of VMRUN itself. This is an incorrect RIP as far as single-stepping VMRUN is concerned. This patch fixes the problem by checking, in nested_svm_vmexit(), for the condition that the VMRUN instruction is being single-stepped and if so, queues the pending #DB intercept so that the #DB is accounted for before we execute L1's next instruction. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oraacle.com> Message-Id: <20210323175006.73249-2-krish.sadhukhan@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
*	KVM: MMU: load PDPTRs outside mmu_lock	Paolo Bonzini	2021-04-17	1	-15/+37
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	On SVM, reading PDPTRs might access guest memory, which might fault and thus might sleep. On the other hand, it is not possible to release the lock after make_mmu_pages_available has been called. Therefore, push the call to make_mmu_pages_available and the mmu_lock critical section within mmu_alloc_direct_roots and mmu_alloc_shadow_roots. Reported-by: Wanpeng Li <wanpengli@tencent.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
*	Merge remote-tracking branch 'tip/x86/sgx' into kvm-next	Paolo Bonzini	2021-04-17	21	-224/+861
\|\ \| \| \| \| \| \|	Pull generic x86 SGX changes needed to support SGX in virtual machines.
\| *	x86/sgx: Mark sgx_vepc_vm_ops static	Wei Yongjun	2021-04-12	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fix the following sparse warning: arch/x86/kernel/cpu/sgx/virt.c:95:35: warning: symbol 'sgx_vepc_vm_ops' was not declared. Should it be static? This symbol is not used outside of virt.c so mark it static. [ bp: Massage commit message. ] Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210412160023.193850-1-weiyongjun1@huawei.com
\| *	x86/sgx: Do not update sgx_nr_free_pages in sgx_setup_epc_section()	Jarkko Sakkinen	2021-04-08	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The commit in Fixes: changed the SGX EPC page sanitization to end up in sgx_free_epc_page() which puts clean and sanitized pages on the free list. This was done for the reason that it is best to keep the logic to assign available-for-use EPC pages to the correct NUMA lists in a single location. sgx_nr_free_pages is also incremented by sgx_free_epc_pages() but those pages which are being added there per EPC section do not belong to the free list yet because they haven't been sanitized yet - they land on the dirty list first and the sanitization happens later when ksgxd starts massaging them. So remove that addition there and have sgx_free_epc_page() do that solely. [ bp: Sanitize commit message too. ] Fixes: 51ab30eb2ad4 ("x86/sgx: Replace section->init_laundry_list with sgx_dirty_page_list") Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210408092924.7032-1-jarkko@kernel.org