aboutsummaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'kvm-arm-for-v4.15' of ↵Radim Krčmář2017-11-0832-635/+838
|\ | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into next KVM/ARM Changes for v4.15 Changes include: - Optimized arch timer handling for KVM/ARM - Improvements to the VGIC ITS code and introduction of an ITS reset ioctl - Unification of the 32-bit fault injection logic - More exact external abort matching logic
| * KVM: arm/arm64: fix the incompatible matching for external abortDongjiu Geng2017-11-062-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | kvm_vcpu_dabt_isextabt() tries to match a full fault syndrome, but calls kvm_vcpu_trap_get_fault_type() that only returns the fault class, thus reducing the scope of the check. This doesn't cause any observable bug yet as we end-up matching a closely related syndrome for which we return the same value. Using kvm_vcpu_trap_get_fault() instead fixes it for good. Signed-off-by: Dongjiu Geng <gengdongjiu@huawei.com> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
| * KVM: arm/arm64: Unify 32bit fault injectionMarc Zyngier2017-11-065-218/+131
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Both arm and arm64 implementations are capable of injecting faults, and yet have completely divergent implementations, leading to different bugs and reduced maintainability. Let's elect the arm64 version as the canonical one and move it into aarch32.c, which is common to both architectures. Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
| * KVM: arm/arm64: vgic-its: Implement KVM_DEV_ARM_ITS_CTRL_RESETEric Auger2017-11-063-49/+58
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On reset we clear the valid bits of GITS_CBASER and GITS_BASER<n>. We also clear command queue registers and free the cache (device, collection, and lpi lists). As we need to take the same locks as save/restore functions, we create a vgic_its_ctrl() wrapper that handles KVM_DEV_ARM_VGIC_GRP_CTRL group functions. Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org> Reviewed-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
| * KVM: arm/arm64: Document KVM_DEV_ARM_ITS_CTRL_RESETEric Auger2017-11-061-0/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | At the moment, the in-kernel emulated ITS is not properly reset. On guest restart/reset some registers keep their old values and internal structures like device, ITE, and collection lists are not freed. This may lead to various bugs. Among them, we can have incorrect state backup or failure when saving the ITS state at early guest boot stage. This patch documents a new attribute, KVM_DEV_ARM_ITS_CTRL_RESET in the KVM_DEV_ARM_VGIC_GRP_CTRL group. Upon this action, we can reset registers and especially those pointing to tables previously allocated by the guest and free the internal data structures storing the list of devices, collections and lpis. The usual approach for device reset of having userspace write the reset values of the registers to the kernel via the register read/write APIs doesn't work for the ITS because it has some internal state (caches) which is not exposed as registers, and there is no register interface for "drop cached data without writing it back to RAM". So we need a KVM API which mimics the hardware's reset line, to provide the equivalent behaviour to a "pull the power cord out of the back of the machine" reset. Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org> Reviewed-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Eric Auger <eric.auger@redhat.com> Reported-by: wanghaibin <wanghaibin.wang@huawei.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
| * KVM: arm/arm64: vgic-its: Free caches when GITS_BASER Valid bit is clearedEric Auger2017-11-061-4/+18
| | | | | | | | | | | | | | | | | | | | | | | | When the GITS_BASER<n>.Valid gets cleared, the data structures in guest RAM are not valid anymore. The device, collection and LPI lists stored in the in-kernel ITS represent the same information in some form of cache. So let's void the cache. Reviewed-by: Marc Zyngier <marc.zyngier@arm.com> Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
| * KVM: arm/arm64: vgic-its: New helper functions to free the cacheswanghaibin2017-11-061-21/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We create two new functions that free the device and collection lists. They are currently called by vgic_its_destroy() and other callers will be added in subsequent patches. We also remove the check on its->device_list.next. Lists are initialized in vgic_create_its() and the device is added to the device list only if this latter succeeds. vgic_its_destroy is the device destroy ops. This latter is called by kvm_destroy_devices() which loops on all created devices. So at this point the list is initialized. Acked-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: wanghaibin <wanghaibin.wang@huawei.com> Signed-off-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
| * KVM: arm/arm64: vgic-its: Remove kvm_its_unmap_deviceEric Auger2017-11-061-12/+2
| | | | | | | | | | | | | | | | | | | | Let's remove kvm_its_unmap_device and use kvm_its_free_device as both functions are identical. Signed-off-by: Eric Auger <eric.auger@redhat.com> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Acked-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
| * arm/arm64: KVM: Load the timer state when enabling the timerChristoffer Dall2017-11-061-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After being lazy with saving/restoring the timer state, we defer that work to vcpu_load and vcpu_put, which ensure that the timer state is loaded on the hardware timers whenever the VCPU runs. Unfortunately, we are failing to do that the first time vcpu_load() runs, because the timer has not yet been enabled at that time. As long as the initialized timer state matches what happens to be in the hardware (a disabled timer, because we never leave the timer screaming), this does not show up as a problem, but is nevertheless incorrect. The solution is simple; disable preemption while setting the timer to be enabled, and call the timer load function when first enabling the timer. Acked-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
| * KVM: arm/arm64: Rework kvm_timer_should_fireChristoffer Dall2017-11-063-4/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | kvm_timer_should_fire() can be called in two different situations from the kvm_vcpu_block(). The first case is before calling kvm_timer_schedule(), used for wait polling, and in this case the VCPU thread is running and the timer state is loaded onto the hardware so all we have to do is check if the virtual interrupt lines are asserted, becasue the timer interrupt handler functions will raise those lines as appropriate. The second case is inside the wait loop of kvm_vcpu_block(), where we have already called kvm_timer_schedule() and therefore the hardware will be disabled and the software view of the timer state is up to date (timer->loaded is false), and so we can simply check if the timer should fire by looking at the software state. Signed-off-by: Christoffer Dall <cdall@linaro.org> Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
| * KVM: arm/arm64: Get rid of kvm_timer_flush_hwstateChristoffer Dall2017-11-063-26/+0
| | | | | | | | | | | | | | | | | | | | | | Now when both the vtimer and the ptimer when using both the in-kernel vgic emulation and a userspace IRQ chip are driven by the timer signals and at the vcpu load/put boundaries, instead of recomputing the timer state at every entry/exit to/from the guest, we can get entirely rid of the flush hwstate function. Signed-off-by: Christoffer Dall <cdall@linaro.org> Acked-by: Marc Zyngier <marc.zyngier@arm.com>
| * KVM: arm/arm64: Avoid phys timer emulation in vcpu entry/exitChristoffer Dall2017-11-061-24/+51
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is no need to schedule and cancel a hrtimer when entering and exiting the guest, because we know when the physical timer is going to fire when the guest programs it, and we can simply program the hrtimer at that point. Now when the register modifications from the guest go through the kvm_arm_timer_set/get_reg functions, which always call kvm_timer_update_state(), we can simply consider the timer state in this function and schedule and cancel the timers as needed. This avoids looking at the physical timer emulation state when entering and exiting the VCPU, allowing for faster servicing of the VM when needed. Signed-off-by: Christoffer Dall <cdall@linaro.org> Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
| * KVM: arm/arm64: Move phys_timer_emulate functionChristoffer Dall2017-11-061-16/+16
| | | | | | | | | | | | | | | | | | | | | | | | We are about to call phys_timer_emulate() from kvm_timer_update_state() and modify phys_timer_emulate() at the same time. Moving the function and modifying it in a single patch makes the diff hard to read, so do this separately first. No functional change. Signed-off-by: Christoffer Dall <cdall@linaro.org> Acked-by: Marc Zyngier <marc.zyngier@arm.com>
| * KVM: arm/arm64: Use kvm_arm_timer_set/get_reg for guest register trapsChristoffer Dall2017-11-061-27/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | When trapping on a guest access to one of the timer registers, we were messing with the internals of the timer state from the sysregs handling code, and that logic was about to receive more added complexity when optimizing the timer handling code. Therefore, since we already have timer register access functions (to access registers from userspace), reuse those for the timer register traps from a VM and let the timer code maintain its own consistency. Signed-off-by: Christoffer Dall <cdall@linaro.org> Acked-by: Marc Zyngier <marc.zyngier@arm.com>
| * KVM: arm/arm64: Support EL1 phys timer register access in set/get regChristoffer Dall2017-11-063-2/+43
| | | | | | | | | | | | | | | | | | | | | | | | | | Add suport for the physical timer registers in kvm_arm_timer_set_reg and kvm_arm_timer_get_reg so that these functions can be reused to interact with the rest of the system. Note that this paves part of the way for the physical timer state save/restore, but we still need to add those registers to KVM_GET_REG_LIST before we support migrating the physical timer state. Acked-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <cdall@linaro.org>
| * KVM: arm/arm64: Avoid timer save/restore in vcpu entry/exitChristoffer Dall2017-11-063-94/+178
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We don't need to save and restore the hardware timer state and examine if it generates interrupts on on every entry/exit to the guest. The timer hardware is perfectly capable of telling us when it has expired by signaling interrupts. When taking a vtimer interrupt in the host, we don't want to mess with the timer configuration, we just want to forward the physical interrupt to the guest as a virtual interrupt. We can use the split priority drop and deactivate feature of the GIC to do this, which leaves an EOI'ed interrupt active on the physical distributor, making sure we don't keep taking timer interrupts which would prevent the guest from running. We can then forward the physical interrupt to the VM using the HW bit in the LR of the GIC, like we do already, which lets the guest directly deactivate both the physical and virtual timer simultaneously, allowing the timer hardware to exit the VM and generate a new physical interrupt when the timer output is again asserted later on. We do need to capture this state when migrating VCPUs between physical CPUs, however, which we use the vcpu put/load functions for, which are called through preempt notifiers whenever the thread is scheduled away from the CPU or called directly if we return from the ioctl to userspace. One caveat is that we have to save and restore the timer state in both kvm_timer_vcpu_[put/load] and kvm_timer_[schedule/unschedule], because we can have the following flows: 1. kvm_vcpu_block 2. kvm_timer_schedule 3. schedule 4. kvm_timer_vcpu_put (preempt notifier) 5. schedule (vcpu thread gets scheduled back) 6. kvm_timer_vcpu_load (preempt notifier) 7. kvm_timer_unschedule And a version where we don't actually call schedule: 1. kvm_vcpu_block 2. kvm_timer_schedule 7. kvm_timer_unschedule Since kvm_timer_[schedule/unschedule] may not be followed by put/load, but put/load also may be called independently, we call the timer save/restore functions from both paths. Since they rely on the loaded flag to never save/restore when unnecessary, this doesn't cause any harm, and we ensure that all invokations of either set of functions work as intended. An added benefit beyond not having to read and write the timer sysregs on every entry and exit is that we no longer have to actively write the active state to the physical distributor, because we configured the irq for the vtimer to only get a priority drop when handling the interrupt in the GIC driver (we called irq_set_vcpu_affinity()), and the interrupt stays active after firing on the host. Reviewed-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <cdall@linaro.org>
| * KVM: arm/arm64: Set VCPU affinity for virt timer irqChristoffer Dall2017-11-061-0/+9
| | | | | | | | | | | | | | | | | | As we are about to take physical interrupts for the virtual timer on the host but want to leave those active while running the VM (and let the VM deactivate them), we need to set the vtimer PPI affinity accordingly. Signed-off-by: Christoffer Dall <cdall@linaro.org> Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
| * KVM: arm/arm64: Move timer save/restore out of the hyp codeChristoffer Dall2017-11-068-52/+95
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As we are about to be lazy with saving and restoring the timer registers, we prepare by moving all possible timer configuration logic out of the hyp code. All virtual timer registers can be programmed from EL1 and since the arch timer is always a level triggered interrupt we can safely do this with interrupts disabled in the host kernel on the way to the guest without taking vtimer interrupts in the host kernel (yet). The downside is that the cntvoff register can only be programmed from hyp mode, so we jump into hyp mode and back to program it. This is also safe, because the host kernel doesn't use the virtual timer in the KVM code. It may add a little performance performance penalty, but only until following commits where we move this operation to vcpu load/put. Signed-off-by: Christoffer Dall <cdall@linaro.org> Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
| * KVM: arm/arm64: Use separate timer for phys timer emulationChristoffer Dall2017-11-062-5/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We were using the same hrtimer for emulating the physical timer and for making sure a blocking VCPU thread would be eventually woken up. That worked fine in the previous arch timer design, but as we are about to actually use the soft timer expire function for the physical timer emulation, change the logic to use a dedicated hrtimer. This has the added benefit of not having to cancel any work in the sync path, which in turn allows us to run the flush and sync with IRQs disabled. Note that the hrtimer used to program the host kernel's timer to generate an exit from the guest when the emulated physical timer fires never has to inject any work, and to share the soft_timer_cancel() function with the bg_timer, we change the function to only cancel any pending work if the pointer to the work struct is not null. Acked-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <cdall@linaro.org>
| * KVM: arm/arm64: Move timer/vgic flush/sync under disabled irqChristoffer Dall2017-11-061-13/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | As we are about to play tricks with the timer to be more lazy in saving and restoring state, we need to move the timer sync and flush functions under a disabled irq section and since we have to flush the vgic state after the timer and PMU state, we do the whole flush/sync sequence with disabled irqs. The only downside is a slightly longer delay before being able to process hardware interrupts and run softirqs. Signed-off-by: Christoffer Dall <cdall@linaro.org> Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
| * KVM: arm/arm64: Rename soft timer to bg_timerChristoffer Dall2017-11-062-10/+10
| | | | | | | | | | | | | | | | | | As we are about to introduce a separate hrtimer for the physical timer, call this timer bg_timer, because we refer to this timer as the background timer in the code and comments elsewhere. Signed-off-by: Christoffer Dall <cdall@linaro.org> Acked-by: Marc Zyngier <marc.zyngier@arm.com>
| * KVM: arm/arm64: Make timer_arm and timer_disarm helpers more genericChristoffer Dall2017-11-062-25/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We are about to add an additional soft timer to the arch timer state for a VCPU and would like to be able to reuse the functions to program and cancel a timer, so we make them slightly more generic and rename to make it more clear that these functions work on soft timers and not the hardware resource that this code is managing. The armed flag on the timer state is only used to assert a condition, and we don't rely on this assertion in any meaningful way, so we can simply get rid of this flack and slightly reduce complexity. Acked-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <cdall@linaro.org>
| * KVM: arm/arm64: Check that system supports split eoi/deactivateChristoffer Dall2017-11-062-4/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some systems without proper firmware and/or hardware description data don't support the split EOI and deactivate operation. On such systems, we cannot leave the physical interrupt active after the timer handler on the host has run, so we cannot support KVM with an in-kernel GIC with the timer changes we are about to introduce. This patch makes sure that trying to initialize the KVM GIC code will fail on such systems. Acked-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <cdall@linaro.org>
| * KVM: arm/arm64: Support calling vgic_update_irq_pending from irq contextChristoffer Dall2017-11-068-72/+107
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We are about to optimize our timer handling logic which involves injecting irqs to the vgic directly from the irq handler. Unfortunately, the injection path can take any AP list lock and irq lock and we must therefore make sure to use spin_lock_irqsave where ever interrupts are enabled and we are taking any of those locks, to avoid deadlocking between process context and the ISR. This changes a lot of the VGIC code, but the good news are that the changes are mostly mechanical. Acked-by: Marc Zyngier <marc,zyngier@arm.com> Signed-off-by: Christoffer Dall <cdall@linaro.org>
| * KVM: arm/arm64: Guard kvm_vgic_map_is_active against !vgic_initializedChristoffer Dall2017-11-061-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | If the vgic is not initialized, don't try to grab its spinlocks or traverse its data structures. This is important because we soon have to start considering the active state of a virtual interrupts when doing vcpu_load, which may happen early on before the vgic is initialized. Signed-off-by: Christoffer Dall <cdall@linaro.org> Acked-by: Marc Zyngier <marc.zyngier@arm.com>
| * arm64: Use physical counter for in-kernel reads when booted in EL2Christoffer Dall2017-11-062-5/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Using the physical counter allows KVM to retain the offset between the virtual and physical counter as long as it is actively running a VCPU. As soon as a VCPU is released, another thread is scheduled or we start running userspace applications, we reset the offset to 0, so that userspace accessing the virtual timer can still read the virtual counter and get the same view of time as the kernel. This opens up potential improvements for KVM performance, but we have to make a few adjustments to preserve system consistency. Currently get_cycles() is hardwired to arch_counter_get_cntvct() on arm64, but as we move to using the physical timer for the in-kernel time-keeping on systems that boot in EL2, we should use the same counter for get_cycles() as for other in-kernel timekeeping operations. Similarly, implementations of arch_timer_set_next_event_phys() is modified to use the counter specific to the timer being programmed. VHE kernels or kernels continuing to use the virtual timer are unaffected. Cc: Will Deacon <will.deacon@arm.com> Cc: Mark Rutland <mark.rutland@arm.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <cdall@linaro.org>
| * arm64: Implement arch_counter_get_cntpct to read the physical counterChristoffer Dall2017-11-062-5/+26
| | | | | | | | | | | | | | | | | | | | | | | | As we are about to use the physical counter on arm64 systems that have KVM support, implement arch_counter_get_cntpct() and the associated errata workaround functionality for stable timer reads. Cc: Will Deacon <will.deacon@arm.com> Cc: Mark Rutland <mark.rutland@arm.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
* | Merge branch 'kvm-ppc-next' of ↵Paolo Bonzini2017-11-0218-338/+797
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc into HEAD Apart from various bugfixes and code cleanups, the major new feature is the ability to run guests using the hashed page table (HPT) MMU mode on a host that is using the radix MMU mode. Because of limitations in the current POWER9 chip (all SMT threads in each core must use the same MMU mode, HPT or radix), this requires the host to be configured to run similar to POWER8: the host runs in single-threaded mode (only thread 0 of each core online), and have KVM be able to wake up the other threads when a KVM guest is to be run, and use the other threads for running guest VCPUs. A new module parameter, called "indep_threads_mode", is normally Y on POWER9 but must be set to N before any HPT guests can be run on a radix host: # echo N >/sys/module/kvm_hv/parameters/indep_threads_mode # ppc64_cpu --smt=off Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | KVM: PPC: Book3S HV: Run HPT guests on POWER9 radix hostsPaul Mackerras2017-11-015-17/+212
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch removes the restriction that a radix host can only run radix guests, allowing us to run HPT (hashed page table) guests as well. This is useful because it provides a way to run old guest kernels that know about POWER8 but not POWER9. Unfortunately, POWER9 currently has a restriction that all threads in a given code must either all be in HPT mode, or all in radix mode. This means that when entering a HPT guest, we have to obtain control of all 4 threads in the core and get them to switch their LPIDR and LPCR registers, even if they are not going to run a guest. On guest exit we also have to get all threads to switch LPIDR and LPCR back to host values. To make this feasible, we require that KVM not be in the "independent threads" mode, and that the CPU cores be in single-threaded mode from the host kernel's perspective (only thread 0 online; threads 1, 2 and 3 offline). That allows us to use the same code as on POWER8 for obtaining control of the secondary threads. To manage the LPCR/LPIDR changes required, we extend the kvm_split_info struct to contain the information needed by the secondary threads. All threads perform a barrier synchronization (where all threads wait for every other thread to reach the synchronization point) on guest entry, both before and after loading LPCR and LPIDR. On guest exit, they all once again perform a barrier synchronization both before and after loading host values into LPCR and LPIDR. Finally, it is also currently necessary to flush the entire TLB every time we enter a HPT guest on a radix host. We do this on thread 0 with a loop of tlbiel instructions. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| * | KVM: PPC: Book3S HV: Allow for running POWER9 host in single-threaded modePaul Mackerras2017-11-013-33/+57
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch allows for a mode on POWER9 hosts where we control all the threads of a core, much as we do on POWER8. The mode is controlled by a module parameter on the kvm_hv module, called "indep_threads_mode". The normal mode on POWER9 is the "independent threads" mode, with indep_threads_mode=Y, where the host is in SMT4 mode (or in fact any desired SMT mode) and each thread independently enters and exits from KVM guests without reference to what other threads in the core are doing. If indep_threads_mode is set to N at the point when a VM is started, KVM will expect every core that the guest runs on to be in single threaded mode (that is, threads 1, 2 and 3 offline), and will set the flag that prevents secondary threads from coming online. We can still use all four threads; the code that implements dynamic micro-threading on POWER8 will become active in over-commit situations and will allow up to three other VCPUs to be run on the secondary threads of the core whenever a VCPU is run. The reason for wanting this mode is that this will allow us to run HPT guests on a radix host on a POWER9 machine that does not support "mixed mode", that is, having some threads in a core be in HPT mode while other threads are in radix mode. It will also make it possible to implement a "strict threads" mode in future, if desired. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| * | KVM: PPC: Book3S HV: Add infrastructure for running HPT guests on radix hostPaul Mackerras2017-11-015-37/+79
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This sets up the machinery for switching a guest between HPT (hashed page table) and radix MMU modes, so that in future we can run a HPT guest on a radix host on POWER9 machines. * The KVM_PPC_CONFIGURE_V3_MMU ioctl can now specify either HPT or radix mode, on a radix host. * The KVM_CAP_PPC_MMU_HASH_V3 capability now returns 1 on POWER9 with HV KVM on a radix host. * The KVM_PPC_GET_SMMU_INFO returns information about the HPT MMU on a radix host. * The KVM_PPC_ALLOCATE_HTAB ioctl on a radix host will switch the guest to HPT mode and allocate a HPT. * For simplicity, we now allocate the rmap array for each memslot, even on a radix host, since it will be needed if the guest switches to HPT mode. * Since we cannot yet run a HPT guest on a radix host, the KVM_RUN ioctl will return an EINVAL error in that case. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| * | KVM: PPC: Book3S HV: Unify dirty page map between HPT and radixPaul Mackerras2017-11-017-113/+96
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, the HPT code in HV KVM maintains a dirty bit per guest page in the rmap array, whether or not dirty page tracking has been enabled for the memory slot. In contrast, the radix code maintains a dirty bit per guest page in memslot->dirty_bitmap, and only does so when dirty page tracking has been enabled. This changes the HPT code to maintain the dirty bits in the memslot dirty_bitmap like radix does. This results in slightly less code overall, and will mean that we do not lose the dirty bits when transitioning between HPT and radix mode in future. There is one minor change to behaviour as a result. With HPT, when dirty tracking was enabled for a memslot, we would previously clear all the dirty bits at that point (both in the HPT entries and in the rmap arrays), meaning that a KVM_GET_DIRTY_LOG ioctl immediately following would show no pages as dirty (assuming no vcpus have run in the meantime). With this change, the dirty bits on HPT entries are not cleared at the point where dirty tracking is enabled, so KVM_GET_DIRTY_LOG would show as dirty any guest pages that are resident in the HPT and dirty. This is consistent with what happens on radix. This also fixes a bug in the mark_pages_dirty() function for radix (in the sense that the function no longer exists). In the case where a large page of 64 normal pages or more is marked dirty, the addressing of the dirty bitmap was incorrect and could write past the end of the bitmap. Fortunately this case was never hit in practice because a 2MB large page is only 32 x 64kB pages, and we don't support backing the guest with 1GB huge pages at this point. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| * | KVM: PPC: Book3S HV: Rename hpte_setup_done to mmu_readyPaul Mackerras2017-11-013-33/+44
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This renames the kvm->arch.hpte_setup_done field to mmu_ready because we will want to use it for radix guests too -- both for setting things up before vcpu execution, and for excluding vcpus from executing while MMU-related things get changed, such as in future switching the MMU from radix to HPT mode or vice-versa. This also moves the call to kvmppc_setup_partition_table() that was done in kvmppc_hv_setup_htab_rma() for HPT guests, and the setting of mmu_ready, into the caller in kvmppc_vcpu_run_hv(). Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| * | KVM: PPC: Book3S HV: Don't rely on host's page size informationPaul Mackerras2017-11-014-53/+126
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This removes the dependence of KVM on the mmu_psize_defs array (which stores information about hardware support for various page sizes) and the things derived from it, chiefly hpte_page_sizes[], hpte_page_size(), hpte_actual_page_size() and get_sllp_encoding(). We also no longer rely on the mmu_slb_size variable or the MMU_FTR_1T_SEGMENTS feature bit. The reason for doing this is so we can support a HPT guest on a radix host. In a radix host, the mmu_psize_defs array contains information about page sizes supported by the MMU in radix mode rather than the page sizes supported by the MMU in HPT mode. Similarly, mmu_slb_size and the MMU_FTR_1T_SEGMENTS bit are not set. Instead we hard-code knowledge of the behaviour of the HPT MMU in the POWER7, POWER8 and POWER9 processors (which are the only processors supported by HV KVM) - specifically the encoding of the LP fields in the HPT and SLB entries, and the fact that they have 32 SLB entries and support 1TB segments. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| * | Merge remote-tracking branch 'remotes/powerpc/topic/ppc-kvm' into kvm-ppc-nextPaul Mackerras2017-11-015-50/+14
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This merges in the ppc-kvm topic branch of the powerpc tree to get the commit that reverts the patch "KVM: PPC: Book3S HV: POWER9 does not require secondary thread management". This is needed for subsequent patches which will be applied on this branch. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| | * | KVM: PPC: Tie KVM_CAP_PPC_HTM to the user-visible TM featureMichael Ellerman2017-10-201-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently we use CPU_FTR_TM to decide if the CPU/kernel can support TM (Transactional Memory), and if it's true we advertise that to Qemu (or similar) via KVM_CAP_PPC_HTM. PPC_FEATURE2_HTM is the user-visible feature bit, which indicates that the CPU and kernel can support TM. Currently CPU_FTR_TM and PPC_FEATURE2_HTM always have the same value, either true or false, so using the former for KVM_CAP_PPC_HTM is correct. However some Power9 CPUs can operate in a mode where TM is enabled but TM suspended state is disabled. In this mode CPU_FTR_TM is true, but PPC_FEATURE2_HTM is false. Instead a different PPC_FEATURE2 bit is set, to indicate that this different mode of TM is available. It is not safe to let guests use TM as-is, when the CPU is in this mode. So to prevent that from happening, use PPC_FEATURE2_HTM to determine the value of KVM_CAP_PPC_HTM. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
| | * | Revert "KVM: PPC: Book3S HV: POWER9 does not require secondary thread ↵Paul Mackerras2017-10-194-48/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | management" This reverts commit 94a04bc25a2c6296bd0c5e82c10e8231c2b11f77. In order to run HPT guests on a radix POWER9 host, we will have to run the host in single-threaded mode, because POWER9 processors do not currently support running some threads of a core in HPT mode while others are in radix mode ("mixed mode"). That means that we will need the same mechanisms that are used on POWER8 to make the secondary threads available to KVM, which were disabled on POWER9 by commit 94a04bc25a2c. Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
| * | | KVM: PPC: Book3S: Fix gas warning due to using r0 as immediate 0Nicholas Piggin2017-11-011-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This fixes the message: arch/powerpc/kvm/book3s_segment.S: Assembler messages: arch/powerpc/kvm/book3s_segment.S:330: Warning: invalid register expression Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| * | | KVM: PPC: Book3S PR: Only install valid SLBs during KVM_SET_SREGSGreg Kurz2017-11-011-3/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Userland passes an array of 64 SLB descriptors to KVM_SET_SREGS, some of which are valid (ie, SLB_ESID_V is set) and the rest are likely all-zeroes (with QEMU at least). Each of them is then passed to kvmppc_mmu_book3s_64_slbmte(), which assumes to find the SLB index in the 3 lower bits of its rb argument. When passed zeroed arguments, it happily overwrites the 0th SLB entry with zeroes. This is exactly what happens while doing live migration with QEMU when the destination pushes the incoming SLB descriptors to KVM PR. When reloading the SLBs at the next synchronization, QEMU first clears its SLB array and only restore valid ones, but the 0th one is now gone and we cannot access the corresponding memory anymore: (qemu) x/x $pc c0000000000b742c: Cannot access memory To avoid this, let's filter out non-valid SLB entries. While here, we also force a full SLB flush before installing new entries. Since SLB is for 64-bit only, we now build this path conditionally to avoid a build break on 32-bit, which doesn't define SLB_ESID_V. Signed-off-by: Greg Kurz <groug@kaod.org> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| * | | KVM: PPC: Book3S HV: Don't call real-mode XICS hypercall handlers if not enabledPaul Mackerras2017-11-011-0/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When running a guest on a POWER9 system with the in-kernel XICS emulation disabled (for example by running QEMU with the parameter "-machine pseries,kernel_irqchip=off"), the kernel does not pass the XICS-related hypercalls such as H_CPPR up to userspace for emulation there as it should. The reason for this is that the real-mode handlers for these hypercalls don't check whether a XICS device has been instantiated before calling the xics-on-xive code. That code doesn't check either, leading to potential NULL pointer dereferences because vcpu->arch.xive_vcpu is NULL. Those dereferences won't cause an exception in real mode but will lead to kernel memory corruption. This fixes it by adding kvmppc_xics_enabled() checks before calling the XICS functions. Cc: stable@vger.kernel.org # v4.11+ Fixes: 5af50993850a ("KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controller") Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| * | | KVM: PPC: Book3S HV: Explicitly disable HPT operations on radix guestsPaul Mackerras2017-10-161-3/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds code to make sure that we don't try to access the non-existent HPT for a radix guest using the htab file for the VM in debugfs, a file descriptor obtained using the KVM_PPC_GET_HTAB_FD ioctl, or via the KVM_PPC_RESIZE_HPT_{PREPARE,COMMIT} ioctls. At present nothing bad happens if userspace does access these interfaces on a radix guest, mostly because kvmppc_hpt_npte() gives 0 for a radix guest, which in turn is because 1 << -4 comes out as 0 on POWER processors. However, that relies on undefined behaviour, so it is better to be explicit about not accessing the HPT for a radix guest. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| * | | KVM: PPC: Book3S PR: Enable in-kernel TCE handlers for PR KVMAlexey Kardashevskiy2017-10-141-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The handlers support PR KVM from the day one; however the PR KVM's enable/disable hcalls handler missed these ones. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| * | | KVM: PPC: Book3S HV: Delete an error message for a failed memory allocation ↵Markus Elfring2017-10-141-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | in kvmppc_allocate_hpt() Omit an extra message for a memory allocation failure in this function. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| * | | KVM: PPC: BookE: Use vma_pages functionThomas Meyer2017-10-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use vma_pages function on vma object instead of explicit computation. Found by coccinelle spatch "api/vma_pages.cocci" Signed-off-by: Thomas Meyer <thomas@m3y3r.de> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| * | | KVM: PPC: Book3S HV: Use ARRAY_SIZE macroThomas Meyer2017-10-141-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use ARRAY_SIZE macro, rather than explicitly coding some variant of it yourself. Found with: find -type f -name "*.c" -o -name "*.h" | xargs perl -p -i -e 's/\bsizeof\s*\(\s*(\w+)\s*\)\s*\ /\s*sizeof\s*\(\s*\1\s*\[\s*0\s*\]\s*\) /ARRAY_SIZE(\1)/g' and manual check/verification. Signed-off-by: Thomas Meyer <thomas@m3y3r.de> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| * | | KVM: PPC: Book3S HV: Handle unexpected interrupts betterPaul Mackerras2017-10-142-1/+137
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | At present, if an interrupt (i.e. an exception or trap) occurs in the code where KVM is switching the MMU to or from guest context, we jump to kvmppc_bad_host_intr, where we simply spin with interrupts disabled. In this situation, it is hard to debug what happened because we get no indication as to which interrupt occurred or where. Typically we get a cascade of stall and soft lockup warnings from other CPUs. In order to get more information for debugging, this adds code to create a stack frame on the emergency stack and save register values to it. We start half-way down the emergency stack in order to give ourselves some chance of being able to do a stack trace on secondary threads that are already on the emergency stack. On POWER7 or POWER8, we then just spin, as before, because we don't know what state the MMU context is in or what other threads are doing, and we can't switch back to host context without coordinating with other threads. On POWER9 we can do better; there we load up the host MMU context and jump to C code, which prints an oops message to the console and panics. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
* | | | KVM: X86: #GP when guest attempts to write MCi_STATUS register w/o 0Wanpeng Li2017-10-201-2/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Both Intel SDM and AMD APM mentioned that MCi_STATUS, when the register is implemented, this register can be cleared by explicitly writing 0s to this register. Writing 1s to this register will cause a general-protection exception. The mce is emulated in qemu, so just the guest attempts to write 1 to this register should cause a #GP, this patch does it. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Jim Mattson <jmattson@google.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
* | | | KVM: VMX: Fix VPID capability detectionWanpeng Li2017-10-201-2/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In my setup, EPT is not exposed to L1, the VPID capability is exposed and can be observed by vmxcap tool in L1: INVVPID supported yes Individual-address INVVPID yes Single-context INVVPID yes All-context INVVPID yes Single-context-retaining-globals INVVPID yes However, the module parameter of VPID observed in L1 is always N, the cpu_has_vmx_invvpid() check in L1 KVM fails since vmx_capability.vpid is 0 and it is not read from MSR due to EPT is not exposed. The VPID can be used to tag linear mappings when EPT is not enabled. However, current logic just detects VPID capability if EPT is enabled, this patch fixes it. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Jim Mattson <jmattson@google.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
* | | | KVM: nVMX: Fix EPT switching advertisingWanpeng Li2017-10-201-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I can use vmxcap tool to observe "EPTP Switching yes" even if EPT is not exposed to L1. EPT switching is advertised unconditionally since it is emulated, however, it can be treated as an extended feature for EPT and it should not be advertised if EPT itself is not exposed. This patch fixes it. Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Jim Mattson <jmattson@google.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
* | | | KVM: SVM: detect opening of SMI window using STGI interceptLadi Prosek2017-10-184-9/+36
|/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 05cade71cf3b ("KVM: nSVM: fix SMI injection in guest mode") made KVM mask SMI if GIF=0 but it didn't do anything to unmask it when GIF is enabled. The issue manifests for me as a significantly longer boot time of Windows guests when running with SMM-enabled OVMF. This commit fixes it by intercepting STGI instead of requesting immediate exit if the reason why SMM was masked is GIF. Fixes: 05cade71cf3b ("KVM: nSVM: fix SMI injection in guest mode") Signed-off-by: Ladi Prosek <lprosek@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>