aboutsummaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* ext4: verify the depth of extent tree in ext4_find_extent()Theodore Ts'o2018-06-142-0/+7
| | | | | | | | | | | | | If there is a corupted file system where the claimed depth of the extent tree is -1, this can cause a massive buffer overrun leading to sadness. This addresses CVE-2018-10877. https://bugzilla.kernel.org/show_bug.cgi?id=199417 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
* ext4: only look at the bg_flags field if it is validTheodore Ts'o2018-06-144-6/+36
| | | | | | | | | | | | | | | | | | The bg_flags field in the block group descripts is only valid if the uninit_bg or metadata_csum feature is enabled. We were not consistently looking at this field; fix this. Also block group #0 must never have uninitialized allocation bitmaps, or need to be zeroed, since that's where the root inode, and other special inodes are set up. Check for these conditions and mark the file system as corrupted if they are detected. This addresses CVE-2018-10876. https://bugzilla.kernel.org/show_bug.cgi?id=199403 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
* ext4: make sure bitmaps and the inode table don't overlap with bg descriptorsTheodore Ts'o2018-06-131-0/+25
| | | | | | | | | | | | It's really bad when the allocation bitmaps and the inode table overlap with the block group descriptors, since it causes random corruption of the bg descriptors. So we really want to head those off at the pass. https://bugzilla.kernel.org/show_bug.cgi?id=199865 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
* ext4: always check block group bounds in ext4_init_block_bitmap()Theodore Ts'o2018-06-131-7/+3
| | | | | | | | | | | Regardless of whether the flex_bg feature is set, we should always check to make sure the bits we are setting in the block bitmap are within the block group bounds. https://bugzilla.kernel.org/show_bug.cgi?id=199865 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
* ext4: always verify the magic number in xattr blocksTheodore Ts'o2018-06-131-3/+3
| | | | | | | | | | | | | | | | | | | If there an inode points to a block which is also some other type of metadata block (such as a block allocation bitmap), the buffer_verified flag can be set when it was validated as that other metadata block type; however, it would make a really terrible external attribute block. The reason why we use the verified flag is to avoid constantly reverifying the block. However, it doesn't take much overhead to make sure the magic number of the xattr block is correct, and this will avoid potential crashes. This addresses CVE-2018-10879. https://bugzilla.kernel.org/show_bug.cgi?id=200001 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Cc: stable@kernel.org
* ext4: add corruption check in ext4_xattr_set_entry()Theodore Ts'o2018-06-131-2/+8
| | | | | | | | | | | | | | In theory this should have been caught earlier when the xattr list was verified, but in case it got missed, it's simple enough to add check to make sure we don't overrun the xattr buffer. This addresses CVE-2018-10879. https://bugzilla.kernel.org/show_bug.cgi?id=200001 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Cc: stable@kernel.org
* ext4: add warn_on_error mount optionTheodore Ts'o2018-06-122-1/+13
| | | | | | | This is very handy when debugging bugs handling maliciously corrupted file systems. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: fix fencepost error in check for inode count overflow during resizeJan Kara2018-05-251-1/+1
| | | | | | | | | | | | | | ext4_resize_fs() has an off-by-one bug when checking whether growing of a filesystem will not overflow inode count. As a result it allows a filesystem with 8192 inodes per group to grow to 64TB which overflows inode count to 0 and makes filesystem unusable. Fix it. Cc: stable@vger.kernel.org Fixes: 3f8a6411fbada1fa482276591e037f3b1adcf55b Reported-by: Jaco Kroon <jaco@uls.co.za> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca>
* ext4: correctly handle a zero-length xattr with a non-zero e_value_offsTheodore Ts'o2018-05-231-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Ext4 will always create ext4 extended attributes which do not have a value (where e_value_size is zero) with e_value_offs set to zero. In most places e_value_offs will not be used in a substantive way if e_value_size is zero. There was one exception to this, which is in ext4_xattr_set_entry(), where if there is a maliciously crafted file system where there is an extended attribute with e_value_offs is non-zero and e_value_size is 0, the attempt to remove this xattr will result in a negative value getting passed to memmove, leading to the following sadness: [ 41.225365] EXT4-fs (loop0): mounted filesystem with ordered data mode. Opts: (null) [ 44.538641] BUG: unable to handle kernel paging request at ffff9ec9a3000000 [ 44.538733] IP: __memmove+0x81/0x1a0 [ 44.538755] PGD 1249bd067 P4D 1249bd067 PUD 1249c1067 PMD 80000001230000e1 [ 44.538793] Oops: 0003 [#1] SMP PTI [ 44.539074] CPU: 0 PID: 1470 Comm: poc Not tainted 4.16.0-rc1+ #1 ... [ 44.539475] Call Trace: [ 44.539832] ext4_xattr_set_entry+0x9e7/0xf80 ... [ 44.539972] ext4_xattr_block_set+0x212/0xea0 ... [ 44.540041] ext4_xattr_set_handle+0x514/0x610 [ 44.540065] ext4_xattr_set+0x7f/0x120 [ 44.540090] __vfs_removexattr+0x4d/0x60 [ 44.540112] vfs_removexattr+0x75/0xe0 [ 44.540132] removexattr+0x4d/0x80 ... [ 44.540279] path_removexattr+0x91/0xb0 [ 44.540300] SyS_removexattr+0xf/0x20 [ 44.540322] do_syscall_64+0x71/0x120 [ 44.540344] entry_SYSCALL_64_after_hwframe+0x21/0x86 https://bugzilla.kernel.org/show_bug.cgi?id=199347 This addresses CVE-2018-10840. Reported-by: "Xu, Wen" <wen.xu@gatech.edu> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Cc: stable@kernel.org Fixes: dec214d00e0d7 ("ext4: xattr inode deduplication")
* ext4: bubble errors from ext4_find_inline_data_nolock() up to ext4_iget()Theodore Ts'o2018-05-221-3/+7
| | | | | | | | | | | | | | | | If ext4_find_inline_data_nolock() returns an error it needs to get reflected up to ext4_iget(). In order to fix this, ext4_iget_extra_inode() needs to return an error (and not return void). This is related to "ext4: do not allow external inodes for inline data" (which fixes CVE-2018-11412) in that in the errors=continue case, it would be useful to for userspace to receive an error indicating that file system is corrupted. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Cc: stable@kernel.org
* ext4: do not allow external inodes for inline dataTheodore Ts'o2018-05-221-0/+6
| | | | | | | | | | | | | | | | | | | | | The inline data feature was implemented before we added support for external inodes for xattrs. It makes no sense to support that combination, but the problem is that there are a number of extended attribute checks that are skipped if e_value_inum is non-zero. Unfortunately, the inline data code is completely e_value_inum unaware, and attempts to interpret the xattr fields as if it were an inline xattr --- at which point, Hilarty Ensues. This addresses CVE-2018-11412. https://bugzilla.kernel.org/show_bug.cgi?id=199803 Reported-by: Jann Horn <jannh@google.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Fixes: e50e5129f384 ("ext4: xattr-in-inode support") Cc: stable@kernel.org
* ext4: report delalloc reserve as non-free in statfs for project quotaKonstantin Khlebnikov2018-05-201-1/+2
| | | | | | | | | | This reserved space isn't committed yet but cannot be used for allocations. For userspace it has no difference from used space. XFS already does this. Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Fixes: 689c958cbe6b ("ext4: add project quota support")
* ext4: remove NULL check before calling kmem_cache_destroy()Sean Fu2018-05-202-4/+2
| | | | | Signed-off-by: Sean Fu <fxinrong@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* jbd2: remove NULL check before calling kmem_cache_destroy()Wang Long2018-05-203-23/+13
| | | | | | | | | | | | The kmem_cache_destroy() function already checks for null pointers, so we can remove the check at the call site. This patch also sets jbd2_handle_cache and jbd2_inode_cache to be NULL after freeing them in jbd2_journal_destroy_handle_cache(). Signed-off-by: Wang Long <wanglong19@meituan.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
* jbd2: remove bunch of empty lines with jbd2 debugWang Shilong2018-05-201-1/+1
| | | | | | | | | | | | | | | | | | | | | | See following dmesg output with jbd2 debug enabled: ...(start_this_handle, 313): New handle 00000000c88d6ceb going live. ...(start_this_handle, 383): Handle 00000000c88d6ceb given 53 credits (total 53, free 32681) ...(do_get_write_access, 838): journal_head 0000000002856fc0, force_copy 0 ...(jbd2_journal_cancel_revoke, 421): journal_head 0000000002856fc0, cancelling revoke We have an extra line with every messages, this is a waste of buffer, we can fix it by removing "\n" in the caller or remove it in the __jbd2_debug(), i checked every jbd2_debug() passed '\n' explicitly. To avoid more lines, let's remove it inside __jbd2_debug(). Signed-off-by: Wang Shilong <wshilong@ddn.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
* ext4: handle errors on ext4_commit_superJaegeuk Kim2018-05-131-14/+21
| | | | | | | | | | | | | | | | | | | | | | | | | When remounting ext4 from ro to rw, currently it allows its transition, even if ext4_commit_super() returns EIO. Even worse thing is, after that, fs/buffer complains buffer dirty bits like: Call trace: [<ffffff9750c259dc>] mark_buffer_dirty+0x184/0x1a4 [<ffffff9750cb398c>] __ext4_handle_dirty_super+0x4c/0xfc [<ffffff9750c7a9fc>] ext4_file_open+0x154/0x1c0 [<ffffff9750bea51c>] do_dentry_open+0x114/0x2d0 [<ffffff9750bea75c>] vfs_open+0x5c/0x94 [<ffffff9750bf879c>] path_openat+0x668/0xfe8 [<ffffff9750bf8088>] do_filp_open+0x74/0x120 [<ffffff9750beac98>] do_sys_open+0x148/0x254 [<ffffff9750beade0>] SyS_openat+0x10/0x18 [<ffffff9750a83ab0>] el0_svc_naked+0x24/0x28 EXT4-fs (dm-1): previous I/O error to superblock detected Buffer I/O error on dev dm-1, logical block 0, lost sync page write EXT4-fs (dm-1): re-mounted. Opts: (null) Buffer I/O error on dev dm-1, logical block 80, lost async page write Signed-off-by: Jaegeuk Kim <jaegeuk@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: do not update s_last_mounted of a frozen fsAmir Goldstein2018-05-131-5/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If fs is frozen after mount and before the first file open, the update of s_last_mounted bypasses freeze protection and prints out a WARNING splat: $ mount /vdf $ fsfreeze -f /vdf $ cat /vdf/foo [ 31.578555] WARNING: CPU: 1 PID: 1415 at fs/ext4/ext4_jbd2.c:53 ext4_journal_check_start+0x48/0x82 [ 31.614016] Call Trace: [ 31.614997] __ext4_journal_start_sb+0xe4/0x1a4 [ 31.616771] ? ext4_file_open+0xb6/0x189 [ 31.618094] ext4_file_open+0xb6/0x189 If fs is frozen, skip s_last_mounted update. [backport hint: to apply to stable tree, need to apply also patches vfs: add the sb_start_intwrite_trylock() helper ext4: factor out helper ext4_sample_last_mounted()] Cc: stable@vger.kernel.org Fixes: bc0b0d6d69ee ("ext4: update the s_last_mounted field in the superblock") Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
* ext4: factor out helper ext4_sample_last_mounted()Amir Goldstein2018-05-131-36/+46
| | | | | | Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
* vfs: add the sb_start_intwrite_trylock() helperAmir Goldstein2018-05-131-0/+5
| | | | | | | | Needed by ext4 to test frozen fs before updating s_last_mounted. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
* ext4: update mtime in ext4_punch_hole even if no blocks are releasedLukas Czerner2018-05-131-18/+18
| | | | | | | | | | | | | | Currently in ext4_punch_hole we're going to skip the mtime update if there are no actual blocks to release. However we've actually modified the file by zeroing the partial block so the mtime should be updated. Moreover the sync and datasync handling is skipped as well, which is also wrong. Fix it. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reported-by: Joe Habermann <joe.habermann@quantum.com> Cc: <stable@vger.kernel.org>
* ext4: add verifier check for symlink with append/immutable flagsLuis R. Rodriguez2018-05-131-0/+7
| | | | | | | | | | The Linux VFS does not allow a way to set append/immuttable attributes to symlinks, this is just not possible. If this is detected inform the user as the filesystem must be corrupted. Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
* fs: ext4: add new return type vm_fault_tSouptick Joarder2018-05-131-3/+4
| | | | | | | | | | | | | Use new return type vm_fault_t for fault handler. For now, this is just documenting that the function returns a VM_FAULT value rather than an errno. Once all instances are converted, vm_fault_t will become a distinct type. commit 1c8f422059ae ("mm: change return type to vm_fault_t") Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
* ext4: fix hole length detection in ext4_ind_map_blocks()Jan Kara2018-05-121-4/+10
| | | | | | | | | | | | | | | | | When ext4_ind_map_blocks() computes a length of a hole, it doesn't count with the fact that mapped offset may be somewhere in the middle of the completely empty subtree. In such case it will return too large length of the hole which then results in lseek(SEEK_DATA) to end up returning an incorrect offset beyond the end of the hole. Fix the problem by correctly taking offset within a subtree into account when computing a length of a hole. Fixes: facab4d9711e7aa3532cb82643803e8f1b9518e8 CC: stable@vger.kernel.org Reported-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: mark block bitmap corrupted when foundWang Shilong2018-05-122-0/+10
| | | | | | | | | | | | | | | | There are still some cases that we missed to set block bitmaps corrupted bit properly: 1) block bitmap number is wrong. 2) failed to read block bitmap due to disk errors. 3) double free block bitmaps.. 4) some mismatch check with bitmaps vs buddy information. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: Wang Shilong <wshilong@ddn.com> Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca>
* ext4: mark inode bitmap corrupted when foundWang Shilong2018-05-121-4/+9
| | | | | | | | | | | | | | | | | There are still some cases that we missed to set block bitmaps corrupted bit properly: 1)inode bitmap number is wrong. 2)failed to read block bitmap due to disk errors. 3)double allocations from bitmap Also remove a duplicated call ext4_error() afer ext4_read_inode_bitmap(), as ext4_error() have been called inside ext4_read_inode_bitmap() properly. Signed-off-by: Wang Shilong <wshilong@ddn.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca>
* ext4: add new ext4_mark_group_bitmap_corrupted() helperWang Shilong2018-05-125-48/+52
| | | | | | | | | | Since there are many places to set inode/block bitmap corrupt bit, add a new helper for it, which will make codes more clear. Signed-off-by: Wang Shilong <wshilong@ddn.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca>
* ext4: fix wrong return value in ext4_read_inode_bitmap()Wang Shilong2018-05-121-1/+1
| | | | | | | | | The only reason that sb_getblk() could fail is out of memory, ext4 codes have returned -ENOMME for all other places except this one, let's fix it here too. Signed-off-by: Wang Shilong <wshilong@ddn.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: use raw i_version value for ea_inodeEryu Guan2018-05-101-2/+22
| | | | | | | | | | | | | | | | | | | | | | Currently, creating large xattr (e.g. 2k) in ea_inode would cause ea_inode refcount corruption, e.g. Pass 4: Checking reference counts Extended attribute inode 13 ref count is 0, should be 1. Fix? no This is because that we save the lower 32bit of refcount in inode->i_version and store it in raw_inode->i_disk_version on disk. But since commit ee73f9a52a34 ("ext4: convert to new i_version API"), we load/store modified i_disk_version from/to disk instead of raw value, which causes on-disk ea_inode refcount corruption. Fix it by loading/storing raw i_version/i_disk_version, because it's a self-managed value in this case. Fixes: ee73f9a52a34 ("ext4: convert to new i_version API") Cc: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Eryu Guan <guaneryu@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: use XATTR_CREATE in ext4_initxattrs()Eryu Guan2018-05-101-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I hit ENOSPC error when creating new file in a newly created ext4 with ea_inode feature enabled, if selinux is enabled and ext4 is mounted without any selinux context. e.g. mkfs -t ext4 -O ea_inode -F /dev/sda5 mount /dev/sda5 /mnt/ext4 touch /mnt/ext4/testfile # got ENOSPC here It turns out that we run out of journal credits in ext4_xattr_set_handle() when creating new selinux label for the newly created inode. This is because that in __ext4_new_inode() we use __ext4_xattr_set_credits() to calculate the reserved credits for new xattr, with the 'is_create' argument being true, which implies less credits in the ea_inode case. But we calculate the required credits in ext4_xattr_set_handle() with 'is_create' being false, which means we need more credits if ea_inode feature is enabled. So we don't have enough credits and error out with ENOSPC. Fix it by simply calling ext4_xattr_set_handle() with XATTR_CREATE flag in ext4_initxattrs(), so we end up with requiring less credits than reserved. The semantic of XATTR_CREATE is "Perform a pure create, which fails if the named attribute exists already." (from setxattr(2)), which is fine in this case, because we only call ext4_initxattrs() on newly created inode. Fixes: af65207c76ce ("ext4: fix __ext4_new_inode() journal credits calculation") Cc: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Eryu Guan <guaneryu@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: make function ‘ext4_getfsmap_find_fixed_metadata’ staticMathieu Malaterre2018-05-101-2/+2
| | | | | | | | | | Since function ‘ext4_getfsmap_find_fixed_metadata’ can be made static, make it so. Remove the following gcc warning (W=1): fs/ext4/fsmap.c:405:5: warning: no previous prototype for ‘ext4_getfsmap_find_fixed_metadata’ [-Wmissing-prototypes] Signed-off-by: Mathieu Malaterre <malat@debian.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* Linux 4.17-rc4Linus Torvalds2018-05-061-2/+2
|
* Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2018-05-0610-85/+122
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pll KVM fixes from Radim Krčmář: "ARM: - Fix proxying of GICv2 CPU interface accesses - Fix crash when switching to BE - Track source vcpu git GICv2 SGIs - Fix an outdated bit of documentation x86: - Speed up injection of expired timers (for stable)" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: x86: remove APIC Timer periodic/oneshot spikes arm64: vgic-v2: Fix proxying of cpuif access KVM: arm/arm64: vgic_init: Cleanup reference to process_maintenance KVM: arm64: Fix order of vcpu_write_sys_reg() arguments KVM: arm/arm64: vgic: Fix source vcpu issues for GICv2 SGI
| * KVM: x86: remove APIC Timer periodic/oneshot spikesAnthoine Bourgeois2018-05-051-17/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since the commit "8003c9ae204e: add APIC Timer periodic/oneshot mode VMX preemption timer support", a Windows 10 guest has some erratic timer spikes. Here the results on a 150000 times 1ms timer without any load: Before 8003c9ae204e | After 8003c9ae204e Max 1834us | 86000us Mean 1100us | 1021us Deviation 59us | 149us Here the results on a 150000 times 1ms timer with a cpu-z stress test: Before 8003c9ae204e | After 8003c9ae204e Max 32000us | 140000us Mean 1006us | 1997us Deviation 140us | 11095us The root cause of the problem is starting hrtimer with an expiry time already in the past can take more than 20 milliseconds to trigger the timer function. It can be solved by forward such past timers immediately, rather than submitting them to hrtimer_start(). In case the timer is periodic, update the target expiration and call hrtimer_start with it. v2: Check if the tsc deadline is already expired. Thank you Mika. v3: Execute the past timers immediately rather than submitting them to hrtimer_start(). v4: Rearm the periodic timer with advance_periodic_target_expiration() a simpler version of set_target_expiration(). Thank you Paolo. Cc: Mika Penttilä <mika.penttila@nextfour.com> Cc: Wanpeng Li <kernellwp@gmail.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Anthoine Bourgeois <anthoine.bourgeois@blade-group.com> 8003c9ae204e ("KVM: LAPIC: add APIC Timer periodic/oneshot mode VMX preemption timer support") Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
| * Merge tag 'kvmarm-fixes-for-4.17-2' of ↵Radim Krčmář2018-05-059-68/+102
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm KVM/arm fixes for 4.17, take #2 - Fix proxying of GICv2 CPU interface accesses - Fix crash when switching to BE - Track source vcpu git GICv2 SGIs - Fix an outdated bit of documentation
| | * arm64: vgic-v2: Fix proxying of cpuif accessJames Morse2018-05-041-5/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Proxying the cpuif accesses at EL2 makes use of vcpu_data_guest_to_host and co, which check the endianness, which call into vcpu_read_sys_reg... which isn't mapped at EL2 (it was inlined before, and got moved OoL with the VHE optimizations). The result is of course a nice panic. Let's add some specialized cruft to keep the broken platforms that require this hack alive. But, this code used vcpu_data_guest_to_host(), which expected us to write the value to host memory, instead we have trapped the guest's read or write to an mmio-device, and are about to replay it using the host's readl()/writel() which also perform swabbing based on the host endianness. This goes wrong when both host and guest are big-endian, as readl()/writel() will undo the guest's swabbing, causing the big-endian value to be written to device-memory. What needs doing? A big-endian guest will have pre-swabbed data before storing, undo this. If its necessary for the host, writel() will re-swab it. For a read a big-endian guest expects to swab the data after the load. The hosts's readl() will correct for host endianness, giving us the device-memory's value in the register. For a big-endian guest, swab it as if we'd only done the load. For a little-endian guest, nothing needs doing as readl()/writel() leave the correct device-memory value in registers. Tested on Juno with that rarest of things: a big-endian 64K host. Based on a patch from Marc Zyngier. Reported-by: Suzuki K Poulose <suzuki.poulose@arm.com> Fixes: bf8feb39642b ("arm64: KVM: vgic-v2: Add GICV access from HYP") Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
| | * KVM: arm/arm64: vgic_init: Cleanup reference to process_maintenanceValentin Schneider2018-05-041-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | One comment still mentioned process_maintenance operations after commit af0614991ab6 ("KVM: arm/arm64: vgic: Get rid of unnecessary process_maintenance operation") Update the comment to point to vgic_fold_lr_state instead, which is where maintenance interrupts are taken care of. Acked-by: Christoffer Dall <christoffer.dall@arm.com> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
| | * KVM: arm64: Fix order of vcpu_write_sys_reg() argumentsJames Morse2018-05-041-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A typo in kvm_vcpu_set_be()'s call: | vcpu_write_sys_reg(vcpu, SCTLR_EL1, sctlr) causes us to use the 32bit register value as an index into the sys_reg[] array, and sail off the end of the linear map when we try to bring up big-endian secondaries. | Unable to handle kernel paging request at virtual address ffff80098b982c00 | Mem abort info: | ESR = 0x96000045 | Exception class = DABT (current EL), IL = 32 bits | SET = 0, FnV = 0 | EA = 0, S1PTW = 0 | Data abort info: | ISV = 0, ISS = 0x00000045 | CM = 0, WnR = 1 | swapper pgtable: 4k pages, 48-bit VAs, pgdp = 000000002ea0571a | [ffff80098b982c00] pgd=00000009ffff8803, pud=0000000000000000 | Internal error: Oops: 96000045 [#1] PREEMPT SMP | Modules linked in: | CPU: 2 PID: 1561 Comm: kvm-vcpu-0 Not tainted 4.17.0-rc3-00001-ga912e2261ca6-dirty #1323 | Hardware name: ARM Juno development board (r1) (DT) | pstate: 60000005 (nZCv daif -PAN -UAO) | pc : vcpu_write_sys_reg+0x50/0x134 | lr : vcpu_write_sys_reg+0x50/0x134 | Process kvm-vcpu-0 (pid: 1561, stack limit = 0x000000006df4728b) | Call trace: | vcpu_write_sys_reg+0x50/0x134 | kvm_psci_vcpu_on+0x14c/0x150 | kvm_psci_0_2_call+0x244/0x2a4 | kvm_hvc_call_handler+0x1cc/0x258 | handle_hvc+0x20/0x3c | handle_exit+0x130/0x1ec | kvm_arch_vcpu_ioctl_run+0x340/0x614 | kvm_vcpu_ioctl+0x4d0/0x840 | do_vfs_ioctl+0xc8/0x8d0 | ksys_ioctl+0x78/0xa8 | sys_ioctl+0xc/0x18 | el0_svc_naked+0x30/0x34 | Code: 73620291 604d00b0 00201891 1ab10194 (957a33f8) |---[ end trace 4b4a4f9628596602 ]--- Fix the order of the arguments. Fixes: 8d404c4c24613 ("KVM: arm64: Rewrite system register accessors to read/write functions") CC: Christoffer Dall <cdall@cs.columbia.edu> Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
| | * KVM: arm/arm64: vgic: Fix source vcpu issues for GICv2 SGIMarc Zyngier2018-04-276-61/+81
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that we make sure we don't inject multiple instances of the same GICv2 SGI at the same time, we've made another bug more obvious: If we exit with an active SGI, we completely lose track of which vcpu it came from. On the next entry, we restore it with 0 as a source, and if that wasn't the right one, too bad. While this doesn't seem to trouble GIC-400, the architectural model gets offended and doesn't deactivate the interrupt on EOI. Another connected issue is that we will happilly make pending an interrupt from another vcpu, overriding the above zero with something that is just as inconsistent. Don't do that. The final issue is that we signal a maintenance interrupt when no pending interrupts are present in the LR. Assuming we've fixed the two issues above, we end-up in a situation where we keep exiting as soon as we've reached the active state, and not be able to inject the following pending. The fix comes in 3 parts: - GICv2 SGIs have their source vcpu saved if they are active on exit, and restored on entry - Multi-SGIs cannot go via the Pending+Active state, as this would corrupt the source field - Multi-SGIs are converted to using MI on EOI instead of NPIE Fixes: 16ca6a607d84bef0 ("KVM: arm/arm64: vgic: Don't populate multiple LRs with the same vintid") Reported-by: Mark Rutland <mark.rutland@arm.com> Tested-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Christoffer Dall <christoffer.dall@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
* | | Merge tag 'iommu-fixes-v4.17-rc4' of ↵Linus Torvalds2018-05-065-34/+37
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu Pull iommu fixes from Joerg Roedel: - fix a compile warning in the AMD IOMMU driver with irq remapping disabled - fix for VT-d interrupt remapping and invalidation size (caused a BUG_ON when trying to invalidate more than 4GB) - build fix and a regression fix for broken graphics with old DTS for the rockchip iommu driver - a revert in the PCI window reservation code which fixes a regression with VFIO. * tag 'iommu-fixes-v4.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: iommu: rockchip: fix building without CONFIG_OF iommu/vt-d: Use WARN_ON_ONCE instead of BUG_ON in qi_flush_dev_iotlb() iommu/vt-d: fix shift-out-of-bounds in bug checking iommu/dma: Move PCI window region reservation back into dma specific path. iommu/rockchip: Make clock handling optional iommu/amd: Hide unused iommu_table_lock iommu/vt-d: Fix usage of force parameter in intel_ir_reconfigure_irte()
| * | | iommu: rockchip: fix building without CONFIG_OFArnd Bergmann2018-05-031-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We get a build error when compiling the iommu driver without CONFIG_OF: drivers/iommu/rockchip-iommu.c: In function 'rk_iommu_of_xlate': drivers/iommu/rockchip-iommu.c:1101:2: error: implicit declaration of function 'of_dev_put'; did you mean 'of_node_put'? [-Werror=implicit-function-declaration] This replaces the of_dev_put() with the equivalent platform_device_put(). Fixes: 5fd577c3eac3 ("iommu/rockchip: Use OF_IOMMU to attach devices automatically") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Joerg Roedel <jroedel@suse.de>
| * | | iommu/vt-d: Use WARN_ON_ONCE instead of BUG_ON in qi_flush_dev_iotlb()Joerg Roedel2018-05-031-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | A misaligned address is only worth a warning, and not stopping the while execution path with a BUG_ON(). Signed-off-by: Joerg Roedel <jroedel@suse.de>
| * | | iommu/vt-d: fix shift-out-of-bounds in bug checkingChangbin Du2018-05-031-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It allows to flush more than 4GB of device TLBs. So the mask should be 64bit wide. UBSAN captured this fault as below. [ 3.760024] ================================================================================ [ 3.768440] UBSAN: Undefined behaviour in drivers/iommu/dmar.c:1348:3 [ 3.774864] shift exponent 64 is too large for 32-bit type 'int' [ 3.780853] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G U 4.17.0-rc1+ #89 [ 3.788661] Hardware name: Dell Inc. OptiPlex 7040/0Y7WYT, BIOS 1.2.8 01/26/2016 [ 3.796034] Call Trace: [ 3.798472] <IRQ> [ 3.800479] dump_stack+0x90/0xfb [ 3.803787] ubsan_epilogue+0x9/0x40 [ 3.807353] __ubsan_handle_shift_out_of_bounds+0x10e/0x170 [ 3.812916] ? qi_flush_dev_iotlb+0x124/0x180 [ 3.817261] qi_flush_dev_iotlb+0x124/0x180 [ 3.821437] iommu_flush_dev_iotlb+0x94/0xf0 [ 3.825698] iommu_flush_iova+0x10b/0x1c0 [ 3.829699] ? fq_ring_free+0x1d0/0x1d0 [ 3.833527] iova_domain_flush+0x25/0x40 [ 3.837448] fq_flush_timeout+0x55/0x160 [ 3.841368] ? fq_ring_free+0x1d0/0x1d0 [ 3.845200] ? fq_ring_free+0x1d0/0x1d0 [ 3.849034] call_timer_fn+0xbe/0x310 [ 3.852696] ? fq_ring_free+0x1d0/0x1d0 [ 3.856530] run_timer_softirq+0x223/0x6e0 [ 3.860625] ? sched_clock+0x5/0x10 [ 3.864108] ? sched_clock+0x5/0x10 [ 3.867594] __do_softirq+0x1b5/0x6f5 [ 3.871250] irq_exit+0xd4/0x130 [ 3.874470] smp_apic_timer_interrupt+0xb8/0x2f0 [ 3.879075] apic_timer_interrupt+0xf/0x20 [ 3.883159] </IRQ> [ 3.885255] RIP: 0010:poll_idle+0x60/0xe7 [ 3.889252] RSP: 0018:ffffb1b201943e30 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13 [ 3.896802] RAX: 0000000080200000 RBX: 000000000000008e RCX: 000000000000001f [ 3.903918] RDX: 0000000000000000 RSI: 000000002819aa06 RDI: 0000000000000000 [ 3.911031] RBP: ffff9e93c6b33280 R08: 00000010f717d567 R09: 000000000010d205 [ 3.918146] R10: ffffb1b201943df8 R11: 0000000000000001 R12: 00000000e01b169d [ 3.925260] R13: 0000000000000000 R14: ffffffffb12aa400 R15: 0000000000000000 [ 3.932382] cpuidle_enter_state+0xb4/0x470 [ 3.936558] do_idle+0x222/0x310 [ 3.939779] cpu_startup_entry+0x78/0x90 [ 3.943693] start_secondary+0x205/0x2e0 [ 3.947607] secondary_startup_64+0xa5/0xb0 [ 3.951783] ================================================================================ Signed-off-by: Changbin Du <changbin.du@intel.com> Signed-off-by: Joerg Roedel <jroedel@suse.de>
| * | | iommu/dma: Move PCI window region reservation back into dma specific path.Shameer Kolothum2018-05-031-29/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This pretty much reverts commit 273df9635385 ("iommu/dma: Make PCI window reservation generic") by moving the PCI window region reservation back into the dma specific path so that these regions doesn't get exposed via the IOMMU API interface. With this change, the vfio interface will report only iommu specific reserved regions to the user space. Cc: Joerg Roedel <joro@8bytes.org> Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Reviewed-by: Robin Murphy <robin.murphy@arm.com> Fixes: 273df9635385 ('iommu/dma: Make PCI window reservation generic') Signed-off-by: Joerg Roedel <jroedel@suse.de>
| * | | iommu/rockchip: Make clock handling optionalHeiko Stuebner2018-05-031-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | iommu clocks are optional, so the driver should not fail if they are not present. Instead just set the number of clocks to 0, which the clk-blk APIs can handle just fine. Fixes: f2e3a5f557ad ("iommu/rockchip: Control clocks needed to access the IOMMU") Signed-off-by: Heiko Stuebner <heiko@sntech.de> Reviewed-by: Robin Murphy <robin.murphy@arm.com> Tested-by: Enric Balletbo i Serra <enric.balletbo@collabora.com> Signed-off-by: Joerg Roedel <jroedel@suse.de>
| * | | iommu/amd: Hide unused iommu_table_lockArnd Bergmann2018-05-031-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The newly introduced lock is only used when CONFIG_IRQ_REMAP is enabled: drivers/iommu/amd_iommu.c:86:24: error: 'iommu_table_lock' defined but not used [-Werror=unused-variable] static DEFINE_SPINLOCK(iommu_table_lock); This moves the definition next to the user, within the #ifdef protected section of the file. Fixes: ea6166f4b83e ("iommu/amd: Split irq_lookup_table out of the amd_iommu_devtable_lock") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Joerg Roedel <jroedel@suse.de>
| * | | iommu/vt-d: Fix usage of force parameter in intel_ir_reconfigure_irte()Jagannathan Raman2018-05-031-1/+1
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It was noticed that the IRTE configured for guest OS kernel was over-written while the guest was running. As a result, vt-d Posted Interrupts configured for the guest are not being delivered directly, and instead bounces off the host. Every interrupt delivery takes a VM Exit. It was noticed that the following stack is doing the over-write: [ 147.463177] modify_irte+0x171/0x1f0 [ 147.463405] intel_ir_set_affinity+0x5c/0x80 [ 147.463641] msi_domain_set_affinity+0x32/0x90 [ 147.463881] irq_do_set_affinity+0x37/0xd0 [ 147.464125] irq_set_affinity_locked+0x9d/0xb0 [ 147.464374] __irq_set_affinity+0x42/0x70 [ 147.464627] write_irq_affinity.isra.5+0xe1/0x110 [ 147.464895] proc_reg_write+0x38/0x70 [ 147.465150] __vfs_write+0x36/0x180 [ 147.465408] ? handle_mm_fault+0xdf/0x200 [ 147.465671] ? _cond_resched+0x15/0x30 [ 147.465936] vfs_write+0xad/0x1a0 [ 147.466204] SyS_write+0x52/0xc0 [ 147.466472] do_syscall_64+0x74/0x1a0 [ 147.466744] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 reversing the sense of force check in intel_ir_reconfigure_irte() restores proper posted interrupt functionality Signed-off-by: Jagannathan Raman <jag.raman@oracle.com> Fixes: d491bdff888e ('iommu/vt-d: Reevaluate vector configuration on activate()') Signed-off-by: Joerg Roedel <jroedel@suse.de>
* | | Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds2018-05-061-1/+5
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fix from Thomas Gleixner: "Unbreak the CPUID CPUID_8000_0008_EBX reload which got dropped when the evaluation of physical and virtual bits which uses the same CPUID leaf was moved out of get_cpu_cap()" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/cpu: Restore CPUID_8000_0008_EBX reload
| * | | x86/cpu: Restore CPUID_8000_0008_EBX reloadThomas Gleixner2018-05-021-1/+5
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The recent commt which addresses the x86_phys_bits corruption with encrypted memory on CPUID reload after a microcode update lost the reload of CPUID_8000_0008_EBX as well. As a consequence IBRS and IBRS_FW are not longer detected Restore the behaviour by bringing the reload of CPUID_8000_0008_EBX back. This restore has a twist due to the convoluted way the cpuid analysis works: CPUID_8000_0008_EBX is used by AMD to enumerate IBRB, IBRS, STIBP. On Intel EBX is not used. But the speculation control code sets the AMD bits when running on Intel depending on the Intel specific speculation control bits. This was done to use the same bits for alternatives. The change which moved the 8000_0008 evaluation out of get_cpu_cap() broke this nasty scheme due to ordering. So that on Intel the store to CPUID_8000_0008_EBX clears the IBRB, IBRS, STIBP bits which had been set before by software. So the actual CPUID_8000_0008_EBX needs to go back to the place where it was and the phys/virt address space calculation cannot touch it. In hindsight this should have used completely synthetic bits for IBRB, IBRS, STIBP instead of reusing the AMD bits, but that's for 4.18. /me needs to find time to cleanup that steaming pile of ... Fixes: d94a155c59c9 ("x86/cpu: Prevent cpuinfo_x86::x86_phys_bits adjustment corruption") Reported-by: Jörg Otte <jrg.otte@gmail.com> Reported-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Jörg Otte <jrg.otte@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: kirill.shutemov@linux.intel.com Cc: Borislav Petkov <bp@alien8.de Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1805021043510.1668@nanos.tec.linutronix.de
* | | Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds2018-05-062-30/+55
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull clocksource fixes from Thomas Gleixner: "The recent addition of the early TSC clocksource breaks on machines which have an unstable TSC because in case that TSC is disabled, then the clocksource selection logic falls back to the early TSC which is obviously bogus. That also unearthed a few robustness issues in the clocksource derating code which are addressed as well" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: clocksource: Rework stale comment clocksource: Consistent de-rate when marking unstable x86/tsc: Fix mark_tsc_unstable() clocksource: Initialize cs->wd_list clocksource: Allow clocksource_mark_unstable() on unregistered clocksources x86/tsc: Always unregister clocksource_tsc_early
| * | | clocksource: Rework stale commentPeter Zijlstra2018-05-021-4/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | AFAICS the hotplug code no longer uses this function. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: len.brown@intel.com Cc: rjw@rjwysocki.net Cc: diego.viola@gmail.com Cc: rui.zhang@intel.com Link: https://lkml.kernel.org/r/20180430100344.656525644@infradead.org