aboutsummaryrefslogtreecommitdiffstats
path: root/include
Commit message (Collapse)AuthorAgeFilesLines
* wrappers for ->i_mutex accessAl Viro2016-01-221-2/+27
| | | | | | | | | | | parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested}, inode_foo(inode) being mutex_foo(&inode->i_mutex). Please, use those for access to ->i_mutex; over the coming cycle ->i_mutex will become rwsem, with ->lookup() done with it held only shared. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* Merge tag 'ext4_for_linus' of ↵Linus Torvalds2016-01-222-5/+28
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Some locking and page fault bug fixes from Jan Kara, some ext4 encryption fixes from me, and Li Xi's Project Quota commits" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: fs: clean up the flags definition in uapi/linux/fs.h ext4: add FS_IOC_FSSETXATTR/FS_IOC_FSGETXATTR interface support ext4: add project quota support ext4: adds project ID support ext4 crypto: simplify interfaces to directory entry insert functions ext4 crypto: add missing locking for keyring_key access ext4: use pre-zeroed blocks for DAX page faults ext4: implement allocation of pre-zeroed blocks ext4: provide ext4_issue_zeroout() ext4: get rid of EXT4_GET_BLOCKS_NO_LOCK flag ext4: document lock ordering ext4: fix races of writeback with punch hole and zero range ext4: fix races between buffered IO and collapse / insert range ext4: move unlocked dio protection from ext4_alloc_file_blocks() ext4: fix races between page faults and hole punching
| * fs: clean up the flags definition in uapi/linux/fs.hTheodore Ts'o2016-01-081-4/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add an explanation for the flags used by FS_IOC_[GS]ETFLAGS and remind people that changes should be revised by linux-fsdevel and linux-api. Add flags that are used on-disk for ext4, and remove FS_DIRECTIO_FL since it was used only by gfs2 and support was removed in 2008 in commit c9f6a6bbc28 ("The ability to mark files for direct i/o access when opened normally is both unused and pointless, so this patch removes support for that feature.") Now we have _two_ remaining flags left. But since we want to discourage people from assigning new flags, that's OK. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * ext4: implement allocation of pre-zeroed blocksJan Kara2015-12-071-1/+2
| | | | | | | | | | | | | | | | | | DAX page fault path needs to get blocks that are pre-zeroed to avoid races when two concurrent page faults happen in the same block of a file. Implement support for this in ext4_map_blocks(). Signed-off-by: Jan Kara <jack@suse.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * ext4: get rid of EXT4_GET_BLOCKS_NO_LOCK flagJan Kara2015-12-071-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When dioread_nolock mode is enabled, we grab i_data_sem in ext4_ext_direct_IO() and therefore we need to instruct _ext4_get_block() not to grab i_data_sem again using EXT4_GET_BLOCKS_NO_LOCK. However holding i_data_sem over overwrite direct IO isn't needed these days. We have exclusion against truncate / hole punching because we increase i_dio_count under i_mutex in ext4_ext_direct_IO() so once ext4_file_write_iter() verifies blocks are allocated & written, they are guaranteed to stay so during the whole direct IO even after we drop i_mutex. So we can just remove this locking abuse and the no longer necessary EXT4_GET_BLOCKS_NO_LOCK flag. Signed-off-by: Jan Kara <jack@suse.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | Merge tag 'xfs-for-linus-4.5-2' of ↵Linus Torvalds2016-01-221-0/+33
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs Pull more xfs updates from Dave Chinner: "This is the second update for XFS that I mentioned in the original pull request last week. It contains a revert for a suspend regression in 4.4 and a fix for a long standing log recovery issue that has been further exposed by all the log recovery changes made in the original 4.5 merge. There is one more thing in this pull request - one that I forgot to merge into the origin. That is, pulling the XFS_IOC_FS[GS]ETXATTR ioctl up to the VFS level so that other filesystems can also use it for modifying project quota IDs Summary: - promotion of XFS_IOC_FS[GS]ETXATTR ioctl to the vfs level so that it can be shared with other filesystems. The ext4 project quota functionality is the first target for this. The commits in this series have not been updated with review or final SOB tags because the branch they were originally published in was needed by ext4. Those tags are: Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Dave Chinner <david@fromrobit.com> - Revert a change that is causing suspend failures. - Fix a use-after-free that can occur on log mount failures. Been around forever, but now exposed by other changes to log recovery made in the first 4.5 merge" * tag 'xfs-for-linus-4.5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: xfs: log mount failures don't wait for buffers to be released Revert "xfs: clear PF_NOFREEZE for xfsaild kthread" xfs: introduce per-inode DAX enablement xfs: use FS_XFLAG definitions directly fs: XFS_IOC_FS[SG]SETXATTR to FS_IOC_FS[SG]ETXATTR promotion
| * | xfs: introduce per-inode DAX enablementDave Chinner2016-01-041-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rather than just being able to turn DAX on and off via a mount option, some applications may only want to enable DAX for certain performance critical files in a filesystem. This patch introduces a new inode flag to enable DAX in the v3 inode di_flags2 field. It adds support for setting and clearing flags in the di_flags2 field via the XFS_IOC_FSSETXATTR ioctl, and sets the S_DAX inode flag appropriately when it is seen. When this flag is set on a directory, it acts as an "inherit flag". That is, inodes created in the directory will automatically inherit the on-disk inode DAX flag, enabling administrators to set up directory heirarchies that automatically use DAX. Setting this flag on an empty root directory will make the entire filesystem use DAX by default. Signed-off-by: Dave Chinner <dchinner@redhat.com>
| * | fs: XFS_IOC_FS[SG]SETXATTR to FS_IOC_FS[SG]ETXATTR promotionDave Chinner2016-01-041-0/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Hoist the ioctl definitions for the XFS_IOC_FS[SG]SETXATTR API from fs/xfs/libxfs/xfs_fs.h to include/uapi/linux/fs.h so that the ioctls can be used by all filesystems, not just XFS. This enables (initially) ext4 to use the ioctl to set project IDs on inodes. Based-on-patch-from: Li Xi <lixi@ddn.com> Signed-off-by: Dave Chinner <dchinner@redhat.com>
* | | Merge branch 'for-linus' of ↵Linus Torvalds2016-01-222-0/+5
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull more vfs updates from Al Viro: "Embarrassing braino fix + pipe page accounting + fixing an eyesore in find_filesystem() (checking that s1 is equal to prefix of s2 of given length can be done in many ways, but "compare strlen(s1) with length and then do strncmp()" is not a good one...)" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: [regression] fix braino in fs/dlm/user.c pipe: limit the per-user amount of pages allocated in pipes find_filesystem(): simplify comparison
| * | | pipe: limit the per-user amount of pages allocated in pipesWilly Tarreau2016-01-192-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On no-so-small systems, it is possible for a single process to cause an OOM condition by filling large pipes with data that are never read. A typical process filling 4000 pipes with 1 MB of data will use 4 GB of memory. On small systems it may be tricky to set the pipe max size to prevent this from happening. This patch makes it possible to enforce a per-user soft limit above which new pipes will be limited to a single page, effectively limiting them to 4 kB each, as well as a hard limit above which no new pipes may be created for this user. This has the effect of protecting the system against memory abuse without hurting other users, and still allowing pipes to work correctly though with less data at once. The limit are controlled by two new sysctls : pipe-user-pages-soft, and pipe-user-pages-hard. Both may be disabled by setting them to zero. The default soft limit allows the default number of FDs per process (1024) to create pipes of the default size (64kB), thus reaching a limit of 64MB before starting to create only smaller pipes. With 256 processes limited to 1024 FDs each, this results in 1024*64kB + (256*1024 - 1024) * 4kB = 1084 MB of memory allocated for a user. The hard limit is disabled by default to avoid breaking existing applications that make intensive use of pipes (eg: for splicing). Reported-by: socketpair@gmail.com Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Mitigates: CVE-2013-4312 (Linux 2.0+) Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | | | Merge branch 'akpm' (patches from Andrew)Linus Torvalds2016-01-222-14/+14
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge misc fixes from Andrew Morton: "Six fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: ocfs2: NFS hangs in __ocfs2_cluster_lock due to race with ocfs2_unblock_lock reiserfs: fix dereference of ERR_PTR ratelimit: fix bug in time interval by resetting right begin time mm: fix kernel crash in khugepaged thread mm: fix mlock accouting thp: change pmd_trans_huge_lock() interface to return ptl
| * | | | mm: fix kernel crash in khugepaged threadyalin wang2016-01-211-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This crash is caused by NULL pointer deference, in page_to_pfn() marco, when page == NULL : Unable to handle kernel NULL pointer dereference at virtual address 00000000 Internal error: Oops: 94000006 [#1] SMP Modules linked in: CPU: 1 PID: 26 Comm: khugepaged Tainted: G W 4.3.0-rc6-next-20151022ajb-00001-g32f3386-dirty #3 PC is at khugepaged+0x378/0x1af8 LR is at khugepaged+0x418/0x1af8 Process khugepaged (pid: 26, stack limit = 0xffffffc079638020) Call trace: khugepaged+0x378/0x1af8 kthread+0xdc/0xf4 ret_from_fork+0xc/0x40 Code: 35001700 f0002c60 aa0703e3 f9009fa0 (f94000e0) ---[ end trace 637503d8e28ae69e ]--- Kernel panic - not syncing: Fatal exception CPU2: stopping CPU: 2 PID: 0 Comm: swapper/2 Tainted: G D W 4.3.0-rc6-next-20151022ajb-00001-g32f3386-dirty #3 Hardware name: linux,dummy-virt (DT) [akpm@linux-foundation.org: fix fat-fingered merge resolution] Signed-off-by: yalin wang <yalin.wang2010@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | thp: change pmd_trans_huge_lock() interface to return ptlKirill A. Shutemov2016-01-211-8/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After THP refcounting rework we have only two possible return values from pmd_trans_huge_lock(): success and failure. Return-by-pointer for ptl doesn't make much sense in this case. Let's convert pmd_trans_huge_lock() to return ptl on success and NULL on failure. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Minchan Kim <minchan@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | | Merge branch 'for-4.5/nvme' of git://git.kernel.dk/linux-blockLinus Torvalds2016-01-216-27/+38
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull NVMe updates from Jens Axboe: "Last branch for this series is the nvme changes. It's in a separate branch to avoid splitting too much between core and NVMe changes, since NVMe is still helping drive some blk-mq changes. That said, not a huge amount of core changes in here. The grunt of the work is the continued split of the code" * 'for-4.5/nvme' of git://git.kernel.dk/linux-block: (67 commits) uapi: update install list after nvme.h rename NVMe: Export NVMe attributes to sysfs group NVMe: Shutdown controller only for power-off NVMe: IO queue deletion re-write NVMe: Remove queue freezing on resets NVMe: Use a retryable error code on reset NVMe: Fix admin queue ring wrap nvme: make SG_IO support optional nvme: fixes for NVME_IOCTL_IO_CMD on the char device nvme: synchronize access to ctrl->namespaces nvme: Move nvme_freeze/unfreeze_queues to nvme core PCI/AER: include header file NVMe: Export namespace attributes to sysfs NVMe: Add pci error handlers block: remove REQ_NO_TIMEOUT flag nvme: merge iod and cmd_info nvme: meta_sg doesn't have to be an array nvme: properly free resources for cancelled command nvme: simplify completion handling nvme: special case AEN requests ...
| * | | | | uapi: update install list after nvme.h renameMike Frysinger2016-01-131-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 9d99a8dda154 ("nvme: move hardware structures out of the uapi version of nvme.h") renamed nvme.h to nvme_ioctl.h, but the uapi list still refers to nvme.h. People trying to install the headers hit a failure as the header no longer exists. Cc: stable@vger.kernel.org Signed-off-by: Mike Frysinger <vapier@gentoo.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | | PCI/AER: include header fileSudip Mukherjee2015-12-231-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We are having build failure with sparc allmodconfig with the error: drivers/nvme/host/pci.c:15:0: include/linux/aer.h: In function 'pci_enable_pcie_error_reporting': include/linux/aer.h:49:10: error: 'EINVAL' undeclared (first use in this function) The file aer.h is using the error values but they are defined in errno.h. Include errno.h so that we have the definitions of the error codes. Fixes: a0a3408ee614 ("NVMe: Add pci error handlers") Cc: Keith Busch <keith.busch@intel.com> Signed-off-by: Sudip Mukherjee <sudip@vectorindia.org> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | | block: remove REQ_NO_TIMEOUT flagChristoph Hellwig2015-12-221-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This was added for the 'magic' AEN requests in the NVMe driver that never return. We now handle them purely inside the driver and don't need this core hack any more. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | | block: defer timeouts to a workqueueChristoph Hellwig2015-12-221-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Timer context is not very useful for drivers to perform any meaningful abort action from. So instead of calling the driver from this useless context defer it to a workqueue as soon as possible. Note that while a delayed_work item would seem the right thing here I didn't dare to use it due to the magic in blk_add_timer that pokes deep into timer internals. But maybe this encourages Tejun to add a sensible API for that to the workqueue API and we'll all be fine in the end :) Contains a major update from Keith Bush: "This patch removes synchronizing the timeout work so that the timer can start a freeze on its own queue. The timer enters the queue, so timer context can only start a freeze, but not wait for frozen." Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | | blk-integrity: empty implementation when disabledKeith Busch2015-12-031-10/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch moves the blk_integrity_payload definition outside the CONFIG_BLK_DEV_INTERITY dependency and provides empty function implementations when the kernel configuration disables integrity extensions. This simplifies drivers that make use of these to map user data so they don't need to repeat the same configuration checks. Signed-off-by: Keith Busch <keith.busch@intel.com> Updated by Jens to pass an error pointer return from bio_integrity_alloc(), otherwise if CONFIG_BLK_DEV_INTEGRITY isn't set, we return a weird ENOMEM from __nvme_submit_user_cmd() if a meta buffer is set. Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | | nvme: use offset instead of a struct for registersChristoph Hellwig2015-12-011-14/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This makes life easier for future non-PCI drivers where access to the registers might be more complicated. Note that Linux drivers are pretty evenly split between the two versions, and in fact the NVMe driver already uses offsets for the doorbells. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Keith Busch <keith.busch@intel.com> [Fixed CMBSZ offset] Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* | | | | | Merge branch 'for-4.5/lightnvm' of git://git.kernel.dk/linux-blockLinus Torvalds2016-01-212-65/+204
|\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull lightnvm fixes and updates from Jens Axboe: "This should have been part of the drivers branch, but it arrived a bit late and wasn't based on the official core block driver branch. So they got a small scolding, but got a pass since it's still new. Hence it's in a separate branch. This is mostly pure fixes, contained to lightnvm/, and minor feature additions" * 'for-4.5/lightnvm' of git://git.kernel.dk/linux-block: (26 commits) lightnvm: ensure that nvm_dev_ops can be used without CONFIG_NVM lightnvm: introduce factory reset lightnvm: use system block for mm initialization lightnvm: introduce ioctl to initialize device lightnvm: core on-disk initialization lightnvm: introduce mlc lower page table mappings lightnvm: add mccap support lightnvm: manage open and closed blocks separately lightnvm: fix missing grown bad block type lightnvm: reference rrpc lun in rrpc block lightnvm: introduce nvm_submit_ppa lightnvm: move rq->error to nvm_rq->error lightnvm: support multiple ppas in nvm_erase_ppa lightnvm: move the pages per block check out of the loop lightnvm: sectors first in ppa list lightnvm: fix locking and mempool in rrpc_lun_gc lightnvm: put block back to gc list on its reclaim fail lightnvm: check bi_error in gc lightnvm: return the get_bb_tbl return value lightnvm: refactor end_io functions for sync ...
| * | | | | | lightnvm: ensure that nvm_dev_ops can be used without CONFIG_NVMJens Axboe2016-01-131-57/+64
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | null_blk defines an empty version of this ops structure if CONFIG_NVM isn't set, but it doesn't know the type. Move those bits out of the protection of CONFIG_NVM in the main lightnvm include. Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | | | lightnvm: introduce factory resetMatias Bjørling2016-01-122-0/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that a device can be managed using the system blocks, a method to reset the device is necessary as well. This patch introduces logic to reset the device easily to factory state and exposes it through an ioctl. The ioctl takes the following flags: NVM_FACTORY_ERASE_ONLY_USER By default all blocks, except host-reserved blocks are erased upon factory reset. Instead of this, only erase host-reserved blocks. NVM_FACTORY_RESET_HOST_BLKS Mark host-reserved blocks to be erased and set their type to free. NVM_FACTORY_RESET_GRWN_BBLKS Mark "grown bad blocks" to be erased and set their type to free. Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | | | lightnvm: use system block for mm initializationMatias Bjørling2016-01-121-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use system block information to register the appropriate media manager. This enables the LightNVM subsystem to instantiate a media manager selected by the user, instead of relying on automatic detection by each media manager loaded in the kernel. A device must now be initialized before it can proceed to initialize its media manager. Upon initialization, the configured media manager is automatically initialized as well. Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | | | lightnvm: introduce ioctl to initialize deviceMatias Bjørling2016-01-121-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Based on the previous patch, we now introduce an ioctl to initialize the device using nvm_init_sysblock and create the necessary system blocks. The user may specify the media manager that they wish to instantiate on top. Default from user-space will be "gennvm". Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | | | lightnvm: core on-disk initializationMatias Bjørling2016-01-122-0/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | An Open-Channel SSD shall be initialized before use. To initialize, we define an on-disk format, that keeps a small set of metadata to bring up the media manager on top of the device. The initial step is introduced to allow a user to format the disks for a given media manager. During format, a system block is stored on one to three separate luns on the device. Each lun has the system block duplicated. During initialization, the system block can be retrieved and the appropriate media manager can initialized. The on-disk format currently covers (struct nvm_system_block): - Magic value "NVMS". - Monotonic increasing sequence number. - The physical block erase count. - Version of the system block format. - Media manager type. - Media manager superblock physical address. The interface provides three functions to manage the system block: int nvm_init_sysblock(struct nvm_dev *, struct nvm_sb_info *) int nvm_get_sysblock(struct nvm *dev, struct nvm_sb_info *) int nvm_update_sysblock(struct nvm *dev, struct nvm_sb_info *) Each implement a part of the logic to manage the system block. The initialization creates the first system blocks and mark them on the device. Get retrieves the latest system block by scanning all pages in the associated system blocks. The update sysblock writes new metadata and allocates new block if necessary. Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | | | lightnvm: introduce mlc lower page table mappingsMatias Bjørling2016-01-121-0/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | NAND MLC memories have both lower and upper pages. When programming, both of these must be written, before data can be read. However, these lower and upper pages might not placed at even and odd flash pages, but can be skipped. Therefore each flash memory has its lower pages defined, which can then be used when programming and to know when padding are necessary. This patch implements the lower page definition in the specification, and exposes it through a simple lookup table at dev->lptbl. Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | | | lightnvm: add mccap supportMatias Bjørling2016-01-121-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some flash media has extended capabilities, such as programming SLC pages on MLC/TLC flash, erase/program suspend, scramble and encryption. MCCAP is introduced to detect support for these capabilities in the command set. Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | | | lightnvm: manage open and closed blocks separatelyJavier González2016-01-121-3/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | LightNVM targets need to know the state of the flash block when doing flash optimizations. An example is implementing a write buffer to respect the flash page size. Currently, block state is not accounted for; the media manager only differentiates among free, bad and in-use blocks. This patch adds the logic in the generic media manager to enable targets manage blocks into open and close separately, and it implements such management in rrpc. It also adds a set of flags to describe the state of the block (open, closed, free, bad). In order to avoid taking two locks (nvm_lun and rrpc_lun) consecutively, we introduce lockless get_/put_block primitives so that the open and close list locks and future common logic is handled within the nvm_lun lock. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | | | lightnvm: fix missing grown bad block typeMatias Bjørling2016-01-121-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The get/set bad block interface defines good block, factory bad block, grown bad block, device reserved block, and host reserved block. Unfortunately the grown bad block was missing, leaving the offsets wrong for device and host side reserved blocks. This patch adds the missing type and corrects the offsets. Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | | | lightnvm: introduce nvm_submit_ppaMatias Bjørling2016-01-121-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Internal logic for both core and media managers, does not have a backing bio for issuing I/Os. Introduce nvm_submit_ppa to allow raw I/Os to be submitted to the underlying device driver. The function request the device, ppa, data buffer and its length and will submit the I/O synchronously to the device. The return value may therefore be used to detect any errors regarding the issued I/O. Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | | | lightnvm: move rq->error to nvm_rq->errorMatias Bjørling2016-01-121-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of passing request error into the LightNVM modules, incorporate it into the nvm_rq. Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | | | lightnvm: support multiple ppas in nvm_erase_ppaMatias Bjørling2016-01-121-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Sometimes a user want to erase multiple PPAs at the same time. Extend nvm_erase_ppa to take multiple ppas and number of ppas to be erased. Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | | | lightnvm: refactor end_io functions for syncMatias Bjørling2016-01-121-4/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To implement sync I/O support within the LightNVM core, the end_io functions are refactored to take an end_io function pointer instead of testing for initialized media manager, followed by calling its end_io function. Sync I/O can then be implemented using a callback that signal I/O completion. This is similar to the logic found in blk_to_execute_io(). By implementing it this way, the underlying device I/Os submission logic is abstracted away from core, targets, and media managers. Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | | | lightnvm: refactor rqd ppa list into set/freeMatias Bjørling2016-01-121-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A device may be driven in single, double or quad plane mode. In that case, the rqd must have either one, two, or four PPAs set for a single PPA sent to the device. Refactor this logic into their own functions to be shared by program/erase/read in the core. Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | | | lightnvm: move ppa erase logic to coreMatias Bjørling2016-01-121-0/+3
| | |_|_|/ / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A device may function in single, dual or quad plane mode. The gennvm media manager manages this with explicit helpers. They convert a single ppa to 1, 2 or 4 separate ppas in a ppa list. To aid implementation of recovery and system blocks, this functionality can be moved directly into the core. Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>
* | | | | | Merge branch 'for-4.5/drivers' of git://git.kernel.dk/linux-blockLinus Torvalds2016-01-215-3/+236
|\ \ \ \ \ \ | |_|_|/ / / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block driver updates from Jens Axboe: "This is the block driver pull request for 4.5, with the exception of NVMe, which is in a separate branch and will be posted after this one. This pull request contains: - A set of bcache stability fixes, which have been acked by Kent. These have been used and tested for more than a year by the community, so it's about time that they got in. - A set of drbd updates from the drbd team (Andreas, Lars, Philipp) and Markus Elfring, Oleg Drokin. - A set of fixes for xen blkback/front from the usual suspects, (Bob, Konrad) as well as community based fixes from Kiri, Julien, and Peng. - A 2038 time fix for sx8 from Shraddha, with a fix from me. - A small mtip32xx cleanup from Zhu Yanjun. - A null_blk division fix from Arnd" * 'for-4.5/drivers' of git://git.kernel.dk/linux-block: (71 commits) null_blk: use sector_div instead of do_div mtip32xx: restrict variables visible in current code module xen/blkfront: Fix crash if backend doesn't follow the right states. xen/blkback: Fix two memory leaks. xen/blkback: make st_ statistics per ring xen/blkfront: Handle non-indirect grant with 64KB pages xen-blkfront: Introduce blkif_ring_get_request xen-blkback: clear PF_NOFREEZE for xen_blkif_schedule() xen/blkback: Free resources if connect_ring failed. xen/blocks: Return -EXX instead of -1 xen/blkback: make pool of persistent grants and free pages per-queue xen/blkback: get the number of hardware queues/rings from blkfront xen/blkback: pseudo support for multi hardware queues/rings xen/blkback: separate ring information out of struct xen_blkif xen/blkfront: correct setting for xen_blkif_max_ring_order xen/blkfront: make persistent grants pool per-queue xen/blkfront: Remove duplicate setting of ->xbdev. xen/blkfront: Cleanup of comments, fix unaligned variables, and syntax errors. xen/blkfront: negotiate number of queues/rings to be used with backend xen/blkfront: split per device io_lock ...
| * | | | | xen/blkif: document blkif multi-queue/ring extensionBob Liu2016-01-041-0/+48
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Document the multi-queue/ring feature in terms of XenStore keys to be written by the backend and by the frontend. Signed-off-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| * | | | | lru_cache: Converted lc_seq_printf_status to return voidRoland Kammerer2015-11-251-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix the semantic of lc_seq_printf. Currently, it always returns 0 and the return value is unused, therefore, convert the return type to void. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | | drbd: make drbd known to lsblk: use bd_link_disk_holderLars Ellenberg2015-11-251-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | lsblk should be able to pick up stacking device driver relations involving DRBD conveniently. Even though upstream kernel since 2011 says "DON'T USE THIS UNLESS YOU'RE ALREADY USING IT." a new user has been added since (bcache), which sets the precedences for us to use it as well. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | | drbd: drop remnants of connector -- we don't use it anymore in drbd 8.4Lars Ellenberg2015-11-251-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | | drbd: Backport the "status" commandAndreas Gruenbacher2015-11-252-0/+49
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The status command originates the drbd9 code base. While for now we keep the status information in /proc/drbd available, this commit allows the user base to gracefully migrate their monitoring infrastructure to the new status reporting interface. In drbd9 no status information is exposed through /proc/drbd. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | | drbd: Backport the "events2" commandAndreas Gruenbacher2015-11-252-0/+130
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The events2 command originates from drbd-9 development. It features more information but requires a incompatible change in output format. Therefore the previous events command continues to exist, the new improved events2 command becomes available now. This prepares the user-base for a later switch to the complete drbd9 code base. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | | drbd: Move enum write_ordering_e to drbd.hAndreas Gruenbacher2015-11-251-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Also change the enum values to all-capital letters. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* | | | | | Merge branch 'uaccess' (batched user access infrastructure)Linus Torvalds2016-01-211-0/+7
|\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Expose an interface to allow users to mark several accesses together as being user space accesses, allowing batching of the surrounding user space access markers (SMAP on x86, PAN on arm64, domain register switching on arm). This is currently only used for the user string lenth and copying functions, where the SMAP overhead on x86 drowned the actual user accesses (only noticeable on newer microarchitectures that support SMAP in the first place, of course). * user access batching branch: Use the new batched user accesses in generic user string handling Add 'unsafe' user access functions for batched accesses x86: reorganize SMAP handling in user space accesses
| * | | | | | Add 'unsafe' user access functions for batched accessesLinus Torvalds2015-12-171-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The naming is meant to discourage random use: the helper functions are not really any more "unsafe" than the traditional double-underscore functions (which need the address range checking), but they do need even more infrastructure around them, and should not be used willy-nilly. In addition to checking the access range, these user access functions require that you wrap the user access with a "user_acess_{begin,end}()" around it. That allows architectures that implement kernel user access control (x86: SMAP, arm64: PAN) to do the user access control in the wrapping user_access_begin/end part, and then batch up the actual user space accesses using the new interfaces. The main (and hopefully only) use for these are for core generic access helpers, initially just the generic user string functions (strnlen_user() and strncpy_from_user()). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | | | | Merge branch 'akpm' (patches from Andrew)Linus Torvalds2016-01-2122-659/+595
|\ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge third patch-bomb from Andrew Morton: "I'm pretty much done for -rc1 now: - the rest of MM, basically - lib/ updates - checkpatch, epoll, hfs, fatfs, ptrace, coredump, exit - cpu_mask simplifications - kexec, rapidio, MAINTAINERS etc, etc. - more dma-mapping cleanups/simplifications from hch" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (109 commits) MAINTAINERS: add/fix git URLs for various subsystems mm: memcontrol: add "sock" to cgroup2 memory.stat mm: memcontrol: basic memory statistics in cgroup2 memory controller mm: memcontrol: do not uncharge old page in page cache replacement Documentation: cgroup: add memory.swap.{current,max} description mm: free swap cache aggressively if memcg swap is full mm: vmscan: do not scan anon pages if memcg swap limit is hit swap.h: move memcg related stuff to the end of the file mm: memcontrol: replace mem_cgroup_lruvec_online with mem_cgroup_online mm: vmscan: pass memcg to get_scan_count() mm: memcontrol: charge swap to cgroup2 mm: memcontrol: clean up alloc, online, offline, free functions mm: memcontrol: flatten struct cg_proto mm: memcontrol: rein in the CONFIG space madness net: drop tcp_memcontrol.c mm: memcontrol: introduce CONFIG_MEMCG_LEGACY_KMEM mm: memcontrol: allow to disable kmem accounting for cgroup2 mm: memcontrol: account "kmem" consumers in cgroup2 memory controller mm: memcontrol: move kmem accounting code to CONFIG_MEMCG mm: memcontrol: separate kmem code from legacy tcp accounting code ...
| * | | | | | | mm: memcontrol: add "sock" to cgroup2 memory.statJohannes Weiner2016-01-201-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Provide statistics on how much of a cgroup's memory footprint is made up of socket buffers from network connections owned by the group. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | | | | mm: free swap cache aggressively if memcg swap is fullVladimir Davydov2016-01-201-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Swap cache pages are freed aggressively if swap is nearly full (>50% currently), because otherwise we are likely to stop scanning anonymous when we near the swap limit even if there is plenty of freeable swap cache pages. We should follow the same trend in case of memory cgroup, which has its own swap limit. Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | | | | mm: vmscan: do not scan anon pages if memcg swap limit is hitVladimir Davydov2016-01-201-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We don't scan anonymous memory if we ran out of swap, neither should we do it in case memcg swap limit is hit, because swap out is impossible anyway. Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>