aboutsummaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* Merge branch 'for-3.6/core' of git://git.kernel.dk/linux-blockLinus Torvalds2012-08-0121-301/+530
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull core block IO bits from Jens Axboe: "The most complicated part if this is the request allocation rework by Tejun, which has been queued up for a long time and has been in for-next ditto as well. There are a few commits from yesterday and today, mostly trivial and obvious fixes. So I'm pretty confident that it is sound. It's also smaller than usual." * 'for-3.6/core' of git://git.kernel.dk/linux-block: block: remove dead func declaration block: add partition resize function to blkpg ioctl block: uninitialized ioc->nr_tasks triggers WARN_ON block: do not artificially constrain max_sectors for stacking drivers blkcg: implement per-blkg request allocation block: prepare for multiple request_lists block: add q->nr_rqs[] and move q->rq.elvpriv to q->nr_rqs_elvpriv blkcg: inline bio_blkcg() and friends block: allocate io_context upfront block: refactor get_request[_wait]() block: drop custom queue draining used by scsi_transport_{iscsi|fc} mempool: add @gfp_mask to mempool_create_node() blkcg: make root blkcg allocation use %GFP_KERNEL blkcg: __blkg_lookup_create() doesn't need radix preload
| * block: remove dead func declarationYuanhan Liu2012-08-011-1/+0
| | | | | | | | | | | | | | | | | | __generic_unplug_device() function is removed with commit 7eaceaccab5f40bbfda044629a6298616aeaed50, which forgot to remove the declaration at meantime. Here remove it. Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: add partition resize function to blkpg ioctlVivek Goyal2012-08-015-9/+132
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a new operation code (BLKPG_RESIZE_PARTITION) to the BLKPG ioctl that allows altering the size of an existing partition, even if it is currently in use. This patch converts hd_struct->nr_sects into sequence counter because One might extend a partition while IO is happening to it and update of nr_sects can be non-atomic on 32bit machines with 64bit sector_t. This can lead to issues like reading inconsistent size of a partition. Sequence counter have been used so that readers don't have to take bdev mutex lock as we call sector_in_part() very frequently. Now all the access to hd_struct->nr_sects should happen using sequence counter read/update helper functions part_nr_sects_read/part_nr_sects_write. There is one exception though, set_capacity()/get_capacity(). I think theoritically race should exist there too but this patch does not modify set_capacity()/get_capacity() due to sheer number of call sites and I am afraid that change might break something. I have left that as a TODO item. We can handle it later if need be. This patch does not introduce any new races as such w.r.t set_capacity()/get_capacity(). v2: Add CONFIG_LBDAF test to UP preempt case as suggested by Phillip. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Phillip Susi <psusi@ubuntu.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: uninitialized ioc->nr_tasks triggers WARN_ONOlof Johansson2012-08-011-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Hi, I'm using the old-fashioned 'dump' backup tool, and I noticed that it spews the below warning as of 3.5-rc1 and later (3.4 is fine): [ 10.886893] ------------[ cut here ]------------ [ 10.886904] WARNING: at include/linux/iocontext.h:140 copy_process+0x1488/0x1560() [ 10.886905] Hardware name: Bochs [ 10.886906] Modules linked in: [ 10.886908] Pid: 2430, comm: dump Not tainted 3.5.0-rc7+ #27 [ 10.886908] Call Trace: [ 10.886911] [<ffffffff8107ce8a>] warn_slowpath_common+0x7a/0xb0 [ 10.886912] [<ffffffff8107ced5>] warn_slowpath_null+0x15/0x20 [ 10.886913] [<ffffffff8107c088>] copy_process+0x1488/0x1560 [ 10.886914] [<ffffffff8107c244>] do_fork+0xb4/0x340 [ 10.886918] [<ffffffff8108effa>] ? recalc_sigpending+0x1a/0x50 [ 10.886919] [<ffffffff8108f6b2>] ? __set_task_blocked+0x32/0x80 [ 10.886920] [<ffffffff81091afa>] ? __set_current_blocked+0x3a/0x60 [ 10.886923] [<ffffffff81051db3>] sys_clone+0x23/0x30 [ 10.886925] [<ffffffff8179bd73>] stub_clone+0x13/0x20 [ 10.886927] [<ffffffff8179baa2>] ? system_call_fastpath+0x16/0x1b [ 10.886928] ---[ end trace 32a14af7ee6a590b ]--- Reproducing is easy, I can hit it on a KVM system with a very basic config (x86_64 make defconfig + enable the drivers needed). To hit it, just install dump (on debian/ubuntu, not sure what the package might be called on Fedora), and: dump -o -f /tmp/foo / You'll see the warning in dmesg once it forks off the I/O process and starts dumping filesystem contents. I bisected it down to the following commit: commit f6e8d01bee036460e03bd4f6a79d014f98ba712e Author: Tejun Heo <tj@kernel.org> Date: Mon Mar 5 13:15:26 2012 -0800 block: add io_context->active_ref Currently ioc->nr_tasks is used to decide two things - whether an ioc is done issuing IOs and whether it's shared by multiple tasks. This patch separate out the first into ioc->active_ref, which is acquired and released using {get|put}_io_context_active() respectively. This will be used to associate bio's with a given task. This patch doesn't introduce any visible behavior change. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> It seems like the init of ioc->nr_tasks was removed in that patch, so it starts out at 0 instead of 1. Tejun, is the right thing here to add back the init, or should something else be done? The below patch removes the warning, but I haven't done any more extensive testing on it. Signed-off-by: Olof Johansson <olof@lixom.net> Acked-by: Tejun Heo <tj@kernel.org> Cc: stable@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: do not artificially constrain max_sectors for stacking driversMike Snitzer2012-08-011-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | blk_set_stacking_limits is intended to allow stacking drivers to build up the limits of the stacked device based on the underlying devices' limits. But defaulting 'max_sectors' to BLK_DEF_MAX_SECTORS (1024) doesn't allow the stacking driver to inherit a max_sectors larger than 1024 -- due to blk_stack_limits' use of min_not_zero. It is now clear that this artificial limit is getting in the way so change blk_set_stacking_limits's max_sectors to UINT_MAX (which allows stacking drivers like dm-multipath to inherit 'max_sectors' from the underlying paths). Reported-by: Vijay Chauhan <vijay.chauhan@netapp.com> Tested-by: Vijay Chauhan <vijay.chauhan@netapp.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blkcg: implement per-blkg request allocationTejun Heo2012-06-266-30/+216
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, request_queue has one request_list to allocate requests from regardless of blkcg of the IO being issued. When the unified request pool is used up, cfq proportional IO limits become meaningless - whoever grabs the next request being freed wins the race regardless of the configured weights. This can be easily demonstrated by creating a blkio cgroup w/ very low weight, put a program which can issue a lot of random direct IOs there and running a sequential IO from a different cgroup. As soon as the request pool is used up, the sequential IO bandwidth crashes. This patch implements per-blkg request_list. Each blkg has its own request_list and any IO allocates its request from the matching blkg making blkcgs completely isolated in terms of request allocation. * Root blkcg uses the request_list embedded in each request_queue, which was renamed to @q->root_rl from @q->rq. While making blkcg rl handling a bit harier, this enables avoiding most overhead for root blkcg. * Queue fullness is properly per request_list but bdi isn't blkcg aware yet, so congestion state currently just follows the root blkcg. As writeback isn't aware of blkcg yet, this works okay for async congestion but readahead may get the wrong signals. It's better than blkcg completely collapsing with shared request_list but needs to be improved with future changes. * After this change, each block cgroup gets a full request pool making resource consumption of each cgroup higher. This makes allowing non-root users to create cgroups less desirable; however, note that allowing non-root users to directly manage cgroups is already severely broken regardless of this patch - each block cgroup consumes kernel memory and skews IO weight (IO weights are not hierarchical). v2: queue-sysfs.txt updated and patch description udpated as suggested by Vivek. v3: blk_get_rl() wasn't checking error return from blkg_lookup_create() and may cause oops on lookup failure. Fix it by falling back to root_rl on blkg lookup failures. This problem was spotted by Rakesh Iyer <rni@google.com>. v4: Updated to accomodate 458f27a982 "block: Avoid missed wakeup in request waitqueue". blk_drain_queue() now wakes up waiters on all blkg->rl on the target queue. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: prepare for multiple request_listsTejun Heo2012-06-254-46/+57
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Request allocation is about to be made per-blkg meaning that there'll be multiple request lists. * Make queue full state per request_list. blk_*queue_full() functions are renamed to blk_*rl_full() and takes @rl instead of @q. * Rename blk_init_free_list() to blk_init_rl() and make it take @rl instead of @q. Also add @gfp_mask parameter. * Add blk_exit_rl() instead of destroying rl directly from blk_release_queue(). * Add request_list->q and make request alloc/free functions - blk_free_request(), [__]freed_request(), __get_request() - take @rl instead of @q. This patch doesn't introduce any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: add q->nr_rqs[] and move q->rq.elvpriv to q->nr_rqs_elvprivTejun Heo2012-06-252-11/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add q->nr_rqs[] which currently behaves the same as q->rq.count[] and move q->rq.elvpriv to q->nr_rqs_elvpriv. blk_drain_queue() is updated to use q->nr_rqs[] instead of q->rq.count[]. These counters separates queue-wide request statistics from the request list and allow implementation of per-queue request allocation. While at it, properly indent fields of struct request_list. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blkcg: inline bio_blkcg() and friendsTejun Heo2012-06-252-25/+22
| | | | | | | | | | | | | | | | | | | | | | | | Make bio_blkcg() and friends inline. They all are very simple and used only in few places. This patch is to prepare for further updates to request allocation path. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: allocate io_context upfrontTejun Heo2012-06-252-30/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Block layer very lazy allocation of ioc. It waits until the moment ioc is absolutely necessary; unfortunately, that time could be inside queue lock and __get_request() performs unlock - try alloc - retry dancing. Just allocate it up-front on entry to block layer. We're not saving the rain forest by deferring it to the last possible moment and complicating things unnecessarily. This patch is to prepare for further updates to request allocation path. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: refactor get_request[_wait]()Tejun Heo2012-06-251-39/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, there are two request allocation functions - get_request() and get_request_wait(). The former tries to allocate a request once and the latter keeps retrying until it succeeds. The latter wraps the former and keeps retrying until allocation succeeds. The combination of two functions deliver fallible non-wait allocation, fallible wait allocation and unfailing wait allocation. However, given that forward progress is guaranteed, fallible wait allocation isn't all that useful and in fact nobody uses it. This patch simplifies the interface as follows. * get_request() is renamed to __get_request() and is only used by the wrapper function. * get_request_wait() is renamed to get_request(). It now takes @gfp_mask and retries iff it contains %__GFP_WAIT. This patch doesn't introduce any functional change and is to prepare for further updates to request allocation path. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: drop custom queue draining used by scsi_transport_{iscsi|fc}Tejun Heo2012-06-254-93/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | iscsi_remove_host() uses bsg_remove_queue() which implements custom queue draining. fc_bsg_remove() open-codes mostly identical logic. The draining logic isn't correct in that blk_stop_queue() doesn't prevent new requests from being queued - it just stops processing, so nothing prevents new requests to be queued after the logic determines that the queue is drained. blk_cleanup_queue() now implements proper queue draining and these custom draining logics aren't necessary. Drop them and use bsg_unregister_queue() + blk_cleanup_queue() instead. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Mike Christie <michaelc@cs.wisc.edu> Acked-by: Vivek Goyal <vgoyal@redhat.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: James Smart <james.smart@emulex.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * mempool: add @gfp_mask to mempool_create_node()Tejun Heo2012-06-253-8/+11
| | | | | | | | | | | | | | | | | | | | | | mempool_create_node() currently assumes %GFP_KERNEL. Its only user, blk_init_free_list(), is about to be updated to use other allocation flags - add @gfp_mask argument to the function. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blkcg: make root blkcg allocation use %GFP_KERNELTejun Heo2012-06-251-16/+43
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, blkcg_activate_policy() depends on %GFP_ATOMIC allocation from __blkg_lookup_create() for root blkcg creation. This could make policy fail unnecessarily. Make blkg_alloc() take @gfp_mask, __blkg_lookup_create() take an optional @new_blkg for preallocated blkg, and blkcg_activate_policy() preload radix tree and preallocate blkg with %GFP_KERNEL before trying to create the root blkg. v2: __blkg_lookup_create() was returning %NULL on blkg alloc failure instead of ERR_PTR() value. Fixed. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blkcg: __blkg_lookup_create() doesn't need radix preloadTejun Heo2012-06-251-9/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There's no point in calling radix_tree_preload() if preloading doesn't use more permissible GFP mask. Drop preloading from __blkg_lookup_create(). While at it, drop sparse locking annotation which no longer applies. v2: Vivek pointed out the odd preload usage. Instead of updating, just drop it. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | Merge branch 'for-next' of git://neil.brown.name/mdLinus Torvalds2012-08-019-219/+426
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull md updates from NeilBrown. * 'for-next' of git://neil.brown.name/md: DM RAID: Add support for MD RAID10 md/RAID1: Add missing case for attempting to repair known bad blocks. md/raid5: For odirect-write performance, do not set STRIPE_PREREAD_ACTIVE. md/raid1: don't abort a resync on the first badblock. md: remove duplicated test on ->openers when calling do_md_stop() raid5: Add R5_ReadNoMerge flag which prevent bio from merging at block layer md/raid1: prevent merging too large request md/raid1: read balance chooses idlest disk for SSD md/raid1: make sequential read detection per disk based MD RAID10: Export md_raid10_congested MD: Move macros from raid1*.h to raid1*.c MD RAID1: rename mirror_info structure MD RAID10: rename mirror_info structure MD RAID10: Fix compiler warning. raid5: add a per-stripe lock raid5: remove unnecessary bitmap write optimization raid5: lockless access raid5 overrided bi_phys_segments raid5: reduce chance release_stripe() taking device_lock
| * | DM RAID: Add support for MD RAID10Jonathan Brassow2012-08-012-5/+116
| | | | | | | | | | | | | | | | | | | | | Support the MD RAID10 personality through dm-raid.c Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
| * | Merge commit 'c039c332f23e794deb6d6f37b9f07ff3b27fb2cf' into mdNeilBrown2012-08-014433-104467/+180036
| |\ \ | | | | | | | | | | | | Pull in pre-requisites for adding raid10 support to dm-raid.
| * | | md/RAID1: Add missing case for attempting to repair known bad blocks.Alexander Lyakas2012-07-311-0/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When doing resync or repair, attempt to correct bad blocks, according to WriteErrorSeen policy Signed-off-by: Alex Lyakas <alex.bolshoy@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
| * | | md/raid5: For odirect-write performance, do not set STRIPE_PREREAD_ACTIVE.majianpeng2012-07-311-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 'sync' writes set both REQ_SYNC and REQ_NOIDLE. O_DIRECT writes set REQ_SYNC but not REQ_NOIDLE. We currently assume that a REQ_SYNC request will not be followed by more requests and so set STRIPE_PREREAD_ACTIVE to expedite the request. This is appropriate for sync requests, but not for O_DIRECT requests. So make the setting of STRIPE_PREREAD_ACTIVE conditional on REQ_NOIDLE rather than REQ_SYNC. This is consistent with the documented meaning of REQ_NOIDLE: __REQ_NOIDLE, /* don't anticipate more IO after this one */ Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
| * | | md/raid1: don't abort a resync on the first badblock.NeilBrown2012-07-311-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a resync of a RAID1 array with 2 devices finds a known bad block one device it will neither read from, or write to, that device for this block offset. So there will be one read_target (The other device) and zero write targets. This condition causes md/raid1 to abort the resync assuming that it has finished - without known bad blocks this would be true. When there are no write targets because of the presence of bad blocks we should only skip over the area covered by the bad block. RAID10 already gets this right, raid1 doesn't. Or didn't. As this can cause a 'sync' to abort early and appear to have succeeded it could lead to some data corruption, so it suitable for -stable. Cc: stable@vger.kernel.org Reported-by: Alexander Lyakas <alex.bolshoy@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
| * | | md: remove duplicated test on ->openers when calling do_md_stop()NeilBrown2012-07-311-6/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | do_md_stop tests mddev->openers while holding ->open_mutex, and fails if this count is too high. So callers do not need to check mddev->openers and doing so isn't very meaningful as they don't hold ->open_mutex so the number could change. So remove the unnecessary tests on mddev->openers. These are not called often enough for there to be any gain in an early test on ->open_mutex to avoid the need for a slightly more costly mutex_lock call. Signed-off-by: NeilBrown <neilb@suse.de>
| * | | raid5: Add R5_ReadNoMerge flag which prevent bio from merging at block layermajianpeng2012-07-312-2/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Because bios will merge at block-layer,so bios-error may caused by other bio which be merged into to the same request. Using this flag,it will find exactly error-sector and not do redundant operation like re-write and re-read. V0->V1:Using REQ_FLUSH instead REQ_NOMERGE avoid bio merging at block layer. Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
| * | | md/raid1: prevent merging too large requestShaohua Li2012-07-312-7/+50
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For SSD, if request size exceeds specific value (optimal io size), request size isn't important for bandwidth. In such condition, if making request size bigger will cause some disks idle, the total throughput will actually drop. A good example is doing a readahead in a two-disk raid1 setup. So when should we split big requests? We absolutly don't want to split big request to very small requests. Even in SSD, big request transfer is more efficient. This patch only considers request with size above optimal io size. If all disks are busy, is it worth doing a split? Say optimal io size is 16k, two requests 32k and two disks. We can let each disk run one 32k request, or split the requests to 4 16k requests and each disk runs two. It's hard to say which case is better, depending on hardware. So only consider case where there are idle disks. For readahead, split is always better in this case. And in my test, below patch can improve > 30% thoughput. Hmm, not 100%, because disk isn't 100% busy. Such case can happen not just in readahead, for example, in directio. But I suppose directio usually will have bigger IO depth and make all disks busy, so I ignored it. Note: if the raid uses any hard disk, we don't prevent merging. That will make performace worse. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
| * | | md/raid1: read balance chooses idlest disk for SSDShaohua Li2012-07-311-3/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | SSD hasn't spindle, distance between requests means nothing. And the original distance based algorithm sometimes can cause severe performance issue for SSD raid. Considering two thread groups, one accesses file A, the other access file B. The first group will access one disk and the second will access the other disk, because requests are near from one group and far between groups. In this case, read balance might keep one disk very busy but the other relative idle. For SSD, we should try best to distribute requests to as many disks as possible. There isn't spindle move penality anyway. With below patch, I can see more than 50% throughput improvement sometimes depending on workloads. The only exception is small requests can be merged to a big request which typically can drive higher throughput for SSD too. Such small requests are sequential reads. Unlike hard disk, sequential read which can't be merged (for example direct IO, or read without readahead) can be ignored for SSD. Again there is no spindle move penality. readahead dispatches small requests and such requests can be merged. Last patch can help detect sequential read well, at least if concurrent read number isn't greater than raid disk number. In that case, distance based algorithm doesn't work well too. V2: For hard disk and SSD mixed raid, doesn't use distance based algorithm for random IO too. This makes the algorithm generic for raid with SSD. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
| * | | md/raid1: make sequential read detection per disk basedShaohua Li2012-07-312-34/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently the sequential read detection is global wide. It's natural to make it per disk based, which can improve the detection for concurrent multiple sequential reads. And next patch will make SSD read balance not use distance based algorithm, where this change help detect truly sequential read for SSD. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
| * | | MD RAID10: Export md_raid10_congestedJonathan Brassow2012-07-312-22/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | md/raid10: Export is_congested test. In similar fashion to commits 11d8a6e3719519fbc0e2c9d61b6fa931b84bf813 1ed7242e591af7e233234d483f12d33818b189d9 we export the RAID10 congestion checking function so that dm-raid.c can make use of it and make use of the personality. The 'queue' and 'gendisk' structures will not be available to the MD code when device-mapper sets up the device, so we conditionalize access to these fields also. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
| * | | MD: Move macros from raid1*.h to raid1*.cJonathan Brassow2012-07-314-29/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | MD RAID1/RAID10: Move some macros from .h file to .c file There are three macros (IO_BLOCKED,IO_MADE_GOOD,BIO_SPECIAL) which are defined in both raid1.h and raid10.h. They are only used in there respective .c files. However, if we wish to make RAID10 accessible to the device-mapper RAID target (dm-raid.c), then we need to move these macros into the .c files where they are used so that they do not conflict with each other. The macros from the two files are identical and could be moved into md.h, but I chose to leave the duplication and have them remain in the personality files. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
| * | | MD RAID1: rename mirror_info structureJonathan Brassow2012-07-312-9/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | MD RAID1: Rename the structure 'mirror_info' to 'raid1_info' The same structure name ('mirror_info') is used by raid10. Each of these structures are defined in there respective header files. If dm-raid is to support both RAID1 and RAID10, the header files will be included and the structure names must not collide. While only one of these structure names needs to change, this patch adds consistency to the naming of the structure. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
| * | | MD RAID10: rename mirror_info structureJonathan Brassow2012-07-312-12/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | MD RAID10: Rename the structure 'mirror_info' to 'raid10_info' The same structure name ('mirror_info') is used by raid1. Each of these structures are defined in there respective header files. If dm-raid is to support both RAID1 and RAID10, the header files will be included and the structure names must not collide. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
| * | | MD RAID10: Fix compiler warning.Jonathan Brassow2012-07-311-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | MD RAID10: Fix compiler warning. Initialize variable to prevent compiler warning. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
| * | | raid5: add a per-stripe lockShaohua Li2012-07-192-16/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a per-stripe lock to protect stripe specific data. The purpose is to reduce lock contention of conf->device_lock. stripe ->toread, ->towrite are protected by per-stripe lock. Accessing bio list of the stripe is always serialized by this lock, so adding bio to the lists (add_stripe_bio()) and removing bio from the lists (like ops_run_biofill()) not race. If bio in ->read, ->written ... list are not shared by multiple stripes, we don't need any lock to protect ->read, ->written, because STRIPE_ACTIVE will protect them. If the bio are shared, there are two protections: 1. bi_phys_segments acts as a reference count 2. traverse the list uses r5_next_bio, which makes traverse never access bio not belonging to the stripe Let's have an example: | stripe1 | stripe2 | stripe3 | ...bio1......|bio2|bio3|....bio4..... stripe2 has 4 bios, when it's finished, it will decrement bi_phys_segments for all bios, but only end_bio for bio2 and bio3. bio1->bi_next still points to bio2, but this doesn't matter. When stripe1 is finished, it will not touch bio2 because of r5_next_bio check. Next time stripe1 will end_bio for bio1 and stripe3 will end_bio bio4. before add_stripe_bio() addes a bio to a stripe, we already increament the bio bi_phys_segments, so don't worry other stripes release the bio. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
| * | | raid5: remove unnecessary bitmap write optimizationShaohua Li2012-07-191-12/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Neil pointed out the bitmap write optimization in handle_stripe_clean_event() is unnecessary, because the chance one stripe gets written twice in the mean time is rare. We can always do a bitmap_startwrite when a write request is added to a stripe and bitmap_endwrite after write request is done. Delete the optimization. With it, we can delete some cases of device_lock. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
| * | | raid5: lockless access raid5 overrided bi_phys_segmentsShaohua Li2012-07-191-30/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Raid5 overrides bio->bi_phys_segments, accessing it is with device_lock hold, which is unnecessary, We can make it lockless actually. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
| * | | raid5: reduce chance release_stripe() taking device_lockShaohua Li2012-07-191-34/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | release_stripe() is a place conf->device_lock is heavily contended. We take the lock even stripe count isn't 1, which isn't required. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
* | | | locks: remove unused lm_release_privateJ. Bruce Fields2012-08-013-8/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In commit 3b6e2723f32d ("locks: prevent side-effects of locks_release_private before file_lock is initialized") we removed the last user of lm_release_private without removing the field itself. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | Merge tag 'irqdomain-for-linus' of git://git.secretlab.ca/git/linux-2.6Linus Torvalds2012-07-317-169/+248
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull irqdomain changes from Grant Likely: "Round of refactoring and enhancements to irq_domain infrastructure. This series starts the process of simplifying irqdomain. The ultimate goal is to merge LEGACY, LINEAR and TREE mappings into a single system, but had to back off from that after some last minute bugs. Instead it mainly reorganizes the code and ensures that the reverse map gets populated when the irq is mapped instead of the first time it is looked up. Merging of the irq_domain types is deferred to v3.7 In other news, this series adds helpers for creating static mappings on a linear or tree mapping." * tag 'irqdomain-for-linus' of git://git.secretlab.ca/git/linux-2.6: irqdomain: Improve diagnostics when a domain mapping fails irqdomain: eliminate slow-path revmap lookups irqdomain: Fix irq_create_direct_mapping() to test irq_domain type. irqdomain: Eliminate dedicated radix lookup functions irqdomain: Support for static IRQ mapping and association. irqdomain: Always update revmap when setting up a virq irqdomain: Split disassociating code into separate function irq_domain: correct a minor wrong comment for linear revmap irq_domain: Standardise legacy/linear domain selection irqdomain: Make ops->map hook optional irqdomain: Remove unnecessary test for IRQ_DOMAIN_MAP_LEGACY irqdomain: Simple NUMA awareness. devicetree: add helper inline for retrieving a node's full name
| * | | | irqdomain: Improve diagnostics when a domain mapping failsMark Brown2012-07-241-6/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When the map operation fails log the error code we get and add a WARN_ON() so we get a backtrace (which should help work out which interrupt is the source of the issue). Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com> Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
| * | | | irqdomain: eliminate slow-path revmap lookupsGrant Likely2012-07-241-40/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With the current state of irq_domain, the reverse map is always updated when new IRQs get mapped. This means that the irq_find_mapping() function can be simplified to execute the revmap lookup functions unconditionally This patch adds lookup functions for the revmaps that don't yet have one and removes the slow path lookup code path. v8: Broke out unrelated changes into separate patches. Rebased on Paul's irq association patches. v7: Rebased to irqdomain/next for v3.4 and applied before the removal of 'hint' v6: Remove the slow path entirely. The only place where the slow path could get called is for a linear mapping if the hwirq number is larger than the linear revmap size. There shouldn't be any interrupt controllers that do that. v5: rewrite to not use a ->revmap() callback. It is simpler, smaller, safer and faster to open code each of the revmap lookups directly into irq_find_mapping() via a switch statement. v4: Fix build failure on incorrect variable reference. Signed-off-by: Grant Likely <grant.likely@secretlab.ca> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Milton Miller <miltonm@bga.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Rob Herring <rob.herring@calxeda.com>
| * | | | Merge remote-tracking branch 'origin' into irqdomain/nextGrant Likely2012-07-244548-104854/+181547
| |\ \ \ \
| * | | | | irqdomain: Fix irq_create_direct_mapping() to test irq_domain type.Grant Likely2012-07-111-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | irq_create_direct_mapping can only be used with the NOMAP type. Make the function test to ensure it is passed the correct type of irq_domain. Signed-off-by: Grant Likely <grant.likely@secretlab.ca> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Rob Herring <rob.herring@calxeda.com>
| * | | | | irqdomain: Eliminate dedicated radix lookup functionsGrant Likely2012-07-114-65/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In preparation to remove the slow revmap path, eliminate the public radix revmap lookup functions. This simplifies the code and makes the slowpath removal patch a lot simpler. Signed-off-by: Grant Likely <grant.likely@secretlab.ca> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Rob Herring <rob.herring@calxeda.com>
| * | | | | irqdomain: Support for static IRQ mapping and association.Grant Likely2012-07-112-25/+107
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds a new strict mapping API for supporting creation of linux IRQs at existing positions within the domain. The new routines are as follows: For dynamic allocation and insertion to specified ranges: - irq_create_identity_mapping() - irq_create_strict_mappings() These will allocate and associate a range of linux IRQs at the specified location. This can be used by controllers that have their own static linux IRQ definitions to map a hwirq range to, as well as for platforms that wish to establish 1:1 identity mapping between linux and hwirq space. For insertion to specified ranges by platforms that do their own irq_desc management: - irq_domain_associate() - irq_domain_associate_many() These in turn call back in to the domain's ->map() routine, for further processing by the platform. Disassociation of IRQs get handled through irq_dispose_mapping() as normal. With these in place it should be possible to begin migration of legacy IRQ domains to linear ones, without requiring special handling for static vs dynamic IRQ definitions in DT vs non-DT paths. This also makes it possible for domains with static mappings to adopt whichever tree model best fits their needs, rather than simply restricting them to linear revmaps. Signed-off-by: Paul Mundt <lethal@linux-sh.org> [grant.likely: Reorganized irq_domain_associate{,_many} to have all logic in one place] [grant.likely: Add error checking for unallocated irq_descs at associate time] Signed-off-by: Grant Likely <grant.likely@secretlab.ca> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Rob Herring <rob.herring@calxeda.com>
| * | | | | irqdomain: Always update revmap when setting up a virqGrant Likely2012-07-112-3/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | At irq_setup_virq() time all of the data needed to update the reverse map is available, but the current code ignores it and relies upon the slow path to insert revmap records. This patch adds revmap updating to the setup path so the slow path will no longer be necessary. Signed-off-by: Grant Likely <grant.likely@secretlab.ca> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Rob Herring <rob.herring@calxeda.com>
| * | | | | irqdomain: Split disassociating code into separate functionGrant Likely2012-07-111-28/+47
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch moves the irq disassociation code out into a separate function in preparation to extend irq_setup_virq to handle multiple irqs and rename it for use by interrupt controller drivers. The new function will be used by irq_setup_virq() in its error path. Signed-off-by: Grant Likely <grant.likely@secretlab.ca> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Rob Herring <rob.herring@calxeda.com>
| * | | | | Merge tag 'v3.5-rc6' into irqdomain/nextGrant Likely2012-07-11995-5037/+9700
| |\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | Linux 3.5-rc6
| * | | | | | irq_domain: correct a minor wrong comment for linear revmapDong Aisheng2012-07-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The revmap type should be linear for irq_domain_add_linear function. Signed-off-by: Dong Aisheng <dong.aisheng@linaro.org> Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
| * | | | | | irq_domain: Standardise legacy/linear domain selectionMark Brown2012-07-113-0/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A large proportion of interrupt controllers that support legacy mappings do so because non-DT systems need to use fixed IRQ numbers when registering devices via buses but can otherwise use a linear mapping. The interrupt controller itself typically is not affected by the mapping used and best practice is to use a linear mapping where possible so drivers frequently select at runtime depending on if a legacy range has been allocated to them. Standardise this behaviour by providing irq_domain_register_simple() which will allocate a linear mapping unless a positive first_irq is provided in which case it will fall back to a legacy mapping. This helps make best practice for irq_domain adoption clearer. Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com> Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
| * | | | | | irqdomain: Make ops->map hook optionalGrant Likely2012-06-171-10/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There isn't a really compelling reason to force ->map to be populated, so allow it to be left unset. Signed-off-by: Grant Likely <grant.likely@secretlab.ca> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Rob Herring <rob.herring@calxeda.com>
| * | | | | | irqdomain: Remove unnecessary test for IRQ_DOMAIN_MAP_LEGACYGrant Likely2012-06-151-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Where irq_domain_associate() is called in irq_create_mapping, there is no need to test for IRQ_DOMAIN_MAP_LEGACY because it is already tested for earlier in the routine. Signed-off-by: Grant Likely <grant.likely@secretlab.ca> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Rob Herring <rob.herring@calxeda.com>