| Commit message (Collapse) | Author | Age | Files | Lines |
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Pull core block IO bits from Jens Axboe:
"The most complicated part if this is the request allocation rework by
Tejun, which has been queued up for a long time and has been in
for-next ditto as well.
There are a few commits from yesterday and today, mostly trivial and
obvious fixes. So I'm pretty confident that it is sound. It's also
smaller than usual."
* 'for-3.6/core' of git://git.kernel.dk/linux-block:
block: remove dead func declaration
block: add partition resize function to blkpg ioctl
block: uninitialized ioc->nr_tasks triggers WARN_ON
block: do not artificially constrain max_sectors for stacking drivers
blkcg: implement per-blkg request allocation
block: prepare for multiple request_lists
block: add q->nr_rqs[] and move q->rq.elvpriv to q->nr_rqs_elvpriv
blkcg: inline bio_blkcg() and friends
block: allocate io_context upfront
block: refactor get_request[_wait]()
block: drop custom queue draining used by scsi_transport_{iscsi|fc}
mempool: add @gfp_mask to mempool_create_node()
blkcg: make root blkcg allocation use %GFP_KERNEL
blkcg: __blkg_lookup_create() doesn't need radix preload
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
__generic_unplug_device() function is removed with commit
7eaceaccab5f40bbfda044629a6298616aeaed50, which forgot to
remove the declaration at meantime. Here remove it.
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Add a new operation code (BLKPG_RESIZE_PARTITION) to the BLKPG ioctl that
allows altering the size of an existing partition, even if it is currently
in use.
This patch converts hd_struct->nr_sects into sequence counter because
One might extend a partition while IO is happening to it and update of
nr_sects can be non-atomic on 32bit machines with 64bit sector_t. This
can lead to issues like reading inconsistent size of a partition. Sequence
counter have been used so that readers don't have to take bdev mutex lock
as we call sector_in_part() very frequently.
Now all the access to hd_struct->nr_sects should happen using sequence
counter read/update helper functions part_nr_sects_read/part_nr_sects_write.
There is one exception though, set_capacity()/get_capacity(). I think
theoritically race should exist there too but this patch does not
modify set_capacity()/get_capacity() due to sheer number of call sites
and I am afraid that change might break something. I have left that as a
TODO item. We can handle it later if need be. This patch does not introduce
any new races as such w.r.t set_capacity()/get_capacity().
v2: Add CONFIG_LBDAF test to UP preempt case as suggested by Phillip.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Phillip Susi <psusi@ubuntu.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Hi,
I'm using the old-fashioned 'dump' backup tool, and I noticed that it spews the
below warning as of 3.5-rc1 and later (3.4 is fine):
[ 10.886893] ------------[ cut here ]------------
[ 10.886904] WARNING: at include/linux/iocontext.h:140 copy_process+0x1488/0x1560()
[ 10.886905] Hardware name: Bochs
[ 10.886906] Modules linked in:
[ 10.886908] Pid: 2430, comm: dump Not tainted 3.5.0-rc7+ #27
[ 10.886908] Call Trace:
[ 10.886911] [<ffffffff8107ce8a>] warn_slowpath_common+0x7a/0xb0
[ 10.886912] [<ffffffff8107ced5>] warn_slowpath_null+0x15/0x20
[ 10.886913] [<ffffffff8107c088>] copy_process+0x1488/0x1560
[ 10.886914] [<ffffffff8107c244>] do_fork+0xb4/0x340
[ 10.886918] [<ffffffff8108effa>] ? recalc_sigpending+0x1a/0x50
[ 10.886919] [<ffffffff8108f6b2>] ? __set_task_blocked+0x32/0x80
[ 10.886920] [<ffffffff81091afa>] ? __set_current_blocked+0x3a/0x60
[ 10.886923] [<ffffffff81051db3>] sys_clone+0x23/0x30
[ 10.886925] [<ffffffff8179bd73>] stub_clone+0x13/0x20
[ 10.886927] [<ffffffff8179baa2>] ? system_call_fastpath+0x16/0x1b
[ 10.886928] ---[ end trace 32a14af7ee6a590b ]---
Reproducing is easy, I can hit it on a KVM system with a very basic
config (x86_64 make defconfig + enable the drivers needed). To hit it,
just install dump (on debian/ubuntu, not sure what the package might be
called on Fedora), and:
dump -o -f /tmp/foo /
You'll see the warning in dmesg once it forks off the I/O process and
starts dumping filesystem contents.
I bisected it down to the following commit:
commit f6e8d01bee036460e03bd4f6a79d014f98ba712e
Author: Tejun Heo <tj@kernel.org>
Date: Mon Mar 5 13:15:26 2012 -0800
block: add io_context->active_ref
Currently ioc->nr_tasks is used to decide two things - whether an ioc
is done issuing IOs and whether it's shared by multiple tasks. This
patch separate out the first into ioc->active_ref, which is acquired
and released using {get|put}_io_context_active() respectively.
This will be used to associate bio's with a given task. This patch
doesn't introduce any visible behavior change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It seems like the init of ioc->nr_tasks was removed in that patch,
so it starts out at 0 instead of 1.
Tejun, is the right thing here to add back the init, or should something else
be done?
The below patch removes the warning, but I haven't done any more extensive
testing on it.
Signed-off-by: Olof Johansson <olof@lixom.net>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
blk_set_stacking_limits is intended to allow stacking drivers to build
up the limits of the stacked device based on the underlying devices'
limits. But defaulting 'max_sectors' to BLK_DEF_MAX_SECTORS (1024)
doesn't allow the stacking driver to inherit a max_sectors larger than
1024 -- due to blk_stack_limits' use of min_not_zero.
It is now clear that this artificial limit is getting in the way so
change blk_set_stacking_limits's max_sectors to UINT_MAX (which allows
stacking drivers like dm-multipath to inherit 'max_sectors' from the
underlying paths).
Reported-by: Vijay Chauhan <vijay.chauhan@netapp.com>
Tested-by: Vijay Chauhan <vijay.chauhan@netapp.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Currently, request_queue has one request_list to allocate requests
from regardless of blkcg of the IO being issued. When the unified
request pool is used up, cfq proportional IO limits become meaningless
- whoever grabs the next request being freed wins the race regardless
of the configured weights.
This can be easily demonstrated by creating a blkio cgroup w/ very low
weight, put a program which can issue a lot of random direct IOs there
and running a sequential IO from a different cgroup. As soon as the
request pool is used up, the sequential IO bandwidth crashes.
This patch implements per-blkg request_list. Each blkg has its own
request_list and any IO allocates its request from the matching blkg
making blkcgs completely isolated in terms of request allocation.
* Root blkcg uses the request_list embedded in each request_queue,
which was renamed to @q->root_rl from @q->rq. While making blkcg rl
handling a bit harier, this enables avoiding most overhead for root
blkcg.
* Queue fullness is properly per request_list but bdi isn't blkcg
aware yet, so congestion state currently just follows the root
blkcg. As writeback isn't aware of blkcg yet, this works okay for
async congestion but readahead may get the wrong signals. It's
better than blkcg completely collapsing with shared request_list but
needs to be improved with future changes.
* After this change, each block cgroup gets a full request pool making
resource consumption of each cgroup higher. This makes allowing
non-root users to create cgroups less desirable; however, note that
allowing non-root users to directly manage cgroups is already
severely broken regardless of this patch - each block cgroup
consumes kernel memory and skews IO weight (IO weights are not
hierarchical).
v2: queue-sysfs.txt updated and patch description udpated as suggested
by Vivek.
v3: blk_get_rl() wasn't checking error return from
blkg_lookup_create() and may cause oops on lookup failure. Fix it
by falling back to root_rl on blkg lookup failures. This problem
was spotted by Rakesh Iyer <rni@google.com>.
v4: Updated to accomodate 458f27a982 "block: Avoid missed wakeup in
request waitqueue". blk_drain_queue() now wakes up waiters on all
blkg->rl on the target queue.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Request allocation is about to be made per-blkg meaning that there'll
be multiple request lists.
* Make queue full state per request_list. blk_*queue_full() functions
are renamed to blk_*rl_full() and takes @rl instead of @q.
* Rename blk_init_free_list() to blk_init_rl() and make it take @rl
instead of @q. Also add @gfp_mask parameter.
* Add blk_exit_rl() instead of destroying rl directly from
blk_release_queue().
* Add request_list->q and make request alloc/free functions -
blk_free_request(), [__]freed_request(), __get_request() - take @rl
instead of @q.
This patch doesn't introduce any functional difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Add q->nr_rqs[] which currently behaves the same as q->rq.count[] and
move q->rq.elvpriv to q->nr_rqs_elvpriv. blk_drain_queue() is updated
to use q->nr_rqs[] instead of q->rq.count[].
These counters separates queue-wide request statistics from the
request list and allow implementation of per-queue request allocation.
While at it, properly indent fields of struct request_list.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Make bio_blkcg() and friends inline. They all are very simple and
used only in few places.
This patch is to prepare for further updates to request allocation
path.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Block layer very lazy allocation of ioc. It waits until the moment
ioc is absolutely necessary; unfortunately, that time could be inside
queue lock and __get_request() performs unlock - try alloc - retry
dancing.
Just allocate it up-front on entry to block layer. We're not saving
the rain forest by deferring it to the last possible moment and
complicating things unnecessarily.
This patch is to prepare for further updates to request allocation
path.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Currently, there are two request allocation functions - get_request()
and get_request_wait(). The former tries to allocate a request once
and the latter keeps retrying until it succeeds. The latter wraps the
former and keeps retrying until allocation succeeds.
The combination of two functions deliver fallible non-wait allocation,
fallible wait allocation and unfailing wait allocation. However,
given that forward progress is guaranteed, fallible wait allocation
isn't all that useful and in fact nobody uses it.
This patch simplifies the interface as follows.
* get_request() is renamed to __get_request() and is only used by the
wrapper function.
* get_request_wait() is renamed to get_request(). It now takes
@gfp_mask and retries iff it contains %__GFP_WAIT.
This patch doesn't introduce any functional change and is to prepare
for further updates to request allocation path.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
iscsi_remove_host() uses bsg_remove_queue() which implements custom
queue draining. fc_bsg_remove() open-codes mostly identical logic.
The draining logic isn't correct in that blk_stop_queue() doesn't
prevent new requests from being queued - it just stops processing, so
nothing prevents new requests to be queued after the logic determines
that the queue is drained.
blk_cleanup_queue() now implements proper queue draining and these
custom draining logics aren't necessary. Drop them and use
bsg_unregister_queue() + blk_cleanup_queue() instead.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Mike Christie <michaelc@cs.wisc.edu>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: James Smart <james.smart@emulex.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
mempool_create_node() currently assumes %GFP_KERNEL. Its only user,
blk_init_free_list(), is about to be updated to use other allocation
flags - add @gfp_mask argument to the function.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Currently, blkcg_activate_policy() depends on %GFP_ATOMIC allocation
from __blkg_lookup_create() for root blkcg creation. This could make
policy fail unnecessarily.
Make blkg_alloc() take @gfp_mask, __blkg_lookup_create() take an
optional @new_blkg for preallocated blkg, and blkcg_activate_policy()
preload radix tree and preallocate blkg with %GFP_KERNEL before trying
to create the root blkg.
v2: __blkg_lookup_create() was returning %NULL on blkg alloc failure
instead of ERR_PTR() value. Fixed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
There's no point in calling radix_tree_preload() if preloading doesn't
use more permissible GFP mask. Drop preloading from
__blkg_lookup_create().
While at it, drop sparse locking annotation which no longer applies.
v2: Vivek pointed out the odd preload usage. Instead of updating,
just drop it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Pull md updates from NeilBrown.
* 'for-next' of git://neil.brown.name/md:
DM RAID: Add support for MD RAID10
md/RAID1: Add missing case for attempting to repair known bad blocks.
md/raid5: For odirect-write performance, do not set STRIPE_PREREAD_ACTIVE.
md/raid1: don't abort a resync on the first badblock.
md: remove duplicated test on ->openers when calling do_md_stop()
raid5: Add R5_ReadNoMerge flag which prevent bio from merging at block layer
md/raid1: prevent merging too large request
md/raid1: read balance chooses idlest disk for SSD
md/raid1: make sequential read detection per disk based
MD RAID10: Export md_raid10_congested
MD: Move macros from raid1*.h to raid1*.c
MD RAID1: rename mirror_info structure
MD RAID10: rename mirror_info structure
MD RAID10: Fix compiler warning.
raid5: add a per-stripe lock
raid5: remove unnecessary bitmap write optimization
raid5: lockless access raid5 overrided bi_phys_segments
raid5: reduce chance release_stripe() taking device_lock
|
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Support the MD RAID10 personality through dm-raid.c
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
|
| |\ \
| | | |
| | | |
| | | | |
Pull in pre-requisites for adding raid10 support to dm-raid.
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
When doing resync or repair, attempt to correct bad blocks, according
to WriteErrorSeen policy
Signed-off-by: Alex Lyakas <alex.bolshoy@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
'sync' writes set both REQ_SYNC and REQ_NOIDLE.
O_DIRECT writes set REQ_SYNC but not REQ_NOIDLE.
We currently assume that a REQ_SYNC request will not be followed by
more requests and so set STRIPE_PREREAD_ACTIVE to expedite the
request.
This is appropriate for sync requests, but not for O_DIRECT requests.
So make the setting of STRIPE_PREREAD_ACTIVE conditional on REQ_NOIDLE
rather than REQ_SYNC. This is consistent with the documented meaning
of REQ_NOIDLE:
__REQ_NOIDLE, /* don't anticipate more IO after this one */
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
If a resync of a RAID1 array with 2 devices finds a known bad block
one device it will neither read from, or write to, that device for
this block offset.
So there will be one read_target (The other device) and zero write
targets.
This condition causes md/raid1 to abort the resync assuming that it
has finished - without known bad blocks this would be true.
When there are no write targets because of the presence of bad blocks
we should only skip over the area covered by the bad block.
RAID10 already gets this right, raid1 doesn't. Or didn't.
As this can cause a 'sync' to abort early and appear to have succeeded
it could lead to some data corruption, so it suitable for -stable.
Cc: stable@vger.kernel.org
Reported-by: Alexander Lyakas <alex.bolshoy@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
do_md_stop tests mddev->openers while holding ->open_mutex,
and fails if this count is too high.
So callers do not need to check mddev->openers and doing so isn't
very meaningful as they don't hold ->open_mutex so the number could
change.
So remove the unnecessary tests on mddev->openers.
These are not called often enough for there to be any gain in
an early test on ->open_mutex to avoid the need for a slightly more
costly mutex_lock call.
Signed-off-by: NeilBrown <neilb@suse.de>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Because bios will merge at block-layer,so bios-error may caused by other
bio which be merged into to the same request.
Using this flag,it will find exactly error-sector and not do redundant
operation like re-write and re-read.
V0->V1:Using REQ_FLUSH instead REQ_NOMERGE avoid bio merging at block
layer.
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
For SSD, if request size exceeds specific value (optimal io size), request size
isn't important for bandwidth. In such condition, if making request size bigger
will cause some disks idle, the total throughput will actually drop. A good
example is doing a readahead in a two-disk raid1 setup.
So when should we split big requests? We absolutly don't want to split big
request to very small requests. Even in SSD, big request transfer is more
efficient. This patch only considers request with size above optimal io size.
If all disks are busy, is it worth doing a split? Say optimal io size is 16k,
two requests 32k and two disks. We can let each disk run one 32k request, or
split the requests to 4 16k requests and each disk runs two. It's hard to say
which case is better, depending on hardware.
So only consider case where there are idle disks. For readahead, split is
always better in this case. And in my test, below patch can improve > 30%
thoughput. Hmm, not 100%, because disk isn't 100% busy.
Such case can happen not just in readahead, for example, in directio. But I
suppose directio usually will have bigger IO depth and make all disks busy, so
I ignored it.
Note: if the raid uses any hard disk, we don't prevent merging. That will make
performace worse.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
SSD hasn't spindle, distance between requests means nothing. And the original
distance based algorithm sometimes can cause severe performance issue for SSD
raid.
Considering two thread groups, one accesses file A, the other access file B.
The first group will access one disk and the second will access the other disk,
because requests are near from one group and far between groups. In this case,
read balance might keep one disk very busy but the other relative idle. For
SSD, we should try best to distribute requests to as many disks as possible.
There isn't spindle move penality anyway.
With below patch, I can see more than 50% throughput improvement sometimes
depending on workloads.
The only exception is small requests can be merged to a big request which
typically can drive higher throughput for SSD too. Such small requests are
sequential reads. Unlike hard disk, sequential read which can't be merged (for
example direct IO, or read without readahead) can be ignored for SSD. Again
there is no spindle move penality. readahead dispatches small requests and such
requests can be merged.
Last patch can help detect sequential read well, at least if concurrent read
number isn't greater than raid disk number. In that case, distance based
algorithm doesn't work well too.
V2: For hard disk and SSD mixed raid, doesn't use distance based algorithm for
random IO too. This makes the algorithm generic for raid with SSD.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Currently the sequential read detection is global wide. It's natural to make it
per disk based, which can improve the detection for concurrent multiple
sequential reads. And next patch will make SSD read balance not use distance
based algorithm, where this change help detect truly sequential read for SSD.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
md/raid10: Export is_congested test.
In similar fashion to commits
11d8a6e3719519fbc0e2c9d61b6fa931b84bf813
1ed7242e591af7e233234d483f12d33818b189d9
we export the RAID10 congestion checking function so that dm-raid.c can
make use of it and make use of the personality. The 'queue' and 'gendisk'
structures will not be available to the MD code when device-mapper sets
up the device, so we conditionalize access to these fields also.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
MD RAID1/RAID10: Move some macros from .h file to .c file
There are three macros (IO_BLOCKED,IO_MADE_GOOD,BIO_SPECIAL) which are defined
in both raid1.h and raid10.h. They are only used in there respective .c files.
However, if we wish to make RAID10 accessible to the device-mapper RAID
target (dm-raid.c), then we need to move these macros into the .c files where
they are used so that they do not conflict with each other.
The macros from the two files are identical and could be moved into md.h, but
I chose to leave the duplication and have them remain in the personality
files.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
MD RAID1: Rename the structure 'mirror_info' to 'raid1_info'
The same structure name ('mirror_info') is used by raid10. Each of these
structures are defined in there respective header files. If dm-raid is
to support both RAID1 and RAID10, the header files will be included and
the structure names must not collide. While only one of these structure
names needs to change, this patch adds consistency to the naming of the
structure.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
MD RAID10: Rename the structure 'mirror_info' to 'raid10_info'
The same structure name ('mirror_info') is used by raid1. Each of these
structures are defined in there respective header files. If dm-raid is
to support both RAID1 and RAID10, the header files will be included and
the structure names must not collide.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
MD RAID10: Fix compiler warning.
Initialize variable to prevent compiler warning.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Add a per-stripe lock to protect stripe specific data. The purpose is to reduce
lock contention of conf->device_lock.
stripe ->toread, ->towrite are protected by per-stripe lock. Accessing bio
list of the stripe is always serialized by this lock, so adding bio to the
lists (add_stripe_bio()) and removing bio from the lists (like
ops_run_biofill()) not race.
If bio in ->read, ->written ... list are not shared by multiple stripes, we
don't need any lock to protect ->read, ->written, because STRIPE_ACTIVE will
protect them. If the bio are shared, there are two protections:
1. bi_phys_segments acts as a reference count
2. traverse the list uses r5_next_bio, which makes traverse never access bio
not belonging to the stripe
Let's have an example:
| stripe1 | stripe2 | stripe3 |
...bio1......|bio2|bio3|....bio4.....
stripe2 has 4 bios, when it's finished, it will decrement bi_phys_segments for
all bios, but only end_bio for bio2 and bio3. bio1->bi_next still points to
bio2, but this doesn't matter. When stripe1 is finished, it will not touch bio2
because of r5_next_bio check. Next time stripe1 will end_bio for bio1 and
stripe3 will end_bio bio4.
before add_stripe_bio() addes a bio to a stripe, we already increament the bio
bi_phys_segments, so don't worry other stripes release the bio.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Neil pointed out the bitmap write optimization in handle_stripe_clean_event()
is unnecessary, because the chance one stripe gets written twice in the mean
time is rare. We can always do a bitmap_startwrite when a write request is
added to a stripe and bitmap_endwrite after write request is done. Delete the
optimization. With it, we can delete some cases of device_lock.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Raid5 overrides bio->bi_phys_segments, accessing it is with device_lock hold,
which is unnecessary, We can make it lockless actually.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
release_stripe() is a place conf->device_lock is heavily contended. We take the
lock even stripe count isn't 1, which isn't required.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
In commit 3b6e2723f32d ("locks: prevent side-effects of
locks_release_private before file_lock is initialized") we removed the
last user of lm_release_private without removing the field itself.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|\ \ \ \
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Pull irqdomain changes from Grant Likely:
"Round of refactoring and enhancements to irq_domain infrastructure.
This series starts the process of simplifying irqdomain. The ultimate
goal is to merge LEGACY, LINEAR and TREE mappings into a single
system, but had to back off from that after some last minute bugs.
Instead it mainly reorganizes the code and ensures that the reverse
map gets populated when the irq is mapped instead of the first time it
is looked up.
Merging of the irq_domain types is deferred to v3.7
In other news, this series adds helpers for creating static mappings
on a linear or tree mapping."
* tag 'irqdomain-for-linus' of git://git.secretlab.ca/git/linux-2.6:
irqdomain: Improve diagnostics when a domain mapping fails
irqdomain: eliminate slow-path revmap lookups
irqdomain: Fix irq_create_direct_mapping() to test irq_domain type.
irqdomain: Eliminate dedicated radix lookup functions
irqdomain: Support for static IRQ mapping and association.
irqdomain: Always update revmap when setting up a virq
irqdomain: Split disassociating code into separate function
irq_domain: correct a minor wrong comment for linear revmap
irq_domain: Standardise legacy/linear domain selection
irqdomain: Make ops->map hook optional
irqdomain: Remove unnecessary test for IRQ_DOMAIN_MAP_LEGACY
irqdomain: Simple NUMA awareness.
devicetree: add helper inline for retrieving a node's full name
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
When the map operation fails log the error code we get and add a WARN_ON()
so we get a backtrace (which should help work out which interrupt is the
source of the issue).
Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
With the current state of irq_domain, the reverse map is always updated
when new IRQs get mapped. This means that the irq_find_mapping() function
can be simplified to execute the revmap lookup functions unconditionally
This patch adds lookup functions for the revmaps that don't yet have one
and removes the slow path lookup code path.
v8: Broke out unrelated changes into separate patches. Rebased on Paul's irq
association patches.
v7: Rebased to irqdomain/next for v3.4 and applied before the removal of 'hint'
v6: Remove the slow path entirely. The only place where the slow path
could get called is for a linear mapping if the hwirq number is larger
than the linear revmap size. There shouldn't be any interrupt
controllers that do that.
v5: rewrite to not use a ->revmap() callback. It is simpler, smaller,
safer and faster to open code each of the revmap lookups directly into
irq_find_mapping() via a switch statement.
v4: Fix build failure on incorrect variable reference.
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Milton Miller <miltonm@bga.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Rob Herring <rob.herring@calxeda.com>
|
| |\ \ \ \ |
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
irq_create_direct_mapping can only be used with the NOMAP type. Make
the function test to ensure it is passed the correct type of
irq_domain.
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rob Herring <rob.herring@calxeda.com>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
In preparation to remove the slow revmap path, eliminate the public
radix revmap lookup functions. This simplifies the code and makes the
slowpath removal patch a lot simpler.
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rob Herring <rob.herring@calxeda.com>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
This adds a new strict mapping API for supporting creation of linux IRQs
at existing positions within the domain. The new routines are as follows:
For dynamic allocation and insertion to specified ranges:
- irq_create_identity_mapping()
- irq_create_strict_mappings()
These will allocate and associate a range of linux IRQs at the specified
location. This can be used by controllers that have their own static linux IRQ
definitions to map a hwirq range to, as well as for platforms that wish to
establish 1:1 identity mapping between linux and hwirq space.
For insertion to specified ranges by platforms that do their own irq_desc
management:
- irq_domain_associate()
- irq_domain_associate_many()
These in turn call back in to the domain's ->map() routine, for further
processing by the platform. Disassociation of IRQs get handled through
irq_dispose_mapping() as normal.
With these in place it should be possible to begin migration of legacy IRQ
domains to linear ones, without requiring special handling for static vs
dynamic IRQ definitions in DT vs non-DT paths. This also makes it possible
for domains with static mappings to adopt whichever tree model best fits
their needs, rather than simply restricting them to linear revmaps.
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
[grant.likely: Reorganized irq_domain_associate{,_many} to have all logic in one place]
[grant.likely: Add error checking for unallocated irq_descs at associate time]
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rob Herring <rob.herring@calxeda.com>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
At irq_setup_virq() time all of the data needed to update the reverse
map is available, but the current code ignores it and relies upon the
slow path to insert revmap records. This patch adds revmap updating
to the setup path so the slow path will no longer be necessary.
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rob Herring <rob.herring@calxeda.com>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
This patch moves the irq disassociation code out into a separate
function in preparation to extend irq_setup_virq to handle multiple
irqs and rename it for use by interrupt controller drivers. The new
function will be used by irq_setup_virq() in its error path.
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rob Herring <rob.herring@calxeda.com>
|
| |\ \ \ \ \
| | | | | | |
| | | | | | |
| | | | | | | |
Linux 3.5-rc6
|
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
The revmap type should be linear for irq_domain_add_linear function.
Signed-off-by: Dong Aisheng <dong.aisheng@linaro.org>
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
|
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
A large proportion of interrupt controllers that support legacy mappings
do so because non-DT systems need to use fixed IRQ numbers when registering
devices via buses but can otherwise use a linear mapping. The interrupt
controller itself typically is not affected by the mapping used and best
practice is to use a linear mapping where possible so drivers frequently
select at runtime depending on if a legacy range has been allocated to
them.
Standardise this behaviour by providing irq_domain_register_simple() which
will allocate a linear mapping unless a positive first_irq is provided in
which case it will fall back to a legacy mapping. This helps make best
practice for irq_domain adoption clearer.
Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
|
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
There isn't a really compelling reason to force ->map to be populated,
so allow it to be left unset.
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rob Herring <rob.herring@calxeda.com>
|
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
Where irq_domain_associate() is called in irq_create_mapping, there is
no need to test for IRQ_DOMAIN_MAP_LEGACY because it is already tested
for earlier in the routine.
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rob Herring <rob.herring@calxeda.com>
|