aboutsummaryrefslogtreecommitdiffstats
path: root/fs/btrfs
Commit message (Collapse)AuthorAgeFilesLines
...
| * Btrfs: fix use after free when close_ctree frees the orphan_rsvChris Mason2015-04-103-1/+7
| | | | | | | | | | | | | | | | | | | | | | Near the end of close_ctree, we're calling btrfs_free_block_rsv to free up the orphan rsv. The problem is this call updates the space_info, which has already been freed. This adds a new __ function that directly calls kfree instead of trying to update the space infos. Signed-off-by: Chris Mason <clm@fb.com>
| * Btrfs: allow block group cache writeout outside critical section in commitChris Mason2015-04-109-37/+341
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We loop through all of the dirty block groups during commit and write the free space cache. In order to make sure the cache is currect, we do this while no other writers are allowed in the commit. If a large number of block groups are dirty, this can introduce long stalls during the final stages of the commit, which can block new procs trying to change the filesystem. This commit changes the block group cache writeout to take appropriate locks and allow it to run earlier in the commit. We'll still have to redo some of the block groups, but it means we can get most of the work out of the way without blocking the entire FS. Signed-off-by: Chris Mason <clm@fb.com>
| * Btrfs: don't use highmem for free space cache pagesChris Mason2015-04-101-7/+5
| | | | | | | | | | | | | | | | | | | | | | In order to create the free space cache concurrently with FS modifications, we need to take a few block group locks. The cache code also does kmap, which would schedule with the locks held. Instead of going through kmap_atomic, lets just use lowmem for the cache pages. Signed-off-by: Chris Mason <clm@fb.com>
| * Btrfs: two stage dirty block group writeoutChris Mason2015-04-104-32/+170
| | | | | | | | | | | | | | | | | | | | | | | | Block group cache writeout is currently waiting on the pages for each block group cache before moving on to writing the next one. This commit switches things around to send down all the caches and then wait on them in batches. The end result is much faster, since we're keeping the disk pipeline full. Signed-off-by: Chris Mason <clm@fb.com>
| * btrfs: move struct io_ctl into ctree.h and rename itChris Mason2015-04-102-33/+33
| | | | | | | | | | | | | | We'll need to put the io_ctl into the block_group cache struct, so name it struct btrfs_io_ctl and move it into ctree.h Signed-off-by: Chris Mason <clm@fb.com>
| * Btrfs: don't steal from the global reserve if we don't have the spaceJosef Bacik2015-04-101-2/+44
| | | | | | | | | | | | | | | | btrfs_evict_inode() needs to be more careful about stealing from the global_rsv. We dont' want to end up aborting commit with ENOSPC just because the evict_inode code was too greedy. Signed-off-by: Chris Mason <clm@fb.com>
| * Btrfs: don't commit the transaction in the async space flushingJosef Bacik2015-04-101-6/+8
| | | | | | | | | | | | | | | | | | We're triggering a huge number of commits from btrfs_async_reclaim_metadata_space. These aren't really requried, because everyone calling the async reclaim code is going to end up triggering a commit on their own. Signed-off-by: Chris Mason <clm@fb.com>
| * Btrfs: reserve space for block groupsJosef Bacik2015-04-103-3/+11
| | | | | | | | | | | | | | This changes our delayed refs calculations to include the space needed to write back dirty block groups. Signed-off-by: Chris Mason <clm@fb.com>
| * Btrfs: refill block reserves during truncateChris Mason2015-04-103-11/+46
| | | | | | | | | | | | | | | | | | | | | | When truncate starts, it allocates some space in the block reserves so that we'll have enough to update metadata along the way. For very large files, we can easily go through all of that space as we loop through the extents. This changes truncate to refill the space reservation as it progresses through the file. Signed-off-by: Chris Mason <clm@fb.com>
| * Btrfs: account for crcs in delayed ref processingJosef Bacik2015-04-105-24/+83
| | | | | | | | | | | | | | | | | | | | | | | | | | As we delete large extents, we end up doing huge amounts of COW in order to delete the corresponding crcs. This adds accounting so that we keep track of that space and flushing of delayed refs so that we don't build up too much delayed crc work. This helps limit the delayed work that must be done at commit time and tries to avoid ENOSPC aborts because the crcs eat all the global reserves. Signed-off-by: Chris Mason <clm@fb.com>
| * btrfs: actively run the delayed refs while deleting large filesChris Mason2015-04-104-5/+52
| | | | | | | | | | | | | | | | | | | | | | | | | | When we are deleting large files with large extents, we are building up a huge set of delayed refs for processing. Truncate isn't checking often enough to see if we need to back off and process those, or let a commit proceed. The end result is long stalls after the rm, and very long commit times. During the commits, other processes back up waiting to start new transactions and we get into trouble. Signed-off-by: Chris Mason <clm@fb.com>
| * fs: btrfs: Add missing include fileGuenter Roeck2015-04-011-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Building alpha:allmodconfig fails with fs/btrfs/inode.c: In function 'check_direct_IO': fs/btrfs/inode.c:8050:2: error: implicit declaration of function 'iov_iter_alignment' due to a missing include file. Fixes: 3737c63e1fb0 ("fs: move struct kiocb to fs.h") Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Guenter Roeck <linux@roeck-us.net> Acked-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
| * Btrfs: free and unlock our path before btrfs_free_and_pin_reserved_extent()Chris Mason2015-04-011-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The error handling path for alloc_reserved_tree_block is calling btrfs_free_and_pin_reserved_extent with a spinning tree lock held. This might sleep as we allocate extent_state objects: BUG: sleeping function called from invalid context at mm/slub.c:1268 in_atomic(): 1, irqs_disabled(): 0, pid: 11093, name: kworker/u4:7 5 locks held by kworker/u4:7/11093: #0: ("%s-%s""btrfs", name){++++.+}, at: [<ffffffff81091d51>] process_one_work+0x151/0x520 #1: ((&work->normal_work)){+.+.+.}, at: [<ffffffff81091d51>] process_one_work+0x151/0x520 #2: (sb_internal){++++.+}, at: [<ffffffffa003a70e>] start_transaction+0x43e/0x590 [btrfs] #3: (&head_ref->mutex){+.+...}, at: [<ffffffffa0089f8c>] btrfs_delayed_ref_lock+0x4c/0x240 [btrfs] #4: (btrfs-extent-00){++++..}, at: [<ffffffffa007697b>] btrfs_clear_lock_blocking_rw+0x9b/0x150 [btrfs] CPU: 0 PID: 11093 Comm: kworker/u4:7 Tainted: G W 4.0.0-rc6-default+ #246 Hardware name: Intel Corporation Santa Rosa platform/Matanzas, BIOS TSRSCRB1.86C.0047.B00.0610170821 10/17/06 Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs] 00000000000004f4 ffff88006dd17848 ffffffff81ab0e3b ffff88006dd17848 ffff88007a944760 ffff88006dd17868 ffffffff8109d516 ffff88006dd17898 0000000000000000 ffff88006dd17898 ffffffff8109d5b2 ffffffff81aba2bb Call Trace: [<ffffffff81ab0e3b>] dump_stack+0x4f/0x6c [<ffffffff8109d516>] ___might_sleep+0xf6/0x140 [<ffffffff8109d5b2>] __might_sleep+0x52/0x90 [<ffffffff81aba2bb>] ? ftrace_call+0x5/0x34 [<ffffffff81196363>] kmem_cache_alloc+0x163/0x1b0 [<ffffffffa0056f31>] ? alloc_extent_state+0x31/0x150 [btrfs] [<ffffffffa0056f20>] ? alloc_extent_state+0x20/0x150 [btrfs] [<ffffffffa0056f31>] alloc_extent_state+0x31/0x150 [btrfs] [<ffffffffa005805b>] __set_extent_bit+0x37b/0x5d0 [btrfs] [<ffffffff81aba2bb>] ? ftrace_call+0x5/0x34 [<ffffffffa005888d>] ? set_extent_bit+0xd/0x30 [btrfs] [<ffffffffa00588a3>] set_extent_bit+0x23/0x30 [btrfs] [<ffffffffa0058e80>] set_extent_dirty+0x20/0x30 [btrfs] [<ffffffffa00195ba>] pin_down_extent+0xaa/0x170 [btrfs] [<ffffffffa001d8ef>] __btrfs_free_reserved_extent+0xcf/0x160 [btrfs] [<ffffffffa0023856>] btrfs_free_and_pin_reserved_extent+0x16/0x20 [btrfs] [<ffffffffa002482a>] __btrfs_run_delayed_refs+0xfca/0x1290 [btrfs] [<ffffffffa0026eae>] btrfs_run_delayed_refs+0x6e/0x2e0 [btrfs] [<ffffffffa0027378>] delayed_ref_async_start+0x48/0xb0 [btrfs] [<ffffffffa006c883>] normal_work_helper+0x83/0x350 [btrfs] [<ffffffffa006cd79>] ? btrfs_extent_refs_helper+0x9/0x20 [btrfs] [<ffffffffa006cd82>] btrfs_extent_refs_helper+0x12/0x20 [btrfs] [<ffffffff81091dcb>] process_one_work+0x1cb/0x520 [<ffffffff81091d51>] ? process_one_work+0x151/0x520 [<ffffffff811c7abf>] ? seq_read+0x3f/0x400 [<ffffffff8109260b>] worker_thread+0x5b/0x4e0 [<ffffffff81097be2>] ? __kthread_parkme+0x12/0xa0 [<ffffffff810925b0>] ? rescuer_thread+0x450/0x450 [<ffffffff81098686>] kthread+0xf6/0x120 [<ffffffff81098590>] ? flush_kthread_worker+0x1b0/0x1b0 [<ffffffff81ab8088>] ret_from_fork+0x58/0x90 [<ffffffff81098590>] ? flush_kthread_worker+0x1b0/0x1b0 ------------[ cut here ]------------ This changes things to free the path first, which will also unlock the extent buffer. Signed-off-by: Chris Mason <clm@fb.com> Reported-by: Dave Sterba <dsterba@suse.cz> Tested-by: Dave Sterba <dsterba@suse.cz>
| * Btrfs: Remove the check for old-style mkfsLiu Bo2015-03-261-6/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This was used to make sure that a fresh btrfs from an older mkfs.btrfs, but it also allows us to mount a buggy btrfs if this btrfs has the right superblock head part but has something wrong with chunk tree part[1], and after that we can hit BUG_ON()s set in the code to prevent something impossible. Since David has released "Btrfs progs v3.19-rc2", just remove the check, if anyone who wants to make a fresh btrfs, please use the latest one. [1]: http://www.spinics.net/lists/linux-btrfs/msg42358.html Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Omar Sandoval <osandov@osandov.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
| * btrfs: cleanup orphans while looking up default subvolumeJeff Mahoney2015-03-261-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | Orphans in the fs tree are cleaned up via open_ctree and subvolume orphans are cleaned via btrfs_lookup_dentry -- except when a default subvolume is in use. The name for the default subvolume uses a manual lookup that doesn't trigger orphan cleanup and needs to trigger it manually as well. This doesn't apply to the remount case since the subvolumes are cleaned up by walking the root radix tree. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
| * btrfs: explicitly set control file's private_dataTom Van Braeckel2015-03-261-0/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The private_data member of the Btrfs control device file (/dev/btrfs-control) is used to hold the current transaction and needs to be initialized to NULL to signify that no transaction is in progress. We explicitly set the control file's private_data to NULL to be independent of whatever value the misc subsystem initializes it to. Backstory: ---------- The misc subsystem (which is used by /dev/btrfs-control) initializes a file's private_data to point to the misc device when a driver has registered a custom open file operation and initializes it to NULL when a custom open file operation has *not* been provided. This subtle quirk is confusing, to the point where kernel code registers *empty* file open operations to have private_data point to the misc device structure. And it leads to bugs, where the addition or removal of a custom open file operation surprisingly changes the initial contents of a file's private_data structure. To simplify things in the misc subsystem, a patch [1] has been proposed to *always* set private_data to point to the misc device instead of only doing this when a custom open file operation has been registered. But before we can fix this in the misc subsystem itself, we need to modify the (few) drivers that rely on this very subtle behavior. [1] https://lkml.org/lkml/2014/12/4/939 Signed-off-by: Martin Kepplinger <martink@posteo.de> Signed-off-by: Tom Van Braeckel <tomvanbraeckel@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
| * btrfs: incorrect handling for fiemap_fill_next_extent returnChengyu Song2015-03-261-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | fiemap_fill_next_extent returns 0 on success, -errno on error, 1 if this was the last extent that will fit in user array. If 1 is returned, the return value may eventually returned to user space, which should not happen, according to manpage of ioctl. Signed-off-by: Chengyu Song <csong84@gatech.edu> Reviewed-by: David Sterba <dsterba@suse.cz> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
| * btrfs: don't accept bare namespace as a valid xattrDavid Sterba2015-03-261-14/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Due to insufficient check in btrfs_is_valid_xattr, this unexpectedly works: $ touch file $ setfattr -n user. -v 1 file $ getfattr -d file user.="1" ie. the missing attribute name after the namespace. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=94291 Reported-by: William Douglas <william.douglas@intel.com> CC: <stable@vger.kernel.org> # 2.6.29+ Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
| * Btrfs: fix log tree corruption when fs mounted with -o discardFilipe Manana2015-03-261-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While committing a transaction we free the log roots before we write the new super block. Freeing the log roots implies marking the disk location of every node/leaf (metadata extent) as pinned before the new super block is written. This is to prevent the disk location of log metadata extents from being reused before the new super block is written, otherwise we would have a corrupted log tree if before the new super block is written a crash/reboot happens and the location of any log tree metadata extent ended up being reused and rewritten. Even though we pinned the log tree's metadata extents, we were issuing a discard against them if the fs was mounted with the -o discard option, resulting in corruption of the log tree if a crash/reboot happened before writing the new super block - the next time the fs was mounted, during the log replay process we would find nodes/leafs of the log btree with a content full of zeroes, causing the process to fail and require the use of the tool btrfs-zero-log to wipeout the log tree (and all data previously fsynced becoming lost forever). Fix this by not doing a discard when pinning an extent. The discard will be done later when it's safe (after the new super block is committed) at extent-tree.c:btrfs_finish_extent_commit(). Fixes: e688b7252f78 (Btrfs: fix extent pinning bugs in the tree log) CC: <stable@vger.kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
| * Btrfs: fix metadata inconsistencies after directory fsyncFilipe Manana2015-03-265-10/+253
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We can get into inconsistency between inodes and directory entries after fsyncing a directory. The issue is that while a directory gets the new dentries persisted in the fsync log and replayed at mount time, the link count of the inode that directory entries point to doesn't get updated, staying with an incorrect link count (smaller then the correct value). This later leads to stale file handle errors when accessing (including attempt to delete) some of the links if all the other ones are removed, which also implies impossibility to delete the parent directories, since the dentries can not be removed. Another issue is that (unlike ext3/4, xfs, f2fs, reiserfs, nilfs2), when fsyncing a directory, new files aren't logged (their metadata and dentries) nor any child directories. So this patch fixes this issue too, since it has the same resolution as the incorrect inode link count issue mentioned before. This is very easy to reproduce, and the following excerpt from my test case for xfstests shows how: _scratch_mkfs >> $seqres.full 2>&1 _init_flakey _mount_flakey # Create our main test file and directory. $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 8K" $SCRATCH_MNT/foo | _filter_xfs_io mkdir $SCRATCH_MNT/mydir # Make sure all metadata and data are durably persisted. sync # Add a hard link to 'foo' inside our test directory and fsync only the # directory. The btrfs fsync implementation had a bug that caused the new # directory entry to be visible after the fsync log replay but, the inode # of our file remained with a link count of 1. ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/foo_2 # Add a few more links and new files. # This is just to verify nothing breaks or gives incorrect results after the # fsync log is replayed. ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/foo_3 $XFS_IO_PROG -f -c "pwrite -S 0xff 0 64K" $SCRATCH_MNT/hello | _filter_xfs_io ln $SCRATCH_MNT/hello $SCRATCH_MNT/mydir/hello_2 # Add some subdirectories and new files and links to them. This is to verify # that after fsyncing our top level directory 'mydir', all the subdirectories # and their files/links are registered in the fsync log and exist after the # fsync log is replayed. mkdir -p $SCRATCH_MNT/mydir/x/y/z ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/x/y/foo_y_link ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/x/y/z/foo_z_link touch $SCRATCH_MNT/mydir/x/y/z/qwerty # Now fsync only our top directory. $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/mydir # And fsync now our new file named 'hello', just to verify later that it has # the expected content and that the previous fsync on the directory 'mydir' had # no bad influence on this fsync. $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/hello # Simulate a crash/power loss. _load_flakey_table $FLAKEY_DROP_WRITES _unmount_flakey _load_flakey_table $FLAKEY_ALLOW_WRITES _mount_flakey # Verify the content of our file 'foo' remains the same as before, 8192 bytes, # all with the value 0xaa. echo "File 'foo' content after log replay:" od -t x1 $SCRATCH_MNT/foo # Remove the first name of our inode. Because of the directory fsync bug, the # inode's link count was 1 instead of 5, so removing the 'foo' name ended up # deleting the inode and the other names became stale directory entries (still # visible to applications). Attempting to remove or access the remaining # dentries pointing to that inode resulted in stale file handle errors and # made it impossible to remove the parent directories since it was impossible # for them to become empty. echo "file 'foo' link count after log replay: $(stat -c %h $SCRATCH_MNT/foo)" rm -f $SCRATCH_MNT/foo # Now verify that all files, links and directories created before fsyncing our # directory exist after the fsync log was replayed. [ -f $SCRATCH_MNT/mydir/foo_2 ] || echo "Link mydir/foo_2 is missing" [ -f $SCRATCH_MNT/mydir/foo_3 ] || echo "Link mydir/foo_3 is missing" [ -f $SCRATCH_MNT/hello ] || echo "File hello is missing" [ -f $SCRATCH_MNT/mydir/hello_2 ] || echo "Link mydir/hello_2 is missing" [ -f $SCRATCH_MNT/mydir/x/y/foo_y_link ] || \ echo "Link mydir/x/y/foo_y_link is missing" [ -f $SCRATCH_MNT/mydir/x/y/z/foo_z_link ] || \ echo "Link mydir/x/y/z/foo_z_link is missing" [ -f $SCRATCH_MNT/mydir/x/y/z/qwerty ] || \ echo "File mydir/x/y/z/qwerty is missing" # We expect our file here to have a size of 64Kb and all the bytes having the # value 0xff. echo "file 'hello' content after log replay:" od -t x1 $SCRATCH_MNT/hello # Now remove all files/links, under our test directory 'mydir', and verify we # can remove all the directories. rm -f $SCRATCH_MNT/mydir/x/y/z/* rmdir $SCRATCH_MNT/mydir/x/y/z rm -f $SCRATCH_MNT/mydir/x/y/* rmdir $SCRATCH_MNT/mydir/x/y rmdir $SCRATCH_MNT/mydir/x rm -f $SCRATCH_MNT/mydir/* rmdir $SCRATCH_MNT/mydir # An fsck, run by the fstests framework everytime a test finishes, also detected # the inconsistency and printed the following error message: # # root 5 inode 257 errors 2001, no inode item, link count wrong # unresolved ref dir 258 index 2 namelen 5 name foo_2 filetype 1 errors 4, no inode ref # unresolved ref dir 258 index 3 namelen 5 name foo_3 filetype 1 errors 4, no inode ref status=0 exit The expected golden output for the test is: wrote 8192/8192 bytes at offset 0 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) wrote 65536/65536 bytes at offset 0 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) File 'foo' content after log replay: 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa * 0020000 file 'foo' link count after log replay: 5 file 'hello' content after log replay: 0000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff * 0200000 Which is the output after this patch and when running the test against ext3/4, xfs, f2fs, reiserfs or nilfs2. Without this patch, the test's output is: wrote 8192/8192 bytes at offset 0 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) wrote 65536/65536 bytes at offset 0 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) File 'foo' content after log replay: 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa * 0020000 file 'foo' link count after log replay: 1 Link mydir/foo_2 is missing Link mydir/foo_3 is missing Link mydir/x/y/foo_y_link is missing Link mydir/x/y/z/foo_z_link is missing File mydir/x/y/z/qwerty is missing file 'hello' content after log replay: 0000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff * 0200000 rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/x/y/z': No such file or directory rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/x/y': No such file or directory rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/x': No such file or directory rm: cannot remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/foo_2': Stale file handle rm: cannot remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/foo_3': Stale file handle rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir': Directory not empty Fsck, without this fix, also complains about the wrong link count: root 5 inode 257 errors 2001, no inode item, link count wrong unresolved ref dir 258 index 2 namelen 5 name foo_2 filetype 1 errors 4, no inode ref unresolved ref dir 258 index 3 namelen 5 name foo_3 filetype 1 errors 4, no inode ref So fix this by logging the inodes that the dentries point to when fsyncing a directory. A test case for xfstests follows. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
| * Btrfs: change the insertion criteria for the qgroup operations rbtreeFilipe Manana2015-03-261-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After looking at Liu Bo's recent patch (titled "Btrfs: fix comp_oper to get right order") I realized the search made by qgroup_oper_exists() was buggy because its rbtree navigation comparison function, comp_oper_exist(), only looks at the fields bytenr and ref_root of a tree node, ignoring the seq field completely. This was wrong because when we insert a node into the rbtree we use comp_oper(), which takes a decision based first on bytenr, then on seq and then on the ref_root field. That means qgroup_oper_exists() could miss the fact that at least one operation with given bytenr and ref_root exists. Consider the following simple example of a 3 nodes qgroup operations rbtree (created using comp_oper before this patch), where each node's key is a tuple with the shape (bytenr, seq, ref_root, op): [ (4096, 2, 20, op X) ] / \ / \ [ (4096, 1, 5, op Y) ] [ (4096, 3, 10, op Z) ] qgroup_oper_exists() when called to search for an existing operation for bytenr 4096 and ref root 10 wouldn't find anything because it would go to the left subtree instead of the right subtree, since comp_oper_exits() ignores the seq field completely. Fix this by changing the insertion navigation function to use the ref_root field right after using the bytenr field and before using the seq field, so that qgroup_oper_exists() / comp_oper_exist() work as expected. This patch applies on top of the patch mentioned above from Liu. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
| * Btrfs: add missing inode item update in fallocate()Filipe Manana2015-03-261-9/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we fallocate(), without the keep size flag, into an area already covered by an extent previously fallocated, we were updating the inode's i_size but we weren't updating the inode item in the fs/subvol tree. A following umount + mount would result in a loss of the inode's size (and an fsync would miss too the fact that the inode changed). Reproducer: $ mkfs.btrfs -f /dev/sdd $ mount /dev/sdd /mnt $ fallocate -n -l 1M /mnt/foobar $ fallocate -l 512K /mnt/foobar $ umount /mnt $ mount /dev/sdd /mnt $ od -t x1 /mnt/foobar 0000000 The expected result is: $ od -t x1 /mnt/foobar 0000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 * 2000000 A test case for fstests follows soon. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
| * Btrfs: incremental send, remove dead codeFilipe Manana2015-03-261-59/+0
| | | | | | | | | | | | | | | | | | | | | | | | The logic to detect path loops when attempting to apply a pending directory rename, introduced in commit f959492fc15b (Btrfs: send, fix more issues related to directory renames) is no longer needed, and the respective fstests test case for that commit, btrfs/045, now passes without this code (as well as all the other test cases for send/receive). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
| * Btrfs: incremental send, clear name from cache after orphanizationFilipe Manana2015-03-261-0/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a directory's reference ends up being orphanized, because the inode currently being processed has a new path that matches that directory's path, make sure we evict the name of the directory from the name cache. This is because there might be descendent inodes (either directories or regular files) that will be orphanized later too, and therefore the orphan name of the ancestor must be used, otherwise we send issue rename operations with a wrong path in the send stream. Reproducer: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ mkdir -p /mnt/data/n1/n2/p1/p2 $ mkdir /mnt/data/n4 $ mkdir -p /mnt/data/p1/p2 $ btrfs subvolume snapshot -r /mnt /mnt/snap1 $ mv /mnt/data/p1/p2 /mnt/data $ mv /mnt/data/n1/n2/p1/p2 /mnt/data/p1 $ mv /mnt/data/p2 /mnt/data/n1/n2/p1 $ mv /mnt/data/n1/n2 /mnt/data/p1 $ mv /mnt/data/p1 /mnt/data/n4 $ mv /mnt/data/n4/p1/n2/p1 /mnt/data $ btrfs subvolume snapshot -r /mnt /mnt/snap2 $ btrfs send /mnt/snap1 -f /tmp/1.send $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.send $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt2 $ btrfs receive /mnt2 -f /tmp/1.send $ btrfs receive /mnt2 -f /tmp/2.send ERROR: rename data/p1/p2 -> data/n4/p1/p2 failed. no such file or directory Directories data/p1 (inode 263) and data/p1/p2 (inode 264) in the parent snapshot are both orphanized during the incremental send, and as soon as data/p1 is orphanized, we must make sure that when orphanizing data/p1/p2 we use a source path of o263-6-o/p2 for the rename operation instead of the old path data/p1/p2 (the one before the orphanization of inode 263). A test case for xfstests follows soon. Reported-by: Robbie Ko <robbieko@synology.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
| * Btrfs: send, don't leave without decrementing clone root's send_progressFilipe Manana2015-03-261-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the clone root was not readonly or the dead flag was set on it, we were leaving without decrementing the root's send_progress counter (and before we just incremented it). If a concurrent snapshot deletion was in progress and ended up being aborted, it would be impossible to later attempt to delete again the snapshot, since the root's send_in_progress counter could never go back to 0. We were also setting clone_sources_to_rollback to i + 1 too early - if we bailed out because the clone root we got is not readonly or flagged as dead we ended up later derreferencing a null pointer because we didn't assign the clone root to sctx->clone_roots[i].root: for (i = 0; sctx && i < clone_sources_to_rollback; i++) btrfs_root_dec_send_in_progress( sctx->clone_roots[i].root); So just don't increment the send_in_progress counter if the root is readonly or flagged as dead. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
| * Btrfs: send, add missing check for dead clone rootFilipe Manana2015-03-261-1/+2
| | | | | | | | | | | | | | | | | | | | After we locked the root's root item, a concurrent snapshot deletion call might have set the dead flag on it. So check if the dead flag is set and abort if it is, just like we do for the parent root. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
| * Btrfs: remove deleted xattrs on fsync log replayFilipe Manana2015-03-261-14/+109
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we deleted xattrs from a file and fsynced the file, after a log replay the xattrs would remain associated to the file. This was an unexpected behaviour and differs from what other filesystems do, such as for example xfs and ext3/4. Fix this by, on fsync log replay, check if every xattr in the fs/subvol tree (that belongs to a logged inode) has a matching xattr in the log, and if it does not, delete it from the fs/subvol tree. This is a similar approach to what we do for dentries when we replay a directory from the fsync log. This issue is trivial to reproduce, and the following excerpt from my test for xfstests triggers the issue: _crash_and_mount() { # Simulate a crash/power loss. _load_flakey_table $FLAKEY_DROP_WRITES _unmount_flakey _load_flakey_table $FLAKEY_ALLOW_WRITES _mount_flakey } rm -f $seqres.full _scratch_mkfs >> $seqres.full 2>&1 _init_flakey _mount_flakey # Create out test file and add 3 xattrs to it. touch $SCRATCH_MNT/foobar $SETFATTR_PROG -n user.attr1 -v val1 $SCRATCH_MNT/foobar $SETFATTR_PROG -n user.attr2 -v val2 $SCRATCH_MNT/foobar $SETFATTR_PROG -n user.attr3 -v val3 $SCRATCH_MNT/foobar # Make sure everything is durably persisted. sync # Now delete the second xattr and fsync the inode. $SETFATTR_PROG -x user.attr2 $SCRATCH_MNT/foobar $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foobar _crash_and_mount # After the fsync log is replayed, the file should have only 2 xattrs, the ones # named user.attr1 and user.attr3. The btrfs fsync log replay bug left the file # with the 3 xattrs that we had before deleting the second one and fsyncing the # file. echo "xattr names and values after first fsync log replay:" $GETFATTR_PROG --absolute-names --dump $SCRATCH_MNT/foobar | _filter_scratch # Now write some data to our file, fsync it, remove the first xattr, add a new # hard link to our file and commit the fsync log by fsyncing some other new # file. This is to verify that after log replay our first xattr does not exist # anymore. echo "hello world!" >> $SCRATCH_MNT/foobar $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foobar $SETFATTR_PROG -x user.attr1 $SCRATCH_MNT/foobar ln $SCRATCH_MNT/foobar $SCRATCH_MNT/foobar_link touch $SCRATCH_MNT/qwerty $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/qwerty _crash_and_mount # Now only the xattr with name user.attr3 should be set in our file. echo "xattr names and values after second fsync log replay:" $GETFATTR_PROG --absolute-names --dump $SCRATCH_MNT/foobar | _filter_scratch status=0 exit The expected golden output, which is produced with this patch applied or when testing against xfs or ext3/4, is: xattr names and values after first fsync log replay: # file: SCRATCH_MNT/foobar user.attr1="val1" user.attr3="val3" xattr names and values after second fsync log replay: # file: SCRATCH_MNT/foobar user.attr3="val3" Without this patch applied, the output is: xattr names and values after first fsync log replay: # file: SCRATCH_MNT/foobar user.attr1="val1" user.attr2="val2" user.attr3="val3" xattr names and values after second fsync log replay: # file: SCRATCH_MNT/foobar user.attr1="val1" user.attr2="val2" user.attr3="val3" A patch with a test case for xfstests follows soon. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
| * Merge branch 'cleanups-post-3.19' of ↵Chris Mason2015-03-2525-350/+384
| |\ | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.1 Signed-off-by: Chris Mason <clm@fb.com> Conflicts: fs/btrfs/disk-io.c
| | * btrfs: cleanup, reduce temporary variables in btrfs_read_rootsDavid Sterba2015-02-161-29/+25
| | | | | | | | | | | | Signed-off-by: David Sterba <dsterba@suse.cz>
| | * btrfs: use correct type for workqueue flagsDavid Sterba2015-02-164-5/+5
| | | | | | | | | | | | | | | | | | | | | Through all the local wrappers to alloc_workqueue, __alloc_workqueue_key takes an unsigned int. Signed-off-by: David Sterba <dsterba@suse.cz>
| | * btrfs: factor btrfs_read_roots() out of open_ctree()Eric Sandeen2015-02-161-65/+66
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Also, remove the two local variables create_uuid_tree and check_uuid_tree; we can use the existence of the uuid root and/or the RESCAN_UUID_TREE flag to determine what action to take. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: David Sterba <dsterba@suse.cz>
| | * btrfs: factor btrfs_replay_log() out of open_ctree()Eric Sandeen2015-02-161-40/+53
| | | | | | | | | | | | | | | Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: David Sterba <dsterba@suse.cz>
| | * btrfs: factor btrfs_init_workqueues() out of open_ctree()Eric Sandeen2015-02-161-70/+83
| | | | | | | | | | | | | | | | | | Signed-off-by: Eric Sandeen <sandeen@redhat.com> [renamed to btrfs_init_workqueues] Signed-off-by: David Sterba <dsterba@suse.cz>
| | * btrfs: factor btrfs_init_qgroup() out of open_ctree()Eric Sandeen2015-02-161-11/+15
| | | | | | | | | | | | | | | | | | Signed-off-by: Eric Sandeen <sandeen@redhat.com> [renamed to btrfs_init_qgroup] Signed-off-by: David Sterba <dsterba@suse.cz>
| | * btrfs: factor btrfs_init_dev_replace_locks() out of open_ctree()Eric Sandeen2015-02-161-6/+12
| | | | | | | | | | | | | | | | | | Signed-off-by: Eric Sandeen <sandeen@redhat.com> [renamed to btrfs_init_dev_replace_locks] Signed-off-by: David Sterba <dsterba@suse.cz>
| | * btrfs: factor btrfs_init_btree_inode() out of open_ctree()Eric Sandeen2015-02-161-25/+31
| | | | | | | | | | | | | | | | | | Signed-off-by: Eric Sandeen <sandeen@redhat.com> [renamed to btrfs_init_btree_inode] Signed-off-by: David Sterba <dsterba@suse.cz>
| | * btrfs: factor btrfs_init_balance() out of open_ctree()Eric Sandeen2015-02-161-8/+12
| | | | | | | | | | | | | | | | | | Signed-off-by: Eric Sandeen <sandeen@redhat.com> [renamed to btrfs_init_balance] Signed-off-by: David Sterba <dsterba@suse.cz>
| | * btrfs: factor btrfs_init_scrub() out of open_ctree()Eric Sandeen2015-02-161-7/+12
| | | | | | | | | | | | | | | | | | Signed-off-by: Eric Sandeen <sandeen@redhat.com> [renamed to btrfs_init_scrub] Signed-off-by: David Sterba <dsterba@suse.cz>
| | * btrfs: consistently use fs_info in close_ctree()Eric Sandeen2015-02-161-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | close_ctree() has a local fs_info var for convienience; use it consistently. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: David Sterba <dsterba@suse.cz>
| | * btrfs: remove unused fs_info arg from btrfs_close_extra_devices()Eric Sandeen2015-02-163-6/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The commit: 8dabb74 Btrfs: change core code of btrfs to support the device replace operations added the fs_info argument, but never used it - just remove it again. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: David Sterba <dsterba@suse.cz>
| | * btrfs: fix sizeof format specifier in btrfs_check_super_valid()Fabian Frederick2015-02-161-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch fixes mips compilation warning: fs/btrfs/disk-io.c: In function 'btrfs_check_super_valid': fs/btrfs/disk-io.c:3927:21: warning: format '%lu' expects argument of type 'long unsigned int', but argument 3 has type 'unsigned int' [-Wformat] Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: David Sterba <dsterba@suse.cz>
| | * btrfs: cleanup: use for() loop in btrfs_map_bio()Zhao Lei2015-02-161-6/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | for() is obviously better in these code block, and remove noused init-value to reduce about 6 bytes binary size. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: David Sterba <dsterba@suse.cz>
| | * btrfs: remove unused chunk_tree argument in several functionsZhao Lei2015-02-161-14/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | There functions include unused chunk_tree argument from the begining, it is time to remove them and clean up relative code to prepare value of this argument in caller. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.cz>
| | * btrfs: cleanup: remove no-used alloc_chunk in btrfs_check_data_free_space()Zhao Lei2015-02-161-2/+2
| | | | | | | | | | | | | | | | | | | | | int alloc_chunk is never used in this function, remove it. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.cz>
| | * btrfs: constify structs with op functions or static definitionsDavid Sterba2015-02-1610-11/+13
| | | | | | | | | | | | | | | | | | | | | There are some op tables that can be easily made const, similarly the sysfs feature and raid tables. This is motivated by PaX CONSTIFY plugin. Signed-off-by: David Sterba <dsterba@suse.cz>
| | * Btrfs: switch to kvfree() helperWang Shilong2015-02-162-14/+4
| | | | | | | | | | | | | | | | | | | | | A new helper kvfree() in mm/utils.c will do this. Signed-off-by: Wang Shilong <wangshilong1991@gmail.com> Signed-off-by: David Sterba <dsterba@suse.cz>
| | * Btrfs: disk-io: replace root args iff only fs_info usedDaniel Dressler2015-02-166-38/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is the 3rd independent patch of a larger project to cleanup btrfs's internal usage of btrfs_root. Many functions take btrfs_root only to grab the fs_info struct. By requiring a root these functions cause programmer overhead. That these functions can accept any valid root is not obvious until inspection. This patch reduces the specificity of such functions to accept the fs_info directly. These patches can be applied independently and thus are not being submitted as a patch series. There should be about 26 patches by the project's completion. Each patch will cleanup between 1 and 34 functions apiece. Each patch covers a single file's functions. This patch affects the following function(s): 1) csum_tree_block 2) csum_dirty_buffer 3) check_tree_block_fsid 4) btrfs_find_tree_block 5) clean_tree_block Signed-off-by: Daniel Dressler <danieru.dressler@gmail.com> Signed-off-by: David Sterba <dsterba@suse.cz>
| | * Btrfs: delayed-inode: replace root args iff only fs_info usedDaniel Dressler2015-02-161-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is the second independent patch of a larger project to cleanup btrfs's internal usage of btrfs_root. Many functions take btrfs_root only to grab the fs_info struct. By requiring a root these functions cause programmer overhead. That these functions can accept any valid root is not obvious until inspection. This patch reduces the specificity of such functions to accept the fs_info directly. These patches can be applied independently and thus are not being submitted as a patch series. There should be about 26 patches by the project's completion. Each patch will cleanup between 1 and 34 functions apiece. Each patch covers a single file's functions. This patch affects the following function(s): 1) btrfs_wq_run_delayed_node Signed-off-by: Daniel Dressler <danieru.dressler@gmail.com> Signed-off-by: David Sterba <dsterba@suse.cz>
| | * Btrfs: ctree: reduce args where only fs_info usedDaniel Dressler2015-02-164-18/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch is part of a larger project to cleanup btrfs's internal usage of struct btrfs_root. Many functions take btrfs_root only to grab a pointer to fs_info. This causes programmers to ponder which root can be passed. Since only the fs_info is read affected functions can accept any root, except this is only obvious upon inspection. This patch reduces the specificty of such functions to accept the fs_info directly. This patch does not address the two functions in ctree.c (insert_ptr, and split_item) which only use root for BUG_ONs in ctree.c This patch affects the following functions: 1) fixup_low_keys 2) btrfs_set_item_key_safe Signed-off-by: Daniel Dressler <danieru.dressler@gmail.com> Signed-off-by: David Sterba <dsterba@suse.cz>
| * | Merge branch 'cleanups-for-4.1-v2' of ↵Chris Mason2015-03-2521-135/+126
| |\ \ | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.1