From: Linus Torvalds Date: Fri, 28 Apr 2023 02:42:02 +0000 (-0700) Subject: Merge tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel... X-Git-Tag: v6.6-pxa1908~1349 X-Git-Url: https://git.dujemihanovic.xyz/?a=commitdiff_plain;h=7fa8a8ee9400fe8ec188426e40e481717bc5e924;p=linux.git Merge tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of switching from a user process to a kernel thread. - More folio conversions from Kefeng Wang, Zhang Peng and Pankaj Raghav. - zsmalloc performance improvements from Sergey Senozhatsky. - Yue Zhao has found and fixed some data race issues around the alteration of memcg userspace tunables. - VFS rationalizations from Christoph Hellwig: - removal of most of the callers of write_one_page() - make __filemap_get_folio()'s return value more useful - Luis Chamberlain has changed tmpfs so it no longer requires swap backing. Use `mount -o noswap'. - Qi Zheng has made the slab shrinkers operate locklessly, providing some scalability benefits. - Keith Busch has improved dmapool's performance, making part of its operations O(1) rather than O(n). - Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd, permitting userspace to wr-protect anon memory unpopulated ptes. - Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive rather than exclusive, and has fixed a bunch of errors which were caused by its unintuitive meaning. - Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature, which causes minor faults to install a write-protected pte. - Vlastimil Babka has done some maintenance work on vma_merge(): cleanups to the kernel code and improvements to our userspace test harness. - Cleanups to do_fault_around() by Lorenzo Stoakes. - Mike Rapoport has moved a lot of initialization code out of various mm/ files and into mm/mm_init.c. - Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for DRM, but DRM doesn't use it any more. - Lorenzo has also coverted read_kcore() and vread() to use iterators and has thereby removed the use of bounce buffers in some cases. - Lorenzo has also contributed further cleanups of vma_merge(). - Chaitanya Prakash provides some fixes to the mmap selftesting code. - Matthew Wilcox changes xfs and afs so they no longer take sleeping locks in ->map_page(), a step towards RCUification of pagefaults. - Suren Baghdasaryan has improved mmap_lock scalability by switching to per-VMA locking. - Frederic Weisbecker has reworked the percpu cache draining so that it no longer causes latency glitches on cpu isolated workloads. - Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig logic. - Liu Shixin has changed zswap's initialization so we no longer waste a chunk of memory if zswap is not being used. - Yosry Ahmed has improved the performance of memcg statistics flushing. - David Stevens has fixed several issues involving khugepaged, userfaultfd and shmem. - Christoph Hellwig has provided some cleanup work to zram's IO-related code paths. - David Hildenbrand has fixed up some issues in the selftest code's testing of our pte state changing. - Pankaj Raghav has made page_endio() unneeded and has removed it. - Peter Xu contributed some rationalizations of the userfaultfd selftests. - Yosry Ahmed has fixed an issue around memcg's page recalim accounting. - Chaitanya Prakash has fixed some arm-related issues in the selftests/mm code. - Longlong Xia has improved the way in which KSM handles hwpoisoned pages. - Peter Xu fixes a few issues with uffd-wp at fork() time. - Stefan Roesch has changed KSM so that it may now be used on a per-process and per-cgroup basis. * tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits) mm,unmap: avoid flushing TLB in batch if PTE is inaccessible shmem: restrict noswap option to initial user namespace mm/khugepaged: fix conflicting mods to collapse_file() sparse: remove unnecessary 0 values from rc mm: move 'mmap_min_addr' logic from callers into vm_unmapped_area() hugetlb: pte_alloc_huge() to replace huge pte_alloc_map() maple_tree: fix allocation in mas_sparse_area() mm: do not increment pgfault stats when page fault handler retries zsmalloc: allow only one active pool compaction context selftests/mm: add new selftests for KSM mm: add new KSM process and sysfs knobs mm: add new api to enable ksm per process mm: shrinkers: fix debugfs file permissions mm: don't check VMA write permissions if the PTE/PMD indicates write permissions migrate_pages_batch: fix statistics for longterm pin retry userfaultfd: use helper function range_in_vma() lib/show_mem.c: use for_each_populated_zone() simplify code mm: correct arg in reclaim_pages()/reclaim_clean_pages_from_list() fs/buffer: convert create_page_buffers to folio_create_buffers fs/buffer: add folio_create_empty_buffers helper ... --- 7fa8a8ee9400fe8ec188426e40e481717bc5e924 diff --cc fs/ext4/inline.c index b9fb1177fff6,1602d74b5eeb..859bc4e2c9b0 --- a/fs/ext4/inline.c +++ b/fs/ext4/inline.c @@@ -564,11 -564,12 +564,11 @@@ retry /* We cannot recurse into the filesystem as the transaction is already * started */ - flags = memalloc_nofs_save(); - page = grab_cache_page_write_begin(mapping, 0); - memalloc_nofs_restore(flags); - if (!page) { - ret = -ENOMEM; - goto out; + folio = __filemap_get_folio(mapping, 0, FGP_WRITEBEGIN | FGP_NOFS, + mapping_gfp_mask(mapping)); - if (!folio) { - ret = -ENOMEM; - goto out; ++ if (IS_ERR(folio)) { ++ ret = PTR_ERR(folio); ++ goto out_nofolio; } ext4_write_lock_xattr(inode, &no_expand); @@@ -626,13 -627,13 +626,14 @@@ if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries)) goto retry; - if (page) - block_commit_write(page, from, to); + if (folio) + block_commit_write(&folio->page, from, to); out: - if (page) { - unlock_page(page); - put_page(page); + if (folio) { + folio_unlock(folio); + folio_put(folio); } ++out_nofolio: if (sem_held) ext4_write_unlock_xattr(inode, &no_expand); if (handle) @@@ -691,10 -693,11 +692,10 @@@ int ext4_try_to_write_inline_data(struc if (ret) goto out; - flags = memalloc_nofs_save(); - page = grab_cache_page_write_begin(mapping, 0); - memalloc_nofs_restore(flags); - if (!page) { - ret = -ENOMEM; + folio = __filemap_get_folio(mapping, 0, FGP_WRITEBEGIN | FGP_NOFS, + mapping_gfp_mask(mapping)); - if (!folio) { - ret = -ENOMEM; ++ if (IS_ERR(folio)) { ++ ret = PTR_ERR(folio); goto out; } @@@ -850,12 -852,11 +851,12 @@@ static int ext4_da_convert_inline_data_ void **fsdata) { int ret = 0, inline_size; - struct page *page; + struct folio *folio; - page = grab_cache_page_write_begin(mapping, 0); - if (!page) - return -ENOMEM; + folio = __filemap_get_folio(mapping, 0, FGP_WRITEBEGIN, + mapping_gfp_mask(mapping)); - if (!folio) - return -ENOMEM; ++ if (IS_ERR(folio)) ++ return PTR_ERR(folio); down_read(&EXT4_I(inode)->xattr_sem); if (!ext4_has_inline_data(inode)) { @@@ -945,10 -947,11 +946,10 @@@ retry_journal * We cannot recurse into the filesystem as the transaction * is already started. */ - flags = memalloc_nofs_save(); - page = grab_cache_page_write_begin(mapping, 0); - memalloc_nofs_restore(flags); - if (!page) { - ret = -ENOMEM; + folio = __filemap_get_folio(mapping, 0, FGP_WRITEBEGIN | FGP_NOFS, + mapping_gfp_mask(mapping)); - if (!folio) { - ret = -ENOMEM; ++ if (IS_ERR(folio)) { ++ ret = PTR_ERR(folio); goto out_journal; } diff --cc fs/ext4/inode.c index 8dbd352e3986,d7973743417b..ffbbd9626bd8 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@@ -1153,17 -1184,16 +1153,17 @@@ static int ext4_write_begin(struct fil } /* - * grab_cache_page_write_begin() can take a long time if the - * system is thrashing due to memory pressure, or if the page + * __filemap_get_folio() can take a long time if the + * system is thrashing due to memory pressure, or if the folio * is being written back. So grab it first before we start * the transaction handle. This also allows us to allocate - * the page (if needed) without using GFP_NOFS. + * the folio (if needed) without using GFP_NOFS. */ retry_grab: - page = grab_cache_page_write_begin(mapping, index); - if (!page) - return -ENOMEM; + folio = __filemap_get_folio(mapping, index, FGP_WRITEBEGIN, + mapping_gfp_mask(mapping)); - if (!folio) - return -ENOMEM; ++ if (IS_ERR(folio)) ++ return PTR_ERR(folio); /* * The same as page allocation, we prealloc buffer heads before * starting the handle. @@@ -2904,22 -3070,22 +2904,22 @@@ static int ext4_da_write_begin(struct f } retry: - page = grab_cache_page_write_begin(mapping, index); - if (!page) - return -ENOMEM; + folio = __filemap_get_folio(mapping, index, FGP_WRITEBEGIN, + mapping_gfp_mask(mapping)); - if (!folio) - return -ENOMEM; ++ if (IS_ERR(folio)) ++ return PTR_ERR(folio); - /* In case writeback began while the page was unlocked */ - wait_for_stable_page(page); + /* In case writeback began while the folio was unlocked */ + folio_wait_stable(folio); #ifdef CONFIG_FS_ENCRYPTION - ret = ext4_block_write_begin(page, pos, len, - ext4_da_get_block_prep); + ret = ext4_block_write_begin(folio, pos, len, ext4_da_get_block_prep); #else - ret = __block_write_begin(page, pos, len, ext4_da_get_block_prep); + ret = __block_write_begin(&folio->page, pos, len, ext4_da_get_block_prep); #endif if (ret < 0) { - unlock_page(page); - put_page(page); + folio_unlock(folio); + folio_put(folio); /* * block_write_begin may have instantiated a few blocks * outside i_size. Trim these off again. Don't need @@@ -3612,14 -3809,13 +3612,14 @@@ static int __ext4_block_zero_page_range ext4_lblk_t iblock; struct inode *inode = mapping->host; struct buffer_head *bh; - struct page *page; + struct folio *folio; int err = 0; - page = find_or_create_page(mapping, from >> PAGE_SHIFT, - mapping_gfp_constraint(mapping, ~__GFP_FS)); - if (!page) - return -ENOMEM; + folio = __filemap_get_folio(mapping, from >> PAGE_SHIFT, + FGP_LOCK | FGP_ACCESSED | FGP_CREAT, + mapping_gfp_constraint(mapping, ~__GFP_FS)); - if (!folio) - return -ENOMEM; ++ if (IS_ERR(folio)) ++ return PTR_ERR(folio); blocksize = inode->i_sb->s_blocksize; diff --cc fs/ext4/move_extent.c index e509c22a21ed,7bf6d069199c..b5af2fc03b2f --- a/fs/ext4/move_extent.c +++ b/fs/ext4/move_extent.c @@@ -138,20 -139,20 +138,20 @@@ mext_folio_double_lock(struct inode *in } flags = memalloc_nofs_save(); - folio[0] = __filemap_get_folio(mapping[0], index1, fgp_flags, + folio[0] = __filemap_get_folio(mapping[0], index1, FGP_WRITEBEGIN, mapping_gfp_mask(mapping[0])); - if (!folio[0]) { + if (IS_ERR(folio[0])) { memalloc_nofs_restore(flags); - return -ENOMEM; + return PTR_ERR(folio[0]); } - folio[1] = __filemap_get_folio(mapping[1], index2, fgp_flags, + folio[1] = __filemap_get_folio(mapping[1], index2, FGP_WRITEBEGIN, mapping_gfp_mask(mapping[1])); memalloc_nofs_restore(flags); - if (!folio[1]) { + if (IS_ERR(folio[1])) { folio_unlock(folio[0]); folio_put(folio[0]); - return -ENOMEM; + return PTR_ERR(folio[1]); } /* * __filemap_get_folio() may not wait on folio's writeback if diff --cc fs/ext4/verity.c index 3b01247066dd,e4da1704438e..2f37e1ea3955 --- a/fs/ext4/verity.c +++ b/fs/ext4/verity.c @@@ -365,17 -367,17 +365,19 @@@ static struct page *ext4_read_merkle_tr index += ext4_verity_metadata_pos(inode) >> PAGE_SHIFT; - page = find_get_page_flags(inode->i_mapping, index, FGP_ACCESSED); - if (!page || !PageUptodate(page)) { + folio = __filemap_get_folio(inode->i_mapping, index, FGP_ACCESSED, 0); - if (!folio || !folio_test_uptodate(folio)) { ++ if (IS_ERR(folio) || !folio_test_uptodate(folio)) { DEFINE_READAHEAD(ractl, NULL, NULL, inode->i_mapping, index); - if (folio) - if (page) - put_page(page); ++ if (!IS_ERR(folio)) + folio_put(folio); else if (num_ra_pages > 1) page_cache_ra_unbounded(&ractl, num_ra_pages, 0); - page = read_mapping_page(inode->i_mapping, index, NULL); + folio = read_mapping_folio(inode->i_mapping, index, NULL); ++ if (IS_ERR(folio)) ++ return ERR_CAST(folio); } - return page; + return folio_file_page(folio, index); } static int ext4_write_merkle_tree_block(struct inode *inode, const void *buf, diff --cc fs/iomap/buffered-io.c index 10a203515583,96bb56c203f4..063133ec77f4 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@@ -467,8 -467,7 +467,7 @@@ EXPORT_SYMBOL_GPL(iomap_is_partially_up */ struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos) { - unsigned fgp = FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE | FGP_NOFS; + unsigned fgp = FGP_WRITEBEGIN | FGP_NOFS; - struct folio *folio; if (iter->flags & IOMAP_NOWAIT) fgp |= FGP_NOWAIT; diff --cc fs/netfs/buffered_read.c index e3d754a9e1b0,209726a9cfdb..3404707ddbe7 --- a/fs/netfs/buffered_read.c +++ b/fs/netfs/buffered_read.c @@@ -347,10 -348,10 +347,10 @@@ int netfs_write_begin(struct netfs_inod DEFINE_READAHEAD(ractl, file, NULL, mapping, index); retry: - folio = __filemap_get_folio(mapping, index, fgp_flags, + folio = __filemap_get_folio(mapping, index, FGP_WRITEBEGIN, mapping_gfp_mask(mapping)); - if (!folio) - return -ENOMEM; + if (IS_ERR(folio)) + return PTR_ERR(folio); if (ctx->ops->check_write_begin) { /* Allow the netfs (eg. ceph) to flush conflicts. */ diff --cc fs/nfs/file.c index 2474cbc30712,1d03406e6c03..f0edf5a36237 --- a/fs/nfs/file.c +++ b/fs/nfs/file.c @@@ -326,10 -335,9 +326,10 @@@ static int nfs_write_begin(struct file file, mapping->host->i_ino, len, (long long) pos); start: - folio = nfs_folio_grab_cache_write_begin(mapping, pos >> PAGE_SHIFT); + folio = __filemap_get_folio(mapping, pos >> PAGE_SHIFT, FGP_WRITEBEGIN, + mapping_gfp_mask(mapping)); - if (!folio) - return -ENOMEM; + if (IS_ERR(folio)) + return PTR_ERR(folio); *pagep = &folio->page; ret = nfs_flush_incompatible(file, folio); diff --cc include/linux/pagemap.h index c4698dcc70ba,fdcd595d2294..a56308a9d1a4 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@@ -504,11 -504,9 +504,11 @@@ pgoff_t page_cache_prev_miss(struct add #define FGP_NOFS 0x00000010 #define FGP_NOWAIT 0x00000020 #define FGP_FOR_MMAP 0x00000040 - #define FGP_ENTRY 0x00000080 - #define FGP_STABLE 0x00000100 + #define FGP_STABLE 0x00000080 +#define FGP_WRITEBEGIN (FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE) + + void *filemap_get_entry(struct address_space *mapping, pgoff_t index); struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index, int fgp_flags, gfp_t gfp); struct page *pagecache_get_page(struct address_space *mapping, pgoff_t index, diff --cc include/linux/userfaultfd_k.h index fff49fec0258,a2c53e98dfd6..d78b01524349 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@@ -36,38 -36,55 +36,53 @@@ #define UFFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK) #define UFFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS) -extern int sysctl_unprivileged_userfaultfd; - extern vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason); - /* - * The mode of operation for __mcopy_atomic and its helpers. - * - * This is almost an implementation detail (mcopy_atomic below doesn't take this - * as a parameter), but it's exposed here because memory-kind-specific - * implementations (e.g. hugetlbfs) need to know the mode of operation. - */ - enum mcopy_atomic_mode { - /* A normal copy_from_user into the destination range. */ - MCOPY_ATOMIC_NORMAL, - /* Don't copy; map the destination range to the zero page. */ - MCOPY_ATOMIC_ZEROPAGE, - /* Just install pte(s) with the existing page(s) in the page cache. */ - MCOPY_ATOMIC_CONTINUE, + /* A combined operation mode + behavior flags. */ + typedef unsigned int __bitwise uffd_flags_t; + + /* Mutually exclusive modes of operation. */ + enum mfill_atomic_mode { + MFILL_ATOMIC_COPY, + MFILL_ATOMIC_ZEROPAGE, + MFILL_ATOMIC_CONTINUE, + NR_MFILL_ATOMIC_MODES, }; - extern int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd, + #define MFILL_ATOMIC_MODE_BITS (const_ilog2(NR_MFILL_ATOMIC_MODES - 1) + 1) + #define MFILL_ATOMIC_BIT(nr) BIT(MFILL_ATOMIC_MODE_BITS + (nr)) + #define MFILL_ATOMIC_FLAG(nr) ((__force uffd_flags_t) MFILL_ATOMIC_BIT(nr)) + #define MFILL_ATOMIC_MODE_MASK ((__force uffd_flags_t) (MFILL_ATOMIC_BIT(0) - 1)) + + static inline bool uffd_flags_mode_is(uffd_flags_t flags, enum mfill_atomic_mode expected) + { + return (flags & MFILL_ATOMIC_MODE_MASK) == ((__force uffd_flags_t) expected); + } + + static inline uffd_flags_t uffd_flags_set_mode(uffd_flags_t flags, enum mfill_atomic_mode mode) + { + flags &= ~MFILL_ATOMIC_MODE_MASK; + return flags | ((__force uffd_flags_t) mode); + } + + /* Flags controlling behavior. These behavior changes are mode-independent. */ + #define MFILL_ATOMIC_WP MFILL_ATOMIC_FLAG(0) + + extern int mfill_atomic_install_pte(pmd_t *dst_pmd, struct vm_area_struct *dst_vma, unsigned long dst_addr, struct page *page, - bool newly_allocated, bool wp_copy); - - extern ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start, - unsigned long src_start, unsigned long len, - atomic_t *mmap_changing, __u64 mode); - extern ssize_t mfill_zeropage(struct mm_struct *dst_mm, - unsigned long dst_start, - unsigned long len, - atomic_t *mmap_changing); - extern ssize_t mcopy_continue(struct mm_struct *dst_mm, unsigned long dst_start, - unsigned long len, atomic_t *mmap_changing); + bool newly_allocated, uffd_flags_t flags); + + extern ssize_t mfill_atomic_copy(struct mm_struct *dst_mm, unsigned long dst_start, + unsigned long src_start, unsigned long len, + atomic_t *mmap_changing, uffd_flags_t flags); + extern ssize_t mfill_atomic_zeropage(struct mm_struct *dst_mm, + unsigned long dst_start, + unsigned long len, + atomic_t *mmap_changing); + extern ssize_t mfill_atomic_continue(struct mm_struct *dst_mm, unsigned long dst_start, + unsigned long len, atomic_t *mmap_changing, + uffd_flags_t flags); extern int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start, unsigned long len, bool enable_wp, atomic_t *mmap_changing); diff --cc mm/mmap.c index eefa6f0cda28,536bbb8fa0ae..5522130ae606 --- a/mm/mmap.c +++ b/mm/mmap.c @@@ -978,11 -1008,11 +1008,11 @@@ struct vm_area_struct *vma_merge(struc vma = next; /* case 3 */ vma_start = addr; vma_end = next->vm_end; - vma_pgoff = next->vm_pgoff; + vma_pgoff = next->vm_pgoff - pglen; - err = 0; - if (mid != next) { /* case 8 */ - remove = mid; - err = dup_anon_vma(res, remove); + if (curr) { /* case 8 */ + vma_pgoff = curr->vm_pgoff; + remove = curr; + err = dup_anon_vma(next, curr); } } } diff --cc mm/zswap.c index f2fc0373b967,af97e8f9d678..e1e621d0b6a0 --- a/mm/zswap.c +++ b/mm/zswap.c @@@ -1537,8 -1570,16 +1570,15 @@@ cache_fail zswap_enabled = false; return -ENOMEM; } + + static int __init zswap_init(void) + { + if (!zswap_enabled) + return 0; + return zswap_setup(); + } /* must be late so crypto has time to come up */ - late_initcall(init_zswap); + late_initcall(zswap_init); -MODULE_LICENSE("GPL"); MODULE_AUTHOR("Seth Jennings "); MODULE_DESCRIPTION("Compressed cache for swap pages"); diff --cc tools/testing/selftests/mm/Makefile index fc35050b5542,f764f2b3e34b..23af4633f0f4 --- a/tools/testing/selftests/mm/Makefile +++ b/tools/testing/selftests/mm/Makefile @@@ -90,10 -93,10 +93,10 @@@ endi endif -ifneq (,$(filter $(MACHINE),arm64 ia64 mips64 parisc64 ppc64 riscv64 s390x sh64 sparc64 x86_64)) +ifneq (,$(filter $(MACHINE),arm64 ia64 mips64 parisc64 ppc64 riscv64 s390x sparc64 x86_64)) - TEST_GEN_FILES += va_128TBswitch - TEST_GEN_FILES += virtual_address_range - TEST_GEN_FILES += write_to_hugetlbfs + TEST_GEN_PROGS += va_high_addr_switch + TEST_GEN_PROGS += virtual_address_range + TEST_GEN_PROGS += write_to_hugetlbfs endif TEST_PROGS := run_vmtests.sh