mm_modules_091_100_kern-2.6.39.patch

diff -Naru8 a/Documentation/mm_modules.txt b/Documentation/mm_modules.txt
--- a/Documentation/mm_modules.txt	1970-01-01 01:00:00.000000000 +0100
+++ b/Documentation/mm_modules.txt	2011-06-08 12:13:07.361021729 +0100
@@ -0,0 +1,284 @@
+	Virtual (MM_MODULES) and Physical (PMEM_MODULES) modules for Linux 2.6.32
+
+	Gil Tene <gil@azulsystems.com>
+
+In order to support extended functionality for virtual and physical memory,
+enabling loadable modules to deliver integrated memory management
+functionality is desirable.
+
+Examples of valuable extended functionality can include:
+
+- Support for mappings with multiple and mixed page sizes
+	- Including transitioning of mapped addresses from large to small page
+	  mappings, or small to large.
+- Support for very high sustained mapping modification rates:
+	- Allowing concurrent modifications within the same address space
+	- Allowing user to [safely] indicate lazy TLB invalidation and
+	  thereby dramatically reduce per change costs
+	- Supporting fast, safe application of very large "batch" sets
+	  of mapping modifications (remaps and mprotects), such that
+	  all changes become visible within the same, extremely short
+	  period of time.
+- Support for large number of disjoint mappings with arbitrary manipulations
+  at high rates
+
+In order to support such functionality, memory management modules need
+to interact with several points in the virtual and physical memory systems
+that are "lower" that those of a typical module or driver under previously
+kernel interfaces. The specific interface points are itemized below.
+
+A set of proposed patches against the RHEL 2.6.32-rc7-git1 is provided,
+which creates the appropriate interfaces for module to register with,
+allowing them to interact with vmas, mm structures and pages through
+their needed lifecycle transitions, as well as keep state they
+may need associated with vmas and mm structures.
+
+----------------------------------------------------------------------
+
+At a high level, the patches represent:
+
+Changes to existing data structures:
+
+- An added "mm_modules" field to struct mm_struct.
+
+- Two added fields to "mm_module_ops" struct vm_area_struct.
+
+New data structures:
+
+- Four new common data types (struct mm_module_struct,
+  struct pmem_module_struct, struct pmem_module_operations_struct,
+  mm_module_operations_struct)
+
+Code changes:
+
+- changes to add calls into mm_module_ops and pmem_module_ops at
+  various appropriate locations.
+
+- changes to disable or make invalid certain operations (e.g. vma
+  split, merge, remap) for vmas that are controlled by mm_modules
+
+- A change to fault handling to allow handle_mm_fault to return
+  an indication for SEGV_MAPERR or SEGV_ACCERR (allow for sparsely
+  mapped, and non-homogeneously protected vmas).
+
+- A change to gup_fast (arch/x86/mm/gup.c) to make it safely independent
+  of any page table locking and invalidation schemes (as long as whatever
+  they do is safe in an SMP environment), including mechanisms that
+  may ref-count pages down to 0 before tlb-invalidating their mappings.
+
+----------------------------------------------------------------------
+Note: about need for physical memory support:
+
+While virtual memory functionality alone can support some of the 
+possible extended functionality, high performance functionality 
+requires physical memory management and control as well. A good example
+of this is in-process recycling of memory and in-process memory free
+lists and their use in dramatically dampening TLB invalidate requirements
+on allocation or deallocation edges. When a system need to sustain 
+a high rate of new mappings (e.g. 20GB/sec of sustained random, disjoint 
+map/remap/unmap operations), such in-process physical memory free lists
+become a must. 
+
+----------------------------------------------------------------------
+Note: About hugetlb
+
+To increase the likelihood of usefulness to generic virtual memory
+functionality additions, the module interface was designed such that
+the all current hugetlb functionality could be developed as a loadable
+kernel module under the proposed interface.
+
+----------------------------------------------------------------------
+
+Some high level design points:
+
+- Virtual memory modules (mm_modules) are generally responsible for
+creating and controlling their own vmas. [whole] vmas can be torn down 
+by the kernel.
+
+- The kernel's "normal" memory manipulation system calls will not modify
+the bounds of an mm_module managed vma. [i.e. no merging, no splitting,
+no remapping]. mm_modules may support such functionality through their own
+entry points.
+
+- mm_modules must adhere to the kernel's convention for locking the
+page table hierarchy for any part of the hierarchy that may be manipulated
+by other code. While mm_modules may apply private locking schemes
+to parts of the hierarchy (e.g. below the pmd level), they must do
+so only with parts of the hierarchy that are know to be completely owned
+by the module. [e.g. 2MB aligned vmas can separately control locking at
+the pmd level and below]
+
+- mm_modules can carry unique state per mm, and unique state per vma.
+
+- mm_modules provide their own fault handling functionality. They may
+  indicate a need to SEGV with a mapping or protection si_code (sparsely
+  mapped vas are a good example of this need).
+
+- pmem_modules manage their own lists of physical pages, and are expected
+  to be aware of physical pages that they are supposed to control, even 
+  when (and especially when) those pages carry a 0 ref count.. They can
+  do so in any way they want (e.g. a module-private vmemmap mirror, or
+  one using much larger aligned page sizes).
+
+- registered pmem_modules intercept all physical page releases at
+  put_page() and release_pages(), such that when a page is ref-counted down
+  to 0, the pmem_module would pick it up before it reaches the system's
+  normal free lists.
+
+- pmem_modules are expected to support hot_plug functionality. When physical
+  memory is added to the system, all current pmem_modules must adjust their
+  internal maps of physical memory to be able to correctly handle physical
+  pages of the newly discovered range.  
+
+----------------------------------------------------------------------
+Virtual Memory Module Interface Points:
+
+Fault handling: 
+	int (*handle_mm_fault)(struct mm_struct *mm,
+			struct vm_area_struct *vma, unsigned long addr,
+			int write_access);
+
+	Called from handle_mm_fault() for vmas managed by the mm_module to 
+	satisfy fault handling needs. May return an indication of SEGV_ACCERR
+	or SEGV_MAPERR if fault address is not mapped (e.g. for sparsely
+	populated vmas).
+
+Protection changes:
+	int (*change_protection)(struct vm_area_struct *vma, unsigned long start,
+			unsigned long end, unsigned long newflags);
+
+	Called from mprotect_fixup() to change the protection of all mapped
+	pages within a vma managed by the mm_module. [needed for e.g. hugetlb
+	interaction with mprotect()]. Note: the module may (and likely will)
+	provide it's own, finer grain protection control calls.
+
+Page range duplication:
+	int (*copy_page_range)(struct mm_struct *dst_mm,
+			struct mm_struct *src_mm, struct vm_area_struct *vma);
+
+	Called from copy_page_range() to duplicate a vma managed by an mm_module
+	from a src_mm to a dst_mm. Used for specialized forking behavior.
+
+Page following:
+	int (*follow_page)(struct mm_struct *mm, struct vm_area_struct *vma,
+			struct page **pages, struct vm_area_struct **vmas,
+			unsigned long *position, int *length,
+			int i, int write);
+
+	Called from get_user_pages() to get a vector of pages associated with a
+	range of addresses within a vma managed by the mm_module. Required
+	for core dumping, gdb, etc.
+
+Probe mapping protection and range:
+	int (*probe_mapped)(struct vm_area_struct *vma, unsigned long start,
+			unsigned long *end_range, unsigned long *range_vm_flags);
+
+	return an indication of whether an address within a vma is mapped or
+	not, along with it's protection and the range of identical protection
+	mapping. Used by core dump functionality (e.g. elf_core_dump()) for
+	efficient traversal and dumping of very large and sparsely populated
+	vmas (e.g. 16TB vma containing 300MB of mapped data).
+
+Unmapping:
+	unsigned long (*unmap_page_range)(struct mmu_gather **tlbp,
+			struct vm_area_struct *vma, unsigned long addr,
+			unsigned long end, long *zap_work,
+			struct zap_details *details); 
+
+	Called by unmap_vmas() to unmap and release all pages with a vma
+	managed by the mm_module.
+
+	void (*free_pgd_range)(struct mmu_gather *tlb, unsigned long addr,
+			unsigned long end, unsigned long floor,
+			unsigned long ceiling);
+
+	Called by free_pgtables() to tear down all page table hierarchy
+	storage associated with a vma managed by the mm_module.
+
+vma lifecycle:
+	int (*init_module_vma)(struct vm_area_struct *vma,
+			struct vm_area_struct *old_vma);
+
+	Called by dup_mmap() to initialize the mm_module state associated with
+	a newly duplicated vma managed by the mm_module.
+
+	void (*exit_module_vma)(struct vm_area_struct *vma);
+
+	Called by remove_vma() to tear down the mm_module state associated with
+	the vma managed by the mm_module.
+
+mm lifecycle:
+	int (*init_module_mm)(struct mm_struct *mm,
+			struct mm_module_struct *mm_mod);
+
+	Called from mm_init to initialize the mm_module state associated with a
+	newly duplicated mm.
+
+	int (*exit_module_mm)(struct mm_struct *mm,
+			struct mm_module_struct *mm_mod);
+
+	Called by mmput to tear down the mm_module state associated with an mm
+
+struct mm_module_operations_struct {
+	int (*handle_mm_fault)(struct mm_struct *mm,
+			struct vm_area_struct *vma, unsigned long addr,
+			int write_access);
+	int (*change_protection)(struct vm_area_struct *vma, unsigned long start,
+			unsigned long end, unsigned long newflags);
+	int (*copy_page_range)(struct mm_struct *dst_mm,
+			struct mm_struct *src_mm, struct vm_area_struct *vma);
+	int (*follow_page)(struct mm_struct *mm, struct vm_area_struct *vma,
+			struct page **pages, struct vm_area_struct **vmas,
+			unsigned long *position, int *length,
+			int i, int write);
+	int (*probe_mapped)(struct vm_area_struct *vma, unsigned long start,
+			unsigned long *end_range, unsigned long *range_vm_flags);
+	unsigned long (*unmap_page_range)(struct mmu_gather **tlbp,
+			struct vm_area_struct *vma, unsigned long addr,
+			unsigned long end, long *zap_work,
+			struct zap_details *details); 
+	void (*free_pgd_range)(struct mmu_gather *tlb, unsigned long addr,
+			unsigned long end, unsigned long floor,
+			unsigned long ceiling);
+	int (*init_module_vma)(struct vm_area_struct *vma,
+			struct vm_area_struct *old_vma);
+	void (*exit_module_vma)(struct vm_area_struct *vma);
+	int (*init_module_mm)(struct mm_struct *mm,
+			struct mm_module_struct *mm_mod);
+	int (*exit_module_mm)(struct mm_struct *mm,
+			struct mm_module_struct *mm_mod);
+};
+
+---------------------------------------------------------------
+Physical Memory Module Interface Points:
+
+Page release interception:
+	int (*put_page)(struct page *page);
+
+	Called by put_page() to allow a pmem_module to receive a released page
+	under it's management. Returns 1 if page was "taken" (determined to 
+	belong to the pmem_module), or 0 if not.
+
+	int (*release_page)(struct page *page, struct zone **zonep,
+			unsigned long flags);
+
+	Called by release_pages() to allow a pmem_module to receive a released
+	page under it's management. Returns 1 if page was "taken" (determined to 
+	belong to the pmem_module), or 0 if not. If page was taken, the spinlock
+	&(*zonep)->lru_lock must also be released as per similar behavior in
+	release_pages().
+
+Memory hotplug support:
+	int (*sparse_mem_map_populate)(unsigned long pnum, int nid);
+
+	Called by kmalloc_section_memmap() to allow the pmem_module to 
+	initialize page mapping state associated with newly discovered
+	physical memory. Must return 0 if not successful.
+
+struct pmem_module_operations_struct {
+	int (*put_page)(struct page *page);
+	int (*release_page)(struct page *page, struct zone **zonep,
+			unsigned long flags);
+	int (*sparse_mem_map_populate)(unsigned long pnum, int nid);
+};
+
diff -Naru8 a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
--- a/arch/x86/mm/fault.c	2011-05-19 05:06:34.000000000 +0100
+++ b/arch/x86/mm/fault.c	2011-06-08 12:13:07.362021584 +0100
@@ -1128,16 +1128,26 @@
 	/*
 	 * If for any reason at all we couldn't handle the fault,
 	 * make sure we exit gracefully rather than endlessly redo
 	 * the fault:
 	 */
 	fault = handle_mm_fault(mm, vma, address, flags);
 
 	if (unlikely(fault & VM_FAULT_ERROR)) {
+#ifdef CONFIG_MM_MODULES
+		if (fault & (VM_FAULT_SIGSEGV)) {
+			if (fault & VM_FAULT_SEGV_ACCERR)
+				bad_area_access_error(regs, error_code,
+					address);
+			else
+				bad_area(regs, error_code, address);
+			return;
+		}
+#endif /* CONFIG_MM_MODULES */
 		mm_fault_error(regs, error_code, address, fault);
 		return;
 	}
 
 	/*
 	 * Major/minor page fault accounting is only done on the
 	 * initial attempt. If we go through a retry, it is extremely
 	 * likely that the page will be found in page cache at that point.
diff -Naru8 a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
--- a/arch/x86/mm/gup.c	2011-05-19 05:06:34.000000000 +0100
+++ b/arch/x86/mm/gup.c	2011-06-08 12:28:18.533000257 +0100
@@ -58,16 +58,30 @@
 	smp_rmb();
 	if (unlikely(pte.pte_low != ptep->pte_low))
 		goto retry;
 
 	return pte;
 #endif
 }
 
+#ifdef CONFIG_MM_MODULES
+static inline int get_page_not_zero(struct page *page)
+{
+	page = compound_head(page);
+	return atomic_inc_not_zero(&page->_count);
+}
+
+static inline int get_head_page_multiple_not_zero(struct page *page, int nr)
+{
+	VM_BUG_ON(page != compound_head(page));
+	return atomic_add_unless(&page->_count, nr, 0);
+}
+#endif /* CONFIG_MM_MODULES */
+
 /*
  * The performance critical leaf functions are made noinline otherwise gcc
  * inlines everything into a single function which results in too much
  * register pressure.
  */
 static noinline int gup_pte_range(pmd_t pmd, unsigned long addr,
 		unsigned long end, int write, struct page **pages, int *nr)
 {
@@ -84,17 +98,23 @@
 		struct page *page;
 
 		if ((pte_flags(pte) & (mask | _PAGE_SPECIAL)) != mask) {
 			pte_unmap(ptep);
 			return 0;
 		}
 		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
 		page = pte_page(pte);
+ #ifdef CONFIG_MM_MODULES
+ 		/* indicate failure if page ref count was already 0 */
+ 		if (!get_page_not_zero(page))
+ 			return 0;
+ #else /* !CONFIG_MM_MODULES */
 		get_page(page);
+ #endif /* CONFIG_MM_MODULES */
 		SetPageReferenced(page);
 		pages[*nr] = page;
 		(*nr)++;
 
 	} while (ptep++, addr += PAGE_SIZE, addr != end);
 	pte_unmap(ptep - 1);
 
 	return 1;
@@ -210,19 +230,28 @@
 	page = head + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
 	do {
 		VM_BUG_ON(compound_head(page) != head);
 		pages[*nr] = page;
 		(*nr)++;
 		page++;
 		refs++;
 	} while (addr += PAGE_SIZE, addr != end);
+#ifdef CONFIG_MM_MODULES
+	if (!get_head_page_multiple_not_zero(head, refs)) {
+		/* revert nr pages, indicate failure (ref count was 0) */
+		(*nr) -= refs;
+		return 0;
+	}
+	return 1;
+#else /* !CONFIG_MM_MODULES */
 	get_head_page_multiple(head, refs);
 
 	return 1;
+#endif /* CONFIG_MM_MODULES */
 }
 
 static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end,
 			int write, struct page **pages, int *nr)
 {
 	unsigned long next;
 	pud_t *pudp;
 
diff -Naru8 a/drivers/misc/sgi-gru/grufault.c b/drivers/misc/sgi-gru/grufault.c
--- a/drivers/misc/sgi-gru/grufault.c	2011-05-19 05:06:34.000000000 +0100
+++ b/drivers/misc/sgi-gru/grufault.c	2011-06-08 12:13:07.362021584 +0100
@@ -187,16 +187,20 @@
  * 		  1 - (atomic only) try again in non-atomic context
  */
 static int non_atomic_pte_lookup(struct vm_area_struct *vma,
 				 unsigned long vaddr, int write,
 				 unsigned long *paddr, int *pageshift)
 {
 	struct page *page;
 
+#ifdef CONFIG_MM_MODULES
+	if (vma->mm_module_ops)
+		return -EFAULT;
+#endif /* CONFIG_MM_MODULES */
 #ifdef CONFIG_HUGETLB_PAGE
 	*pageshift = is_vm_hugetlb_page(vma) ? HPAGE_SHIFT : PAGE_SHIFT;
 #else
 	*pageshift = PAGE_SHIFT;
 #endif
 	if (get_user_pages
 	    (current, current->mm, vaddr, 1, write, 0, &page, NULL) <= 0)
 		return -EFAULT;
@@ -246,16 +250,19 @@
 		return 1;
 
 	*paddr = pte_pfn(pte) << PAGE_SHIFT;
 #ifdef CONFIG_HUGETLB_PAGE
 	*pageshift = is_vm_hugetlb_page(vma) ? HPAGE_SHIFT : PAGE_SHIFT;
 #else
 	*pageshift = PAGE_SHIFT;
 #endif
+#ifdef CONFIG_MM_MODULES
+	*pageshift = (pmd_large(*pmdp)) ? HPAGE_SHIFT : PAGE_SHIFT;
+#endif /* CONFIG_MM_MODULES */
 	return 0;
 
 err:
 	return 1;
 }
 
 static int gru_vtop(struct gru_thread_state *gts, unsigned long vaddr,
 		    int write, int atomic, unsigned long *gpa, int *pageshift)
diff -Naru8 a/fs/binfmt_elf.c b/fs/binfmt_elf.c
--- a/fs/binfmt_elf.c	2011-05-19 05:06:34.000000000 +0100
+++ b/fs/binfmt_elf.c	2011-06-08 12:27:32.272000202 +0100
@@ -1909,16 +1909,44 @@
 	 */
 	segs = current->mm->map_count;
 	segs += elf_core_extra_phdrs();
 
 	gate_vma = get_gate_vma(current->mm);
 	if (gate_vma != NULL)
 		segs++;
 
+  #ifdef CONFIG_MM_MODULES
+  	/* Find out how many extra regions (if any) there are in vmas: */
+  	for (vma = first_vma(current, gate_vma); vma != NULL;
+  		vma = next_vma(vma, gate_vma)) {
+  		if (vma->mm_module_ops) {
+  			int nr_regions = 0;
+  			unsigned long range_start = vma->vm_start;
+  			unsigned long range_end;
+  			do {
+  				BUG_ON(!vma->mm_module_ops->probe_mapped);
+  				if (vma->mm_module_ops->probe_mapped(vma,
+  							range_start,
+  							&range_end, NULL) &&
+  				    vma_dump_size(vma, current->mm->flags) > 0)
+  					nr_regions++;
+  			} while (range_start = range_end, range_start <
+  					vma->vm_end);
+  			/*
+  			 * Aadjust segment count according to # of regions in
+  			 * vma. Note: this should decrement segment count for
+  			 * vmas with no mapped regions.
+  			 */
+  			segs += nr_regions - 1;
+  		}
+  	}
+  #endif /* CONFIG_MM_MODULES */
++ 
+
 	/* for notes section */
 	segs++;
 
 	/* If segs > PN_XNUM(0xffff), then e_phnum overflows. To avoid
 	 * this, kernel supports extended numbering. Have a look at
 	 * include/linux/elf.h for further information. */
 	e_phnum = segs > PN_XNUM ? PN_XNUM : segs;
 
@@ -1976,17 +2004,52 @@
 	if (size > cprm->limit
 	    || !dump_write(cprm->file, phdr4note, sizeof(*phdr4note)))
 		goto end_coredump;
 
 	/* Write program headers for segments dump */
 	for (vma = first_vma(current, gate_vma); vma != NULL;
 			vma = next_vma(vma, gate_vma)) {
 		struct elf_phdr phdr;
-
+#if CONFIG_MM_MODULES
+		unsigned long range_start = vma->vm_start;
+		unsigned long range_end = vma->vm_end;
+		unsigned long range_vm_flags = vma->vm_flags;
+		do {
+			if (vma->mm_module_ops) {
+				BUG_ON(!vma->mm_module_ops->probe_mapped);
+				if (!vma->mm_module_ops->
+				    probe_mapped(vma, range_start, &range_end,
+						 &range_vm_flags) ||
+				    vma_dump_size(vma, cprm->mm_flags) == 0)
+					continue;
+			}
+
+			phdr.p_type = PT_LOAD;
+			phdr.p_offset = offset;
+			phdr.p_vaddr = range_start;
+			phdr.p_paddr = 0;
+			phdr.p_filesz = vma->mm_module_ops ?
+				(range_end - range_start) :
+				vma_dump_size(vma, cprm->mm_flags);
+			phdr.p_memsz = range_end - range_start;
+			offset += phdr.p_filesz;
+			phdr.p_flags = range_vm_flags & VM_READ ? PF_R : 0;
+			if (range_vm_flags & VM_WRITE)
+				phdr.p_flags |= PF_W;
+			if (range_vm_flags & VM_EXEC)
+				phdr.p_flags |= PF_X;
+			phdr.p_align = ELF_EXEC_PAGESIZE;
+
+			size += sizeof(phdr);
+			if (size > cprm->limit
+		    	|| !dump_write(cprm->file, &phdr, sizeof(phdr)))
+				goto end_coredump;
+		} while (range_start = range_end, range_start < vma->vm_end);
+#else /* CONFIG_MM_MODULES */
 		phdr.p_type = PT_LOAD;
 		phdr.p_offset = offset;
 		phdr.p_vaddr = vma->vm_start;
 		phdr.p_paddr = 0;
 		phdr.p_filesz = vma_dump_size(vma, cprm->mm_flags);
 		phdr.p_memsz = vma->vm_end - vma->vm_start;
 		offset += phdr.p_filesz;
 		phdr.p_flags = vma->vm_flags & VM_READ ? PF_R : 0;
@@ -1995,16 +2058,17 @@
 		if (vma->vm_flags & VM_EXEC)
 			phdr.p_flags |= PF_X;
 		phdr.p_align = ELF_EXEC_PAGESIZE;
 
 		size += sizeof(phdr);
 		if (size > cprm->limit
 		    || !dump_write(cprm->file, &phdr, sizeof(phdr)))
 			goto end_coredump;
+#endif /* CONFIG_MM_MODULES */
 	}
 
 	if (!elf_core_write_extra_phdrs(cprm->file, offset, &size, cprm->limit))
 		goto end_coredump;
 
  	/* write out the notes section */
 	if (!write_note_info(&info, cprm->file, &foffset))
 		goto end_coredump;
@@ -2016,16 +2080,53 @@
 	if (!dump_seek(cprm->file, dataoff - foffset))
 		goto end_coredump;
 
 	for (vma = first_vma(current, gate_vma); vma != NULL;
 			vma = next_vma(vma, gate_vma)) {
 		unsigned long addr;
 		unsigned long end;
 
+#ifdef CONFIG_MM_MODULES
+		unsigned long range_start = vma->vm_start;
+		unsigned long range_end;
+		end = (vma->mm_module_ops ? vma->vm_end :
+		       vma->vm_start + vma_dump_size(vma, cprm->mm_flags));
+		range_end = end;
+
+		do {
+			if (vma->mm_module_ops) {
+				BUG_ON(!vma->mm_module_ops->probe_mapped);
+				if (!vma->mm_module_ops->
+				    probe_mapped(vma, range_start, &range_end,
+						 NULL) || 
+				    vma_dump_size(vma, cprm->mm_flags) == 0)
+					continue;
+			}
+
+			for (addr = range_start; addr < range_end;
+					addr += PAGE_SIZE) {
+				struct page *page;
+				int stop;
+
+				page = get_dump_page(addr);
+				if (page) {
+					void *kaddr = kmap(page);
+					stop = ((size += PAGE_SIZE) > cprm->limit) ||
+						!dump_write(cprm->file, kaddr,
+								PAGE_SIZE);
+					kunmap(page);
+					page_cache_release(page);
+				} else
+					stop = !dump_seek(cprm->file, PAGE_SIZE);
+				if (stop)
+					goto end_coredump;
+			}
+		} while (range_start = range_end, range_start < end);
+#else /* CONFIG_MM_MODULES */
 		end = vma->vm_start + vma_dump_size(vma, cprm->mm_flags);
 
 		for (addr = vma->vm_start; addr < end; addr += PAGE_SIZE) {
 			struct page *page;
 			int stop;
 
 			page = get_dump_page(addr);
 			if (page) {
@@ -2035,16 +2136,17 @@
 						    PAGE_SIZE);
 				kunmap(page);
 				page_cache_release(page);
 			} else
 				stop = !dump_seek(cprm->file, PAGE_SIZE);
 			if (stop)
 				goto end_coredump;
 		}
+#endif /* CONFIG_MM_MODULES */
 	}
 
 	if (!elf_core_write_extra_data(cprm->file, &size, cprm->limit))
 		goto end_coredump;
 
 	if (e_phnum == PN_XNUM) {
 		size += sizeof(*shdr4extnum);
 		if (size > cprm->limit
diff -Naru8 a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
--- a/fs/proc/task_mmu.c	2011-05-19 05:06:34.000000000 +0100
+++ b/fs/proc/task_mmu.c	2011-06-08 12:26:34.392000212 +0100
@@ -228,25 +228,38 @@
 	/* We don't show the stack guard page in /proc/maps */
 	start = vma->vm_start;
 	if (stack_guard_page_start(vma, start))
 		start += PAGE_SIZE;
 	end = vma->vm_end;
 	if (stack_guard_page_end(vma, end))
 		end -= PAGE_SIZE;
 
+ #ifdef CONFIG_MM_MODULES
+ 	seq_printf(m, "%08lx-%08lx %c%c%c%c%c %08llx %02x:%02x %lu %n",
+ 			vma->vm_start,
+ 			vma->vm_end,
+ 			flags & VM_READ ? 'r' : '-',
+ 			flags & VM_WRITE ? 'w' : '-',
+ 			flags & VM_EXEC ? 'x' : '-',
+ 			flags & VM_MAYSHARE ? 's' : 'p',
+ 			vma->mm_module_ops ? 'm' : '\0',
+ 			((loff_t)vma->vm_pgoff) << PAGE_SHIFT,
+ 			MAJOR(dev), MINOR(dev), ino, &len);
+ #else /* CONFIG_MM_MODULES */
 	seq_printf(m, "%08lx-%08lx %c%c%c%c %08llx %02x:%02x %lu %n",
 			start,
 			end,
 			flags & VM_READ ? 'r' : '-',
 			flags & VM_WRITE ? 'w' : '-',
 			flags & VM_EXEC ? 'x' : '-',
 			flags & VM_MAYSHARE ? 's' : 'p',
 			pgoff,
 			MAJOR(dev), MINOR(dev), ino, &len);
+ #endif /* CONFIG_MM_MODULES */
 
 	/*
 	 * Print the dentry name for named mappings, and a
 	 * special [heap] marker for the heap:
 	 */
 	if (file) {
 		pad_len_spaces(m, len);
 		seq_path(m, &file->f_path, "\n");
@@ -430,16 +443,19 @@
 		.pmd_entry = smaps_pte_range,
 		.mm = vma->vm_mm,
 		.private = &mss,
 	};
 
 	memset(&mss, 0, sizeof mss);
 	mss.vma = vma;
 	/* mmap_sem is held in m_start */
+#ifdef CONFIG_MM_MODULES
+	if (!vma->mm_module_ops)
+#endif /* CONFIG_MM_MODULES */
 	if (vma->vm_mm && !is_vm_hugetlb_page(vma))
 		walk_page_range(vma->vm_start, vma->vm_end, &smaps_walk);
 
 	show_map_vma(m, vma);
 
 	seq_printf(m,
 		   "Size:           %8lu kB\n"
 		   "Rss:            %8lu kB\n"
@@ -554,16 +570,19 @@
 	if (mm) {
 		struct mm_walk clear_refs_walk = {
 			.pmd_entry = clear_refs_pte_range,
 			.mm = mm,
 		};
 		down_read(&mm->mmap_sem);
 		for (vma = mm->mmap; vma; vma = vma->vm_next) {
 			clear_refs_walk.private = vma;
+#ifdef CONFIG_MM_MODULES
+			if (!vma->mm_module_ops)
+#endif /* CONFIG_MM_MODULES */
 			if (is_vm_hugetlb_page(vma))
 				continue;
 			/*
 			 * Writing 1 to /proc/pid/clear_refs affects all pages.
 			 *
 			 * Writing 2 to /proc/pid/clear_refs only affects
 			 * Anonymous pages.
 			 *
@@ -672,16 +691,19 @@
 		/* check to see if we've left 'vma' behind
 		 * and need a new, higher one */
 		if (vma && (addr >= vma->vm_end))
 			vma = find_vma(walk->mm, addr);
 
 		/* check that 'vma' actually covers this address,
 		 * and that it isn't a huge page vma */
 		if (vma && (vma->vm_start <= addr) &&
+#ifdef CONFIG_MM_MODULES
+		    !vma->mm_module_ops &&
+#endif /* CONFIG_MM_MODULES */
 		    !is_vm_hugetlb_page(vma)) {
 			pte = pte_offset_map(pmd, addr);
 			pfn = pte_to_pagemap_entry(*pte);
 			/* unmap before userspace copy */
 			pte_unmap(pte);
 		}
 		err = add_to_pagemap(addr, pfn, pm);
 		if (err)
diff -Naru8 a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h	2011-05-19 05:06:34.000000000 +0100
+++ b/include/linux/mm.h	2011-06-08 12:22:36.013000182 +0100
@@ -237,26 +237,95 @@
 	int (*migrate)(struct vm_area_struct *vma, const nodemask_t *from,
 		const nodemask_t *to, unsigned long flags);
 #endif
 };
 
 struct mmu_gather;
 struct inode;
 
+  #ifdef CONFIG_PMEM_MODULES
+  struct pmem_module_operations_struct {
+  	int (*put_page)(struct page *page);
+  	int (*get_page)(struct page *page);
+  	int (*sparse_mem_map_populate)(unsigned long pnum, int nid);
+  };
+  #endif /* CONFIG_PMEM_MODULES */
+  
+  #ifdef CONFIG_MM_MODULES
+  struct zap_details;
+  
+  struct mm_module_operations_struct {
+  	int (*handle_mm_fault)(struct mm_struct *mm,
+  			struct vm_area_struct *vma, unsigned long addr,
+  			unsigned int flags);
+  	int (*change_protection)(struct vm_area_struct *vma, unsigned long start,
+  			unsigned long end, unsigned long newflags);
+  	int (*copy_page_range)(struct mm_struct *dst_mm,
+  			struct mm_struct *src_mm, struct vm_area_struct *vma);
+  	int (*follow_page)(struct mm_struct *mm, struct vm_area_struct *vma,
+  			struct page **pages, struct vm_area_struct **vmas,
+  			unsigned long *position, int *length,
+  			int i, int write);
+  	int (*probe_mapped)(struct vm_area_struct *vma, unsigned long start,
+  			unsigned long *end_range, unsigned long *range_vm_flags);
+  	unsigned long (*unmap_page_range)(struct mmu_gather **tlbp,
+  			struct vm_area_struct *vma, unsigned long addr,
+  			unsigned long end, long *zap_work,
+  			struct zap_details *details); 
+  	void (*free_pgd_range)(struct mmu_gather *tlb, unsigned long addr,
+  			unsigned long end, unsigned long floor,
+  			unsigned long ceiling);
+  	int (*init_module_vma)(struct vm_area_struct *vma,
+  			struct vm_area_struct *old_vma);
+  	void (*exit_module_vma)(struct vm_area_struct *vma);
+  	int (*init_module_mm)(struct mm_struct *mm,
+  			struct mm_module_struct *mm_mod);
+  	int (*exit_module_mm)(struct mm_struct *mm,
+  			struct mm_module_struct *mm_mod);
+  };
+  #endif /* CONFIG_MM_MODULES */
+  
+
 #define page_private(page)		((page)->private)
 #define set_page_private(page, v)	((page)->private = (v))
 
 /*
  * FIXME: take this include out, include page-flags.h in
  * files which need it (119 of them)
  */
 #include <linux/page-flags.h>
 #include <linux/huge_mm.h>
 
+#ifdef CONFIG_PMEM_MODULES
+static inline void pmem_modules_get_page(struct page *page)
+{
+	struct pmem_module_struct *module = pmem_modules;
+	for (module = pmem_modules; module; module = module->next) {
+		VM_BUG_ON(!module->pmem_module_ops->get_page);
+		if (module->pmem_module_ops->get_page(page))
+			return;
+	}
+	/* One of the modules should have picked the page up */
+	BUG();
+}
+
+static inline void pmem_modules_put_page(struct page *page)
+{
+	struct pmem_module_struct *module = pmem_modules;
+	for (module = pmem_modules; module; module = module->next) {
+		VM_BUG_ON(!module->pmem_module_ops->put_page);
+		if (module->pmem_module_ops->put_page(page))
+			return;
+	}
+	/* One of the modules should have picked the page up */
+	BUG();
+}
+#endif /* CONFIG_PMEM_MODULES */
+
 /*
  * Methods to modify the page usage count.
  *
  * What counts for a page usage:
  * - cache mapping   (page->mapping)
  * - private data    (page->private)
  * - page mapped in a task's page tables, each mapping
  *   is counted separately
@@ -356,16 +425,22 @@
 
 static inline int page_count(struct page *page)
 {
 	return atomic_read(&compound_head(page)->_count);
 }
 
 static inline void get_page(struct page *page)
 {
+ #ifdef CONFIG_PMEM_MODULES
+ 	if (unlikely(PagePmemModule(page))) {
+ 		pmem_modules_get_page(page);
+ 		return;
+ 	}
+ #endif /* CONFIG_PMEM_MODULES */
 	/*
 	 * Getting a normal page or the head of a compound page
 	 * requires to already have an elevated page->_count. Only if
 	 * we're getting a tail page, the elevated page->_count is
 	 * required only in the head page, so for tail pages the
 	 * bugcheck only verifies that the page->_count isn't
 	 * negative.
 	 */
@@ -842,18 +917,27 @@
 #define VM_FAULT_HWPOISON_LARGE 0x0020  /* Hit poisoned large page. Index encoded in upper bits */
 
 #define VM_FAULT_NOPAGE	0x0100	/* ->fault installed the pte, not return page */
 #define VM_FAULT_LOCKED	0x0200	/* ->fault locked the returned page */
 #define VM_FAULT_RETRY	0x0400	/* ->fault blocked, must retry */
 
 #define VM_FAULT_HWPOISON_LARGE_MASK 0xf000 /* encodes hpage index for large hwpoison */
 
+#ifdef CONFIG_MM_MODULES
+#define VM_FAULT_SEGV_ACCERR	0x1000
+#define VM_FAULT_SEGV_MAPERR	0x2000
+#define VM_FAULT_SIGSEGV	(VM_FAULT_SEGV_ACCERR | VM_FAULT_SEGV_MAPERR)
+#define VM_FAULT_ERROR \
+	(VM_FAULT_OOM | VM_FAULT_SIGBUS | VM_FAULT_HWPOISON | \
+	 VM_FAULT_HWPOISON_LARGE | VM_FAULT_SIGSEGV)
+#else /* !CONFIG_MM_MODULES */
 #define VM_FAULT_ERROR	(VM_FAULT_OOM | VM_FAULT_SIGBUS | VM_FAULT_HWPOISON | \
 			 VM_FAULT_HWPOISON_LARGE)
+#endif /* CONFIG_MM_MODULES */
 
 /* Encode hstate index for a hwpoisoned large page */
 #define VM_FAULT_SET_HINDEX(x) ((x) << 12)
 #define VM_FAULT_GET_HINDEX(x) (((x) >> 12) & 0xf)
 
 /*
  * Can be called by the pagefault handler when it gets a VM_FAULT_OOM.
  */
diff -Naru8 a/include/linux/mm_types.h b/include/linux/mm_types.h
--- a/include/linux/mm_types.h	2011-05-19 05:06:34.000000000 +0100
+++ b/include/linux/mm_types.h	2011-06-08 12:24:07.916000179 +0100
@@ -178,16 +178,20 @@
 	unsigned long vm_truncate_count;/* truncate_count or restart_addr */
 
 #ifndef CONFIG_MMU
 	struct vm_region *vm_region;	/* NOMMU mapping region */
 #endif
 #ifdef CONFIG_NUMA
 	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
 #endif
+#ifdef CONFIG_MM_MODULES
+	struct mm_module_operations_struct * mm_module_ops;
+	void * mm_module_vma_state;
+#endif /* CONFIG_MM_MODULES */
 };
 
 struct core_thread {
 	struct task_struct *task;
 	struct core_thread *next;
 };
 
 struct core_state {
@@ -214,16 +218,33 @@
 	int count[NR_MM_COUNTERS];
 };
 #else  /* !USE_SPLIT_PTLOCKS */
 struct mm_rss_stat {
 	unsigned long count[NR_MM_COUNTERS];
 };
 #endif /* !USE_SPLIT_PTLOCKS */
 
+#ifdef CONFIG_PMEM_MODULES
+struct pmem_module_struct {
+	struct pmem_module_operations_struct * pmem_module_ops;
+	struct pmem_module_struct *next;
+};
+
+extern struct pmem_module_struct *pmem_modules;
+#endif /* CONFIG_PMEM_MODULES */
+
+#ifdef CONFIG_MM_MODULES
+struct mm_module_struct {
+	struct mm_module_operations_struct * mm_module_ops;
+	void * mm_module_mm_state;
+	struct mm_module_struct *next;
+};
+#endif /* CONFIG_MM_MODULES */
+
 struct mm_struct {
 	struct vm_area_struct * mmap;		/* list of VMAs */
 	struct rb_root mm_rb;
 	struct vm_area_struct * mmap_cache;	/* last find_vma result */
 #ifdef CONFIG_MMU
 	unsigned long (*get_unmapped_area) (struct file *filp,
 				unsigned long addr, unsigned long len,
 				unsigned long pgoff, unsigned long flags);
@@ -309,16 +330,19 @@
 #ifdef CONFIG_PROC_FS
 	/* store ref to file /proc/<pid>/exe symlink points to */
 	struct file *exe_file;
 	unsigned long num_exe_file_vmas;
 #endif
 #ifdef CONFIG_MMU_NOTIFIER
 	struct mmu_notifier_mm *mmu_notifier_mm;
 #endif
+#ifdef CONFIG_MM_MODULES
+	struct mm_module_struct *mm_modules;
+#endif /* CONFIG_MM_MODULES */
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	pgtable_t pmd_huge_pte; /* protected by page_table_lock */
 #endif
 };
 
 /* Future-safe accessor for struct mm_struct's cpu_vm_mask. */
 #define mm_cpumask(mm) (&(mm)->cpu_vm_mask)
 
diff -Naru8 a/include/linux/page-flags.h b/include/linux/page-flags.h
--- a/include/linux/page-flags.h	2011-05-19 05:06:34.000000000 +0100
+++ b/include/linux/page-flags.h	2011-06-08 12:23:25.717000251 +0100
@@ -99,16 +99,19 @@
 	PG_mlocked,		/* Page is vma mlocked */
 #endif
 #ifdef CONFIG_ARCH_USES_PG_UNCACHED
 	PG_uncached,		/* Page has been mapped as uncached */
 #endif
 #ifdef CONFIG_MEMORY_FAILURE
 	PG_hwpoison,		/* hardware poisoned page. Don't touch */
 #endif
+#ifdef CONFIG_PMEM_MODULES
+	PG_pmem_module,
+#endif /* CONFIG_PMEM_MODULES */
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	PG_compound_lock,
 #endif
 	__NR_PAGEFLAGS,
 
 	/* Filesystems */
 	PG_checked = PG_owner_priv_1,
 
@@ -275,16 +278,22 @@
 #define __PG_HWPOISON (1UL << PG_hwpoison)
 #else
 PAGEFLAG_FALSE(HWPoison)
 #define __PG_HWPOISON 0
 #endif
 
 u64 stable_page_flags(struct page *page);
 
+#ifdef CONFIG_PMEM_MODULES
+PAGEFLAG(PmemModule, pmem_module)
+#else /* !CONFIG_PMEM_MODULES */
+PAGEFLAG_FALSE(PmemModule)
+#endif /* CONFIG_PMEM_MODULES */
+
 static inline int PageUptodate(struct page *page)
 {
 	int ret = test_bit(PG_uptodate, &(page)->flags);
 
 	/*
 	 * Must ensure that the data we read out of the page is loaded
 	 * _after_ we've loaded page->flags to check for PageUptodate.
 	 * We can skip the barrier if the page is not uptodate, because
diff -Naru8 a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c	2011-05-19 05:06:34.000000000 +0100
+++ b/kernel/fork.c	2011-06-08 12:18:27.614000144 +0100
@@ -389,16 +389,23 @@
 			tmp->vm_truncate_count = mpnt->vm_truncate_count;
 			flush_dcache_mmap_lock(mapping);
 			/* insert tmp into the share list, just after mpnt */
 			vma_prio_tree_add(tmp, mpnt);
 			flush_dcache_mmap_unlock(mapping);
 			spin_unlock(&mapping->i_mmap_lock);
 		}
 
+#ifdef CONFIG_MM_MODULES
+		if (mpnt->mm_module_ops) {
+			BUG_ON(!mpnt->mm_module_ops->init_module_vma);
+			mpnt->mm_module_ops->init_module_vma(tmp, mpnt);
+		}
+#endif /* CONFIG_MM_MODULES */
+
 		/*
 		 * Clear hugetlb-related page reserves for children. This only
 		 * affects MAP_PRIVATE mappings. Faults generated by the child
 		 * are not guaranteed to succeed, even if read-only
 		 */
 		if (is_vm_hugetlb_page(tmp))
 			reset_vma_resv_huge_pages(tmp);
 
@@ -481,16 +488,51 @@
 static void mm_init_aio(struct mm_struct *mm)
 {
 #ifdef CONFIG_AIO
 	spin_lock_init(&mm->ioctx_lock);
 	INIT_HLIST_HEAD(&mm->ioctx_list);
 #endif
 }
 
+#ifdef CONFIG_MM_MODULES
+void mm_modules_init(struct mm_struct * mm)
+{
+	struct mm_module_struct *old_mm_modules;
+	struct mm_module_struct *module;
+
+	/* Extract old mm's modules list, initialize new mm's list: */
+	old_mm_modules = mm->mm_modules;
+	mm->mm_modules = NULL;
+
+	/* Iterate on modules. Allow each to initialize new mm based on old: */
+	for (module = old_mm_modules; module; module = module->next) {
+		BUG_ON(!module->mm_module_ops);
+		BUG_ON(!module->mm_module_ops->init_module_mm);
+		module->mm_module_ops->init_module_mm(mm, module);
+	}
+}
+
+void mm_modules_exit(struct mm_struct * mm)
+{
+	struct mm_module_struct *module;
+
+	/*
+	 * Modules will remove their own mm_module_struct and free it,
+	 * so keep calling the top module's mm_exit_module call
+	 * until none are left.
+	 */
+	while ((module = mm->mm_modules)) {
+		BUG_ON(!module->mm_module_ops);
+		BUG_ON(!module->mm_module_ops->exit_module_mm);
+		module->mm_module_ops->exit_module_mm(mm, module);
+	}
+}
+#endif /* CONFIG_MM_MODULES */
+
 static struct mm_struct * mm_init(struct mm_struct * mm, struct task_struct *p)
 {
 	atomic_set(&mm->mm_users, 1);
 	atomic_set(&mm->mm_count, 1);
 	init_rwsem(&mm->mmap_sem);
 	INIT_LIST_HEAD(&mm->mmlist);
 	mm->flags = (current->mm) ?
 		(current->mm->flags & MMF_INIT_MASK) : default_dump_filter;
@@ -499,16 +541,20 @@
 	memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
 	spin_lock_init(&mm->page_table_lock);
 	mm->free_area_cache = TASK_UNMAPPED_BASE;
 	mm->cached_hole_size = ~0UL;
 	mm_init_aio(mm);
 	mm_init_owner(mm, p);
 	atomic_set(&mm->oom_disable_count, 0);
 
+#ifdef CONFIG_MM_MODULES
+	mm_modules_init(mm);
+#endif /* CONFIG_MM_MODULES */
+
 	if (likely(!mm_alloc_pgd(mm))) {
 		mm->def_flags = 0;
 		mmu_notifier_mm_init(mm);
 		return mm;
 	}
 
 	free_mm(mm);
 	return NULL;
@@ -553,16 +599,19 @@
 void mmput(struct mm_struct *mm)
 {
 	might_sleep();
 
 	if (atomic_dec_and_test(&mm->mm_users)) {
 		exit_aio(mm);
 		ksm_exit(mm);
 		khugepaged_exit(mm); /* must run before exit_mmap */
+ #ifdef CONFIG_MM_MODULES
+ 		mm_modules_exit(mm);
+ #endif /* CONFIG_MM_MODULES */
 		exit_mmap(mm);
 		set_mm_exe_file(mm, NULL);
 		if (!list_empty(&mm->mmlist)) {
 			spin_lock(&mmlist_lock);
 			list_del(&mm->mmlist);
 			spin_unlock(&mmlist_lock);
 		}
 		put_swap_token(mm);
diff -Naru8 a/mm/Kconfig b/mm/Kconfig
--- a/mm/Kconfig	2011-05-19 05:06:34.000000000 +0100
+++ b/mm/Kconfig	2011-06-08 12:13:07.653998033 +0100
@@ -126,16 +126,30 @@
 	help
 	 SPARSEMEM_VMEMMAP uses a virtually mapped memmap to optimise
 	 pfn_to_page and page_to_pfn operations.  This is the most
 	 efficient option when sufficient kernel resources are available.
 
 config HAVE_MEMBLOCK
 	boolean
 
+config MM_MODULES
+	bool "Memory Management Modules"
+	depends on MMU_NOTIFIER
+	default y
+	help
+	 Provides support for dynamically loadable memory management modules
+
+config PMEM_MODULES
+	bool "Physical Memory Management Modules"
+	default y
+	help
+	 Provides support for dynamically loadable physical memory
+	 management modules
+
 # eventually, we can have this option just 'select SPARSEMEM'
 config MEMORY_HOTPLUG
 	bool "Allow for memory hot-add"
 	depends on SPARSEMEM || X86_64_ACPI_NUMA
 	depends on HOTPLUG && ARCH_ENABLE_MEMORY_HOTPLUG
 	depends on (IA64 || X86 || PPC_BOOK3S_64 || SUPERH || S390)
 
 config MEMORY_HOTPLUG_SPARSE
diff -Naru8 a/mm/Makefile b/mm/Makefile
--- a/mm/Makefile	2011-05-19 05:06:34.000000000 +0100
+++ b/mm/Makefile	2011-06-08 12:16:13.671000323 +0100
@@ -44,8 +44,9 @@
 obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
 obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
 obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
 obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
 obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
 obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
+obj-$(CONFIG_MM_MODULES) += mm_module_exports.o
diff -Naru8 a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c	2011-05-19 05:06:34.000000000 +0100
+++ b/mm/memory.c	2011-06-08 12:13:07.656998231 +0100
@@ -62,16 +62,22 @@
 #include <asm/pgalloc.h>
 #include <asm/uaccess.h>
 #include <asm/tlb.h>
 #include <asm/tlbflush.h>
 #include <asm/pgtable.h>
 
 #include "internal.h"
 
+#ifdef CONFIG_PMEM_MODULES
+struct pmem_module_struct *pmem_modules = NULL;
+EXPORT_SYMBOL_GPL(pmem_modules);
+EXPORT_SYMBOL_GPL(ptep_clear_flush);
+#endif /* CONFIG_PMEM_MODULES */
+
 #ifndef CONFIG_NEED_MULTIPLE_NODES
 /* use the per-pgdat data instead for discontigmem - mbligh */
 unsigned long max_mapnr;
 struct page *mem_map;
 
 EXPORT_SYMBOL(max_mapnr);
 EXPORT_SYMBOL(mem_map);
 #endif
@@ -368,24 +373,35 @@
 
 		/*
 		 * Hide vma from rmap and truncate_pagecache before freeing
 		 * pgtables
 		 */
 		unlink_anon_vmas(vma);
 		unlink_file_vma(vma);
 
+#ifdef CONFIG_MM_MODULES
+		if (vma->mm_module_ops) {
+			BUG_ON(!vma->mm_module_ops->free_pgd_range);
+			vma->mm_module_ops->free_pgd_range(tlb, addr,
+					vma->vm_end, floor,
+					next? next->vm_start: ceiling);
+		} else 
+#endif /* CONFIG_MM_MODULES */
 		if (is_vm_hugetlb_page(vma)) {
 			hugetlb_free_pgd_range(tlb, addr, vma->vm_end,
 				floor, next? next->vm_start: ceiling);
 		} else {
 			/*
 			 * Optimization: gather nearby vmas into one call down
 			 */
 			while (next && next->vm_start <= vma->vm_end + PMD_SIZE
+#ifdef CONFIG_MM_MODULES
+				&& !next->mm_module_ops
+#endif /* CONFIG_MM_MODULES */
 			       && !is_vm_hugetlb_page(next)) {
 				vma = next;
 				next = vma->vm_next;
 				unlink_anon_vmas(vma);
 				unlink_file_vma(vma);
 			}
 			free_pgd_range(tlb, addr, vma->vm_end,
 				floor, next? next->vm_start: ceiling);
@@ -854,16 +870,23 @@
 	int ret;
 
 	/*
 	 * Don't copy ptes where a page fault will fill them correctly.
 	 * Fork becomes much lighter when there are big shared or private
 	 * readonly mappings. The tradeoff is that copy_page_range is more
 	 * efficient than faulting.
 	 */
+#ifdef CONFIG_MM_MODULES
+	if (vma->mm_module_ops) {
+		BUG_ON(!vma->mm_module_ops->copy_page_range);
+		return vma->mm_module_ops->copy_page_range(dst_mm, src_mm, vma);
+	}
+#endif /* CONFIG_MM_MODULES */
+
 	if (!(vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_PFNMAP|VM_INSERTPAGE))) {
 		if (!vma->anon_vma)
 			return 0;
 	}
 
 	if (is_vm_hugetlb_page(vma))
 		return copy_hugetlb_page_range(dst_mm, src_mm, vma);
 
@@ -1152,16 +1175,24 @@
 			untrack_pfn_vma(vma, 0, 0);
 
 		while (start != end) {
 			if (!tlb_start_valid) {
 				tlb_start = start;
 				tlb_start_valid = 1;
 			}
 
+#ifdef CONFIG_MM_MODULES
+			if (unlikely(vma->mm_module_ops)) {
+				BUG_ON(!vma->mm_module_ops->unmap_page_range);
+				start = vma->mm_module_ops->unmap_page_range(
+							tlbp, vma, start, end,
+							&zap_work, details);
+			} else
+#endif /* CONFIG_MM_MODULES */
 			if (unlikely(is_vm_hugetlb_page(vma))) {
 				/*
 				 * It is undesirable to test vma->vm_file as it
 				 * should be non-null for valid hugetlb area.
 				 * However, vm_file will be NULL in the error
 				 * cleanup path of do_mmap_pgoff. When
 				 * hugetlbfs ->mmap method fails,
 				 * do_mmap_pgoff() nullifies vma->vm_file
@@ -1539,16 +1570,25 @@
 			goto next_page;
 		}
 
 		if (!vma ||
 		    (vma->vm_flags & (VM_IO | VM_PFNMAP)) ||
 		    !(vm_flags & vma->vm_flags))
 			return i ? : -EFAULT;
 
+#ifdef CONFIG_MM_MODULES
+		if (vma->mm_module_ops) {
+			BUG_ON(!vma->mm_module_ops->follow_page);
+			i = vma->mm_module_ops->follow_page(mm, vma, pages,
+					vmas, &start, &nr_pages, i, gup_flags);
+			continue;
+		}
+#endif /* CONFIG_MM_MODULES */
+
 		if (is_vm_hugetlb_page(vma)) {
 			i = follow_hugetlb_page(mm, vma, pages, vmas,
 					&start, &nr_pages, i, gup_flags);
 			continue;
 		}
 
 		do {
 			struct page *page;
@@ -3356,16 +3396,24 @@
 
 	__set_current_state(TASK_RUNNING);
 
 	count_vm_event(PGFAULT);
 
 	/* do counter updates before entering really critical section. */
 	check_sync_rss_stat(current);
 
+#ifdef CONFIG_MM_MODULES
+	if (unlikely(vma->mm_module_ops)) {
+		BUG_ON(!vma->mm_module_ops->handle_mm_fault);
+		return vma->mm_module_ops->handle_mm_fault(mm, vma,
+				address, flags);
+	}
+#endif /* CONFIG_MM_MODULES */
+
 	if (unlikely(is_vm_hugetlb_page(vma)))
 		return hugetlb_fault(mm, vma, address, flags);
 
 	pgd = pgd_offset(mm, address);
 	pud = pud_alloc(mm, pgd, address);
 	if (!pud)
 		return VM_FAULT_OOM;
 	pmd = pmd_alloc(mm, pud, address);
diff -Naru8 a/mm/mempolicy.c b/mm/mempolicy.c
--- a/mm/mempolicy.c	2011-05-19 05:06:34.000000000 +0100
+++ b/mm/mempolicy.c	2011-06-08 12:13:07.656998231 +0100
@@ -584,16 +584,19 @@
 	for (vma = first; vma && vma->vm_start < end; vma = vma->vm_next) {
 		if (!(flags & MPOL_MF_DISCONTIG_OK)) {
 			if (!vma->vm_next && vma->vm_end < end)
 				return ERR_PTR(-EFAULT);
 			if (prev && prev->vm_end < vma->vm_start)
 				return ERR_PTR(-EFAULT);
 		}
 		if (!is_vm_hugetlb_page(vma) &&
+#ifdef CONFIG_MM_MODULES
+		    !vma->mm_module_ops &&
+#endif /* CONFIG_MM_MODULES */
 		    ((flags & MPOL_MF_STRICT) ||
 		     ((flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) &&
 				vma_migratable(vma)))) {
 			unsigned long endvma = vma->vm_end;
 
 			if (endvma > end)
 				endvma = end;
 			if (vma->vm_start > start)
@@ -2637,16 +2640,21 @@
 		seq_path(m, &file->f_path, "\n\t= ");
 	} else if (vma->vm_start <= mm->brk && vma->vm_end >= mm->start_brk) {
 		seq_printf(m, " heap");
 	} else if (vma->vm_start <= mm->start_stack &&
 			vma->vm_end >= mm->start_stack) {
 		seq_printf(m, " stack");
 	}
 
+#ifdef CONFIG_MM_MODULES
+	if (vma->mm_module_ops) {
+		seq_printf(m, " mm_module");
+	} else
+#endif /* CONFIG_MM_MODULES */
 	if (is_vm_hugetlb_page(vma)) {
 		check_huge_range(vma, vma->vm_start, vma->vm_end, md);
 		seq_printf(m, " huge");
 	} else {
 		check_pgd_range(vma, vma->vm_start, vma->vm_end,
 			&node_states[N_HIGH_MEMORY], MPOL_MF_STATS, md);
 	}
 
@@ -2664,16 +2672,19 @@
 
 	if (md->mapcount_max > 1)
 		seq_printf(m, " mapmax=%lu", md->mapcount_max);
 
 	if (md->swapcache)
 		seq_printf(m," swapcache=%lu", md->swapcache);
 
 	if (md->active < md->pages && !is_vm_hugetlb_page(vma))
+#ifdef CONFIG_MM_MODULES
+		if (!vma->mm_module_ops)
+#endif /* CONFIG_MM_MODULES */
 		seq_printf(m," active=%lu", md->active);
 
 	if (md->writeback)
 		seq_printf(m," writeback=%lu", md->writeback);
 
 	for_each_node_state(n, N_HIGH_MEMORY)
 		if (md->node[n])
 			seq_printf(m, " N%d=%lu", n, md->node[n]);
diff -Naru8 a/mm/mlock.c b/mm/mlock.c
--- a/mm/mlock.c	2011-05-19 05:06:34.000000000 +0100
+++ b/mm/mlock.c	2011-06-08 12:17:31.056000118 +0100
@@ -216,16 +216,19 @@
 	/*
 	 * filter unlockable vmas
 	 */
 	if (vma->vm_flags & (VM_IO | VM_PFNMAP))
 		goto no_mlock;
 
 	if (!((vma->vm_flags & (VM_DONTEXPAND | VM_RESERVED)) ||
 			is_vm_hugetlb_page(vma) ||
+ #ifdef CONFIG_MM_MODULES
+ 			(vma->mm_module_ops) ||
+ #endif /* CONFIG_MM_MODULES */
 			vma == get_gate_vma(current->mm))) {
 
 		__mlock_vma_pages_range(vma, start, end, NULL);
 
 		/* Hide errors from mmap() and other callers */
 		return 0;
 	}
 
diff -Naru8 a/mm/mm_module_exports.c b/mm/mm_module_exports.c
--- a/mm/mm_module_exports.c	1970-01-01 01:00:00.000000000 +0100
+++ b/mm/mm_module_exports.c	2011-06-08 12:13:07.657998293 +0100
@@ -0,0 +1,100 @@
+
+#include <linux/module.h>
+
+#include <linux/errno.h>
+#include <linux/mm.h>
+#include <linux/elf.h>
+#include <asm/pgtable.h>
+
+/* arch/x86/kernel/tlb_64.c */
+extern void flush_tlb_page(struct vm_area_struct *vma, unsigned long va);
+EXPORT_SYMBOL_GPL(flush_tlb_page);
+extern void flush_tlb_mm(struct mm_struct *);
+EXPORT_SYMBOL_GPL(flush_tlb_mm);
+
+/* arch/x86/mm/pgtable.c */
+extern void ___pud_free_tlb(struct mmu_gather *, pud_t *);
+extern void ___pmd_free_tlb(struct mmu_gather *, pmd_t *);
+EXPORT_SYMBOL_GPL(___pud_free_tlb);
+EXPORT_SYMBOL_GPL(___pmd_free_tlb);
+
+/* XXX Not sure how the heck _this_ is supposed to get set when building a KO... */
+#ifndef __PAGETABLE_PUD_FOLDED
+extern int __pud_alloc(struct mm_struct *, pgd_t *, unsigned long);
+EXPORT_SYMBOL_GPL(__pud_alloc);
+#endif
+
+/* mm/swap.c */
+extern void lru_add_drain(void);
+EXPORT_SYMBOL_GPL(lru_add_drain);
+
+/* mm/memory.c */
+extern void pgd_clear_bad(pgd_t *);
+EXPORT_SYMBOL_GPL(pgd_clear_bad);
+extern void pud_clear_bad(pud_t *);
+EXPORT_SYMBOL_GPL(pud_clear_bad);
+struct page *vm_normal_page(struct vm_area_struct *, unsigned long, pte_t pte);
+EXPORT_SYMBOL_GPL(vm_normal_page);
+extern int __pmd_alloc(struct mm_struct *, pud_t *, unsigned long);
+EXPORT_SYMBOL_GPL(__pmd_alloc);
+extern void pmd_clear_bad(pmd_t *);
+EXPORT_SYMBOL_GPL(pmd_clear_bad);
+#if defined(SPLIT_RSS_COUNTING)
+extern unsigned long get_mm_counter(struct mm_struct *mm, int member);
+EXPORT_SYMBOL_GPL(get_mm_counter);
+#endif /* defined(SPLIT_RSS_COUNTING) */
+
+/* mm/swap_state.c */
+extern void free_pages_and_swap_cache(struct page **, int);
+EXPORT_SYMBOL_GPL(free_pages_and_swap_cache);
+
+#ifdef CONFIG_MMU_NOTIFIER
+/* mm/mmu_notifier.c */
+extern void __mmu_notifier_invalidate_range_start(struct mm_struct *,
+		unsigned long, unsigned long);
+EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);
+extern void __mmu_notifier_invalidate_range_end(struct mm_struct *,
+		unsigned long, unsigned long);
+EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_end);
+#endif /* CONFIG_MMU_NOTIFIER */
+
+/* mm/rmap.c */
+extern int anon_vma_prepare(struct vm_area_struct *);
+EXPORT_SYMBOL_GPL(anon_vma_prepare);
+
+/* mm/bootmem.c */
+extern unsigned long max_pfn;
+EXPORT_SYMBOL_GPL(max_pfn);
+
+#ifdef CONFIG_SPARSEMEM_VMEMMAP
+/* arch/x86/mm/init_64.c */
+extern int vmemmap_populate(struct page *, unsigned long, int);
+EXPORT_SYMBOL_GPL(vmemmap_populate);
+#endif
+
+#if 1
+/* WTF? */
+/* arch/x86/mm/init.c */
+extern struct mmu_gather mmu_gathers;
+EXPORT_SYMBOL_GPL(mmu_gathers);
+#endif
+
+/* kernel/fork.c */
+extern void __put_task_struct(struct task_struct *);
+EXPORT_SYMBOL_GPL(__put_task_struct);
+extern rwlock_t tasklist_lock;
+EXPORT_SYMBOL_GPL(tasklist_lock);
+
+/* kernel/pid.c */
+enum pid_type;
+extern struct task_struct *get_pid_task(struct pid *, enum pid_type);
+EXPORT_SYMBOL_GPL(get_pid_task);
+extern struct task_struct *find_task_by_vpid(pid_t vnr);
+EXPORT_SYMBOL_GPL(find_task_by_vpid);
+
+/* arch/x86/kernel/init_task.c */
+extern struct mm_struct init_mm;
+EXPORT_SYMBOL_GPL(init_mm); /* will be removed in 2.6.26 */
+
+/* mm/mmap.c */
+EXPORT_SYMBOL_GPL(split_vma);
diff -Naru8 a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c	2011-05-19 05:06:34.000000000 +0100
+++ b/mm/mmap.c	2011-06-08 12:13:07.657998293 +0100
@@ -234,16 +234,22 @@
 	might_sleep();
 	if (vma->vm_ops && vma->vm_ops->close)
 		vma->vm_ops->close(vma);
 	if (vma->vm_file) {
 		fput(vma->vm_file);
 		if (vma->vm_flags & VM_EXECUTABLE)
 			removed_exe_file_vma(vma->vm_mm);
 	}
+#ifdef CONFIG_MM_MODULES
+	if (vma->mm_module_ops) {
+		BUG_ON(!vma->mm_module_ops->exit_module_vma);
+		vma->mm_module_ops->exit_module_vma(vma);
+	}
+#endif /* CONFIG_MM_MODULES */
 	mpol_put(vma_policy(vma));
 	kmem_cache_free(vm_area_cachep, vma);
 	return next;
 }
 
 SYSCALL_DEFINE1(brk, unsigned long, brk)
 {
 	unsigned long rlim, retval;
@@ -690,16 +696,20 @@
 {
 	/* VM_CAN_NONLINEAR may get set later by f_op->mmap() */
 	if ((vma->vm_flags ^ vm_flags) & ~VM_CAN_NONLINEAR)
 		return 0;
 	if (vma->vm_file != file)
 		return 0;
 	if (vma->vm_ops && vma->vm_ops->close)
 		return 0;
+#ifdef CONFIG_MM_MODULES
+	if (vma->mm_module_ops)
+		return 0;
+#endif /* CONFIG_MM_MODULES */
 	return 1;
 }
 
 static inline int is_mergeable_anon_vma(struct anon_vma *anon_vma1,
 					struct anon_vma *anon_vma2)
 {
 	return !anon_vma1 || !anon_vma2 || (anon_vma1 == anon_vma2);
 }
@@ -1970,16 +1980,21 @@
  */
 static int __split_vma(struct mm_struct * mm, struct vm_area_struct * vma,
 	      unsigned long addr, int new_below)
 {
 	struct mempolicy *pol;
 	struct vm_area_struct *new;
 	int err = -ENOMEM;
 
+#ifdef CONFIG_MM_MODULES
+	if (vma->mm_module_ops)
+		return -EINVAL;
+#endif /* CONFIG_MM_MODULES */
+
 	if (is_vm_hugetlb_page(vma) && (addr &
 					~(huge_page_mask(hstate_vma(vma)))))
 		return -EINVAL;
 
 	new = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
 	if (!new)
 		goto out_err;
 
diff -Naru8 a/mm/mprotect.c b/mm/mprotect.c
--- a/mm/mprotect.c	2011-05-19 05:06:34.000000000 +0100
+++ b/mm/mprotect.c	2011-06-08 12:13:07.658998352 +0100
@@ -194,16 +194,31 @@
 	}
 
 	if (end != vma->vm_end) {
 		error = split_vma(mm, vma, end, 0);
 		if (error)
 			goto fail;
 	}
 
+#ifdef CONFIG_MM_MODULES
+	if (vma->mm_module_ops) {
+		BUG_ON(!vma->mm_module_ops->change_protection);
+		error = vma->mm_module_ops->change_protection(vma, start,
+				end, newflags);
+		if (error)
+			goto fail;
+		/* mm_module_ops->change_protection is responsible for
+		 * vma flags, vm_page_prot, all pte settings, mmu_notifier, 
+		 * and vm_stat accounting. When it comes back, we're done.
+		 */
+		return 0;
+	} 
+#endif /* CONFIG_MM_MODULES */
+
 success:
 	/*
 	 * vm_flags and vm_page_prot are protected by the mmap_sem
 	 * held in write mode.
 	 */
 	vma->vm_flags = newflags;
 	vma->vm_page_prot = pgprot_modify(vma->vm_page_prot,
 					  vm_get_page_prot(newflags));
diff -Naru8 a/mm/mremap.c b/mm/mremap.c
--- a/mm/mremap.c	2011-05-19 05:06:34.000000000 +0100
+++ b/mm/mremap.c	2011-06-08 12:13:07.658998352 +0100
@@ -265,16 +265,21 @@
 	unsigned long old_len, unsigned long new_len, unsigned long *p)
 {
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma = find_vma(mm, addr);
 
 	if (!vma || vma->vm_start > addr)
 		goto Efault;
 
+#ifdef CONFIG_MM_MODULES
+	if (vma->mm_module_ops)
+		goto Einval;
+#endif /* CONFIG_MM_MODULES */
+
 	if (is_vm_hugetlb_page(vma))
 		goto Einval;
 
 	/* We can't remap across vm area boundaries */
 	if (old_len > vma->vm_end - addr)
 		goto Efault;
 
 	/* Need to be careful about a growing mapping */
diff -Naru8 a/mm/sparse.c b/mm/sparse.c
--- a/mm/sparse.c	2011-05-19 05:06:34.000000000 +0100
+++ b/mm/sparse.c	2011-06-08 12:13:07.658998352 +0100
@@ -610,21 +610,42 @@
 	vmemmap_populate_print_last();
 
 #ifdef CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER
 	free_bootmem(__pa(map_map), size2);
 #endif
 	free_bootmem(__pa(usemap_map), size);
 }
 
+#ifdef CONFIG_PMEM_MODULES
+static inline int pmem_modules_sparse_mem_map_populate(unsigned long pnum,
+		int nid)
+{
+	struct pmem_module_struct *pmem_module = pmem_modules;
+	int success = true;
+	while (pmem_module) {
+		BUG_ON(!pmem_module->pmem_module_ops->sparse_mem_map_populate);
+		success &=
+			pmem_module->pmem_module_ops->sparse_mem_map_populate(
+					pnum, nid);
+		pmem_module = pmem_module->next;
+	}
+	return success;
+}
+#endif  /* CONFIG_PMEM_MODULES */
+
 #ifdef CONFIG_MEMORY_HOTPLUG
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
 static inline struct page *kmalloc_section_memmap(unsigned long pnum, int nid,
 						 unsigned long nr_pages)
 {
+#ifdef CONFIG_PMEM_MODULES
+	if (!pmem_modules_sparse_mem_map_populate(pnum, nid))
+		return NULL;
+#endif /* CONFIG_PMEM_MODULES */
 	/* This will make the necessary allocations eventually. */
 	return sparse_mem_map_populate(pnum, nid);
 }
 static void __kfree_section_memmap(struct page *memmap, unsigned long nr_pages)
 {
 	return; /* XXX: Not implemented yet */
 }
 static void free_map_bootmem(struct page *page, unsigned long nr_pages)
diff -Naru8 a/mm/swap.c b/mm/swap.c
--- a/mm/swap.c	2011-05-19 05:06:34.000000000 +0100
+++ b/mm/swap.c	2011-06-08 12:15:32.696000202 +0100
@@ -148,16 +148,22 @@
 			__put_compound_page(page);
 		else
 			__put_single_page(page);
 	}
 }
 
 void put_page(struct page *page)
 {
+ #ifdef CONFIG_PMEM_MODULES
+ 	if (unlikely(PagePmemModule(page))) {
+ 		pmem_modules_put_page(page);
+ 		return;
+ 	}
+ #endif /* CONFIG_PMEM_MODULES */
 	if (unlikely(PageCompound(page)))
 		put_compound_page(page);
 	else if (put_page_testzero(page))
 		__put_single_page(page);
 }
 EXPORT_SYMBOL(put_page);
 
 /**
@@ -522,16 +528,26 @@
 	int i;
 	struct pagevec pages_to_free;
 	struct zone *zone = NULL;
 	unsigned long uninitialized_var(flags);
 
 	pagevec_init(&pages_to_free, cold);
 	for (i = 0; i < nr; i++) {
 		struct page *page = pages[i];
+#ifdef CONFIG_PMEM_MODULES
+		if (unlikely(PagePmemModule(page))) {
+			if (zone) {
+				spin_unlock_irqrestore(&zone->lru_lock, flags);
+				zone = NULL;
+			}
+			pmem_modules_put_page(page);
+			continue;
+		}
+#endif /* CONFIG_PMEM_MODULES */
 
 		if (unlikely(PageCompound(page))) {
 			if (zone) {
 				spin_unlock_irqrestore(&zone->lru_lock, flags);
 				zone = NULL;
 			}
 			put_compound_page(page);
 			continue;