Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor VMMap and introduce ChunkResource #932

Open
wks opened this issue Aug 31, 2023 · 5 comments
Open

Refactor VMMap and introduce ChunkResource #932

wks opened this issue Aug 31, 2023 · 5 comments
Labels
P-normal Priority: Normal.

Comments

@wks
Copy link
Collaborator

wks commented Aug 31, 2023

Status quo

A Space allocates objects. A Space relies on a PageResource to allocates contiguous pages. Currently, a PageResource asks VMMap for "contiguous chunks" if needed.

VMMap has two implementations: Map32 and Map64. They are named after the word size (32 and 64). I think it is misleading. They actually implement two completely different strategies.

  • Map32 implements a "free-list-based discontiguous chunk allocator".
  • Map64 implements a "monotonic contiguous chunk allocator".

Map32: free-list-based, discontiguous

On 32-bit machines, the address space is precious. Only a few spaces (Currently only the Gen nursery. Previously the VM space, too.) occupy contiguous address range. Other spaces share the remaining part of the heap, within which all free chunks are owned by a free list Map32::region_map. All discontiguous spaces can ask for "contiguous chunks" from this central free list.

When allocating a "contiguous chunks",

  1. It grabs chunks from region_map: IntArrayFreeList.
  2. Then it links the "contiguous chunks" in a per-space linked list. The head of the linked list is ield in CommonPageResource::head_discontiguous_region on behalf of the space, and the links are held in prev_link: Vec<i32> and next_link: Vec<i32> indexed by chunk index.
  3. Then it writes to descriptor_map: Vec<SpaceDescriptor> which is indexed by chunk. For each chunk the "contiguous chunks" cover, it writes the SpaceDescriptor of it into this descriptor_map. It describes which space owns this chunk.

Map32 is designed this way so that multiple discontiguous spaces can grab chunk from a freelist (region_map) which manages the shared address range. And Map32 keeps track of "which space owns which chunk".

Note that on 32-bit systems, both MonotonicPageResource and FreeListPageResource can use this discontiguous chunk allocation strategy. In SemiSpace, for example, both CopySpace instances are discontiguous, but use MonotonicPageResource. MonotonicPageResource, when created as "discontiguous", will gradually ask for "contiguous chunks" from Map32 (in grow_discontiguous_space), and then adjust its cursor and sentinel to the newly obtained "contiguous chunks". In this way, the MonotonicPageResource allocates pages monotonically, while the underlying "chunk resource" is discontiguous.

p.s. Contiguous MonotonicPageResource on 32-bit systems cannot ask Map32 for more "contiguous chunks". Once the cursor meets the sentinel, allocation fails. Currently only the Gen nursery uses contiguous MonotonicPageResource.

Map64: monotonic, contiguous

On 64-bit machines, the address space is so big that we don't bother letting two Space instance overlap in the address space. Each space occupies a 2TB region. Because the entire 2TB region belongs to that space, Map64 simply keeps a lower bound (base) and upper bound ("high water") for each Space. When a Space (actually its PageResource) asks for "contiguous chunks", Map64 simply bumps the upper bound ("high water"). There is never any conflict with other spaces when allocating new chunks.

Problem

Obviously, the distinction between Map32 and Map64 conflated several different ideas:

  1. The word size, and
  2. the size of the address space, and
  3. how we divide the address space for each space, and
  4. how we allocate chunks for each space.

It is true that the large address space of 64-bit machines motivated the idea of giving each space a 2TB region. But it is not always desirable. If a VM uses compressed pointers, the address space will be much smaller.

On 32-bit machines, it may be impractical to allocate chunks in a contiguous address region monotonically like 64-bit, but it should still be possible in practice. Conversely, on 64-bit machines without compressed pointers, it may be unnecessary to let multiple spaces share the same address range managed by chunk-grained free-list, but it is still possible in practice, and may even be useful. For example, if a program running on a large server also has program components that allocate a lot of memory, we can no longer assume each space has an independent 2TB to spare.

It should be possible to decouple those concerns mentioned above.

Proposal

We can think from top down to understand what we need to allocate memory at larger-than-page granularity.

Structure

Whole heap: Let's assume for a moment that the whole heap in an MMTk instance occupies a contiguous range of the address space, and is chunk-aligned.

Dividing heap into "partitions": (Let's use the word "partition" for now. There may be a better word for this. The idea is like managing hard disk space, partitioning the disk into partitions.) We divide the heap into multiple contiguous address ranges (aligned to chunks), each of which is called a "partition". Some partitions are "dedicated". A dedicated partition is occupied by exactly one space. Other partitions are shared. A shared partition is shared by multiple Spaces.

Allocating "super-chunks" from partitions: (Let's use "super-chunk" for now. There may be a better word for that. It means "multiple contiguous chunks.) The space asks for multiple chunks at a time (currently via the allocate_contiguous_chunks method). Let's call it a "super-chunk". A super chunk has many chunks, and is contiguous. A space (more precisely, the PageResource of a space) acquires one super-chunk at a time.

  • Allocating "super-chunks" from dedicated partitions: A dedicated partition serves only one space. It can allocate chunks using bump-pointer, and does not need to synchronise with other partitions.

  • Allocating "super-chunks" from shared partition: A shared partition is, as its name suggests, shared by several spaces. It needs synchronisation. The most important part of it is the data structure that keeps track of "which space is using which chunks", which is a FreeList. That thing must be properly synchronised. It allocates super-chunks by getting contiguous range of chunks out of the FreeList. Obviously, different super-chunks allocated from the FreeList may not be adjacent.

Implementation concerns of shared (discontiguous) partitions

PageResource allocates pages instead of chunks. Each PageResource keeps track of which page within the PageResource is allocated. But if multiple PageResource shares one partition, it will be tricky.

FreeListPageResource keeps track of which page within it is free or used using a FreeList. If the partition is shared, the FreeListPageResource will be discontiguous. It will need to keep track of pages from different super-chunks acquired from the partition (but currently from Map32). A FreeList has no problem handling it. In theory, it is practicable (not space-efficient, of course) to let each FreeListPageResource have an independent FreeList (i.e. the thing called a "parent freelist" in the current code) to keep track of all free pages in that FreeListPageResource. In practice, we let multiple FreeListPageResources share one "parent" FreeList which gives each space a "head" (i.e. the two-level structure of FreeList). Different heads will link through different entries in the parent FreeList. This allows each space to have a "child FreeList" which shares the underlying array of entries with its parent.

This explains why VMMap is responsible for creating FreeList even though not all page resources are FreeListPageResource. Map32 is the owner of the "parent" FreeList in the two-level FreeList structure, so it must be responsible for creating all the "child" FreeLists. Map64, although never handle shared partitions, still implementes the create_freelist and create_parent_freelist methods required by the VMMap trait. Actually, a FreeListPageResource needs to obtain FreeList from a central eneity only if it lives in a shared partition.

On the other hand, MonotonicPageResource doesn't use FreeList to track free chunks. It uses a cursor and a sentinel to record the range of free pages. A MonotonicPageResource may sit on top of a shared partition (discontiguous). There will be a "parent FreeList" of pages for this discontiguous partition (i.e. Map32::global_page_map), but the MonotonicPageResource will simply not use that FreeList. Instead, MonotonicPageResource uses a linked list to record super-chunks it obtained from the underlying partition.

Then how should we refactor VMMap, Map32, Map64 and page resources?

First of all, each MMTk instance should have a HeapMeta object. Yes. It is the util::heap::heap_meta::HeapMeta we currently have. It will be the ultimate authority of heap layout.

A HeapMeta shall have multiple partitions. Partitions don't overlap. Partitions egions are aligned to chunk boundaries (or coarser, for example, aligned to 2TB). Each partition can be "dedicated" (to one Space) or "shared" (by multiple Spaces). But even though we have the concept of partitions, we may or may not have a Rust structure to represent it (such as struct Partition { start: Address, extent: usize }).

And then, there is one ChunkResource per partition. Each ChunkResource can be DedicatedChunkResource or SharedChunkResource. (Alternatively we may call them "ContiguousChunkResource" or "DiscontiguousChunkResource".)

  • A DedicatedChunkResource is bound to a "dedicated" partition, and can be used by only one Space. It manages chunks like the current Map64. It bumps the upper bound (i.e. "high water") to get more chunks. It may optionally be initialised with a RawMemoryFreeList.
  • A SharedChunkResource is bound to one "shared" partition, and can be used by multiple Spaces. It manages chunks like the current Map32. It has a "parent" FreeList, and allows its "clients" to create "child" FreeLists.

Finally, each PageResource is associated with zero or one ChunkResource. ExternalPageResource doesn't need ChunkResource because the memory is allocated by the VM. FreeListPageResource and MonotonicPageResource can be freely matched against DedicatedChunkResource or SharedChunkResource. For example:

  • FreeListPageResource + DedicatedChunkResource:
    • like what ImmixSpace currently uses on 64-bit.
  • FreeListPageResource + SharedChunkResource:
    • like what ImmixSpace currently uses on 32-bit.
  • MonotonicPageResource + DedicatedChunkResource:
    • like what the CopySpace in SemiSpace currently uses on 64-bit.
    • Can be configured to let the MonotonicPageResource take over the whole partition during start-up. This will be like the nursery in GenCopy.
  • FreeListPageResource + SharedChunkResource:
    • like what the CopySpace in SemiSpace currently uses on 32-bit.

Coupling and synchronisation between PageResource and ChunkResource

A DedicatedChunkResource is owned by a single PageResource, so the PageResource always has exclusive access to the DedicatedChunkResource.

A SharedChunkResource is shared by multiple PageResource instances, so synchronisation is necessary. One PageResource shall hold an instance of SharedChunkResourceClient (which can implement the ChunkResource trait facing the PageResource). Per-space metadata, such as the "head pointer" of the linked list of "contiguous chunks", can be held in the "client". The "client" will access the shared SharedChunkResource instance, but will require a Mutex.

Related topics

Should we refactor the creation of spaces when creating a Plan?

Currently, spaces need a separate "resizing" step after the plan is created. That is because the address range of discontigous spaces can only be determined after the address range of all other spaces are determined.

We can change this by letting a central entity decide the range of all spaces at once, before any spaces are created.

Step 1: The plan describes the requirement for each space. A requirement (i.e. VMRequest) can be Extent, Fraction (like what we currently have) and IDontCare (replacing the Discontiguous we currently have).

Step 2: A chosen HeapLayoutStrategy (see below) partition the heap space into partitions according to the VMRequests of all spaces, and label each partition as shared or dedicated, and create ChunkResource instances.

Step 3: Spaces create Pageresources, coupling with each ChunkResource.

HeapLayoutStrategy

A HeapLayoutStrategy decides how to partition the heap address space into partitions.

Overall, there are two basic strategies.

  • Strategy for generous address spaces: Give each space a dedicated 2TB (configurable) partition.
    • Mainly used for 64-bit systems.
  • Strategy for confined address spaces: Let all spaces that "don't care" where it is placed share one large partition (size configurble). Some spaces may demand dedicated partitions, and may even demand fixed address ranges (such as VM space).
    • Mainly used for 32-bit systems.
    • Also useful for 64-bit systems with compressed pointers.

MMTk-core should give a reasonable default (for example, "generous" on 64-bit machines, and "confined" on 32-bit machines). The VM can also choose the strategy it desires and configure it. For example,

  • The VM can choose the "generous" strategy, but sets partition sizes to be smaller, such as 16GB, if it detects (at run time) that the VM is running on a machine with a smaller virtual address space, such as RISC-V with 39-bit addresses (Sv39).
  • The VM can choose the "confined" strategy, and limit the address range of some spaces to the lowest 32-bit (or 35-bit, depending on VM) so that objects can be referred compressed pointers.

SFT and side metadata

The design of heap layout influences the design of SFT and side metadata. It should be fast to locate the SFT entry and side metadata address when given the address of an object.

SFT is a per-space dispatching table. If spaces are aligned to 2TB regions, the SFT can be implemented very efficiently using a vector where each element governs a 2TB region (SFTSpaceMap). If spaces are discontiguous, we have to allocate one entry per chunk (SFT{Dense,Sparse}ChunkMap), where SFTDenseChunkMap introduces one level of indirection, and SFTSparseChunkMap is less space-efficient.

For both SFT and side metadata, since they are not part of heap objects, they can be allocated outside the lowest 32-bit (or 35-bit or other sizes depending on the design of compressed pointers) address space, allowing us to dedicate the lowest 32-bit address space for the heap.

@steveblackburn
Copy link
Contributor

You've written a lot, and I'm not sure I followed it all.

A few quick comments:

  1. Let's not use the word "region". It has a very well defined meaning in memory management which is different to what you're meaning here.
  2. Let's be clear about what a space is. It is part of the address space subject to the same GC semantics. Each space has precisely one GC semantics (eg copying, mark-compact, immix, etc). There may be multiple spaces with the same semantics.
  3. Let's be clear that there are reasons for discontiguous spaces and contiguous spaces. Although contiguous spaces are clearly address-space-inefficient, these ideas are in principle orthogonal to address space space size. This seems not to be clear in the status quo (the ideas are somewhat jumbled).

@wks
Copy link
Collaborator Author

wks commented Sep 4, 2023

From today's discussion, we know that different heap layout has different implications on the design of metadata and SFT.

@wks
Copy link
Collaborator Author

wks commented Sep 6, 2023

I changed the proposal and related topics sections.

I now use the term "partition" (instead of the misused term "region") for the idea of partitioning the heap space into non-overlapping chunk-aligned address ranges (i.e. partitions). There may be better word for that.

I also introduced the term "super-chunk" to refer to "contiguous chunks" a PageResource acquires from a partition (currently from Map32 or Map64 using the allocate_contiguous_chunks method). I think it deserves a name because it is quite tricky to manage "super-chunks", e.g. using linked lists, or using shared FreeLists.

I also changed the "heap layout strategy" part to use the term "generous" and "confined" as @steveblackburn mentioned in the meeting.

@wks
Copy link
Collaborator Author

wks commented Sep 11, 2023

In our meeting this morning, we came to the following conclusions.

The structural change makes synchronisation clearer.

Currently all data structures responsible for allocation at chunk granularity are owned by the global singleton Map32 or Map64. This is bad because everything seems to be shared, and everything seems to need locking while we sometimes access fields without locks.

With the structural refactoring involving "partitions" and "ChunkResource", it will be like the following diagram:

image

It is clear what is owned by one space (dedicated partitions and ContiguousChunkResource) and what is shared by multiple spaces (shared partitions and DiscontiguousChunkResource). Therefore it is clear where we should add Mutex for synchronisation (e.g. DiscontiguousChunkResourceSync), and what data structures can be accessed without locks (e.g. the base and high_water fields of ContiguousChunkResource).

But we don't need all the structural change to make it safe.

Even if we refactor the structure as the diagram above, we still need as many Mutex as we currently have.. Currently, Map32 has a field Map32::sync which is a Mutex<()> that protects almost everything in Map32Inner, and we acquire the Mutex whenever we allocate/free any chunks from Map32 by calling mut_self_with_sync. The name is quite ironic if you ask me, because we can simply put all the mutable things inside the sync to make it safely mutable. We can eliminate the need to use unsafe in Map32 by putting things in sync, and we can do this refactoring now without the large-scale structural change.

We identified some unsafe use cases in Map32.

Map32::get_contiguous_region_chunks calls self.region_map.size to get the size of a "contiguous chunks" (which I call a "super-chunk" in this Issue). It queries this information by looking up the region_map: FreeList, but it does not acquire the sync: Mutex<()> when doing so. This means if function is called while another thread is allocating chunks, it will be a race. Fortunately, the Map32::get_contiguous_region_chunks function is currently only used in the following cases:

  1. When freeing pages. During the sweeping stage, only one GC worker is executing the Release work packet, so it is still race-free.
  2. When compacting a discontiguous MarkCompactSpace (See Discontiguous mark compact space support #939). It needs to linearly scan through the super-chunks ("contiguous chunks"-es) it obtained from the chunk resource (Map32). But this only happens during the forwarding and compacting stages of GC, so it is still race-free.
  3. When printing VM map. This is for debugging purpose. The user should call it at the "right" moment in order not to race with other threads.

Although none of the above introduced actual races, it is still not safe to access some shared data structure without holding the Mutex (or a &mut Mutex in which case Rust allows it to access the underlying data of a Mutex without locking because &mut is an evidence of unique access). We should just require the use cases listed above to acquire the lock when accessing. Anyway, it is an error to access a FreeList from multiple threads without synchronisation.

The role of some data structures can be replaced by SFT

The JikesRVM MMTk didn't have SFT. Therefore, we need global tables to dispatch from address to spaces.

We introduced SFT in the Rust MMTk. We currently have several SFT implementations at different granularities for 32-bit and 64-bit systems. SFT can replace some of the data structures we currently have in Map32 and Map64.

The descriptor_map is one example. It maps from chunks (Map32) or space IDs (Map64) to SpaceDescriptors. SpaceDescriptor is used by Space::address_in_space for discontiguous spaces, and also used by some assertions. When using discontiguous spaces (currently only on 32-bit systems or when using compressed OOPs), the address space is confined, and we implement SFT as a chunk-grained map, too. We can simply look up the space by looking up the SFT directly. (Actually, we only need the space index from SFT, without getting the &dyn reference. See #945 (comment) for some ideas for refactoring SFT.) If we use SFT to look up Space from address, we can remove descriptor_map completely.

We can replace the linked list of super-chunks with a Vec (or other containers)

Currently, Map32::prev_link and map32::next_link form linked lists through super-chunks ("contiguous chunks"-es). Since it is a per-chunk shared structure, it has to be accessed with unsafe (or Atomic or Mutex if we refactor). But given that it's purpose is recording super-chunks a PageResource has obtained from a ChunkResource (Map32 or Map64), it doesn't really matter what data structure it uses.

Linked list might be the right data structure for JikesRVM MMTk because the metacircular VM implementation doesn't provide MMTK any advanced auto-resizing data structures (such as ArrayList) to use. But Rust MMTk can use Vec or other data structures that solely depend on malloc.

If we use a Vec to record super-chunks, we can remove the prev_link and next_link tables completely, and the algorithm for iterating through super-chunks will be greatly simplified. We can also make it a Vec<SuperChunk> instead of Vec<Address>, and make struct SuperChunk { start: Address, extent: usize } record its own size so that we don't need to ask the shared FreeList for the size of the SuperChunk. This will remove the unsafe access pattern mentioned above.

Debugging concerns

When doing the structural refactoring, it will be better if we have one VM that supports compressed pointers. In that way, we can debug discontiguous spaces on 64-bit systems, without cross-compiling for i686-unknown-linux-gnu or using -m32. Compressed pointer is the main use case of Map32 on 64-bit systems so we can test if that functions correctly. But we can also test Map32 by manually overriding the default VM_LAYOUT on 64-bit systems to use VMLayout::new_32bit().

Space::acquire already holds locks when allocating pages. We should refactor that to reduce locking, too.

The method Space::acquire holds a lock CommonSpace::acquire_lock to prevent a race condition where one thread allocated a new chunk but has not initialised its metadata and SFT, but another thread then attempts to allocate new pages in the same chunk and assume the metadata and SFT have already been initialised. But when the space calls pr.get_new_pages to allocating pages, the PageResource acquires its own lock (MonotonicPageResource::sync and FreeListPageResource::sync) for synchronisation. It should be possible to reduce one level of locking.

@wks
Copy link
Collaborator Author

wks commented Sep 11, 2023

During today's meeting, we also asked one question:

Why does the nursery of GenCopy and GenImmix occupy a dedicated partition instead of sharing the partition with other spaces?

It is the way the nursery is currently implemented. Currently, the nursery in CommonGenPlan is created with VMRequest::fixed_extent. (See

VMRequest::fixed_extent(args.global_args.options.get_max_nursery_bytes(), false),
) Therefore, it is contiguous (and therefore has its dedicated address range not shared with other spaces) even with Map32.

We don't know why it was designed like this. I guess it is faster to allocate memory in contiguous spaces, and it is easy to control the size of the nursery, and it is easy to ensure the address range is always available for the nursery, and cannot be taken by other spaces. But those are purely speculations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P-normal Priority: Normal.
Projects
None yet
Development

No branches or pull requests

3 participants