reflection on the x86 virtualization memory

Paper Reflection:

Q: List at least one pro and one con for software MMU and for hardware MMU A: software MMU: Flexibility but lack of performance, flush shadow page table on each CR3 change. hardwar MMU: high performace but complex (amplify the memory access x 20, stress on TLB) and $$$ to implement (bigger TLB)

Q: What is the double paging problem and what caused it? A: page swap出去了，又跳进来了，又跳出去了。为啥？两个领导不好伺候。Guest and meta-level policies may clash, resulting in double paging

Q: What is the benefit of keeping a “hint” entry for each scanned (but unshared) page (as compared to not maintaining anything for the page) A: how quickly decide a page is identical to avoid full scanned and re-scan.? the idea is to hash content for the signature.

Questions:

page sharing 的时候会有hash collision的问题么？如果有，如何避免data corruption 和data leak的问题。
memory ballooing 本质是会哭的孩子有奶吃，如何避免noisy neighbor problem，特别是在云端multi-tenant的环境下?
how to decide upper limit of number of VM to provision on a server? how to maximum the overcommitment? 切分逻辑是动态的还是固定的？

Note:

from ppt ** sw vs hw memory virtualization

if TLB hit: sw and hw is the same. MMU generate a page fault if invalid OS performs page fault handling: fetch the page, update the page table and resume the execution.

but if TLB miss, sw controlled tlb: HW raise exception, trap to OS, and OS refresh the page table => TLB. hw controlled tlb: mmu refresh the page table => TLB.

知识点： TLB is (hardware cache) subset of page table(data structure in RAM).

** Difficulty in Virtualizing Hardware-Managed TLB why a hypervisor doesn’t have a chance to intercept TLB misses？ it’s because TLB misses are typically handled directly by the hardware MMU. When a TLB miss occurs, the MMU initiates a page table walk to fetch the missing translation from memory. the hypervisor operates at a higher level of abstraction and does not have direct control over the MMU’s low-level operations.

solution:

shadow paging (this slide)
para-virtualization: modified guest OS, eg. gues OS remove sensitive but unpriviledge insturctions
new hw (this slide)

shadow paging

VMM 监控CR3, base address of the virtual address spaces
CR3 变了，也意味着shadow page table需要re-sync
每个guest physicall address需要重新翻译为machine address
在把CR3指向shadow page table

recall: logic page number -> physical pages number -> machine page number shadow page table and guest page table per application and pmaps per VM. shadow pages table can also have user/kenel split

** Hardware-assisted memory virtualization

Intel Extended Page Table (EPTE), referenced by the EPT base pointer. PPN -> MPN per vm HW directly walk the guest page table and the extended page table. 不用自己维护shadow page table. 硬件自己re-sync.

** Memory management memory mamagement: reclaiming, sharing, allocation

*** Reclaiming Memory total memory size of all VM > actual machine memory size b/c overcommitment.

Requires “meta-level” decisions: which page from which VM to swap and knowledge of guest OS.

solution: implicit cooperation

*** Ballooning dynamically adjusts the memory allocation for each virtual machine (VM) based on the memory pressure.

** memory sharing shared same OS, apps or shared memory shm. copy-on-write.

Performance Evaluation of Intel EPT Hardware Assist nice visualization of sw MMU and hardward MMU
Memory Resource Management in VMware ESX Server

** physical is “physical” in the context of virtualization. pmap: “physical” page number to machine page number

** chap3: Reclamation Mechanisms 3.1 Page Replacement Issues: why double paging problem 3.2 Ballooning: dynamic and also pop the ballon, reset the mapping PPN->MPN. 3.3 demand paging: When ballooning is not possible or insufficient, the system falls back to a paging mechanism

** Sharing Memory 4.1 Transparent Page Sharing: knowledge-based redundant detection 4.2 Content-Based Page Sharing: content-based redundant detection from hint-frame

Written on February 27, 2024