Emulating Persistent Memory in the Linux Kernel- 6 mins
Emulating Persistent Memory in the Linux Kernel: Experience and How-To
This past semester I was fortunate enough to take CPSC 508 with the marvelous Margo Seltzer. CPSC 508 (for anyone who doesn’t go to UBC) is a graduate level operating systems course, and as part of this we were all tasked to work on a research topic of our choosing. For our project, my group decided we would do some research that required two things:
- emulate persistent memory
- change the page size of that emulated persistent memory
At first glance, I didn’t think that either of these goals would present any real roadblocks, but as a whole our group soon discovered that we were incredibly mistaken. In this post I hope to shed some light on this process so that others will benefit from a reduced amount of pain and anxiety (all joking aside, this really was painful).
Step 1: Emulating PMEM
This is the easy part, so here goes.
Emulating persistent memory in the Linux kernel can be achieved via the
memmap flag in kernel versions 4.0+. This flag works as follows:
- X is the amount of memory you’d like to set aside for your emulated PMEM device
- Y is the physical offset for where this mapping should begin
Now, when I say physical offset, I mean the actual address where Y starts. But, you obviously can’t just go and map any region of your address space as a PMEM device. You need to do a small amount of due diligence first to determine which regions are not currently reserved. This can be achieved via running
dmesg | grep BIOS-e820 which should give you a listing of address ranges that are in use and free, like this:
➜ ~ dmesg | grep BIOS-e820 [ 0.000000] BIOS-e820: [mem 0x0000000000000100-0x0000000000057fff] usable [ 0.000000] BIOS-e820: [mem 0x0000000000058000-0x0000000000058fff] reserved
Now, find a region that’s marked as ‘usable’, determine how big it is, and map a region less than or equal to that amount with the memmap flag. The memmap flag can be set using GRUB at boot time. As an example, to map 32 GB of my DRAM as a PMEM device, using a region of usable memory that starts at 4 GB, I would set
memmap=32G!4G in my GRUB config (or via kexec).
There are a couple ways to now verify whether this mapping succeeded. The first way is to just list
/dev/. It should show up there as pmemX (i.e. pmem0). Another way to further verify is to run
dmesg | grep user, which should list a persistent memory device.
At this point, there are multiple options to get stuff up and running on your newly emulated PMEM device. You can create an EXT4 or XFS filesystem on top of it and run any applications with it by using libvmmalloc, or you can use the PMDK (persistent memory development kit) to write your own new applications to run on PMEM.
If you want to mess around with the page size, however, it turns out you cannot simply strap libvmmalloc onto your app and run it. There is a bit of an issue here – which brings us to the next section: changing page size.
Step 2: Changing Page Size with PMEM
Now, if for some reason you (like us) would like to/need to change the page size for an application you intend to run on PMEM, things get trickier.
Libvmmalloc, the simple solution for getting something running on PMEM, was developed to run on top of a PMEM device that was formatted in fsdax mode. This mode is the same one mentioned in the previous section, where you create a filesystem on top of your pmem. This essentially allows libvmmalloc to allocate a temporary file as a ‘VMMALLOC_POOL’, and handing that region off to jemalloc in an arena-esque allocation method. This is great if you’re okay with the default 2MB page size for PMEM devices in the Linux kernel; but what if you wanted 4K or 1G pages? What then?
The DRAM method for acquiring larger pages comes in the form of the
hugectl tools, and has hugepage support built directly into the kernel. Support is also built into the kernel for different page sizes for PMEM, but the knob is really not as easy to discover. Unlike DRAM, there is no obvious “change me to change the page size” button staring you in the face. This is where some digging around had to be done.
If you have the guts and inclination, digging into the nvdimm driver allows you to start to get an inkling of how page size is chosen for a PMEM device. Essentially, there are 3 page sizes supported – 4K, 2M, and 1G. These correspond to the variables PAGE_SIZE, HUGEPAGE_PMD_SIZE, and HUGEPAGE_PUD_SIZE respectively. What we realized is that these page sizes are implicitly controlled via the alignment of the PMEM device. If the device was aligned to 4K, you get 4K pages, and so on for 2M and 1G sizes.
Now that we figured this out, all that was left to do was to change the alignment… or so we thought. Even when changing the alignment with the
ndctl tool, using fsdax mode it was not possible to get the page size we wanted (this was verified by looking that the number of TLB misses, which were the same for 4K and ‘1G’ pages - suggesting this method would not work). What came as a strange surprise, however, was the fact that when running a benchmark designed to be used with devdax mode, we did see results indicative of a page size change!
This final observation led us to the last piece of the puzzle – getting our applications to run with devdax mode. The solution, as it turns out, was to hack libvmmalloc. There wasn’t a huge development effort that had to happen here, so we just went ahead and removed any functionality that relied on a filesystem being present, and replaced it with functionality that should still achieve our goal. Admittedly, this is not the most stable, but it worked for most of the time we tried it. There may be more tweaks needed to get a stable version of libvmmalloc for devdax usage, however.
Our hacked version of libvmmalloc can be found here.