Why is Rails boot so slow on macOS?

About a year ago I noticed various benchmarks such as running specs were running abnormally slow on macOS. Noah Gibbs started some basic investigation but did not come up with anything definitive.

Since then I crafted a version of Discourse Bench (used at: https://rubybench.org/) that is there to profile “performance” of local development machines.

It is a simple benchmark that measures performance of various tasks you are likely to run into in development:

  • Booting Rails
  • Running a spec
  • Visiting a very common page
  • Creating a topic
  • Converting text to HTML via our markdown library

It also contains some micro benches to help isolate culprits including:

  • PBKDF2 calculation
  • Filesystem random read access

The benchmark lives here, anyone can run it.

git clone https://github.com/SamSaffron/discourse-dev-benchmark.git
bundle
./bench

We have run the benchmark on many of our development machines and noticed a trend. The same CPU tends to perform a lot better (especially boot) on either Linux native or WSL2 (virtualized Linux on Windows) than it does on MacOS

Additionally, I took the time to run the same benchmark on a dual booting Mac that I have. Here are some examples of the benchmarks we got (abbreviated, cause we have many benches):

Booting Rails

In this benchmark we simply run rails r Post.first.

Desktop - 9900KF - nvme - Arch | 1.375 Desktop - 9900KF - nvme - WSL2 | 1.407 Desktop - Ryzen 3900X - NVME - Ubuntu | 1.417 Lenovo P1 - Gen 2 - 9850H - WSL2 - Ubuntu | 1.433 Laptop - XPS 15 - 9880HK - Linux 5.5.8 | 1.439 Lenovo P1 - Gen 2 - Arch Native - 9850H | 1.539 MacBook Pro (16-inch, 2019) - Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz | 2.51 MBP i7-3615QM, Windows WSL 2 (2.3 quad mid 2012) | 2.52 MBP i7-3615QM, MacOS Catalina (2.3 quad mid 2012) | 3.953
desc seconds variance percent
Desktop - i9-9900K Linux Mint 1.356 (±1.7%) 100.0%
Desktop - 9900KF - Asus Prime Z370 - nvme - Arch 1.375 (±1.0%) 101.4%
Desktop - 9900KF - Asus Prime Z370 - nvme - WSL2 1.407 (±2.7%) 103.76%
Desktop - Ryzen 3900X - NVME - Ubuntu 1.417 (±0.4%) 104.5%
Lenovo P1 - Gen 2 - 9850H - WSL2 - Ubuntu 1.433 (±1.2%) 105.68%
Laptop - XPS 15 - 9880HK - Linux 5.5.8 1.439 (±0.6%) 106.12%
Lenovo P1 - Gen 2 - Arch Native - 9850H 1.539 (±1.0%) 113.5%
Desktop: AMD Ryzen 3950X, Arch, NVMe, ruby 2.6.5 (rvm) 1.728 (±3.2%) 127.43%
Oryx Pro - i7-9750H CPU @ 2.60GHz, 32GB, NVMe. 1.787 (±4.5%) 131.78%
MacBook Pro (16-inch, 2019) - Intel® Core™ i9-9980HK CPU @ 2.40GHz 2.51 (±2.4%) 185.1%
MBP i7-3615QM, Windows WSL 2 (2.3 quad mid 2012) 2.52 (±0.3%) 185.84%
MBP i7-9750H @ 2.6GHz (15", 2019, macOS Catalina) 3.046 (±1.7%) 224.63%
Lenovo P1 - Gen 2 - 9850H - WSL1 - Ubuntu 3.06 (±0.9%) 225.66%
MBP i7-3615QM, MacOS Catalina (2.3 quad mid 2012) 3.953 (±0.1%) 291.52%

Observations

  • An ancient 2012 MacBook Pro with WSL2 boots Discourse almost as fast as the state of the art MacBook Pro!

  • 9880HK CPU boots about 80% faster on Linux than it does in macOS

  • WSL1 is slower on boot, but can have adequate performance if you enable bootsnap (which we do)

  • 9900K by intel is the single thread king, beating both the 3900X and 3950X by AMD

  • 3950X though on paper faster than the 3900X failed to beat any of our 3900X based computers on single thread benches despite the 2x price tag.

  • Top of the range laptops perform Rails boot very close to decked up desktops

  • We stopped benching on rbenv based systems, everyone moved to chruby or rvm cause the shims rbenv adds introduce significant delays on boot.

Running the Discourse Post spec

In this benchmark we run: bin/rspec spec/models/post_spec.rb --seed 1, it is one of our larger spec files and is pretty representative.

Desktop - i9-9900K Linux Mint | 9.907 Desktop - 9900KF - nvme - Arch | 9.994 Desktop - Ryzen 3900X - NVME - Ubuntu | 10.524 Laptop - XPS 15 - 9880HK - Linux 5.5.8 | 10.74 Lenovo P1 - Gen 2 - Arch Native - 9850H | 11.346 Desktop - 9900KF - Asus Prime Z370 - nvme - WSL2 | 11.748 Desktop: AMD Ryzen 3950X, Arch, NVMe, ruby 2.6.5 (asdf) | 11.809 Lenovo P1 - Gen 2 - 9850H - WSL2 - Ubuntu | 12.308 Oryx Pro - i7-9750H CPU @ 2.60GHz, 32GB, NVMe. | 12.557 MacBook Pro (16-inch, 2019) - Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz | 12.963 MBP i7-9750H @ 2.6GHz (15", 2019, macOS Catalina) | 16.173 Lenovo P1 - Gen 2 - 9850H - WSL1 - Ubuntu | 17.598 MBP i7-3615QM, MacOS Catalina (2.3 quad mid 2012) | 23.396 MacBook Air (13-inch, 2018) - i5 8210Y | 23.569 MBP i7-3615QM, Windows WSL 2 (2.3 quad mid 2012) | 23.738
desc seconds variance percent
Desktop - i9-9900K Linux Mint 9.907 (±0.6%) 100.0%
Desktop - 9900KF - Asus Prime Z370 - nvme - Arch 9.994 (±0.5%) 100.88%
Desktop - Ryzen 3900X - NVME - Ubuntu 10.524 (±0.4%) 106.23%
Laptop - XPS 15 - 9880HK - Linux 5.5.8 10.74 (±2.4%) 108.41%
Lenovo P1 - Gen 2 - Arch Native - 9850H 11.346 (±0.0%) 114.53%
Desktop - 9900KF - Asus Prime Z370 - nvme - WSL2 11.748 (±0.5%) 118.58%
Desktop: AMD Ryzen 3950X, Arch, NVMe, ruby 2.6.5 (asdf) 11.809 (±0.5%) 119.2%
Lenovo P1 - Gen 2 - 9850H - WSL2 - Ubuntu 12.308 (±2.0%) 124.24%
Oryx Pro - i7-9750H CPU @ 2.60GHz, 32GB, NVMe. 12.557 (±2.5%) 126.75%
MacBook Pro (16-inch, 2019) - Intel® Core™ i9-9980HK CPU @ 2.40GHz 12.963 (±1.7%) 130.85%
Mac Mini i7-8700B @ 3.2 GHz (2018, Ubuntu 19.10, USB3 SSD) 13.971 (±11.3%) 141.02%
MBP i7-9750H @ 2.6GHz (15", 2019, macOS Catalina) 16.173 (±1.0%) 163.25%
Lenovo P1 - Gen 2 - 9850H - WSL1 - Ubuntu 17.598 (±0.6%) 177.63%
MBP i7 2.6 GHz, 15’’ 2018, Catalina 16Gb RAM 20.922 (±2.1%) 211.18%
MBP i7-3615QM, MacOS Catalina (2.3 quad mid 2012) 23.396 (±1.7%) 236.16%
MBP i7-3615QM, Windows WSL 2 (2.3 quad mid 2012) 23.738 (±0.6%) 239.61%
(Docker) MBP i7-8559U (13-inch, 2018) 74.232 (±4.0%) 749.29%

Observation

  • Even if we compensate for the 1 second slower boot, performance of the 9980HK under macOS is still 10% or so slower than Linux using the same CPU

  • Again, to of the range laptops are very very close to top of the range desktops for running a single spec

  • Docker on Mac is a world of pain when it comes to developing Rails

  • WSL2 and Catalina on the same machine have similar performance running our spec file. It seems that in a steady state after the slow boot, Mac can beat virtualized Linux performance but remains slower than non virtualized Linux.

Random Read

In this benchmark we read 10 random files that are 10_000 bytes long from a directory containing 5000 files.

This benchmark really highlights the slowness of WSL1 and MacOS

Lenovo P1 - Gen 2 - Arch Native - 9850H | 15745.077 Lenovo P1 - Gen 2 - 9850H - WSL2 - Ubuntu | 14991.689 Lenovo P1 - Gen 2 - 9850H - WSL1 - Ubuntu | 1270.476
Laptop - XPS 15 - 9880HK - Linux 5.5.8 | 16956.501 Laptop XPS15, Ubuntu 18.04.4 w/KDE Plasma (2.4GHz i9-9980HK, 32GB RAM) - V2 | 15997.256 MacBook Pro (16-inch, 2019) - Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz | 3776.437

Raw data omitted cause graphs tell the story. It appears across the board, reading lots of smallish files randomly is significantly slower on macOS which may be a big reason for the much slower boot times.

Does anyone have any ideas how we could get boot and performance on MacOS improved?

Any other thoughts here?

23 Likes

Based on your conclusion that the difference comes from IO performance, maybe you could try to read from a RAMFS partition? It would probably be tricky to properly integrate in a development environment, but it’s worth a try.

8 Likes

Sam said on Twitter he tested w/ a ramdisk and saw no significant change. That makes sense to me as a 4x slowdown in random read performance on 2 machines w/SSDs is going to have to be software related.

This looks like a non-Rails-specific issue that others have encountered.

Questions still open here for me:

  1. Does an encrypted/decrypted partition make a difference?
  2. Does HFS+ or another filesystem on MacOS perform better or worse?
  3. Does a virtualized or dockerized environment perform better or worse on Sam’s benchmark?
  4. Can we wrap a profile around this benchmark to see which parts exactly are slower?
3 Likes

I think when we ran the test we only tested putting the application directory on the ramdisk, I am guessing to get any material boost we would need to keep all the gems on the ramdisk.

Agree completely that even if we manage to prove this is way faster, integrating it into a dev environment may be too difficult. One possible alternative would be to have a “non validating” mode for bootsnap where it does not stat files of validate and simply loads stuff directly from the cache?

@nateberkopec one wonderful thing about this benchmark is that we can all run it. All you need is redis and postgres installed, it takes care of the rest including configuring a standalone pg process and standalone redis process.

Perhaps you know someone who wants to get in to “performance” who could try answering these questions? Just play with knobs and re-run the bench.

I strongly suspect standalone virtualbox image will boot faster on a Mac than native Mac, which is kind of scary. I think Docker is painfully slow once you involve any cached filesystem that replicates to the host. We tried running everything inside docker, but the problem there is that it runs in a overlayfs volume which by its nature is really slow, if you could mount a native ext3fs volume to it somehow, it may be fast, but replicating to the host will be tricky.

4 Likes

Yes, for most app gems contain way more files than the actual app. Testing with the gems on the RAMdisk shouldn’t be too hard if you use bundle --path to install gems in a subdirectory.

It would also be interesting to test both HFS+ (is that still available on newer OSX?) and APFS ramdisk, if there is a notable difference between the two that could confirm that the performance difference is in the filesystem implementation.

It’s all conjecture, but it really sounds like OSX isn’t keeping files in cache as aggressively than Linux. That’s what the RAMdisk would help prove. If the file system cache works properly, the vast majority of your benchmark file access should be done in RAM if you run it twice.

If the RAMdisk is almost as slow as the real disk, then it’s not a cache problem.

1 Like

I just ran benchmark on a Macbook 12-inch (Early 2015, 1.1 GHz Intel Core M, 8GB, macOS 10.14.6). Ruby v2.6.5 via rbenv. Postgresql v12.2 installed via homebrew.

Running benchmark on 1 core
Topic View: 1.907 (±14.4%) op/s
Topic Create: 1.114 (±3.4%) op/s
Post Spec: 51.092 (±16.5%) seconds
PBKdf2 64000 sha256: 2.693 (±12.6%) op/s
Cook README.md: 36.738 (±16.8%) op/s
Discourse Boot: 10.181 (±28.1%) seconds
Random Read: 1119.328 (±308.1%) op/s

Then on the same machine, but via Vagrant(v2.2.6) & Virtualbox(v6.1.0). VM was running Ubuntu 18.04.1 LTS (GNU/Linux 4.15.0-29-generic x86_64). Assigned only 1GB of RAM and a single core. Ruby v2.6.5 again via rbenv. Postgresql v10.6 OS package.

Running benchmark on 1 core
Topic View: 2.789 (±10.1%) op/s
Topic Create: 1.822 (±4.7%) op/s
Post Spec: 39.85 (±4.0%) seconds
PBKdf2 64000 sha256: 5.55 (±23.3%) op/s
Cook README.md: 43.146 (±37.4%) op/s
Discourse Boot: 3.774 (±3.5%) seconds
Random Read: 3056.49 (±413.0%) op/s

Just a quick test, but Discourse Boot and Random Read are nearly 3x faster inside VM.

3 Likes

I ran a ruby-prof around a single run of Random Read test. Comparing macOS vs a VM on macOS:

macOS

Ubuntu VM on macOS

On the VM random_file is 22% of the time, compared to 4% on macOS. Assuming they take the same real world time on both VM and macOS then it suggests Class::IO#read is 5x slower on macOS?

3 Likes

I simplified the test to remove all the PG/Redis stuff and the thread forking. It just created the files and ran the 11 random reads. I benchmarked with benchmark-ips and this is what i got:

RANDOM_FILES = 5_000
RANDOM_FILE_SIZE = 10_000

macOS
random_read      1.712k (±10.3%) i/s -      8.550k in   5.069574s

VM on macOS
random_read      4.593k (± 8.6%) i/s -     22.984k in   5.052378s

And I tested with fewer files:

RANDOM_FILES = 50
RANDOM_FILE_SIZE = 10_000

macOS
random_read      1.866k (± 6.2%) i/s -      9.324k in   5.020224s

VM on macOS
random_read      6.176k (± 8.0%) i/s -     30.798k in   5.024956s

More testing. With tiny files the performance difference increases

macOS:

Creating 500 random files for IO tests of size 50000
random_read    593.334  (±17.4%) i/s -      2.856k in   5.010175s

Creating 500 random files for IO tests of size 5
random_read      1.651k (± 5.8%) i/s -      8.296k in   5.041559s

VM on macOS:

Creating 500 random files for IO tests of size 50000
random_read      2.580k (±25.0%) i/s -     12.250k in   5.308382s

Creating 500 random files for IO tests of size 5
random_read     10.451k (± 7.5%) i/s -     52.377k in   5.044150s

I tried to install valgrind/callgrind to profile the C code. But so far not had any success in getting that to work.

UPDATE 2020-04-01 (I cannot make any more posts as new user, so editing this)

I gave up on valgrind and tried Apple’s Instruments. I ran a Time Profile while running my benchmark-ips test. Seems like the majority of the time is being spent in open_key_argshttps://github.com/ruby/ruby/blob/d214c188e48db5049fba8049a57dd6f73b1a57b0/io.c#L10614

Drilling in more gives:

5 Likes

I profiled randomly reading files with File.read(random_file) on both macOS (10.14.6) and then repeated on a VM on the same computer: Vagrant(2.2.6), Virtualbox(6.1.0), Ubuntu 18.04.1 LTS (GNU/Linux 4.15.0-29-generic x86_64). Both are running Ruby 2.6.5.

Here is the key areas of the profile.

macOS:

Ubuntu VM on macOS:

So maybe open_key_args is running much slower on macOS?

3 Likes

On macOS it looks like a lot of time is spent in rb_cloexec_open ruby/io.c at d214c188e48db5049fba8049a57dd6f73b1a57b0 · ruby/ruby · GitHub

If I am understanding this correctly, then in the VM very little time is spent in rb_cloexec_open:

I don’t know C, so I cannot really go much further with this. But if anyone needs anymore info then please let me know.

5 Likes

@g101 your profiles are interesting. They show that most of the time is spent either in the open syscall itself, or waiting to for it (there is some busy looping if it would block).

At this stage it would be interesting to reduce the benchmark to a simple C program doing lots of fopen.

I’m starting to wonder wether open on OSX might be slow because of the extra layers of file system security, with signed binaries and all they added a few versions back (System Integrity Protection - Wikipedia), in the same way that SELinux and AppArmor, or many anti-virus on Windows can have performance overhead on file access as well (but much lower than what is witnessed here).

5 Likes

Newer Macbooks have the T2 chip so I think even encrypted APFS is meant to be fast.

I have no idea if this is relevant, but it discuss slow I/O reads on OS X using C++. And the issue seemed to come down to compiler options c++ - LLVM Clang produces extremely slow input/output on OS X - Stack Overflow

3 Likes

I’m not talking about encryption overhead, here.

I suggest that everytime you open a file (open("/foo/bar")), the kernel has to check if the process is allowed to open that path (not just unix permissions). I’m suspecting this check is slow.

This could be true with the latest introduction of the security thing which we see on Catalina. When you’re trying to open something, macOS just asks for permission. It’s possible that even when running Rails apps, it just tries every file we open with that same check.

Whenever I’m developing rails apps in docker on a Mac I always I have to mount an NFS volume to get anywhere near decent performance. Our app not only needs to load a lot of Ruby files, but also a bunch of static assets during each page load. Hitting our app’s dashboard after a cold boot can take minutes. Interesting to hear from this article that there might be something inherently slow about the Mac’s filesystem.

2 Likes

The Docker file system performance issues may not be related, as a lot of those come from the fact than docker on a mac runs in a Linux VM, and Docker for mac uses a “bridge” to allow this VM (and thus your containers) to access your local filesystem. This has a huge performance impact.

2 Likes

Right, that’s true. I’m wondering however if the MacOS filesystem contributes to the overall slowness in some capacity, considering the article mentions reading many small files is slower than when using other operating systems.

1 Like

I am not a GoLang expert but I created some code to randomly read small files. I created a folder with 10,000 files of size 100 bytes.

import (
	"fmt"
	"io/ioutil"
	"math/rand"
	"time"
)

func main() {
	rand.Seed(time.Now().UnixNano())
	start := time.Now()
	for i := 0; i < 10000; i++ {
		file := fmt.Sprint("random_files/", rand.Intn(10000))
		_, err := ioutil.ReadFile(file)
		if err != nil {
			fmt.Println("Err")
		}
	}
	t := time.Now()
	difference := t.Sub(start)
	fmt.Printf("difference = %v\n", difference)
}

On macOS Mojave 10.14.6 the results were:

user$ ./gofile 
difference = 3.300336361s
user$ ./gofile 
difference = 1.16395616s
user$ ./gofile 
difference = 693.500542ms
...
user$ ./gofile 
difference = 240.043622ms

After multiple runs the time eventually leveled off at around 240ms. I assume once all the files were cached by the OS.

Running the same code on same computer but inside a VirtualBox VM running Ubuntu 18.04 LTS, I got the results:

$ ./gofile 
difference = 137.963123ms
$ ./gofile 
difference = 129.356798ms
$ ./gofile 
difference = 130.005818ms
$ ./gofile 
difference = 133.438552ms
$ ./gofile 
difference = 115.163246ms
$ ./gofile 
...
$ ./gofile 
difference = 97.664743ms

This time leveling off around 100ms.

3 Likes

Wow so MacOS is almost 2.5x slower for that test :frowning:

2 Likes

MacOS is almost 2.5x slower for that test

Yep. So that go test proves that as suspected above the performance issue is somewhere in Apple’s code.

I doubt there’s much that can be done beside making a lot of noise about it, and hope Apple fixes it some day.

We could also try to workaround it by bundling multiple small files into one, but that is a quite tricky, and hard to justify for working around an OSX specific bug.