Neural Science and AR/VR limitations

Two years ago, after reading the glowing media reviews of upcoming VR headsets in virtually all tech publication, I chose to attend the GDC conference in San Francisco to check VR headsets myself. Sure enough, trade show aisles were peppered with booths sporting Microsoft HoloLense, Oculus Rift, HTC Vive, Sony Playstation VR and more.

I expected to be swept off my feet by life-like immersive experience. I was not. The dirty little secret of first-generation VR headsets is that the user experience is overhyped by the VR companies as well as by the press. I predict, that they will not be embraced by the mainstream consumers. After the initial sell-in to early adopters, VR companies will have a hard time convincing the majority of customers who expect experience exceeding that of a modern movie theater or console-style video games. No matter what anyone else tells you, resolution of Oculus Rift and pretty much all new VR headsets is inadequate. Recent Magic Leap fiasco just confirms the sad state of AR/VR product quality. I doubt Apple will do much better.

How could the engineers at Facebook and HTC as well as technical editors at respected technical publications conclude that the current experience is acceptable is beyond me. The only explanation I can come up with is that being engineers, they focused too much on the technical aspects of their problem and they paid little attention to perceptual limits of human vision.

I decided to perform a quick, back of the envelope analysis and this is what I came up with comparing a 4K TV with a VR experience: Typically, TV manufacturers recommend a 20-degree field of view (FOV) for the best viewing experience. Modern 4K TV sets offer 8M pixel resolution, which translates to about 20,000 pixels per 1 degree of FOV using both X and Y dimensions. Oculus Rift and HTC Vive have almost identical specs, both providing 2.5 M pixels for 110-degree FOV, which translates to just over 200 pixels per 1 degree i.e. 100 times less than 4K TV. Human eyes are very sensitive, most of us (with a notable exception of CNET TV editors) can easily tell the difference between HDTV and 4K TV picture on a 65-inch screen at 8 feet viewing distance, which for 4K TV’s, is not too close. VR displays are much closer still, in near proximity of our eyes, rather than in our pockets, where hapless phone displays live most of the time. But how much res VR display needs to have to deliver a satisfying experience?

To match visual acuity of the human eye VR display would need about 100,000 pixels per 1 degree FOV. So while 4K HDTV in typical viewing situation is off by a factor of 5, VR displays are off by a factor of 500. This is why the images on VR headsets seem so blurry, not even approaching the “uncanny valley” for the suspension of disbelief. This becomes even more of a problem for AR displays which by definition must blend the natural and CG images. Mike Abrash, the software architect for Oculus Rift spent a considerable amount of time during the last Facebook F8 conference to emphasize visual perception research his team is doing. Judging by their first two commercial products, I am not sure they are on the right track.

Could Oculus do better? I think they could. Palmer Luckey used to complain that no Apple computer is fast enough to recalculate pixels for Rift 3D display and only a small minority of PCs today have enough graphics horsepower to drive Rift. In my view, the solution is not to increase the static display resolution (and the needed computer power) by 100 times. Rather, we need to replace the static frame buffer with a small, dynamic foveal display. Why do I think this? Basic neural science tells us that the highest density of cones in the human retina which allows us to see the sharp details and color are within 1.5 square mm area of the retina known as the fovea, which translates to around 1 degree FOV. The rest is the peripheral vision for motion detection. When our peripheral vision detects something interesting for us to see, our finely tuned ocular muscles rotate both eyes in such a way that light rays from the object of interest fall precisely on the fovea. The motion which accomplishes this task is called a saccade and each second of our waking life (and some of our sleeping time) we make between 3–5 saccades, so our eyes get quite a workout. Just like our other senses, our eyes can not be separated from our bodies. We say that they are embodied, to emphasize the disembodied nature of computing solutions produced by engineers today.

Technology elements for the creation of such dynamic foveal displays exist today and I can imagine a number of design options. While this task is certainly challenging, it is a far more worthy approach than the present insistence of VR computer engineers on the mindless increase of resolution of static displays. This reminds me of Intel’s unsuccessful efforts some years ago to increase the clock speed of its single CPU’s beyond 4Ghz, rather than adopting parallel architectures which work so well today in Nvidia and AMD GPU’s powering VR displays. I believe that the first generation VR headset companies will ultimately pay the price for their ignorance of human visual perception.

But there is hope on the horizon. The recent emergence of foveated rendering is a small step in the right direction which alas, still keeps the expensive static display in place. The real solution is to build a moving foveal (not just foveated) display with much lower resolution, which follows our gaze and dynamically projects all its pixels solely on our retina after each saccadic motion. We could also add a second, low resolution black and white display for the peripheral vision, primarily detecting the motion of objects appearing in our field of vision as we move around our environment. Such a system would match human visual acuity using a small fraction of transistors and the energy needed for a static display. To my knowledge, not a single AR/VR headset manufacturer thought of it. Why? The answer again may lie in the engineering mentality of Silicon Valley companies focused solely on disembodied, symbolic techniques.

The foveal display is an almost trivial application of basic neural science, applied to a low-level brain structure, i.e. retina. Real bonanza lies in the application of neural science at the level of higher brain structures, such as the Superior Colliculus for sensor fusion, the hippocampus for navigation and memory, and the entorhinal cortex which accomplishes multimodal integration and motor action planning. Importantly, motor actions include the use of our muscles for the positioning of our bodily sensors in the environment in active anticipation of expected stimuli. It turns out that a large portion of the human neocortex is devoted to real time multimodal sensor/motor correlation, which is, in fact, one of the most important functions of the brain.

In summary, the golden age of AR/VR is in front of us but in order to get there, we will have to include brain science and its practitioners in our arsenal of tools, technologies, and human resources.