Headsets in the thousand-dollar range are plenty good and still not selling. Take the hint. Push costs down. Cut out everything that is not strictly necessary. Less Switch, more Game Boy.
6DOF inside-out tracking is required, but you can get that from one camera and an orientation sensor. Is it easy? Nope. Is it tractable for any of the companies already making headsets? Yes, obviously. People want pick-up-and-go immersion. Lighthouses were infrastructure and Cardboard was not immersive. Proper tracking in 3D space has to Just Work.
Latency is intolerable. Visual quality, scene detail, shader complexity - these are nice back-of-the-box boasts. Instant response time is do-or-die. Some monocular 640x480 toy with rock-solid 1ms latency would feel more real than any ultrawide 4K pancake monstrosity that’s struggling to maintain 10ms.
Two innovations could make this painless.
One, complex lenses are a hack around flat lighting. Get rid of the LCD backlight and use one LED. This simplifies the ray diagram to be nearly trivial. Only the point light source needs to be far from the eye. The panel and its single lens can be right in your face. Or - each lens can be segmented. The pyramid shape of a distant point source gets smaller, and everything gets thinner. At some point the collection of tiny projectors looks like a lightfield, which is what we should pursue anyway.
Two, intermediate representation can guarantee high performance, even if the computer chokes. It is obviously trivial to throw a million colored dots at a screen. Dice up a finished frame into floating paint squares, and an absolute potato can still rotate, scale, and reproject that point-cloud, hundreds of times per second. But flat frames are meant for flat screens. Any movement at all reveals gaps behind everything. So: send point-cloud data, directly. Do “depth peeling.” Don’t do backface culling. Toss the headset a version of the scene that looks okay from anywhere inside a one-meter cube. If that takes longer for the computer to render and transmit… so what? The headset’s dinky chipset can show it more often than your godlike PC, because it’s just doing PS2-era rendering with microsecond-old head-tracking. The game could crash and you’d still be wandering through a frozen moment at 100, 200, 500 Hz.
I think the slightly more viable version of the rendering side is to send a combination of low res point cloud + low res large FOV frames for each eye with detailed (!) depth map + more detailed image streamed where the eye is focused (with movement prediction) + annotation on what features are static and which changed and where light sources are, allowing the headset to render the scene with low latency and continuously update the recieved frames based on movement with minimal noticable loss in detail, tracking stuff like shadows and making parallax with flawlessly even if the angle and position of the frame was a few degrees off.
Undoubtedly point-clouds can be beaten, and adding a single wide-FOV render is an efficient way to fill space “offscreen.” I’m just cautious about explaining this because it invites the most baffling rejections. At one point I tried explaining the separation of figuring out where stuff is, versus showing that location to you, using beads floating in a fluid simulation. Tracking the liquid and how things move within it is obviously full of computer-melting complexity. Rendering a dot, isn’t. And this brain case acted like I’d described simulating the entire ocean for free. As if the goal was plucking all future positions out of thin air, and not, y’know, remembering where it is, now.
The lowest-bullshit way is probably frustum slicing. Picture the camera surrounded by transparent spheres. Anything between two layers gets rendered onto the further one. This is more-or-less how “deep view video” works. (Worked?) Depth information can be used per-layer to create lumpen meshes or do parallax mapping. Whichever is cheaper at obscene framerates. Rendering with alpha is dirt cheap because it’s all sorted.
Point clouds (or even straight-up original geometry) might be better at nose-length distances. Separating moving parts is almost mandatory for anything attached to your hands. Using a wide-angle point render instead of doing a cube map is one of several hacks available since Fisheye Quake, and a great approach if you expect to replace things before the user can turn around.
But I do have to push back on active fake focus. Lightfields are better. Especially if we’re distilling the scene to be renderable in a hot millisecond, there’s no reason to motorize the optics and try guessing where your pupils are headed. Passive systems can provide genuine focal depth.
My suggestions are mostly about maintaining quality while limiting bandwidth requirements to the headset, wouldn’t a lightfield require a fair bit of bandwidth to keep updated?
(Another idea is to annotate moving objects with predicated trajectories)
Less than you might think, considering the small range perspectives involved. Rendering to a stack of layers or a grid of offsets technically counts. It is more information than simply transmitting a flat frame… but update rate isn’t do-or-die, if the headset itself handles perspective.
Optimizing for bandwidth would probably look more like depth-peeled layers with very approximate depth values. Maybe rendering objects independently to lumpy reliefs. The illusion only has to work for a fraction of a second, from about where you’re standing.
Alpha-blending is easy because, again, it is a set of sorted layers. The only real geometry is some crinkly concentric spheres. I wouldn’t necessarily hand-wave Silent Hill 2 levels of subtlety, with one static moment, but even uniform fog would be sliced-up along with everything else.
Reflections are handled as cutouts with stuff behind them. That part is a natural consequence of their focus on lightfield photography, but it could be faked somewhat directly by rendering. Or you could transmit environment maps and blend between those. Just remember the idea is to be orders of magnitude more efficient than rendering everything normally.
I thought the windows MR lineup filled that gap pretty well. Much cheaper than most of the other alternatives back then but it never really took off and MS has quietly dropped it.
Still $300 or $400 for a wonky platform. That’s priced better than I thought they were, but the minimum viable product is far below that, and we might need a minimal product, to improve adoption rates. The strictly necessary components could total tens of dollars… off the shelf.
Quality is irrelevant, reduce retail price.
Headsets in the thousand-dollar range are plenty good and still not selling. Take the hint. Push costs down. Cut out everything that is not strictly necessary. Less Switch, more Game Boy.
6DOF inside-out tracking is required, but you can get that from one camera and an orientation sensor. Is it easy? Nope. Is it tractable for any of the companies already making headsets? Yes, obviously. People want pick-up-and-go immersion. Lighthouses were infrastructure and Cardboard was not immersive. Proper tracking in 3D space has to Just Work.
Latency is intolerable. Visual quality, scene detail, shader complexity - these are nice back-of-the-box boasts. Instant response time is do-or-die. Some monocular 640x480 toy with rock-solid 1ms latency would feel more real than any ultrawide 4K pancake monstrosity that’s struggling to maintain 10ms.
Two innovations could make this painless.
One, complex lenses are a hack around flat lighting. Get rid of the LCD backlight and use one LED. This simplifies the ray diagram to be nearly trivial. Only the point light source needs to be far from the eye. The panel and its single lens can be right in your face. Or - each lens can be segmented. The pyramid shape of a distant point source gets smaller, and everything gets thinner. At some point the collection of tiny projectors looks like a lightfield, which is what we should pursue anyway.
Two, intermediate representation can guarantee high performance, even if the computer chokes. It is obviously trivial to throw a million colored dots at a screen. Dice up a finished frame into floating paint squares, and an absolute potato can still rotate, scale, and reproject that point-cloud, hundreds of times per second. But flat frames are meant for flat screens. Any movement at all reveals gaps behind everything. So: send point-cloud data, directly. Do “depth peeling.” Don’t do backface culling. Toss the headset a version of the scene that looks okay from anywhere inside a one-meter cube. If that takes longer for the computer to render and transmit… so what? The headset’s dinky chipset can show it more often than your godlike PC, because it’s just doing PS2-era rendering with microsecond-old head-tracking. The game could crash and you’d still be wandering through a frozen moment at 100, 200, 500 Hz.
I think the slightly more viable version of the rendering side is to send a combination of low res point cloud + low res large FOV frames for each eye with detailed (!) depth map + more detailed image streamed where the eye is focused (with movement prediction) + annotation on what features are static and which changed and where light sources are, allowing the headset to render the scene with low latency and continuously update the recieved frames based on movement with minimal noticable loss in detail, tracking stuff like shadows and making parallax with flawlessly even if the angle and position of the frame was a few degrees off.
Undoubtedly point-clouds can be beaten, and adding a single wide-FOV render is an efficient way to fill space “offscreen.” I’m just cautious about explaining this because it invites the most baffling rejections. At one point I tried explaining the separation of figuring out where stuff is, versus showing that location to you, using beads floating in a fluid simulation. Tracking the liquid and how things move within it is obviously full of computer-melting complexity. Rendering a dot, isn’t. And this brain case acted like I’d described simulating the entire ocean for free. As if the goal was plucking all future positions out of thin air, and not, y’know, remembering where it is, now.
The lowest-bullshit way is probably frustum slicing. Picture the camera surrounded by transparent spheres. Anything between two layers gets rendered onto the further one. This is more-or-less how “deep view video” works. (Worked?) Depth information can be used per-layer to create lumpen meshes or do parallax mapping. Whichever is cheaper at obscene framerates. Rendering with alpha is dirt cheap because it’s all sorted.
Point clouds (or even straight-up original geometry) might be better at nose-length distances. Separating moving parts is almost mandatory for anything attached to your hands. Using a wide-angle point render instead of doing a cube map is one of several hacks available since Fisheye Quake, and a great approach if you expect to replace things before the user can turn around.
But I do have to push back on active fake focus. Lightfields are better. Especially if we’re distilling the scene to be renderable in a hot millisecond, there’s no reason to motorize the optics and try guessing where your pupils are headed. Passive systems can provide genuine focal depth.
That last paper is from ten years ago.
My suggestions are mostly about maintaining quality while limiting bandwidth requirements to the headset, wouldn’t a lightfield require a fair bit of bandwidth to keep updated?
(Another idea is to annotate moving objects with predicated trajectories)
Less than you might think, considering the small range perspectives involved. Rendering to a stack of layers or a grid of offsets technically counts. It is more information than simply transmitting a flat frame… but update rate isn’t do-or-die, if the headset itself handles perspective.
Optimizing for bandwidth would probably look more like depth-peeled layers with very approximate depth values. Maybe rendering objects independently to lumpy reliefs. The illusion only has to work for a fraction of a second, from about where you’re standing.
How does it handle stuff like fog effects, by the way? Can it be made to work (efficiently) with reflections?
The “deep view” link has video - and interactive online demos.
Alpha-blending is easy because, again, it is a set of sorted layers. The only real geometry is some crinkly concentric spheres. I wouldn’t necessarily hand-wave Silent Hill 2 levels of subtlety, with one static moment, but even uniform fog would be sliced-up along with everything else.
Reflections are handled as cutouts with stuff behind them. That part is a natural consequence of their focus on lightfield photography, but it could be faked somewhat directly by rendering. Or you could transmit environment maps and blend between those. Just remember the idea is to be orders of magnitude more efficient than rendering everything normally.
Admittedly you can kinda see the gaps if you go looking.
I thought the windows MR lineup filled that gap pretty well. Much cheaper than most of the other alternatives back then but it never really took off and MS has quietly dropped it.
Still $300 or $400 for a wonky platform. That’s priced better than I thought they were, but the minimum viable product is far below that, and we might need a minimal product, to improve adoption rates. The strictly necessary components could total tens of dollars… off the shelf.
deleted by creator