Project Aura: Intent-Aware AR Interfaces

Sumukh Bettadapura, Rohan Sathish, Ajan Subramanian, Rajath Magaji and David Francisco

Every so often we're made to rethink what a computer is, or what it can do. Fundamentally, we believe that computers are machines that allow humans to express themselves. We can express ourselves through our work, or by messaging a friend. We can express ourselves by video calling family members, or by watching a video that we love. And we believe that the fundamental unit to all of these types of expressions is attention. What we choose to attend to is an involuntary act of valuation, where every fixation is a micro-decision about what's worth your limited cognitive resources. While computers have completely enhanced our ability to attend to different things, the amount of information that flows now, competing for our attention, can very easily become overwhelming and overbearing.

At Kubo, we've come to the realization that the necessary paradigm shift, the rethinking of the computer, lies in solving two problems.

For starters, while computers currently are a platform for attention, can we get them to truly understand what you're attending to? Can we get them to understand you in a deeper, more intuitive way, where they can mould digital spaces to you? And secondly, how do we seamlessly blend the digital with the physical, in a fluid, intentional way? How do we stop it from feeling like you're teleported to another world every time you check your phone?

Project Aura is our attempt at solving these problems. It's an imagination of an affective intelligence layer that we can deploy to any computer system, that makes the leap from requiring conscious human input all the way to the computer implicitly understanding your intention, transforming itself to feel like an extension of you and your current cognitive state.

Our intuition behind this is very simple, and based on the observation of humans. A good teacher sees confusion on your face before you raise your hand. A barista reads your hesitation between two drinks. A friend knows when to talk and when to just be present. They're reading something deeper than your actions.

The body as a signal

Before you consciously process an event, before you form an opinion, before you decide, your body has already responded. Your pupils dilate before you know you're interested. Your skin conductance shifts before you know you're surprised. Your heart rate changes before you register stress. The brain releases ErrP's when it sees something unexpected. These signals are often more honest than the conscious narrative that you construct, because they haven't been filtered through self-narrative or social performance.

We've imagined Aura to work best with AR, because you can get the computer to see what you see, which is one of the richest modalities of context that you could provide. More importantly, it allows us to tap into one of the most informative modalities of what your internal state is. Your pupils.

The pupil, in particular, has a vocabulary that neuroscience has been documenting for over sixty years.¹

This isn't one signal. It's insight into whether someone is scanning or deciding, confused or certain, exploring options or committing to one.

How do you know someone's interested?

This was the first question we sat with, and the first assumption we had to let go of.

The standard approach in eye tracking is dwell time. If someone looks at something for a long time, they must be interested. It's simple and measurable, and it's the basis of most gaze interaction today. But early on, something felt off. People stare at things that confuse them. They fixate on text they can't parse. They linger on things that make them uncomfortable. Dwell time tells you that someone is looking. It doesn't tell you why.

We now understand that interest has a shape in the pupil that dwell time can't see, and this was found over sixty years ago. Your pupils dilate with something interesting, and contract for things you find aversive. They also dilate for curiosity, and are also reliable enough to distinguish positive from negative engagement.²

When we ran our own analyses on egocentric data from the Visual Experience Database, we found these patterns held beautifully outside the lab. A person scanning a kitchen counter, deciding what to reach for. The gaze might rest on several items for similar durations. But the pupil tells us whether there was novelty and genuine stimulus.

Left: A standard gaze heatmap over a real-world scene, showing where someone looked. All fixation points seem the same. Right: The same scene with pupil-weighted overlay. Now you can see which fixations carried genuine interest and which were confusion or passive scanning. The difference is immediately visible.

Imagine you’re browsing items, where there are shelves, menus, and a market stall. One item has a subtle warm glow with a contextual detail surfaced. Adjacent items that you scan merely remain clean and quiet.

The design implication delighted us. It knows the difference between "I'm looking at this" and "I want this."

Can you see a decision before it's made?

This is the finding that changed everything for us.

We expected the pupil to be a good reporter. Only telling us what someone just experienced, what they just felt. What we didn't expect was that it's also a remarkably good predictor. It's able to anticipate decisions. The signal leads the action, not the other way around. It can assess decision trajectories and also precedes voluntary actions.³

Consistently. In our research on the relationship between pupil dilation and visual feature diversity, we found that pupil responses don't just track what's in front of someone, they anticipate the novelty of what's coming.

Imagine a real-world moment where you’re reaching for an object, or turning toward something. The action is marked on the timeline. The pupil dilation trace begins its ramp clearly before the action occurs. The gap between the pupil signal and the action is highlighted.

Think about what this means for an interface. Current systems are reactive, where they wait for you to tap, speak, gesture, or look. By the time the system responds, you've already spent the cognitive effort of translating your intent into an input. But if the system can read the anticipatory signal, the decision forming in your body before it reaches your fingers, it can begin preparing its response before you've consciously asked for one.

This is the difference between an interface that responds to what you did and one that's ready for what you're about to do. It doesn't feel faster. It feels like it understands.

Imagine you’re navigating a foreign street, looking at signs they can't fully read. Confusion detected on specific characters. Translations appear anchored to those signs, subtle and spatial. As the person's pupil response normalises, the translations fade.

When are you exploring, and when have you decided?

There's a mode you're in when you walk into a new place and your eyes are everywhere. Scanning, sampling, building a sense of what's here. And there's a different mode, the one where you've found something and you're weighing it seriously. You know the feeling of each. We call this exploration and exploitation, and the pupil reacts differently in each case.⁴

This was a pivotal insight for us, because these two modes need completely opposite things from an interface.

Imagine light ambient labels, gentle category markers, nothing pulling focus. The pupil trace below shows an elevated, variable baseline. Once you engage, the AR interface has shifted: richer detail, specific context, something actionable.

A person exploring wants breadth. Options, awareness, the lay of the land. Surfacing deep detail in this state is noise. It clutters the scanning process. A person exploiting wants depth. Specifics, context, support for the decision that's forming. Staying minimal in this state feels like the system isn't paying attention.

Every interface you've ever used treats these moments the same way. Same information, same layout, same density, regardless of whether you're browsing or committing. The pupil lets you match the response to the mode, in real time, without asking.

Imagine you’re walking through a market or a neighbourhood. Imagine a detail that matters, something personally relevant, context for the decision at hand.

Can you catch confusion before it becomes frustration?

Here's where the data surprised us the most.

Working with the OneStop dataset, eye-tracking from 180 subjects reading natural English text across thousands of words and over 1 million individual fixations, we found that confusion and difficulty has a signature that you can detect. Fixation durations stretch. The eyes start jumping backward, regressions to earlier text, as if the brain is trying to re-parse something it missed. And the pupil follows a jittery, erratic dilation pattern that looks nothing like the smooth ramp of genuine interest. There's a window, sometimes several seconds long, where the body knows it's struggling but the conscious mind hasn't quite admitted it yet.⁵

Look at clean forward saccades on easy passages, then a visible cluster of regressions and lengthened fixations on a difficult section. The pupil trace below shows the corresponding dilation spike. The moment of confusion is marked in the data, seconds before the reader would consciously say "I'm lost."

The way we handle confusion in digital interfaces today is to wait for failure. You get stuck. You stop. You search for help. You parse a FAQ. By the time support arrives, the frustration has already set in.

We think the intervention should come earlier. Right at the moment of onset. A recipe floating in your view detects your hesitation on an unfamiliar technique and shows a brief, three-second animation of the method, then dissolves as your comprehension resolves. A sign in a foreign city where translations appear on the specific words causing trouble. Not because you asked, but because your pupils said you needed it, and gently fade as your processing normalises.

Imagine you’re cooking, hands occupied. A recipe floats in view. They hit an unfamiliar instruction and the confusion signature fires. A contextual animation of the technique appears right where they're looking, brief and clear. Then it dissolves.

Imagine your gaze and pupil signal indicating they're about to engage with something (the anticipatory ramp is visible in a subtle data overlay). The interface is already beginning to surface relevant context. Imagine your eyes arriving at the object, and the information is there, waiting.

What should you never interrupt?

There are moments, awe, flow, deep presence, etc., that carry a distinct physiological signature. High sustained dilation, low gaze velocity, a kind of coherence in the signal that you can feel in the data before you even label it. A parent watching their child climb higher than they've ever climbed. A person standing quietly before something beautiful. A musician lost inside a performance.

These moments are the reason technology should exist. To protect and enrich human experience. They're also the moments most likely to be shattered by a notification.

Imagine you’re in a moment that matters, watching their child or loved one, standing before a landscape, absorbed in something real. No AR overlay at all. Just the world.

Every notification system ever built treats these moments the same as any other. A message arrives, a card appears, completely agnostic to the fact that right now, you're experiencing something that matters more than anything on a screen.

The most important design decision in intent-aware AR isn't what to show. It's what to protect. When Aura detects real, embodied and absorbed engagement, the glass goes dark. Not minimised. Not dimmed.

Your body can't fake awe. You can't perform flow. These signals are the most reliable precisely because they're the least controllable. An interface that respects them isn't just good design. It's an act of care.

Coming soon

We're calling it Aura because that's what it reads. The field of signals your body broadcasts continuously, encoding what you need, what moves you, and what you're about to do, whether you know it yet or not.

We're currently working more on this and will be releasing some sneak-peeks into what we've done so far very shortly.

We think this is a huge step towards machines that understand you. The pupil is the first signal we can capture from hardware that already exists, but the body broadcasts on many channels. Heart rate, skin conductance, EEG, micro-expressions, breathing patterns. Each one adds a dimension to the vocabulary of intent. Pupil is the first reliable signal that we can start decoding.

We're not building a design language. We're building the intelligence that decides what any design language should do, at any given moment, for any given person. The prettiest interface in the world is still worthless if it appears at the wrong time. The simplest piece of information is invaluable if it arrives at exactly the right moment.

Your body already knows what you need. We're building the system that listens.

References

de Gee et al., PNAS 2014; Van Slooten et al., PLOS Computational Biology 2018; Preuschoff, 't Hart, & Einhäuser et al., Frontiers 2011. The mechanism is the locus coeruleus-norepinephrine system, which modulates pupil diameter as a direct readout of cognitive arousal and uncertainty.
Hess & Polt, 1960; Kang et al., Psychological Science 2009; Partala & Surakka, 2003.
de Gee et al., PNAS 2014; Einhäuser et al., Frontiers in Human Neuroscience 2010; Privitera et al., 2010.
Aston-Jones & Cohen, Annual Review of Neuroscience 2005; Jepma & Nieuwenhuis, Journal of Cognitive Neuroscience 2011; Gilzenrat et al., 2010.
Hollenstein et al., Scientific Data 2018; D'Mello et al., 2014.