Tanstack Start | EyesOff: How I built a screen contact detection model

EyesOff: How I built a screen contact detection model(ym2132.github.io)

35 points by Two_hands 4 days ago | 24 comments

jwcacces 3 days ago
Man, it's going to be great when this gets adapted to make sure I'm looking into the screen at all the ads I'm required to watch, or when it complies a report of whether or not I'm paying attention to my boss in an all-hands...
- Two_hands 3 days ago |parent
  Others are already trying to do this https://arxiv.org/abs/2504.06237. I haven't seen anyone take the approach I tried though, as most uses cases focus on tracking the main user rather than others around.
- harvey9 3 days ago |parent
  Article links to a library for that but says the licence didn't suit this project.
  https://github.com/rehg-lab/eye-contact-cnn
Freak_NL 3 days ago
> It’s an application which detects people looking at your screen. The aim is to keep you safe from shoulder surfing, utilising your webcam to give you the power to prevent snoopers.
When is this ever a problem that cannot be solved by positioning yourself with a wall behind you or going somewhere private? This feels like overkill for the stated use-case. I can imagine someone thinking they might need this to do private stuff in a public space (a coffee shop?), but they'd turn paranoid from everyone passing by just glancing around.
Also, is this a realistic threat model anywhere? People snooping by standing behind you tend to be colleagues or totally random passers-by; not people actually interested in gleaning private information. Anything more serious than logging into your Facebook account would imply simply having proper OpSec procedures (like: 'only do this in private').
All I can think of is employee monitoring where such tools will just end up making people insecure in their workplace (and less productive, because gazing out of a window or into nothingness actually helps when you are doing work which requires pondering; and less healthy, because looking away from your screen into the distance is recommended for anyone with working eyes).
_menelaus a day ago
I've done a ton of mobile gaze tracking. We never went for the most important application here: babies will preferentially look at different things on a screen if they predisposed to autism. A screening tool is the easiest thing to make from a technical point of view and also the most useful for society. Why don't you try that? Current methods wait until the baby can talk and this could trigger intervention a very critical year earlier.
- Two_hands a day ago |parent
  This has been done! It’s the paper I first looked at for this task: https://github.com/rehg-lab/eye-contact-cnn
  They create this CNN for exactly this task, autism diagnosis in children. I suppose this model would work for babies too.
  Edit: ah I see your point, in the paper they diagnose autism with eye contact, but your point is a task closer to what my model does. It could definelty be augmented for such a task, we’d just need to improve the accuracy. The only issue I see is sourcing training data might be tricky, unless I partner with some institution researching this. If you know of anyone in this field I’d be happy to speak with them.
  - _menelaus a day ago |parent
    That's great! What I'm talking about is a bit different though and might be a lot easier to deploy and work on much younger subjects:
    Put a tablet in front of a baby. Left half has images of gears and stuff, right half has images of people and faces. Does the baby look at the left or right half of the screen? This is actually pretty indicative of autism and easy to put into a foolproof app.
    The linked github is recording a video of an older child's face while they look at a person who is wearing a camera or something, and judging whether or not they make proper eye contact. This is thematically similar but actually really different. Requires an older kid, both for the model and method, and is hard to actually use. Not that useful.
    Intervening when still a baby is absolutely critical.
    P.S., deciding which half of a tablet a baby is looking is MUCH MUCH easier than gaze tracking. Make the tablet screen bright white around the edges. Turn brightness up. Use off the shelf iris tracking software. Locate the reflection of the iPad in the baby's iris. Is it on the right half or left half of the iris? Adjust for their position in FOV and their face pose a bit and bam that's very accurate. Full, robust gaze tracking is a million times harder, believe me.
    - Two_hands a day ago |parent
      Thats a cool idea, thanks for sharing! It's cool to see other uses for a model I built for a completely different task.
      Is there any research/papers on this type of autism diagnosis tools for babies?
      To your last point, yes I agree. Even the task I setup the model for is relatively easy compared to proper gaze tracking, I just rely on large datasets.
      I suppose you could do it in the way you say and then from that gather data to eventually build out another model.
      I'll for sure look into this, appreciate the idea sharing!
      - _menelaus a day ago |parent
        Idk of any research, sorry, just going from memory from a few years ago. Feel free to lmk if you ever have any questions about mobile gaze tracking, I spent several years on it. Can you DM on here? Idk.
        FYI: I got funding and gathered really big mobile phone gaze datasets and trained CNN models on them that got pretty accurate. Avg err below 1cm.
        The whole thing worked like this:
        Mech Turk workers got to play a mobile memory game. A number flashed on the screen at a random point for a second while the phone took a photo. Then they had to enter it. If they entered it correctly, I assumed they were looking at that point at the screen when the photo was taken and added it to the gaze dataset. Collecting clean data like this was very important for model accuracy. Data is long gone, unfortunately. Oh and the screen for the memory game was pure white, which was essential for a reason described below.
        The CNN was a cascade of several models:
        First, off the shelf stuff located the face in the image.
        A crop of the face was fed to an iris location model we trained. This estimated eye location and size.
        Crops of the eyes were fed into 2 more cycles of iris detection, taking a smaller crop and making a more accurate estimation until the irises were located and sized to within about 1 pixel. Imagine the enhance... enhance... trope.
        Then crops of the super well centered and size-normalized irises, as well as a crop of the face, were fed together into a CNN, along with metadata about the phone type.
        This CNN estimated gaze location using the labels from the memory game derived data.
        This worked really well, usually, in lighting the model liked. Failed unpredictably in other lighting. We tried all kinds of pre-processing to address lighting but it was always the achilles heel.
        To my shock, I eventually realized (too late) that the CNN was learning to find the phone's reflection in the irises, and estimating the phone position relative to the gaze direction using that. So localizing the irises extremely well was crucial. Letting it know what kind of phone it was looking at, and ergo how large the reflection should appear at a certain distance, was too.
        Making a model that segments out the phone or tablet's reflection in the iris is just a very efficient shortcut to do what any actually good model will learn to do anyway, and it will remove all of the lighting variation. Its the way to make gaze tracking on mobile actually work reliably without infrared. Never had time to backtrack and do this because we ran out of funding.
        The MOST IMPORTANT thing here is to control what is on the phone screen. If the screen is half black, or dim, or has random darkly colored blobs, it will screw with where the model things the phone screen reflection begins and ends. HOWEVER if your use case allows you to control what is on the screen so it always has for instance a bright white border, your problem is 10x easier. The baby autism screener would let you control that for instance.
        But anyway, like I said, to make something that just determines if the baby is looking on one side of the screen or the other, you could do the following:
        1. Take maybe 1000 photos of a sampling of people watching a white tablet screen moving around in front of their face. 2. Annotate the photos by labeling visible corners of the reflection of the tablet screen in their irises 3. Make simple CNN to place these
        If you can also make a model that locates the irises extremely well, like to within 1px, then making the gaze estimate becomes sort of trivial with that plus the tablet-reflection-in-iris finder. And I promise the iris location model is easy. We trained on about 3000-4000 images of very well labeled irises (with circle drawn over them for the label) with a simple CNN and got really great sub-pixel accuracy in 2018. That plus some smoothing across camera frames was more than enough.
        Anyway, hope some of this helps. I know you aren't doing fine-grained gaze tracking like this but maybe something in there is useful.
        Two_hands a day ago |parent
        wow, this is great! You can't DM but my email is in my blog post, in the footnotes.
        Do you remember the cost of Mech Turk? It was something I wanted to use for EyesOff but never could get around the cost aspect.
        I need some time to process everything you said, but the EyesOff model has pretty low accuracy at the moment. I'm sure some of these tidbits of info could help to improve the model, although my data is pretty messy in comparison. I had thought of doing more gaze tracking work for my model, but at long ranges it just breaks down completely (in my experience, happy to stand corrected if you're worked on that too).
        Regarding the baby screener, I see how this approach could be very useful. If I get the time, I'll look into it a bit more and see what I can come up with. I'll let you know once I get round to it.
        _menelaus 4 hours ago |parent
        The cost for mech turk:
        We paid something like $10 per hour and people loved our tasks. We paid a bit more to make sure our tasks were completed well. The main thing was just making the data collection app as efficient as possible. If you pay twice as much but collect 4x the data in the task, you doubled your efficiency.
        Yeah I think its impossible to get good gaze accuracy without observing the device reflection in the eyes. And you will never, ever be able to deal well with edge cases like lighting, hair, glasses, asymmetrical faces, etc. There's just a fundamental information limit you can't overcome. Maybe you could get within 6 inches of accuracy? But mostly it would be learning face pose I assume. Trying to do gaze tracking with a webcam of someone 4 feet away and half offscreen just seems Sisyphean.
        Is EyesOff really an important application? I'm not sure many people would want to drain their battery running it. Just a rhetorical question, I don't know.
        With the baby autism screener its difficult part is the regulatory aspect. I might have some contacts in Mayo Clinic that would be interested in productizing something like this though, and could be asked about it.
        If I were you, I would look at how to take a mobile photo of an iris, and artificially add the reflection of a phone screen to create a synthetic dataset (it won't look like a neat rectangle, more like a blurry fragment of one). Then train a CNN to predict the corners of the added reflection. And after that is solved, try the gaze tracking problem as an algebraic exercise. Like, think of the irises as 2 spherical mirrors. Assume their physical size. If you can locate the reflection of an object of known size in them, you should be able to work out the spatial relationships to figure out where the object being reflected is relative to the mirrors. This is hard, but is 10-100x easier than trying end-to-end gaze tracking with a single model. Also nobody in the world knows to do this, AFAIK.
        Two_hands 2 hours ago |parent
        interesting, $10 per hour is pretty reasonable.
        ha, thats probably why I noticed the EyesOff accuracy drops so much at longer ranges, I suppose two models would do better but atm battery drain is a big issue.
        I'm not sure if it's important or not, but the app comes from my own problems working in public so I'm happy to continue working on it. I do want to train and deploy an optimised model, something much smaller.
        Sounds great, once a POC get's built I'll let you know and can see about the clinical side.
        Thanks for the tips! I'll be sure to post something and reach out if I get round to implementing such a model.
dinobones 3 days ago
So much text and not a single example, diagram, or demo.
I'm honestly skeptical this will work at all, the FOV of most webcams is so small that it can barely capture the shoulder of someone sitting beside me, let alone their eyes.
Then what you're basically looking for is callibration from the eye position / angle to the screen rectangle. You want to shoot a ray from each eye and see if they intersect with the laptop's screen.
This is challenging because most webcams are pretty low resolution, so each eyeball will probably be like ~20px. From these 20px, you need to estimate the eyeball->screen ray. And of course this varies with the screen size.
TLDR: Decent idea, but should've done some napkin math and or quick bounds checking first. Maybe a $5 privacy protector is better.
Here's an idea:
Maybe start by seeing if you can train a primary user gaze tracker first, how well you can get it with modeling and then calibration. Then once you've solved that problem, you can use that as your upper bound of expected performance, and transform the problem to detecting the gaze of people nearby instead of the primary user.
- Two_hands 3 days ago |parent
  Sorry, I haven't gotten around to gathering examples yet. I ran the models on some example videos which is where the accuracy stats come.
  Perhaps I have been jaded by the Mac webcam, I agree on most old webcams it wont be great but on newer webcams I have had success.
  I did try a calibration approach but it's simply too fragile for in the wild deployment, calibration works great if you only care about one user but when you start looking at other people it doesn't work so well.
  Good idea, it may be more fruitful to do that. At least then for the primary user we can be much more certain.
xlii 3 days ago
Assuming this works it will for sure be used for employee tracking.
Privacy protector solves different problems - they prevent people from extracting information on screen, not merely inform about possible infraction.
That being said it's useful in a way that if I'd see anything like that in a contract it wouldn't be a red flag. It'd be red flashing GT*O alarm ;)
- Two_hands 3 days ago |parent
  This models supports the EyesOff application which will prevent information extraction by either having a popup, switching to another app, or a notification(you can define the behaviour in a few different ways).
  Privacy screens are still useful and I recommend people to use EyesOff and the screen protector. A privacy screen won't stop someone shoulder surfing from directly behind you etc.
  There is also better ways to do this sort of task when all you care about is tracking the main user: https://arxiv.org/abs/2504.06237, https://pmc.ncbi.nlm.nih.gov/articles/PMC11019238/
jauco 3 days ago
Thanks for the detailed log on what it takes to build your own model and how you prepared your own dataset. Interesting read!
- Two_hands 3 days ago |parent
  Thanks glad you enjoyed it.
IshKebab 3 days ago
Not going to be very useful for its stated purpose because front facing cameras generally have quite a narrow field of view.
Interesting problem anyway. I'm surprised the accuracy is so low.
- Two_hands 3 days ago |parent
  Yeah tbh I do recommend using this alongside a privacy screen for best protection. Privacy screens also suffer from the fact that they won’t block someone directly behind you from seeing the screen, so both methods have issues.
  Any tips on improving accuracy? A lot of it might be due to lack of diverse images + labelling errors as I did it all manually.
  - IshKebab 2 days ago |parent
    I dunno, my only idea is that maybe if you use traditional face detection to find the face/eyes and then do classification (assuming you aren't doing that already?).
    - Two_hands 2 days ago |parent
      Right now that's pretty much what I do. I use YuNet to get faces, crop them out and run detection. It's probably a factor of a not enough data/poor model choice.
welcome_dragon a day ago
Am I the only one that remembers a phone several years ago that advertised a feature like this?
I remember a guy watching a video them looking up and it paused, etc
- Two_hands a day ago |parent
  Yeah I did see something like this, may have been huawei. Not sure if they use a model or sensor based approaches though