If you’re blind, you can now get AI-generated descriptions of photos on your Facebook feed. You can use the Microsoft’s SeeingAI iPhone application to get descriptions of whatever you capture with your phone’s camera. These technologies are already being used widely by many blind and low vision people. So are we really nearing a world where blind people have equal access to visual information? Well, to answer that question, we must ask ourselves what exactly is “equal” access. Here, I argue that to begin that conversation, we must first ask ourselves what is desirable access. As a person with a visual impairment who develops AI-powered applications for people with visual impairments, I started to think about my personal experiences. What kind of information do I want to know about the visual world around me? How do I currently access the information I would like to know?
Back in June, I attended a lecture by Prof. Georgina Kleege from UC Berkeley about audio descriptions. Today, audio descriptions are added to television and films as an extra track, where a trained human describes the action taking place for blind and low vision people. They are delivered in a dry and “objective” way: “a woman is walking down the street. She smiles and turns towards a house.” This is similar to the way AI-generated descriptions of images are delivered today: “woman standing near a house, outdoors.”
Prof. Kleege showed segments of old films that included voice-over narrations of the action, from the perspective of the protagonist in the scene. These narratives were not added for the explicit purpose of making the scenes accessible to blind people; they were written in the script, presumably by the same person or people who wrote the dialog and scripted the action, and performed by the actors. And they did make the scene accessible, while also enhancing the emotions in the scene and providing all viewers a window into the narrator’s thoughts. This is similar to the experience you would get from reading a book. Professor Kleege argued that these descriptions were preferable to standard audio descriptions. They enhance the scene for everyone and allow an additional channel of expression to the writer and actor. Based on the examples she showed, I completely agreed.
Of course, screenplay actors today don’t typically add narration to their films, and people who post photos on Facebook or other sites don’t usually describe their photos. While I have a lot of vision myself, I occasionally need some help in understanding the contents of a photo or video on the web. The most effective way for me to do this is simply asking my husband:
“What am I missing in this photo? Why is this funny?” I find myself asking him occasionally. “She’s going for the last piece of Sushi while they’re posing for the photo,” he recently explained about a photo my brother shared of himself, my sister-in-law, and my two-year-old niece (the sushi thief).
If we’re looking at a photo or video shared by a friend or family member, my husband probably knows the poster, and something about the context. Most important, he knows me. He knows what information I may have missed, whether it’s because of my vision, because of the context in the photo (maybe I don’t know who’s in it), or simply because of my idiosyncratic ignorance of popular culture. Of course, if my husband’s not around, I might ask a good friend.
This brings me back to AI for automatic descriptions of visual content. We already have AI-generated descriptions of photos on the web. Soon we will likely have AI-generated descriptions of videos too (that would be great!). But in working towards both these applications, there seems to be an implicit assumption that the AI should strive to simulate a “neutral” stranger--with no affect and no familiarity with the people or context of the photo or video (aside from maybe knowing their names). Instead, let us stop to think about the role the AI should take. Perhaps we should strive to simulate a close friend--dare I say, my husband--who knows us, and the context of our photo or video. Or perhaps we should strive to simulate the creator or producer of the media.
Thinking about how to design AI systems that would know enough to simulate a friend or a content creator introduces a slew of questions, not least of which are technical challenges and privacy concerns. But we can only make progress on these new questions if we first make our assumptions explicit about what exactly we are trying to achieve. What exactly is equal access to information? How and who can provide the kind of information that I want? So to design better AI, we must first understand our own desires for information access, and how they relate to people in different roles in our lives.