Brain does visual processing, text prompts are a cognitive bottleneck: Xi Zeng

In a conversation with HT, Xi Zeng notes India’s importance for building any successful and diverse AI. (Official photo)


Xi Zeng, who is addressed as “Dr Zeng” in his corporate and academic roles, made a leap of faith late last year to solve a problem that none of the current artificial intelligence (AI) problems seemed intent or capable of solving. The former OnePlus, Oppo and Bytedance executive launched Chance AI, as founder and CEO. His inspiration to build Curiosity Lens, which is the world’s first visual AI agent, is to see the world and discover information without having to type. One that can see the world around you, gather context and in Zeng’s words, catch the vibe.

In a conversation with HT, Xi Zeng notes India’s importance for building any successful and diverse AI. (Official photo)

A few years ago, Zeng standing before the Basílica de la Sagrada Família in Barcelona, wanted to search more about the history of the complex architecture. He couldn’t, because typical search results prioritised ticket links, and selling you a tour plan. Though detailed information did filter through, some time later. There was too much friction, and being forced to stop, type and read text results, ruins an intuitive moment of curiosity in his opinion. Zeng holds a Ph.D. in Cognitive Science and Contemporary Art from the University of Barcelona, is a Distinguished Professor at the China Academy of Art and Honorary Fellow at the Nottingham University Business School China.

Earlier this year, Chance AI reported $3 million seed funding, to build on the idea of instant visual answers, visual reasoning, and no-typing prompts, all using your phone’s camera. In a conversation with HT, Xi Zeng notes India’s importance for building any successful and diverse AI, saying “India is the ultimate testing ground because of its extreme visual and cultural density”. Chance AI’s users provide the sort of usage trends that give them a clear direction — build for visual expressions, not workplace efficiency. The Chance AI app is available for Apple iPhone, as well as Android phones. Edited excerpts.

Q. There seems to be an agreement that AI is moving beyond chat—but what specific capability gap in chat-based interfaces have forced this shift?

Xi Zeng: The fundamental gap is that chat is inherently “anti-human” for exploring the physical world. From an evolutionary perspective, humans process the world visually first—about 70% of our brain’s computing power is dedicated to visual processing. Language comes much later. Chat interfaces force a cognitive bottleneck, because they demand that you first clearly formulate your intent into text. But in the real world, when you see something intriguing such as a building’s architecture, a unique outfit, or a cultural artefact, and you often don’t even know how to ask the right question.

If you are forced to stop and type a prompt, the intuitive moment of curiosity is already lost. The shift beyond chat is happening because Gen-Z users (who are visual natives) find typing out the physical world incredibly inefficient.

Q. What’s the hardest unsolved problem in visual AI right now—is it data, compute efficiency, or context understanding?

Xi Zeng: It is absolutely context understanding, but more specifically, it is how we engineer that understanding. Right now, many companies are trying to make the “eyes do the thinking”—cramming perception, reasoning, and decision-making into a single massive Vision-Language Model (VLM). That leads to hallucinations and massive compute inefficiency.

The hardest problem is replicating the biological cognitive pipeline. At Chance AI, we solved this through what we call “Harness Engineering”. We separated the visual pipeline: seeing (camera), signal transmission, visual cortex processing (understanding structure/semantics), and finally the frontal lobe (decision making). Furthermore, we developed a proprietary protocol (a 100×100 compressed visual token system) that allows AI agents to communicate via images rather than translating everything into text. Preserving the “vibe” and unspoken context without translation loss is the true frontier.

Q. Utility-driven AI sounds intuitive—but what does that look like in product terms? What replaces the prompt as the core interaction unit?

Xi Zeng: The camera replaces the prompt, and “seeing” replaces “asking”. In product terms, this means moving from a search box to a continuous “Live Mode”. You don’t take a photo, upload it, and wait for an answer. Instead, the Visual Agent looks at the world with you synchronously. We call this moving from Prompt to Perception. For example, if you look at a menu in a foreign language, the AI doesn’t just translate it. It understands your dietary history, knows what’s trending on local Instagram, recommends the best dish, and initiates the action to order. The interaction unit is no longer a text command; it is the continuous stream of your real-world visual context combined with your ongoing actions.

Q. In an AI ecosystem that is dominated by dominated by OpenAI, Google, Anthropic and others, is the key to differentiation today more about model capability, product design, or distribution?

Xi Zeng: It’s about Product Design driven by a distinct business model. Giants like Google absolutely have the model capability. However, tools like Google Lens are fundamentally designed for search and transaction-identifying an item to sell it to you. That restricts their product design.

Chance AI is building a “Lifestyle Companion”. Gen Z doesn’t always want to buy something; they want to know the meaning behind it-the vibe, the culture, the history. When AI simply gives an answer, it acts as a tool. When AI helps you form a judgment and develop taste, it becomes an Agent. We are thriving in the vacuum that giants overlook because providing emotional, subjective, and cultural interpretations contradicts traditional search-and-ads business models.

Q. In terms of an India context, are there specific user behaviours in India make it a strong testing ground for AI-first products and what does localisation actually mean for AI?

Xi Zeng: India is the ultimate testing ground because of its extreme visual and cultural density. A bustling street market in New Delhi presents a level of unstructured visual data—layered with deep cultural codes, that breaks standard AI models trained purely on Western data. For AI, localisation is not translation; it is cultural consensus.

For instance, when we launched a highly localised feature in Latin America (AI palm reading), it blew up to 50,000 DAUs (daily active users() simply because we understood the local cultural vibe. In India, true AI localisation means the Visual Agent must understand the subtle difference between various regional textiles (like a Kanjeevaram versus a Banarasi saree) or the specific nuances of local street food. It’s about understanding the meaning and societal context behind the pixels, which requires hyper-local visual reasoning.

Q. What are you building at Chance AI right now, and what signals do you pick to shortlist which internal project or data point to build with?

Xi Zeng: We are building the “Visual Agent OS”, as a visual brain and operating system for the next generation of AI hardware (smart glasses, wearables, etc.). Before the hardware fully matures, we are perfecting the “brain” on mobile. The core signal we look for is irrational, high-frequency human curiosity.

We don’t build for workplace efficiency. We look at scenarios where users want to express rather than solve a math problem. For example, when we noticed young female users in North America using our AI 2.8 times a day just to get feedback on their Outfits of the Day (OOTD), or users snapping 160 photos a day to organise their niche collectibles, we knew we had hit a nerve. We build for scenarios where AI acts as a companion that helps you rediscover the serendipity of the real world.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *