Wouldn't it be cool if your children could play the classic game of "I Spy" while travelling, using your current location, powered by a generative AI, and an app integrated with Apple CarPlay? We recently took on this intriguing question and developed a Proof-of-Concept (PoC) app to test the feasibility, combining a multitude of technologies in an innovative solution.

Our PoC app integrates the following technologies:

Google Street View: Provides location-specific images based on location data.
Google Cloud Vision: Identifies and labels items in the given image.
OpenAI GPT-3.5: Engages in interactive "I Spy" game using the image data and user conversation.
OpenAI Whisper: Converts user voice input to text.
Google Cloud Text-to-Speech: Converts AI response to audio, enhancing user interaction.
Time Under Tension API: Manages user sessions and facilitates API chaining.
ReactJS frontend: Serves as an interactive interface to test and play the game.

With this setup, we explored the possibility of providing GPT enough context to play "I Spy" based solely on location data.

In our first test, we determined our location and travel direction.

Our second test involved seeing what a computer could perceive in the image.

For the third test, we engaged GPT in playing "I Spy" using the labels obtained, within the OpenAI Playground.

Following these promising tests, we took the next step - prototyping.

Web app - voice is the only way to communicate with GPT.

An interface to view labels sent to GPT and to change locations.

One of the standout technologies used was OpenAI Whisper, which displayed an exceptional ability to convert voice audio to text. Initially, we used ElevenLabs for the voice of GPT, but the swift depletion of credits led us to switch to the more cost-effective Google Text-to-Speech, which proved more sustainable for this application. Furthermore, we opted for GPT-3.5 over GPT-4 due to its speed, cost-effectiveness, and suitability for a game like "I Spy."

We also compared Google Cloud Vision, Microsoft Vision Studio, and AWS Recognition for the computer vision component. Google emerged as the winner, with Microsoft following closely and AWS Recognition trailing.

Naturally, some concerns arise regarding the transient nature of certain objects, such as vehicles or pedestrians, in Google Street View images. We can tackle this by instructing GPT that the game is being played in a moving car, so it should not select items that might disappear quickly. Future app development could involve predicting your location based on velocity and direction, sampling multiple images at intervals, and using common labels. Moreover, GPT proved adept at managing situations when a child chose an object not found in the labels, adeptly narrowing down the item through a series of queries.

This experiment was not just a fun exploration but also an insightful opportunity to blend various AI offerings into a single application. It shed light on the efficacy of voice-to-text and text-to-voice services, hinting at their potential use cases. Voice interaction with Language Learning Models (LLMs) could revolutionise hands-free applications, making it an ideal solution in scenarios where the user is engaged in activities like driving, cooking, or even jogging.

In conclusion, while our "I Spy" AI concept remains an experimental one, it demonstrates the potential AI holds to create interactive and engaging experiences, perhaps making the classic road trip game more immersive and captivating than ever. If you would like to try the demo, click the play button below.

Play “I SPY” demo

I Spy with My AI: Transforming Road Trips with a Touch of AI Magic

About

Method

Work

Imagine

Contact