When Two Industries Converge: 3 new capabilities Boston Dynamics is integrating into robots with Generative AI

by | Oct 28, 2023 | AI


The fusion of Generative AI with robotics is gradually unfolding a new era of autonomous and intelligent machines capable of interacting with their environment and humans in unprecedented ways. At the forefront of this transformative wave is Boston Dynamics with its robotic marvel – Spot. By integrating Generative AI and Visual Question Answering (VQA) models, Spot has been endowed with reasoning, real-time decision-making, and customizable interactive experiences. This venture is a robust demonstration of the limitless possibilities this synergy between Generative AI and robotics holds, especially in industrial domains like Engineering and Manufacturing.

Spot’s Evolution: Becoming More Than Just A Robot

Boston Dynamics took a significant leap by enriching Spot with Generative AI, notably integrating ChatGPT and VQA models. This blend enabled Spot to translate visual data from its cameras into text, which is further processed by ChatGPT to engage in meaningful interactions. A highlight of this initiative is the robot tour guide project where Spot, while strolling through Boston Dynamics’ office, could observe its surroundings, interpret visual data, and share insights about different spots interactively and engagingly with the audience: Robots That Can Chat | Boston Dynamics

Generative AI: A Catalyst for Industrial Transformation

The prowess of Generative AI extends beyond robotics into the realms of Engineering and Manufacturing, where it’s poised to revolutionize processes and operations. Its capability to optimize and accelerate processes is particularly appealing for engineering disciplines requiring high precision and efficiency (Generative AI and machine learning are engineering the future in these 9 disciplines | ZDNET). Moreover, with Generative AI, engineers can delve into extensive design explorations, analyze large datasets to enhance safety, create simulation datasets, and expedite the manufacturing processes, thus ensuring a quicker market entry of products (How Generative AI will transform manufacturing | AWS for Industries (amazon.com)).

Generative AI will soon be able to understand multiple modalities. Today, researcher combined two LLMs with VQA to "show" SPOT the domain
a three-dimensional map of specific areas within our premises, marked distinctly for Generative AI the Large Language Model (LLM) to interpret: 1 “demo_lab/balcony”; 2 “demo_lab/levers”; 3 “museum/old-spots”; 4 “museum/atlas”; 5 “lobby”; 6 “outside/entrance”. This 3D autonomy map, meticulously compiled by Spot, comes with concise descriptions for each labeled section. Utilizing Spot’s advanced localization system, we identified descriptions of nearby locations, which were then relayed to the large language model alongside other contextual data from Spot’s array of sensors. The LLM, in turn, processes this information to formulate commands like ‘say’, ‘ask’, ‘go_to’, or ‘label’, facilitating Spot’s interactive engagement and real-time decision-making in its environment, as detailed in the article. This demonstrates the seamless interaction between visual data and Generative AI, propelling Spot’s autonomous navigational and conversational capabilities to the forefront.

The journey doesn’t stop here; Generative AI is facilitating the emergence of conversational chatbots, predictive assistants, and various other tools that promise to ease our daily industrial operations – take a look at this article by Siemens on The future of generative AI in design and manufacturing (The future of generative AI in design and manufacturing – Thought Leadership (siemens.com)). These advancements are not only making processes more efficient but are also unlocking new avenues of innovation and productivity.

What are Visual Question Answering (VQA) models?

I used VQA in this article – since OpenAI has no API Access to GPT-4V yet, VQA was the best way, to provide visual inputs to the model. Visual Question Answering (VQA) models represent a captivating intersection of computer vision and natural language processing technologies, engineered to interpret visual data and provide responses to text-based queries concerning that data. These models are fed an image alongside a text question and are trained to generate a relevant answer.

For instance, given a picture of a room and asked, “How many chairs are in the room?”, a VQA model aims to analyze the image and provide an accurate answer. The underlying mechanism often involves the extraction of features from the image, understanding the context of the question, and subsequently generating a text answer based on the interplay of visual and textual cues. By bridging the gap between visual perception and language understanding, VQA models open avenues for more intuitive human-machine interactions and find applications in various fields including robotics, accessibility services for the visually impaired, and interactive customer service solutions among others.

If you interested in learning more, you can read about VQA in the original VQA paper: [1505.00468] VQA: Visual Question Answering (arxiv.org)

Architecture of the model, using several Generative AI providers for this experience.

Spot’s Journey: A Glimpse into an AI-Driven Future

Spot’s transformation is a vivid illustration of the practical applications and the future of robotics intertwined with Generative AI. It’s a testament to how robots can assume various personalities and engage in nuanced, interactive dialogues, making real-time decisions based on environmental feedback. This venture is not just a technical demonstration but a narrative of what the future holds – a world where robots and humans interact and collaborate seamlessly in an enriched, intelligent, and intuitive ecosystem.

The role of Large Language Models (LLMs) like ChatGPT is undeniably significant in this narrative, acting as the brain behind Spot’s conversational and reasoning abilities. This showcases a future where the integration of language models and Generative AI could lead to the development of autonomous, interactive, and highly engaging robotic applications across various sectors.


The meld of Generative AI with robotics as illustrated by Spot’s evolution is a stepping stone towards a future buzzing with intelligent robots capable of meaningful interactions and autonomous decision-making. The initiative by Boston Dynamics is not just a technological breakthrough but a beacon illuminating the path of digital transformation, especially in industrial domains. It beckons a future where the digital and physical realms seamlessly intertwine, paving the way for innovations that could redefine the landscape of Engineering, Manufacturing, and beyond.

Follow, for more news on Generative AI & Generative AI!