AI-Powered WhatsApp Assistant with Google Gemini Multimodal Capabilities

Version: 1.0.0 | Last Updated: 2025-05-16

Integrates with:

Core AI Power

8/10

Automation Level

8/10

Integration Reach

4 systems

Setup Simplicity

4/10

Adaptability

7/10

Overview

Unlock Advanced Customer Interaction with this AI Agent

This n8n workflow transforms your WhatsApp channel into a dynamic, AI-powered communication hub. It acts as an intelligent AI Agent capable of receiving various message types—text, audio voice notes, videos, and images. Using Google Gemini's advanced multimodal models, it transcribes audio, describes video content, and analyzes images. The extracted information, along with text messages (which can be summarized), is then processed by a Langchain-powered AI Agent. This agent, equipped with conversational memory and access to Wikipedia, formulates relevant and helpful responses, which are then sent back to the user via WhatsApp.

Key Features & Benefits

Multimodal Message Handling: Processes text, audio, video, and image messages received via WhatsApp, offering versatile interaction capabilities.
AI-Powered Transcription & Description: Utilizes Google Gemini (e.g., gemini-1.5-pro-002) to accurately transcribe audio messages, describe the content of videos, and analyze images, providing rich context to the AI.
Intelligent Text Summarization: Condenses lengthy text messages using Google Gemini, ensuring efficient processing and understanding by the AI Agent.
Conversational AI Agent: Employs a Langchain-powered agent with Google Gemini as its core Large Language Model (LLM) to understand user queries in-depth and generate contextually appropriate responses.
Knowledge Augmentation: Integrates with Wikipedia via a Langchain tool, allowing the AI Agent to fetch and incorporate factual information into its replies, enhancing response quality.
Persistent Memory: Remembers previous interactions within a single user session using Langchain's Window Buffer Memory, enabling more coherent and natural conversations.
Automated WhatsApp Replies: Seamlessly sends the AI-generated responses back to the originating user on WhatsApp, closing the communication loop efficiently.

Use Cases

For B2C e-commerce: Provide 24/7 AI-driven customer support on WhatsApp, understanding product image queries, voice complaints, or video demonstrations of issues, and offering instant solutions or product information.
For B2B SaaS companies: Offer an intelligent first line of support via WhatsApp, capable of interpreting screenshots of error messages, audio explanations of problems, or short video recordings of software behavior, before escalating complex issues.
Automate interactive product demos or FAQs on WhatsApp, where users can send varied media types for queries and receive AI-generated, multimedia-aware responses.
Streamline user onboarding or feedback collection via WhatsApp, using the AI agent to guide users and understand their input regardless of whether it's text, voice, or visual.

Prerequisites

An n8n instance (Cloud or self-hosted).
WhatsApp Business API access with a configured phone number ID and credentials for n8n (e.g., Permanent Access Token).
Google Gemini API Key with access to a multimodal model like gemini-1.5-pro-002 (or equivalent) that supports text, audio, image, and video processing. Ensure your Google Cloud project is set up for Gemini API usage.
Basic familiarity with n8n expressions to adapt prompts or data mapping if needed.

Setup Instructions

Download the n8n workflow JSON file.
Import the workflow into your n8n instance.
Configure the 'WhatsApp Trigger' node: Set up a new webhook or select existing credentials for your WhatsApp Business API.
Update WhatsApp credentials: For all 'WhatsApp' nodes ('Get Audio URL', 'Get Video URL', 'Get Image URL', 'Download Audio', 'Download Video', 'Download Image', 'Respond to User'), select your configured WhatsApp API credentials.
Configure Google Gemini HTTP Request nodes ('Google Gemini Audio', 'Google Gemini Video'):
- Ensure these nodes are using a credential type that can authenticate with the Google Generative Language API (e.g., 'Google PaLM API' credential type in n8n, providing your Gemini API key).
- Verify the url parameter points to the correct Gemini model endpoint (e.g., https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-pro-002:generateContent).
- Confirm the jsonBody correctly formats the request with the binary data and prompt.
Configure Google Gemini Chat Model nodes ('Google Gemini Chat Model', 'Google Gemini Chat Model1', 'Google Gemini Chat Model2'):
- Select your configured Google Gemini API credentials.
- Ensure the 'Model Name' parameter is set to your desired multimodal model (e.g., models/gemini-1.5-pro-002).
Customize the 'AI Agent' node: Review and adjust the 'System Message' in the 'Options' tab to define the agent's persona and primary instructions for your specific use case. The input text for the agent is constructed in the 'Get User's Message' node.
Review 'Window Buffer Memory' node: Ensure the sessionKey is appropriate for distinguishing between different WhatsApp users (e.g., whatsapp-tutorial-{{ $json.from }}).
Test Thoroughly: Send various message types (text, audio, video, image) from a test WhatsApp account to the number connected to your n8n trigger. Observe the flow and AI responses.
Activate the workflow to make it live.

Tags:

AI AgentWhatsAppGoogle GeminiLangchainAutomationCustomer SupportMultimodal AINLPChatbot

Want your own unique AI agent?

Talk to us - we know how to build custom AI agents for your specific needs.

Request a Consultation