Animated data flow diagram

AI Image Captioner & Overlay Agent using Google Gemini

Version: 1.0.0 | Last Updated: 2025-05-16

Integrates with:

Google Gemini Langchain

Overview

Unlock Automated Image Enhancement with this AI Agent

This n8n workflow is an AI Agent designed for intelligent image captioning and visual overlay. It takes an input image, and the AI Agent then performs the following:

  1. Image Analysis: Leverages the vision capabilities of Google Gemini (specifically the gemini-1.5-flash model by default) to deeply understand the content and context of your image.
  2. Creative Caption Generation: Based on its analysis, the Agent crafts a descriptive caption, including a distinct title (often with a punny twist as per its instructions!).
  3. Structured Output: Ensures the generated title and text are cleanly extracted using a structured output parser, making the caption data easy to use in subsequent steps.
  4. Dynamic Overlay: Intelligently calculates the optimal positioning, font size, and a subtle background for the caption to ensure readability, then overlays it directly onto your image using n8n's image editing tools.

This AI Agent empowers you to automate a significant part of your visual content post-processing, making your images more informative and engaging with minimal effort.

Key Features & Benefits

  • AI-driven Image Understanding: Goes beyond simple object detection to grasp nuances in images using Google Gemini's multimodal capabilities.
  • Automated Caption Crafting: Saves hours of manual work by generating contextually relevant and creative captions, including titles.
  • Enhanced Visuals: Overlays captions professionally with a calculated background, improving the informational value and appeal of your images.
  • Customizable AI Behavior: Tailor the captioning style (e.g., tone, length, specific instructions) by adjusting the prompts fed to the Gemini model via the Langchain node.
  • Flexible Image Input: Easily integrates with various image sources, whether from a URL (as in the template) or other n8n triggers providing image binaries.
  • Dynamic Text Placement: A Code node calculates caption position and font size based on image dimensions for a polished look.
  • End-to-End Automation: From image input to the final captioned image, the entire pipeline is managed within n8n.

Use Cases

  • B2C E-commerce: Automatically generate and overlay engaging captions on product images for social media feeds, email campaigns, or online store listings to boost visibility and click-through rates.
  • B2B SaaS: Quickly create visually descriptive images for blog posts, knowledge base articles, case studies, or feature announcements, saving significant content creation time.
  • Solopreneurs & Content Creators: Rapidly add context, branding, or witty remarks to images shared across platforms, enhancing audience engagement and message recall.
  • Automated generation of accessible alt-text for images, potentially with a creative or branded spin, before overlaying a summarized version.
  • Marketing Agencies: Streamline the creation of captioned visuals for client campaigns, improving turnaround times and consistency.

Prerequisites

  • An n8n instance (Cloud or self-hosted).
  • A Google Cloud Project with the Vertex AI API enabled and appropriate billing setup.
  • Credentials for Google Gemini API (e.g., a Service Account key JSON file). Ensure the API access is configured for your n8n environment and has permissions for multimodal models like gemini-1.5-flash or gemini-pro-vision.
  • An image to be captioned. The template uses a sample URL, but this can be dynamically provided by other nodes or triggers.

Setup Instructions

  1. Download the n8n workflow JSON file.
  2. Import the workflow into your n8n instance.
  3. Configure the 'Google Gemini Chat Model' node: Select or create new credentials for 'Google Gemini API'. This typically involves uploading your Service Account JSON key obtained from your Google Cloud Project.
  4. Verify the selected model in the 'Google Gemini Chat Model' node's parameters (default is models/gemini-1.5-flash) supports vision input and meets your needs.
  5. (Optional) Customize Image Input: Modify the 'Get Image' (HTTP Request) node to point to your desired image URL, or replace this node entirely to feed images from other sources (e.g., Webhook, Read Binary File, cloud storage trigger).
  6. (Optional) Refine AI Prompt: Adjust the Text (main instruction) and Messages (system prompt/role definition) parameters within the 'Image Captioning Agent' (Langchain LLM Chain) node to fine-tune Gemini's captioning style, tone, or specific instructions (e.g., desired caption length, details to focus on, output format).
  7. (Optional) Adjust Caption Structure: If you significantly change the desired output structure from the AI (e.g., add more fields than caption_title and caption_text), update the JSON Schema Example in the 'Structured Output Parser' node accordingly.
  8. (Optional) Customize Overlay Appearance: For different caption styling (font, color, background opacity/color) or positioning, modify the parameters in the 'Apply Caption to Image' (Edit Image) node and the JavaScript logic in the 'Calculate Positioning' (Code) node. Note: Ensure the font specified (e.g., Arial) in the 'Apply Caption to Image' node is available in your n8n execution environment, especially for self-hosted instances. You might need to install fonts or use paths to custom fonts.
  9. Activate the workflow. Test by clicking 'Test workflow' (if using the manual trigger and default image URL) or by triggering your chosen input method.

Tags:

AI AgentImage CaptioningGoogle GeminiComputer VisionContent AutomationImage ProcessingLangchainMultimodal AI

Want your own unique AI agent?

Talk to us - we know how to build custom AI agents for your specific needs.

Schedule a Consultation