Vision-Based AI Web Scraper Agent with Gemini & ScrapingBee

Version: 1.0.0 | Last Updated: 2025-05-16

Integrates with:

Core AI Power

7/10

Automation Level

8/10

Integration Reach

3 systems

Setup Simplicity

4/10

Adaptability

7/10

Overview

Unlock Automated Web Data Extraction with this Vision-Based AI Agent

This AI Agent, "Vision-Based AI Scraper," empowers you to automate the extraction of structured data from virtually any webpage. It ingeniously combines Google Gemini's advanced vision capabilities with ScrapingBee's robust web scraping infrastructure and Google Sheets for seamless data management. The agent first attempts to "read" information directly from webpage screenshots. If needed, it intelligently falls back to analyzing the page's HTML content, ensuring comprehensive data capture even from complex, dynamic sites.

Key Features & Benefits

AI-Powered Vision Scraping: Leverages Google Gemini 1.5 Pro to understand and extract data from webpage screenshots, making it highly effective for sites where traditional scraping fails.
Intelligent HTML Fallback: If visual extraction isn't enough, the agent automatically calls ScrapingBee to get the page's HTML, which is then processed for any missing information. This dual approach maximizes data accuracy and completeness.
Structured Data Output: Utilizes a LangChain Structured Output Parser to transform raw extracted data into clean, organized JSON, perfectly formatted for your needs (e.g., e-commerce product details).
Google Sheets Integration: Reads a list of target URLs directly from your Google Sheet and writes the extracted, structured data back into a designated "Results" sheet.
ScrapingBee for Reliability: Employs ScrapingBee to capture full-page screenshots and fetch HTML, handling proxies and browser rendering.
Customizable Extraction: Tailor the AI's instructions (system prompt) and the expected output structure (JSON schema) to fit your specific data extraction requirements.
Token-Efficient HTML Processing: Converts HTML to Markdown before sending it to the LLM during fallback, optimizing for token usage and cost.
Workflow Tool for Modularity: The HTML fallback mechanism is designed as a callable tool within the main agent, showcasing a smart, modular n8n design.

How it Works

Input: The workflow fetches a list of URLs from your specified Google Sheet.
Screenshot & Vision Analysis: For each URL, ScrapingBee captures a full-page screenshot. This image is sent to the Google Gemini 1.5 Pro model via the Vision-based Scraping Agent.
AI Data Extraction: Gemini attempts to extract the predefined data points (e.g., product title, price, brand) based on the visual information and your custom system prompt.
HTML Fallback (If Needed): If Gemini determines it needs more information than the screenshot provides, it can trigger an "HTMLScrapingTool". This tool uses ScrapingBee to get the page's HTML, converts it to Markdown, and feeds it back to Gemini for further analysis.
Structured Output: The extracted data is parsed into a structured JSON format according to your defined schema.
Output: The structured JSON data is then split into individual items and appended as new rows to your "Results" Google Sheet.

Important Legal Note: Web scraping activities are subject to legal regulations and website terms of service. Ensure you have the right to scrape the targeted websites and comply with all applicable laws (e.g., GDPR, CCPA) and ethical guidelines before using this AI Agent.

Use Cases

B2C E-commerce: Automatically extract product details (names, prices, brands, promotions) from competitor or supplier websites for market analysis and pricing strategies.
B2B SaaS: Gather information from industry blogs, news sites, or directories to identify potential leads or track market trends.
B2C E-commerce: Monitor product availability and pricing changes across multiple online stores.
B2B SaaS: Collect data from online forums or communities for sentiment analysis or product feedback.
Automate the collection of structured data from websites where traditional selectors-based scraping is unreliable or too complex to maintain.

Prerequisites

An n8n instance (Cloud or self-hosted).
Google Gemini API Key (the default gemini-1.5-pro-latest model is powerful for vision but consider cost; ensure your API key has access).
ScrapingBee API Key (offers a free tier for initial testing).
Google Cloud Platform Service Account credentials with permissions for Google Sheets API.
A Google Sheet prepared with two sheets: one named 'List of URLs' (or similar, to be configured in the 'Google Sheets - Get list of URLs' node) and another for 'Results' (matching the structure defined in the 'Structured Output Parser' and 'Google Sheets - Create Rows' nodes). An example Google Sheet link is often provided in the workflow's original notes.

Setup Instructions

Download the n8n workflow JSON file.
Import the workflow into your n8n instance.
Configure the 'Google Sheets - Get list of URLs' node: Select your Google Sheets service account credential and specify the Document ID and Sheet Name for your input URLs.
Configure the 'ScrapingBee - Get page screenshot' node: Enter your ScrapingBee API Key in the 'api_key' query parameter.
Configure the 'Google Gemini Chat Model' node: Select or create your Google Gemini API credential.
In the 'Vision-based Scraping Agent' node, review and customize the systemMessage to define what data the AI should extract and how. Ensure the passthroughBinaryImages option is enabled.
In the 'Structured Output Parser' node, adjust the jsonSchemaExample to match the data structure you want Gemini to return. This should align with your system prompt and Google Sheets 'Results' columns.
Configure the 'HTML-based Scraping Tool' (Workflow Tool Node): Verify it's referencing the current workflow ID (PpFVCrTiYoa35q1m in the provided template) to enable the self-calling fallback mechanism which uses the 'HTML-Scraping Tool Trigger'.
Within the fallback mechanism (nodes connected to 'HTML-Scraping Tool Trigger'), configure the 'ScrapingBee- Get page HTML' node: Enter your ScrapingBee API Key.
Configure the 'Google Sheets - Create Rows' node: Select your Google Sheets service account credential, specify the Document ID, and ensure the 'Results' Sheet Name and column mappings accurately reflect the data fields from the 'Structured Output Parser' node.
Review all credential parameters and replace placeholders like <your_scrapingbee_apikey> with your actual keys.
(Important) Ensure legal compliance: Verify that scraping the target websites is permissible and adheres to their terms of service and relevant legal regulations.
Activate the workflow and test with a few URLs from your Google Sheet.

Tags:

AI AgentData ScrapingWeb ScrapingGoogle GeminiScrapingBeeAutomationGoogle SheetsVision AIE-commerce

Want your own unique AI agent?

Talk to us - we know how to build custom AI agents for your specific needs.

Request a Consultation