AI-Powered Ultimate Web Scraper Agent: Selenium & OpenAI
Integrates with:
Overview
Unlock Advanced Web Data Extraction with this AI Agent
This AI Agent is your ultimate solution for complex web scraping tasks. It automates data collection from virtually any website, whether you have a direct URL or just a subject and domain. It leverages Selenium to navigate websites like a human, handle JavaScript-heavy pages, and even manage sessions requiring logins by injecting cookies. The real power comes from its AI capabilities: OpenAI is used to intelligently identify the most relevant page for your query via Google Search, analyze screenshots of web pages for information (thanks to GPT-4o's vision), and then precisely extract the data points you need.
Key Features & Benefits
- Intelligent URL Discovery: Automatically finds the target webpage using Google Search and an AI-driven relevance assessment if a direct URL isn't provided.
- Full Browser Automation: Utilizes Selenium to interact with websites, handling dynamic content, AJAX calls, and complex navigation.
- Session Management: Supports cookie injection to scrape data from pages that require user login.
- AI-Powered Visual Data Extraction: Takes screenshots of target pages and uses OpenAI's GPT-4o vision capabilities to 'read' the page and identify relevant information, even if it's not easily selectable text.
- Structured Data Output: Employs AI to extract specific, structured data points from the scraped content or image analysis, based on your defined targets.
- Anti-Scraping Evasion: Implements techniques like cleaning the webdriver signature and using custom user-agents to reduce the likelihood of being blocked.
- Flexible Input: Can be triggered with a subject and domain for discovery, or a direct target URL.
- Comprehensive Error Handling: Includes multiple checks and session management steps to ensure robust operation.
Use Cases
- For B2C E-commerce: Automate daily competitor price tracking and stock availability checks, even from sites protected by logins or dynamic loading. Extract customer reviews and product details at scale for market analysis.
- For B2B SaaS: Gather competitive intelligence by scraping features, pricing, and customer testimonials from competitor websites. Automate lead list building by identifying and extracting contact information from company websites based on industry and keywords.
- General Business: Monitor brand mentions or specific topics across various websites. Collect data for research projects from multiple online sources without manual browsing.
Prerequisites
- An n8n instance (Cloud or self-hosted).
- OpenAI API Key with access to GPT-4o (or a comparable vision-capable model).
- A running Selenium Grid (e.g., Docker container
selenium_chrome
) accessible by your n8n instance athttp://selenium_chrome:4444/wd/hub
. Refer to the project's GitHub for setup: https://github.com/Touxan/n8n-ultimate-scraper. - (Optional for logged-in sites) A browser extension for collecting session cookies (see project GitHub for details).
- (Optional for bypassing IP blocks) Residential proxy server configuration (see project GitHub and workflow notes).
Setup Instructions
- Download the n8n workflow JSON file.
- Import the workflow into your n8n instance.
- Ensure your Selenium Grid is running and accessible to n8n. The default URL used in the workflow is
http://selenium_chrome:4444
. Adjust HTTP Request nodes targeting Selenium if your setup differs. - Configure your OpenAI API Key in all 'OpenAI Chat Model', 'OpenAI' (image analysis), and 'Information Extractor' nodes. Ensure the selected model (e.g., gpt-4o) is appropriate.
- The workflow is triggered by a Webhook. Review the 'Webhook' node and the initial 'Edit Fields' node to understand the expected JSON input structure:
subject
(optional),Url
(domain, optional ifTarget Url
provided),Target Url
(optional),Target data
(array of objects withDataName
anddescription
), andcookies
(optional array of cookie objects). - If scraping sites requiring login, use the recommended browser extension (see project GitHub: https://github.com/Touxan/n8n-ultimate-scraper) to obtain cookies and pass them in the webhook request.
- If using a proxy, configure it within your Selenium Docker setup and update the
--proxy-server
argument in the 'Create Selenium Session' node's JSON body as per the workflow's sticky note and project documentation. - Review and customize the prompts in 'Information Extractor' nodes to fine-tune data extraction for your specific needs.
- Activate the workflow and test with a sample POST request to the webhook URL.
Want your own unique AI agent?
Talk to us - we know how to build custom AI agents for your specific needs.
Schedule a Consultation