Animated data flow diagram

AI Agent: Web Content & Product Data Extractor

Version: 1.0.0 | Last Updated: 2025-05-16

Integrates with:

OpenAI Google Sheets Jina.ai Langchain

Overview

Unlock Automated Product Data Collection with this AI Agent

This n8n workflow acts as an intelligent AI Agent designed to scrape web pages, specifically product listings, and extract structured data. It leverages Jina.ai for clean content retrieval and an OpenAI-powered Langchain node for precise information extraction. Imagine automatically populating your product databases or competitor analysis sheets without manual copy-pasting.

This agent is ideal for solopreneurs and businesses needing to gather product details, competitor intelligence, or any structured information from websites efficiently.

Key Features & Benefits

  • AI-Powered Data Extraction: Leverages Langchain with OpenAI models to intelligently identify and pull specific data points (e.g., product name, price, availability, image URL, product URL) from unstructured web content based on your defined schema.
  • Simplified Web Content Retrieval: Integrates Jina.ai's reader function (r.jina.ai) to fetch clean, easily parsable content from web pages, bypassing many common scraping complexities.
  • Structured Output for Databases: Converts messy web data into a clean, JSON array format, perfect for populating databases or spreadsheets.
  • Automated Google Sheets Logging: Seamlessly appends all extracted information to a designated Google Sheet, making data access and further analysis straightforward.
  • Highly Customizable: Easily adapt the target URL for Jina.ai, fine-tune the OpenAI prompt in the Information Extractor node, and modify the data schema to scrape diverse websites and extract various types of information.
  • Boost Productivity: Eliminates tedious manual data collection and entry, freeing up valuable time for focused work.

Use Cases

  • B2C E-commerce: Automatically scrape competitor pricing, product descriptions, and availability for market analysis.
  • B2C E-commerce: Populate new product listings in a Google Sheet by extracting data from supplier or manufacturer websites.
  • B2B SaaS: Gather feature lists, pricing tiers, or integration partners from competitor websites for competitive intelligence.
  • Market Research: Collect product details from multiple e-commerce sites to identify trends, average pricing, and stock levels.
  • Content Aggregation: Systematically extract specific data points from articles, forums, or listings to build a knowledge base.

Prerequisites

  • An n8n instance (Cloud or self-hosted).
  • OpenAI API Key with access to a suitable model (e.g., gpt-3.5-turbo, gpt-4).
  • Jina.ai API Key (if required for r.jina.ai usage, to be configured as an HTTP header credential in n8n for the 'Jina Fetch' node).
  • Google Sheets credentials configured in n8n.

Setup Instructions

  1. Download the n8n workflow JSON file.
  2. Import the workflow into your n8n instance.
  3. Configure the 'Jina Fetch' node:
    • Update the URL (currently https://r.jina.ai/http://books.toscrape.com/...) to prefix https://r.jina.ai/ to your target website URL.
    • Ensure your Jina.ai API Key (if needed) is set up as an 'HTTP Header Auth' credential in n8n and selected in this node.
  4. Configure the 'OpenAI Chat Model' node:
    • Select or add your OpenAI API credentials.
  5. Customize the 'Information Extractor' node:
    • The text input defaults to {{ $json.data }} from the 'Jina Fetch' node, which is usually correct.
    • Review and modify the systemPromptTemplate. The current prompt is geared towards extracting book details: "Each book should have a title, price, availability and product_url, image_url". Adapt this for your specific needs.
    • Update the inputSchema to match the exact data points (and their types) you want to extract. This schema guides the AI.
  6. Configure the 'Save to Google Sheets' node:
    • Select your Google Sheets credentials.
    • Choose your target Google 'Document ID' (Spreadsheet) and 'Sheet Name'.
    • The 'Information Extractor' outputs data with field names defined in its schema (e.g., title, price, availability, product_url, image_url).
    • The Google Sheets node will attempt to auto-map these. Ensure your Google Sheet columns are named accordingly.
    • If your sheet uses different column names (e.g., the node defaults to looking for name, link, image), adjust the 'Columns' settings in the Google Sheets node: either rename the fields in its 'Schema' display to match the Extractor's output, or switch 'Mapping Mode' to 'Map Each Column Manually' for precise control (e.g., map your sheet's name column to {{ $json.title }}).
  7. Activate the workflow.
  8. (Optional) Refer to the Sticky Note within the workflow for a helpful YouTube tutorial link and an example Google Sheet structure provided by the original workflow author.

Tags:

AI AgentData ScrapingOpenAILangchainWeb AutomationGoogle SheetsProduct DataInformation ExtractionJina.ai

Want your own unique AI agent?

Talk to us - we know how to build custom AI agents for your specific needs.

Schedule a Consultation