AI Web Content Ingestion Agent: Firecrawl Scraper to Markdown & Links
Integrates with:
Overview
Unlock AI-Ready Web Content with this AI Agent
This n8n workflow acts as an AI Web Content Ingestion Agent, designed to streamline the process of gathering and preparing web content for your AI applications. It takes a list of URLs, uses Firecrawl.dev to scrape each page, converts the HTML content into AI-friendly markdown, and extracts all associated links, along with metadata like title and description. This structured output is perfect for feeding into Large Language Models (LLMs), populating knowledge bases, or any data pipeline requiring clean web content.
Key Features & Benefits
- Automated Web Scraping: Leverages Firecrawl.dev for reliable and efficient scraping of web pages.
- HTML to Markdown Conversion: Transforms raw HTML into clean, structured markdown, ideal for AI processing.
- Comprehensive Data Extraction: Captures page title, meta description, full markdown content, and all discoverable hyperlinks.
- Batch Processing & Rate Limiting: Intelligently handles large lists of URLs with built-in batching and configurable wait times to respect API limits and server resources.
- Flexible Integration: Easily connect your existing data sources for URL input (e.g., CSV, database, Google Sheets) and pipe the extracted data to your preferred destination (e.g., Airtable, database, vector store).
- Scalable Ingestion: Designed to process URLs efficiently, with notes on adjusting batch sizes based on server memory (default 40 URLs per main batch, 10 URLs per API call batch).
Use Cases
- B2C E-commerce: Gather competitor product page content in markdown for AI-powered feature comparison and market analysis.
- B2B SaaS: Scrape industry blogs and knowledge bases to build a corpus for training a specialized support chatbot.
- Content Marketers: Automate the collection of web research, converting articles into structured markdown for easier AI-driven summarization, re-purposing, or populating a content management system.
- Founders & Solopreneurs: Quickly amass and structure online information for AI analysis, competitive intelligence, or building curated datasets.
- AI Developers: Streamline the data ingestion pipeline for LLMs by feeding them clean, markdown-formatted web content.
Prerequisites
- An n8n instance (Cloud or self-hosted).
- A Firecrawl.dev account and API Key (Bearer Token).
- A list of URLs to process. You can input this via the example 'Set' node or connect to your own data source (e.g., Google Sheets, database).
- (Recommended) Credentials for your target data store if you plan to save the output automatically (e.g., Airtable API key, database credentials).
Setup Instructions
- Download the n8n workflow JSON file.
- Import the workflow into your n8n instance.
- Input URLs:
- Modify the
Example fields from data source
node: Edit the array in thePage
field with your list of URLs. - Alternatively, replace the
Example fields from data source
node with a node that fetches URLs from your preferred data source (e.g., Google Sheets, Airtable, Database). Ensure the field containing the URLs is namedPage
.
- Modify the
- Configure Firecrawl API Key:
- Select the
Retrieve Page Markdown and Links
(HTTP Request) node. - Under 'Authentication', ensure 'Generic Credential Type' is selected and 'httpHeaderAuth' is chosen for 'Generic Auth Type'.
- Click on 'Credential for Header Auth' and either select your existing Firecrawl credential or create a new one:
- Name:
Authorization
- Value:
Bearer YOUR_FIRECRAWL_API_KEY
(replaceYOUR_FIRECRAWL_API_KEY
with your actual key).
- Name:
- Select the
- Adjust Batching & Rate Limits (Optional):
- The
40 items at a time
(Limit) node processes URLs in main batches. Adjust if your server has different memory capacities (a sticky note suggests 40 is a good starting point for some servers). - The
10 at a time
(SplitInBatches) node prepares smaller batches for the API calls. - The
Wait
node is set to 45 seconds to respect potential Firecrawl API limits (e.g., 10 requests/minute). Adjust theAmount
(seconds) if needed based on your Firecrawl plan or observed API behavior.
- The
- Configure Output:
- The
Connect to your own data source
(NoOp) node is a placeholder. - Replace it with a node to send the extracted data (available from the
Markdown data and Links
node:title
,description
,content
,links
) to your system of choice (e.g., Airtable 'Create Record', Google Sheets 'Append Row', a database node, etc.).
- The
- Save and activate the workflow. Test with a few URLs first.
Want your own unique AI agent?
Talk to us - we know how to build custom AI agents for your specific needs.
Schedule a Consultation