AI-Ready Content Extractor: Webpages to Markdown & Links via Firecrawl
Integrates with:
Overview
Unlock AI-Ready Web Content with this Automation Workflow
This n8n workflow automates the extraction of clean, structured content from web pages, making it perfectly formatted for your AI applications. It takes a list of URLs, uses Firecrawl.dev to scrape each page, converts HTML to Markdown, and extracts essential metadata like title, description, and all embedded links. The process is batched and includes rate limiting to work efficiently and respectfully with APIs.
Key Features & Benefits
- Automated Web Scraping: Provide a list of URLs and let the workflow handle the rest.
- AI-Friendly Markdown: Converts complex HTML into clean Markdown, ideal for input into Large Language Models (LLMs).
- Comprehensive Data Extraction: Captures page title, meta description, full Markdown content, and a list of all links on the page.
- Batch Processing: Efficiently processes large lists of URLs by breaking them into manageable chunks (default 40 URLs overall, processed in sub-batches of 10).
- Built-in Rate Limiting: Includes configurable wait times (default 45s per 10 requests) and batch sizes to manage API rate limits. Adjust as needed for your specific Firecrawl plan and limits.
- Flexible Integration: Easily connect to your existing data sources for URL input (e.g., databases, CSVs) and output the structured data to your preferred storage or downstream processes.
- Robustness: The HTTP Request node for Firecrawl is configured to retry on failure, enhancing operational stability.
Use Cases
- Populating knowledge bases for RAG (Retrieval Augmented Generation) systems.
- Gathering and structuring web content for AI model training or fine-tuning.
- Automated content ingestion for AI-powered analysis or summarization tasks.
- Extracting all hyperlinks from a list of websites for SEO analysis, link checking, or building knowledge graphs.
- Converting online documentation, blog posts, or articles into Markdown for easier processing or archiving.
Prerequisites
- An n8n instance (Cloud or self-hosted).
- Firecrawl.dev API Key (obtainable from https://firecrawl.dev/).
- A list of URLs to process. These can be provided by connecting your own data source node or by editing the example data within the workflow.
Setup Instructions
- Download the n8n workflow JSON file.
- Import the workflow into your n8n instance.
- Prepare your URL list: Either connect your database/URL source to the 'Get urls from own data source' placeholder node, or modify the 'Example fields from data source' node. Ensure the field containing the URLs is named
Page
and provides one URL per item/row. - Configure Firecrawl.dev API Access: In the 'Retrieve Page Markdown and Links' (HTTP Request) node, navigate to 'Authentication'. Select 'Generic Credential Type', then 'HTTP Header Auth' for 'Auth Type'. Click 'Create New Credential' (or select existing if you have one). For a new credential, use
Authorization
as the 'Name' andBearer YOUR_FIRECRAWL_API_KEY
as the 'Value' (replaceYOUR_FIRECRAWL_API_KEY
with your actual API key from Firecrawl.dev). - Review Batching & Rate Limits: The workflow processes up to 40 URLs at a time (adjustable in '40 items at a time' Limit node), in sub-batches of 10 (adjustable in '10 at a time' SplitInBatches node). A 45-second wait (adjustable in 'Wait' node) is applied after each sub-batch of 10 requests. Adjust these settings based on your server's memory and your Firecrawl.dev plan's specific rate limits (e.g., the free tier often has a 10 requests/minute limit).
- Define Data Output: Replace the 'Connect to your own data source' (NoOp node placeholder) with your desired n8n destination node (e.g., Airtable, Google Sheets, PostgreSQL, etc.) to store the extracted
title
,description
,content
(Markdown), andlinks
. - Test the workflow with a few URLs, then activate it for full processing.
Want your own unique AI agent?
Talk to us - we know how to build custom AI agents for your specific needs.
Schedule a Consultation