AI-Powered API Schema Crawler & Extractor Agent (Gemini, Apify, Qdrant)

Version: 1.0.0 | Last Updated: 2025-05-16

Integrates with:

Core AI Power

8/10

Automation Level

8/10

Integration Reach

5 systems

Setup Simplicity

4/10

Adaptability

7/10

Overview

Unlock Automated API Discovery and Schema Generation with this AI Agent

This powerful n8n workflow acts as an AI Agent to systematically find, understand, and structure API information from the web. It's designed for solopreneurs, founders, CTOs, and Heads of Automation who need to quickly gather intelligence on third-party APIs for integration, competitive analysis, or building new services.

The agent operates in three distinct, automated stages:

Research: It scours the web using Google Search (via Apify) for potential API documentation pages based on a service name and URL you provide. Web content is scraped (Apify), and Google Gemini AI classifies pages to identify relevant API documentation. Validated content is then chunked, embedded using Gemini, and stored in a Qdrant vector database for efficient retrieval.
Extraction: The agent queries the Qdrant vector store using Gemini to identify a service's core products or solutions. It then performs further semantic searches to find API documentation specifically related to these offerings. Google Gemini's information extraction capabilities are then employed to pull out detailed API operations (endpoints, methods, descriptions) from the retrieved documents. These extracted operations are logged into a Google Sheet.
Generation: Finally, the agent fetches all extracted API operations for a service from Google Sheets, intelligently groups them by resource, and constructs a custom, structured JSON schema. This schema file is then uploaded to Google Drive for your use.

This AI Agent empowers you to automate a typically manual and time-consuming process, delivering structured API data with speed and accuracy.

Key Features & Benefits

AI-Driven Web Research: Leverages Apify for robust web searching and scraping, combined with Google Gemini for intelligent content sifting to find relevant API documentation.
Intelligent Document Processing: Uses Google Gemini to classify web content and extract specific API operation details (endpoints, methods, descriptions, resources).
Advanced Semantic Search: Employs Google Gemini embeddings and a Qdrant vector store for highly relevant document retrieval based on semantic understanding.
Automated API Operation Extraction: Automatically identifies and structures API operations from unstructured documentation text.
Custom Schema Generation: Converts extracted API operations into a well-organized JSON schema, ready for use.
Modular & Stateful Process: Operates in three clear stages (Research, Extraction, Generation), with progress tracked via Google Sheets, allowing for robust and manageable execution.
Scalable API Discovery: Efficiently processes multiple services, building a comprehensive knowledge base of API specifications.
Extensible & Customizable: Built on n8n, allowing for easy modification and integration with other tools and systems.

Use Cases

For B2B SaaS: Automatically discover, document, and monitor APIs of competitor products or potential integration partners.
For Development Teams: Build and maintain a comprehensive internal or external API catalog by crawling and structuring API information from diverse sources.
For Market Researchers: Systematically gather and analyze API offerings and capabilities across specific industry sectors.
For Solopreneurs & Founders: Quickly evaluate the API capabilities of third-party services to accelerate product development and integration.
Automating the initial phase of API integration projects by providing a structured overview of available endpoints.

Prerequisites

An n8n instance (Cloud or self-hosted).
Apify API Key. You'll need to configure this for the HTTP Request nodes calling Apify actors (serping~fast-google-search-results-scraper and apify~web-scraper).
Google Cloud project with Gemini API enabled and appropriate credentials configured in n8n for all Google Gemini nodes (e.g., gemini-1.5-flash-latest, gemini-1.5-pro-002, text-embedding-004).
A Qdrant vector database instance (Cloud or self-hosted) accessible by n8n, with credentials configured.
Google Workspace account with Google Sheets and Google Drive API access, and credentials configured in n8n.
A master Google Sheet (see Setup Instructions for necessary columns).

Setup Instructions

Download the n8n workflow JSON file.
Import the workflow into your n8n instance.
Master Google Sheet Setup: Create a Google Sheet with the following columns: Service (text), Website (URL), Stage 1 - Research (text, initially empty), Stage 2 - Extraction (text, initially empty), Stage 3 - Output File (text, initially empty), Output Destination (text, for filename), and row_number (number, for unique row ID). Populate the Service, Website, and row_number columns for the services you want to process.
Google Sheets Nodes Configuration: Update all Google Sheets nodes (e.g., Get All Research, Research Pending, Append Row, Extract Result) with your Google Sheet ID and the correct Sheet Name/GID. Ensure they reference the row_number column for updates.
Apify HTTP Request Nodes:
- For 'Web Search For API Schema': Configure its 'Authentication' to 'Header Auth' and add your Apify API token (e.g., Header: Authorization, Value: Bearer YOUR_APIFY_TOKEN).
- For 'Scrape Webpage Contents': Configure its 'Authentication' to 'Query Auth' and add your Apify API token (e.g., Parameter Name: token, Value: YOUR_APIFY_TOKEN).
Google Gemini Nodes: Ensure all 'Google Gemini Chat Model' and 'Embeddings Google Gemini' nodes are configured with your active Google Cloud AI credentials. You can adjust model names (e.g., gemini-1.5-pro-002 for more complex tasks) if needed.
Qdrant Vector Store Nodes: Configure the 'Store Document Embeddings', 'Search in Relevant Docs', and 'Search in Relevant Docs1' nodes with your Qdrant instance URL and API Key. The qdrantCollection parameter is dynamically set (e.g., to api_schema_crawler_and_extractor); ensure your Qdrant instance can create/use this collection.
Google Drive Node: Configure the 'Upload to Drive' node with your Google Drive credentials and specify a target folderId where the generated JSON schemas will be saved.
Review Workflow Parameters: Examine the 'Set' nodes like 'Research Event', 'Extract Event', 'Generate Event', and 'Execution Data' to understand the data being passed between workflow runs. The collection name for Qdrant is set in these event nodes.
Initial Trigger: The workflow is designed to be triggered manually using 'When clicking ‘Test workflow’'. It will then fetch tasks from your master Google Sheet based on empty status fields for each stage.
Recursive Execution: This workflow calls itself via the 'Execute Workflow' nodes to process each stage (Research, Extract, Generate). Ensure your n8n instance can handle such recursive calls (e.g., check execution limits if self-hosting).
Activate the workflow. Monitor the master Google Sheet for real-time progress updates across the stages.

Tags:

AI AgentAPI DiscoveryData ScrapingGeminiQdrantAutomationInformation ExtractionApifyGoogle Cloud

Want your own unique AI agent?

Talk to us - we know how to build custom AI agents for your specific needs.

Request a Consultation