firecrawlfirecrawl

84,034 stars6,093 forksTypeScriptagpl-3.02 views

Firecrawl

Firecrawl is a web data extraction platform designed to convert unstructured web content into clean, LLM-ready formats like markdown or JSON. It functions as an autonomous web crawler and scraper, capable of mapping entire domains, performing recursive navigation, and executing complex data gathering tasks. By leveraging headless browser orchestration, the system handles dynamic, JavaScript-heavy pages to ensure comprehensive data capture.

The platform distinguishes itself through its focus on agentic workflows, providing a programmatic interface that allows autonomous agents to perform live web research, interact with pages, and execute multi-step navigation tasks. It supports distributed crawling infrastructure, enabling users to scale data collection across multiple nodes while managing concurrency and long-running jobs through asynchronous queueing. The system also integrates with agentic frameworks via standardized protocols, allowing for seamless connection to AI-powered clients and automated pipelines.

Beyond its core extraction capabilities, the project provides a suite of developer tools for site mapping, batch scraping, and web searching. It includes features for stateful session persistence, webhook-based notifications, and configurable crawl depth, allowing for granular control over how information is retrieved and processed.

The project offers comprehensive API documentation and SDKs to facilitate integration into backend services and local development environments. Users can deploy the crawling infrastructure within their own private networks or utilize managed cloud services.

Features

Autonomous Web Agents - Interpret natural language prompts to perform complex data gathering tasks by navigating the web and making independent decisions to locate specific information across multiple sources.
Autonomous Web Researchers - Automating complex information gathering tasks by allowing agents to navigate, map, and extract data from websites without manual intervention.
Agentic Web Browsing - Equipping AI agents with the ability to perform live web searches and interact with pages to solve real-time information retrieval problems.
Web Access Interfaces - Initialize standardized command-line interfaces to expose web-browsing capabilities to local development environments or coding agents for external network communication.
Application Integration SDKs - Connect web data extraction capabilities to backend services or agent loops by utilizing software development kits and API keys for programmatic access.
Automated Workflow Generators - Route requests to specialized workflow skills to automate data collection and synthesis for creating finished artifacts like research briefs or SEO audits.
LLM-Ready Data Extractors - A data extraction engine that converts unstructured web content into clean, structured formats optimized for large language model ingestion.
LLM Data Preparation Tools - Converting unstructured web content into clean, structured formats like Markdown or JSON to feed directly into large language models.
LLM-Driven Data Extractors - Transforms unstructured HTML into clean, semantic markdown or structured JSON by leveraging large language models for intelligent content parsing.
Web Content Scrapers - Extract content from a single URL and convert it into structured formats like markdown to prepare raw data for use in downstream applications or processing pipelines.
Web Search APIs - Retrieve relevant links and content based on natural language queries by executing search requests with configurable result limits to gather targeted information from across the internet.
Web Data Connectors - Connect web crawling and scraping capabilities to AI agents and automation platforms by utilizing pre-built connectors and server-side tools for data ingestion.
Web Data Pipelines - Connecting live web data sources to backend services and automation workflows using standardized APIs and protocols for consistent ingestion.
Browser Automation Interfaces - Scrape a page, then interact with it using AI prompts or code. ```python from firecrawl import Firecrawl app = Firecrawl(api_key="fc-YOUR_API_KEY") result = app.scrape("https://amazon.com") scrape_id = result.metadata.sc
Website Crawlers - Discover all URLs on a website instantly. ```bash curl -X POST 'https://api.firecrawl.dev/v2/map' \ -H 'Authorization: Bearer fc-YOUR_API_KEY' \ -H 'Content-Type: application/json' \ -d '{"url": "https://firecrawl.dev"}'
Autonomous Web Crawlers - A recursive navigation service that maps site structures and traverses domains to aggregate comprehensive datasets from multiple interconnected pages.
Distributed Crawling Infrastructures - A scalable architecture for executing large-scale web data collection tasks across private or managed environments with built-in concurrency and error management.
Large-Scale Domain Crawlers - Systematically discovering and indexing entire domains to retrieve comprehensive datasets for training, analysis, or content migration projects.
Web Crawlers - Crawl an entire website and get content from all pages. ```bash curl -X POST 'https://api.firecrawl.dev/v2/crawl' \ -H 'Authorization: Bearer fc-YOUR_API_KEY' \ -H 'Content-Type: application/json' \ -d '{ "url": "https:/
Web Scraping APIs - Get LLM-ready data from any website — markdown, JSON, screenshots, and more. ```python from firecrawl import Firecrawl app = Firecrawl(api_key="fc-YOUR_API_KEY") result = app.scrape('firecrawl.dev') ``` <details> <summar
Autonomous Research Agents - Retrieve structured data from target URLs by providing prompts and output schemas to automate complex research tasks without manual navigation.
Headless Browser Orchestrators - Executes dynamic page rendering and interaction by controlling isolated browser instances to capture content from JavaScript-heavy web applications.
Recursive Web Crawlers - Follow links up to a specified depth to retrieve structured data from multiple pages while automatically waiting for completion to ensure all content is captured.
Batch Scrapers - Scrape multiple URLs at once: ```python from firecrawl import Firecrawl app = Firecrawl(api_key="fc-YOUR_API_KEY") job = app.batch_scrape([ "https://firecrawl.dev", "https://docs.firecrawl.dev", "https://firecrawl.dev/pr
Batch Web Scrapers - Extract content from multiple web pages simultaneously by providing a list of URLs to convert unstructured web data into structured formats like markdown.
Crawl API Endpoints - Crawl API Endpoint — a named example documented in this learning resource.
Screenshot Capture Services - Scrape Screenshot Capability — a named example documented in this learning resource.
Distributed Crawl Coordination - Scales web discovery across multiple nodes by partitioning URL frontiers and managing concurrency limits to ensure efficient site indexing.
Model Context Protocol Integrations - Connect AI-powered clients to web data sources using a standardized protocol to facilitate seamless tool integration and real-time information retrieval.
Asynchronous Job Queues - Manages long-running crawling and scraping tasks by decoupling request submission from execution via persistent background worker processes.
Web Data Service Integrations - Connect web scraping and data cleaning tools to automation workflows and agentic frameworks using standardized protocols to ensure consistent data ingestion across diverse development environments.
Agentic Browsing Interfaces - A programmatic layer that exposes web navigation and interaction capabilities to autonomous agents through standardized protocols and tool definitions.
Stateful Session Persistence - Maintains browser context and authentication state across multiple interactions to enable complex navigation flows and multi-step web tasks.