8 repos
Data Extraction & Ingestion — Data & Databases
We curate 8 GitHub repositories matching data & databases · Data Extraction & Ingestion. Refine with filters or upvote what's useful.
Data Extraction & Ingestion — Data & Databases
- jackfrued/Python-100-Days
jackfrued/Python-100-Days
178,734This project is a comprehensive, day-by-day curriculum designed to guide learners through the Python programming language and its professional applications. The content spans from fundamental syntax and object-oriented design to advanced topics including database management, web development, data analysis, and machine learning. The curriculum is structured into distinct modules that cover practical software engineering practices, such as version control, containerization, and system architecture. It also provides resources for technical interview preparation and an analysis of career paths within the software development and data science ecosystems. The material is delivered through a series of structured lessons and practical exercises.
Jupyter Notebook - papers-we-love/papers-we-love
papers-we-love/papers-we-love
103,417Papers We Love is a community-driven repository and learning network dedicated to the study and discussion of foundational computer science literature. It functions as a centralized educational archive, providing a structured environment where software professionals can engage with academic research to bridge the gap between theoretical concepts and practical application. The project distinguishes itself through a decentralized model of crowdsourced curation, where community members collectively maintain and categorize a vast index of technical resources. Beyond the repository itself, the initiative supports a global network of autonomous regional chapters that operate under shared governance standards to facilitate in-person knowledge sharing. This ecosystem is further supported by an extensive library of archived expert presentations and curated reading methodologies designed to improve technical literature literacy. The platform organizes its scholarly resources through a hierarchical directory structure, enabling efficient navigation and version-controlled tracking of academic content. It provides tools for discovering external research repositories, establishing contribution standards for collaborative growth, and developing community-focused applications that extend the utility of the shared knowledge base.
Shellawesomecomputer-sciencemeetup - microsoft/markitdown
microsoft/markitdown
87,305This project is an AI-powered document processing engine designed to transform diverse file formats into structured Markdown. By leveraging multimodal language models, it performs complex layout analysis and semantic text extraction, allowing for the conversion of both unstructured files and scanned images into machine-readable content. The toolkit distinguishes itself through a modular, plugin-based architecture that orchestrates multi-stage extraction pipelines. Users can steer the parsing behavior by injecting custom instructions, enabling the system to adapt to domain-specific document structures and formatting requirements. This flexibility is supported by an integrated optical character recognition capability that ensures text recovery from embedded images during the conversion process. The system provides both a command-line interface and a programmatic library, facilitating automated batch processing and custom integration into data pipelines. To ensure consistent performance across different environments, the project supports deployment within containerized architectures that encapsulate all necessary system-level dependencies and binaries.
Pythonautogenautogen-extensionlangchain - firecrawl/firecrawl
firecrawl/firecrawl
84,034Firecrawl is a web data extraction platform designed to convert unstructured web content into clean, LLM-ready formats like markdown or JSON. It functions as an autonomous web crawler and scraper, capable of mapping entire domains, performing recursive navigation, and executing complex data gathering tasks. By leveraging headless browser orchestration, the system handles dynamic, JavaScript-heavy pages to ensure comprehensive data capture. The platform distinguishes itself through its focus on agentic workflows, providing a programmatic interface that allows autonomous agents to perform live web research, interact with pages, and execute multi-step navigation tasks. It supports distributed crawling infrastructure, enabling users to scale data collection across multiple nodes while managing concurrency and long-running jobs through asynchronous queueing. The system also integrates with agentic frameworks via standardized protocols, allowing for seamless connection to AI-powered clients and automated pipelines. Beyond its core extraction capabilities, the project provides a suite of developer tools for site mapping, batch scraping, and web searching. It includes features for stateful session persistence, webhook-based notifications, and configurable crawl depth, allowing for granular control over how information is retrieved and processed. The project offers comprehensive API documentation and SDKs to facilitate integration into backend services and local development environments. Users can deploy the crawling infrastructure within their own private networks or utilize managed cloud services.
TypeScriptaiai-agentsai-crawler - browser-use/browser-use
browser-use/browser-use
78,576Browser-use is a framework for building autonomous agents that navigate, interact with, and extract data from web interfaces using natural language instructions. By acting as an orchestration layer between large language models and browser automation protocols, it enables the execution of complex, multi-step workflows without relying on brittle selectors. The system functions as a headless browser controller, providing a programmatic interface to manage browser instances and execute granular interactions. The project distinguishes itself through its ability to translate high-level intent into specific browser primitives, supported by a serialization process that converts complex web page structures into simplified text for model processing. It includes robust support for stateful session persistence, allowing agents to maintain authenticated environments across long-running tasks. Furthermore, the framework facilitates remote browser orchestration, enabling the scaling of automation routines in cloud environments with integrated support for stealth configurations and proxy management. Beyond its core agent capabilities, the platform provides extensive tooling for structured data extraction and workflow integration. It supports a variety of model configurations and allows for the definition of custom tools to extend interaction logic. The project documentation includes quickstart guides for command-line execution and examples for integrating browser automation into broader software ecosystems.
Pythonai-agentsai-toolsbrowser-automation - netdata/netdata
netdata/netdata
77,812Netdata is a distributed observability platform designed for real-time infrastructure monitoring and performance tracking. It functions as a high-frequency agent that collects system, container, and application metrics with per-second precision, providing both local visualization and centralized aggregation across complex, multi-cloud environments. The platform distinguishes itself through edge-based intelligence, utilizing local machine learning models to automatically detect performance anomalies without requiring manual configuration or external query engines. Its architecture prioritizes local-first data persistence and secure metadata-only synchronization, ensuring that granular observability data remains on the host while essential system information is routed to a cloud-connected management plane. This hierarchical approach allows for horizontal scaling through parent-child node relationships, enabling unified monitoring and alerting across distributed infrastructure. Beyond core collection and analysis, the system supports automated troubleshooting through natural language querying and intelligent metric correlation. It features a modular data acquisition engine that employs thread-per-core execution for low-latency performance, alongside isolated external processes for heterogeneous application support. The platform includes automated service discovery, diverse deployment options, and built-in diagnostic utilities to maintain visibility and connectivity across large-scale clusters. Installation is supported through various methods including package managers, automated scripts, source compilation, and containerized orchestration.
Caialertingcncf - infiniflow/ragflow
infiniflow/ragflow
73,425This project is a comprehensive retrieval-augmented generation platform designed for building, managing, and deploying knowledge-based AI applications. It provides a unified environment for organizing datasets, configuring conversational chat assistants, and developing autonomous agents that execute multi-step reasoning workflows. By integrating document intelligence with advanced retrieval pipelines, the platform enables the creation of grounded, verifiable responses supported by traceable citations. The platform distinguishes itself through deep document understanding and sophisticated knowledge orchestration. It supports complex document parsing, including the extraction of tables and images, and utilizes graph-based indexing to enhance reasoning over large document collections. Users can configure multiple recall strategies and fused re-ranking to optimize retrieval accuracy, while the system maintains context through multi-turn dialogue management and flexible tool-use frameworks. The architecture is built on a modular, containerized microservice foundation that supports both local inference engines and external language model APIs. It includes asynchronous task processing for document ingestion and indexing, ensuring system responsiveness during heavy workloads. The platform also provides a standardized interface for model abstraction, allowing for seamless integration with existing language model ecosystems. Developers can interact with the platform through a comprehensive suite of RESTful endpoints and Python client libraries, which cover the full lifecycle of agents, datasets, and knowledge graphs. The system is designed for flexible deployment, offering configurable environment settings and support for custom containerized environments to facilitate local development and infrastructure portability.
Pythonagentagenticagentic-ai - tesseract-ocr/tesseract
tesseract-ocr/tesseract
72,460Tesseract is a neural network-based optical character recognition engine designed to convert scanned images and digital documents into machine-readable, searchable text. It functions as both a command-line utility for automating large-scale digitization workflows and a cross-platform library that can be embedded into desktop, mobile, or server-side applications. By utilizing long short-term memory networks, the engine provides robust text extraction across more than one hundred languages and dozens of scripts. The project distinguishes itself through a sophisticated document layout analysis framework that employs a hybrid approach to resolve complex structures like multi-column text and tables. It offers extensive configurability, allowing users to refine recognition accuracy through custom linguistic models, user-defined dictionaries, and specialized training pipelines. The engine supports the generation of various structured outputs, including searchable PDFs with hidden text layers, and provides hardware-accelerated math kernels to optimize inference performance. Beyond core recognition, the system includes comprehensive tooling for image pre-processing, page segmentation, and the management of modular language data. It provides C and C++ APIs alongside various language-specific wrappers, enabling integration into diverse software environments. The engine is available as pre-built binary packages or can be compiled from source using standard system compilers.
C++hacktoberfestlstmmachine-learning