17 repos
Search & Information Retrieval — Data & Databases
We curate 17 GitHub repositories matching data & databases · Search & Information Retrieval. Refine with filters or upvote what's useful.
Search & Information Retrieval — Data & Databases
- EbookFoundation/free-programming-books
EbookFoundation/free-programming-books
382,801This project is a centralized, open-access repository that serves as a structured directory for technical education and professional development. It functions as a community-driven knowledge base, aggregating high-quality learning materials to support global accessibility to computer science and software engineering resources. The platform distinguishes itself through a collaborative governance model that utilizes peer-reviewed workflows for all content additions and modifications. By leveraging structured text files and decentralized version control, the repository maintains a searchable, human-readable index that is continuously updated and categorized through community-driven metadata tagging. The collection encompasses a broad range of educational assets, including comprehensive technical literature, structured online courses, and interactive programming tutorials. Users can access resources for skill acquisition, interview preparation, and rapid syntax reference, with content organized by programming language, technical domain, and human language to facilitate self-directed study.
Pythonbookseducationhacktoberfest - vinta/awesome-python
vinta/awesome-python
283,687This project is a comprehensive, community-curated directory that organizes a vast landscape of Python software libraries, frameworks, and tools. It serves as a centralized knowledge base designed to facilitate ecosystem navigation and accelerate developer discovery across the entire software development lifecycle. The directory distinguishes itself by providing a structured index of resources categorized by technical domain, ranging from foundational development utilities to specialized engineering fields. It covers high-level capabilities including artificial intelligence, data science, web development, and infrastructure management, allowing developers to identify vetted solutions for specific technical challenges. The project encompasses a broad capability surface, including tools for dependency management, static code analysis, and automated testing. It also catalogs resources for persistent data storage, cloud infrastructure orchestration, and interface development, providing a unified reference for building and maintaining complex software systems.
Pythonawesomecollectionspython - openclaw/openclaw
openclaw/openclaw
211,971Openclaw is a platform for managing agent execution environments, providing the infrastructure to control agent lifecycles, session state, and workspace persistence. It features a centralized gateway that handles model loops, tool invocation, and streaming events, while supporting multi-agent routing and persistent memory management. The system is designed to normalize tool execution signatures and provide a standardized interface for cross-provider compatibility. The platform includes extensive developer tooling, such as a command-line interface for workspace management, diagnostic logging, and a plugin architecture that allows for the registration of custom tools and capabilities. It supports automated workflows through event-driven hooks, task scheduling, and integration with external services. Security is managed through execution policies, credential portability, and approval workflows for agent actions. Deployment is supported through automated infrastructure installers and containerized gateway helpers, with built-in utilities for backups and configuration management. The system provides a structured format for orchestrating multi-step workflows and includes specialized tools for browser automation and structured code patching.
TypeScriptaiassistantcrustacean - Significant-Gravitas/AutoGPT
Significant-Gravitas/AutoGPT
181,891AutoGPT is an orchestration platform designed for building, managing, and deploying autonomous agents. It provides a visual canvas-based environment where users can assemble agents by connecting modular blocks that represent actions, data flows, and conditional logic. The platform supports the entire agent lifecycle, including task scheduling, execution monitoring, and configuration management, while offering a marketplace for discovering and sharing community-built workflows. The project includes a legacy framework for command-line agent execution and an extensible component system for developers to build custom agent capabilities. These tools allow for the integration of various language models, web search utilities, and external services such as database management, productivity platforms, and software development tools. Users can deploy the platform locally using provided installation scripts and containerization utilities or utilize the managed cloud environment.
Pythonaiartificial-intelligenceautonomous-agents - f/prompts.chat
f/prompts.chat
145,637Prompts.chat is a community-driven repository and management platform for AI prompts and agent skills. It provides a centralized interface for users to search, retrieve, and save prompts, while offering structured storage for multi-file agent skills that include documentation and supporting assets. The platform distinguishes itself through a Model Context Protocol-first API and standard REST endpoints, enabling direct integration with AI assistants, IDEs, and external automation tools. It includes generative AI capabilities to transform basic prompts into structured versions and supports granular access control through key-based and OAuth authentication. Beyond core management, the platform offers developer-focused tooling, including command-line interfaces and editor plugins to incorporate prompt workflows into software development. It also features an interactive, game-based learning environment for AI communication and provides comprehensive configuration options for white-label deployments, custom branding, and external object storage.
HTMLaiartificial-intelligenceawesome-list - ripienaar/free-for-dev
ripienaar/free-for-dev
118,073This project is a community-maintained directory of technical resources, tools, and services that offer free tiers for developers. It serves as a centralized reference point for discovering infrastructure, software, and educational materials, helping individuals and teams minimize operational costs while building and scaling applications. The directory distinguishes itself through a collaborative, community-driven curation model that aggregates metadata about third-party services. By utilizing a hierarchical taxonomy and storing all content in version-controlled, plain-text files, the project ensures that resource discovery remains decoupled from the underlying service infrastructure, facilitating transparent and frequent updates from the community. The collection covers a broad spectrum of the software development lifecycle, including cloud infrastructure, development toolchains, security, and frontend design utilities. It provides access to managed services for identity management, continuous integration, monitoring, and data processing, enabling rapid prototyping and the integration of external APIs without the need for extensive custom backend development. The entire directory is maintained as a static, open-source repository, allowing users to browse and contribute to the index through standard version control workflows.
HTMLawesome-listfree-for-developers - Anduin2017/HowToCook
Anduin2017/HowToCook
98,028HowToCook is a structured culinary knowledge base and computational engine designed for the management and scaling of instructional cooking content. It provides a framework for organizing technical preparation procedures and ingredient data, allowing users to maintain consistent culinary standards across various meal scales. The platform distinguishes itself through a scalable recipe engine that programmatically adjusts ingredient quantities and procedural steps based on specific serving requirements. It utilizes a modular approach to documentation, breaking down complex cooking methods into discrete, reusable steps that support precise execution regardless of the preparation technique. The system includes a search-indexed retrieval interface for querying centralized culinary databases and supports full self-hosting. By deploying the application within a self-managed server environment, users maintain independent control over their data storage, service availability, and the delivery of instructional resources.
Dockerfilechinesecookbookcooking - supabase/supabase
supabase/supabase
97,908This project provides an integrated backend platform built around a relational database. It automatically generates REST and GraphQL APIs from database schemas, allowing for direct data interaction through standard requests and client libraries. The platform includes a comprehensive authentication system that manages user identity, session handling, and fine-grained access control through database-native row-level security policies. Beyond core data management, the platform offers specialized services for object storage, vector data processing for semantic search, and real-time communication features like broadcast messaging and database change subscriptions. It also supports server-side logic execution through globally distributed edge functions, database-resident functions, and a native job scheduler for automated tasks. Developers can manage the entire project lifecycle using a command-line interface and containerized local development environments. The platform supports both managed cloud services and self-hosted deployments, providing options for infrastructure control and data sovereignty.
TypeScriptaialternativeauth - firecrawl/firecrawl
firecrawl/firecrawl
84,034Firecrawl is a web data extraction platform designed to convert unstructured web content into clean, LLM-ready formats like markdown or JSON. It functions as an autonomous web crawler and scraper, capable of mapping entire domains, performing recursive navigation, and executing complex data gathering tasks. By leveraging headless browser orchestration, the system handles dynamic, JavaScript-heavy pages to ensure comprehensive data capture. The platform distinguishes itself through its focus on agentic workflows, providing a programmatic interface that allows autonomous agents to perform live web research, interact with pages, and execute multi-step navigation tasks. It supports distributed crawling infrastructure, enabling users to scale data collection across multiple nodes while managing concurrency and long-running jobs through asynchronous queueing. The system also integrates with agentic frameworks via standardized protocols, allowing for seamless connection to AI-powered clients and automated pipelines. Beyond its core extraction capabilities, the project provides a suite of developer tools for site mapping, batch scraping, and web searching. It includes features for stateful session persistence, webhook-based notifications, and configurable crawl depth, allowing for granular control over how information is retrieved and processed. The project offers comprehensive API documentation and SDKs to facilitate integration into backend services and local development environments. Users can deploy the crawling infrastructure within their own private networks or utilize managed cloud services.
TypeScriptaiai-agentsai-crawler - macrozheng/mall
macrozheng/mall
82,926This project is an enterprise-grade Java framework designed for building scalable, full-stack e-commerce applications. It provides a comprehensive foundation for microservice-based distributed architectures, enabling the development of complex retail platforms that include product management, order processing, and secure user authentication. By leveraging modular service patterns and centralized API gateways, the framework supports the construction of resilient systems that decompose monolithic business logic into independent, manageable services. The platform distinguishes itself through a robust suite of infrastructure and operational tools that facilitate high-scale deployments. It features integrated support for container-orchestrated environments, event-driven message brokering, and centralized security via token-based authentication. To ensure operational visibility, the framework includes a centralized log aggregation pipeline, real-time health monitoring, and distributed system observability, allowing teams to maintain stability across complex service boundaries. Beyond its core architecture, the platform offers extensive developer tooling and data management capabilities. It supports advanced database operations, including read-write splitting, query routing, and data synchronization, alongside integration with distributed search engines and object storage systems. The development environment is further enhanced by utilities for code quality enforcement, automated entity generation, dependency management, and architectural visualization, providing a complete ecosystem for the lifecycle of enterprise-grade web applications.
Javadockerelasticsearchelk - bregman-arie/devops-exercises
bregman-arie/devops-exercises
81,169This project is a comprehensive educational curriculum designed to build proficiency across modern infrastructure, cloud-native technologies, and systems administration. It functions as a reference library and interview preparation resource, offering a structured collection of conceptual questions, practical coding challenges, and hands-on scenarios that cover the full spectrum of software delivery and operational workflows. The repository distinguishes itself through a modular, domain-specific structure that links instructional problem statements with verified implementation examples. By employing a standardized documentation schema, it provides a predictable learning path for mastering complex technical concepts, ranging from infrastructure-as-code patterns and container orchestration to cloud platform administration and security best practices. The content spans a wide array of technical domains, including automated configuration management, distributed system monitoring, database operations, and version control. It provides deep dives into specific tooling for cloud provisioning, container networking, and service deployment, ensuring that learners can validate their technical skills through isolated, practical exercises. All instructional materials are organized into a unified taxonomy of markdown-based documents, allowing users to navigate and study specific technical topics at their own pace.
Pythonansibleawsazure - DopplerHQ/awesome-interview-questions
DopplerHQ/awesome-interview-questions
81,035This project is a comprehensive, community-sourced repository of technical interview questions and study materials. It serves as a centralized index for software engineers to prepare for technical assessments, benchmark their personal knowledge, and identify gaps in their expertise across a wide range of programming languages, frameworks, and infrastructure domains. The collection distinguishes itself by aggregating high-quality educational resources and coding challenges that span the entire software development lifecycle. It covers diverse technical areas including algorithms, data structures, design patterns, and system-specific topics such as database technologies, networking, and operating systems. By organizing these materials into a structured directory, the project facilitates professional development and helps candidates evaluate their proficiency for hiring processes.
android-interview-questionsangularjs-interview-questionsawesome - junegunn/fzf
junegunn/fzf
77,987This project is a general-purpose command-line filter that provides an interactive interface for processing standard input streams. It enables real-time fuzzy searching, data selection, and transformation, allowing users to navigate complex information or file systems directly within their terminal. By utilizing a pipe-oriented architecture, it integrates into existing shell pipelines and workflows to facilitate efficient data exploration. What distinguishes this tool is its highly extensible, event-driven design that allows for deep integration with external processes. It supports asynchronous data transformation and dynamic list reloading, enabling users to trigger shell commands or update content based on user interactions without blocking the interface. The system maintains selection identity across these updates, providing a consistent experience when managing large or streaming datasets. The project offers a comprehensive suite of features for terminal user interface development, including multi-threaded search performance, configurable preview windows, and support for various terminal multiplexers. It provides extensive customization options for visual layout, key bindings, and search logic, allowing developers to build custom selection interfaces or automate complex shell tasks. The tool is configured through environment variables and configuration files, supporting inline comments for maintainability. It is designed to be installed as a standalone command-line utility, with library integration options available for embedding its filtering capabilities into other applications.
Gobashclifish - elastic/elasticsearch
elastic/elasticsearch
76,163Elasticsearch is a distributed search engine and document store designed for the high-performance indexing and retrieval of massive volumes of unstructured data. It functions as a centralized analytics platform, providing a schema-flexible architecture that organizes information into searchable indices while maintaining global cluster state through a distributed consensus mechanism. The platform distinguishes itself through its integrated approach to observability, security, and advanced analytics. It combines full-text, vector, and hybrid search capabilities with machine learning-driven insights, allowing users to perform complex statistical aggregations, geospatial analysis, and automated anomaly detection. Its storage architecture supports multi-tier data lifecycles, enabling efficient data placement across hot, warm, and cold nodes to balance performance with long-term retention requirements. Beyond core search and storage, the system provides comprehensive observability tools for centralized log analysis, application performance monitoring, and infrastructure health diagnostics. It includes built-in security operations for threat detection and endpoint protection, all managed through a unified RESTful API gateway. The system is accessible via standardized REST APIs for cluster management, data ingestion, and query execution. Extensive documentation is available to guide users through API references for search, indexing, security, and cluster administration.
Javaelasticsearchjavasearch-engine - infiniflow/ragflow
infiniflow/ragflow
73,425This project is a comprehensive retrieval-augmented generation platform designed for building, managing, and deploying knowledge-based AI applications. It provides a unified environment for organizing datasets, configuring conversational chat assistants, and developing autonomous agents that execute multi-step reasoning workflows. By integrating document intelligence with advanced retrieval pipelines, the platform enables the creation of grounded, verifiable responses supported by traceable citations. The platform distinguishes itself through deep document understanding and sophisticated knowledge orchestration. It supports complex document parsing, including the extraction of tables and images, and utilizes graph-based indexing to enhance reasoning over large document collections. Users can configure multiple recall strategies and fused re-ranking to optimize retrieval accuracy, while the system maintains context through multi-turn dialogue management and flexible tool-use frameworks. The architecture is built on a modular, containerized microservice foundation that supports both local inference engines and external language model APIs. It includes asynchronous task processing for document ingestion and indexing, ensuring system responsiveness during heavy workloads. The platform also provides a standardized interface for model abstraction, allowing for seamless integration with existing language model ecosystems. Developers can interact with the platform through a comprehensive suite of RESTful endpoints and Python client libraries, which cover the full lifecycle of agents, datasets, and knowledge graphs. The system is designed for flexible deployment, offering configurable environment settings and support for custom containerized environments to facilitate local development and infrastructure portability.
Pythonagentagenticagentic-ai - redis/redis
redis/redis
73,096Redis is an in-memory, key-value database designed to provide sub-millisecond latency for read and write operations. It functions as a versatile data platform, serving as a distributed cache, a message broker, a NoSQL document store, and a vector database. The system utilizes an event-driven, single-threaded loop to process requests efficiently, while maintaining data durability through append-only persistence logs and asynchronous snapshotting mechanisms. What distinguishes Redis is its ability to handle complex data structures—including strings, hashes, lists, sets, and sorted sets—alongside hierarchical JSON documents and high-dimensional vector embeddings. It supports advanced operational patterns such as active-active database deployment for global distribution, real-time data streaming, and probabilistic statistics for large-scale data analysis. These capabilities are complemented by a pluggable indexing engine that enables semantic similarity matching and full-text retrieval. The platform offers a comprehensive ecosystem for managing distributed state, including master-replica replication, automated cluster management, and granular security controls like access control lists and TLS encryption. Developers can interact with the database through language-specific client libraries that support connection multiplexing and object mapping, or via a command-line interface for direct administrative tasks and scripting. Redis is deployed through standard package managers and supports both self-managed clusters and managed cloud instances. Observability is provided through integrated tools for performance analysis, slow log monitoring, and bulk data management.
Ccachecachingdatabase - awesomedata/awesome-public-datasets
awesomedata/awesome-public-datasets
72,846This project is a community-maintained, open-access directory of high-quality public datasets. It serves as a centralized reference point for researchers, developers, and data scientists to locate reliable information sources across a wide spectrum of industries and scientific fields. By providing a structured index, the repository facilitates the discovery of data necessary for exploratory analysis, machine learning model training, and the development of data-intensive applications. The directory distinguishes itself through a lightweight, platform-agnostic approach to resource indexing that avoids the need for complex backend infrastructure. Content is organized using a topic-centric hierarchical taxonomy, which simplifies navigation across diverse domains ranging from climate science and economics to healthcare and computer networks. This structure is maintained through a collaborative, community-driven model where peer review and version-controlled updates ensure the ongoing accuracy and relevance of the curated links. The collection covers a broad capability surface, including specialized datasets for fields such as physics, geographic information systems, natural language processing, and time-series analysis. The repository is documented entirely through human-readable markdown files, allowing for transparent contributions and easy access to its comprehensive index of public information.
aaron-swartzawesome-public-datasetsdatasets