š Search Engine (Crawler + Indexer + Ranking)
A mini search engine built from scratch that demonstrates how real-world search systems work ā including web crawling, indexing, ranking, and query optimization with caching.
š Links
š Tech Stack
- Backend: Node.js, Express
- Database: PostgreSQL
- Queue & Workers: Redis, BullMQ
- Caching: Redis
- Frontend: React.js
- Architecture: Monolithic backend with background workers
š§ Features
š Web Crawler
- Built a scalable crawler using BullMQ workers
- Extracts links from web pages and processes them asynchronously
- Handles deduplication and prevents infinite crawling using:
- max page limits
- depth control
- visited URL tracking
š Indexing Engine
- Designed an inverted index for fast lookups
- Processes documents to:
- clean and normalize text
- tokenize content
- compute term frequency (TF)
- Stores term ā document mappings for efficient querying
ā” Search & Ranking
- Implements term frequency-based ranking
- Ranks documents based on relevance to query
- Supports:
- multi-keyword queries
- result sorting by score
- pagination
š Redis Caching
- Integrated Redis caching for search results
- Cache key:
- Reduces latency for repeated queries
ā” Improved response time by ~60% for repeated queries
š„ļø Frontend (React)
- Minimal UI for:
- search input
- results display
- pagination
- Connects to backend via REST APIs
šļø System Architecture
š Data Flow
-
Seeding
- Initial URLs are added to the queue
-
Crawling
- Worker fetches HTML ā extracts links ā saves document
-
Indexing
- Text is cleaned ā tokenized ā term frequency calculated ā stored in inverted index
-
Search Query
- Query ā tokenize ā lookup index ā rank documents ā return results
-
Caching
- Results cached in Redis for faster repeated queries
š API Endpoints
Search
š Example Response
šÆ Key Learnings
- Built a queue-based ingestion pipeline using BullMQ
- Implemented inverted indexing and ranking logic
- Designed fault-tolerant crawler with limits and deduplication
- Used Redis for caching and performance optimization
- Understood real-world search engine architecture