.-. ..-*%-. :=..:#- :. -#: .:**:=. .=*..+:. ... .-= :*. .... :=...... .: :*: .=:.---=-=. . :. .*- .---:=- .- .--.++ .:=++-.:... .-...#- ...+. .. -- .#. ......... .+.. .-. -. .=+ ..*+. ..*. .-=. ....... :* .-. .*+:=%: :#= =-. .#-:::. .. ... +-..:. :. -%:..+:. *. .=:.....:=+%=:. ..-=:. -. - -. .#+.=-. -+ .-=+-..:. .. :- : : .%-%. += .:+=. .:*:. .= - : -%*. -=. +*=.... ... +*.-=.. .@* :#. .:=.. ... =*===*.-#. .:*. .:+::. ... =-. .-::-+. .-:. .+-.....+-.=-. .. .-*- ..- ..: ... .==+....=*. .......--: .-----. -:=-:. . .-==:.-. -. =:..... ......:::. .: . .-.-=:.--..=. :-.-=====..:*+-.. .. ..-. ....+=... .: .: ..-+++++-..:. .::...+-.: ..... . ....++...........:*: . .:++..... . ..:*=. ...:. ....+.. ..+* +#. :#....... ..#: .*::**. .....+* ... .-..*+. = -=. .-. .:++ :.:#+ #: . -. .=@::**=-*- :- :. :...==-.. .:.+-.. :-. .:::=: .:. :*%*.... .. ...-=====+*: :-. -++:. .:*: .:*.: .. :++#. .=+.=* .. :-+-..-* .:. :%: ..:-+*@*. .+=. .:.=#. =. .%+ .. .. .. =%. ==. .-.*#...* -* .: ...:*=. ++. =. **..+ .-*.- ::::++.. .*=. =+ -* .. .+= .-:.:: -*. .. .= .*+ ..:*:..-: .#- .: :#. -: .*. .- .*. .: -=.:+: == .- .+= :+ :#:.. .*+ + +* .-. -+ = +* .#: -#. :. +* :: .#: :. +*.. -*.=*. #* .*+ *. :+. -+ *. -+ :+ .-+:= .**+. .*-.
← Back to Portfolio

WebCrawler

Advanced Web Scraping & Data Extraction Tool

A high-performance, multithreaded web crawler built in C++ with advanced graph analysis capabilities. Designed for scalable data extraction with intelligent rate limiting and robust error handling.

View on GitHub

Project Overview

WebCrawler is a powerful, efficient web scraping tool designed to navigate and extract data from websites with precision and speed. Built with modern C++ and optimized for performance, it handles complex crawling scenarios while respecting robots.txt and implementing intelligent rate limiting.
High Performance
Multithreaded processing and intelligent request management for maximum throughput and efficiency
Respectful Crawling
Built-in robots.txt compliance and configurable rate limiting to respect server resources
Smart Parsing
Advanced HTML parsing with libcurl integration for reliable data extraction
Graph Analysis
Built-in graph analysis capabilities for understanding website structure and relationships

Code Showcase

Here's a glimpse of the core WebCrawler implementation with configuration and URL management:
// WebCrawler constructor with configuration setup WebCrawler::WebCrawler(const CrawlerConfig& config) : config_(config), httpClient_(config.httpConfig) { // Extract allowed domains from seed URLs for (const auto& seedUrl : config.seedUrls) { auto parsed = UrlParser::parse(seedUrl); if (parsed.valid) { allowedDomains_.insert(parsed.host); } } } void WebCrawler::addSeedUrl(const std::string& url) { auto parsed = UrlParser::parse(url); if (parsed.valid) { urlQueue_.push(url); allowedDomains_.insert(parsed.host); } } void WebCrawler::setCrawlCallback(CrawlCallback callback) { crawlCallback_ = std::move(callback); } void WebCrawler::start() { running_ = true; // Add seed URLs to queue for (const auto& seedUrl : config_.seedUrls) { if (visitedUrls_.find(seedUrl) == visitedUrls_.end()) { urlQueue_.push(seedUrl); } } }
// Configuration structure for flexible crawler setup struct CrawlerConfig { std::vector<std::string> seedUrls; int maxDepth = 3; int maxPages = 1000; std::chrono::milliseconds requestDelay{500}; HttpConfig httpConfig; bool respectRobotsTxt = true; }; // URL parser for domain extraction and validation struct ParsedUrl { std::string protocol; std::string host; std::string path; bool valid = false; };

Key Features

WebCrawler is built with enterprise-grade features for professional web scraping and data extraction:
Multithreaded Architecture: Built with C++ threading for handling multiple concurrent requests efficiently while maintaining thread safety

Smart Rate Limiting: Intelligent delays and request throttling to avoid overwhelming target servers and respect robots.txt

Robust Error Handling: Comprehensive error handling and timeout mechanisms for reliable crawling operations

Graph Analysis: Built-in graph structure analysis to understand website hierarchy and link relationships

Memory Management: Efficient memory usage with smart pointers and RAII principles for leak-free operation