WebCrawler - Devin Coster

.-. ..-*%-. :=..:#- :. -#: .:**:=. .=*..+:. ... .-= :*. .... :=...... .: :*: .=:.---=-=. . :. .*- .---:=- .- .--.++ .:=++-.:... .-...#- ...+. .. -- .#. ......... .+.. .-. -. .=+ ..*+. ..*. .-=. ....... :* .-. .*+:=%: :#= =-. .#-:::. .. ... +-..:. :. -%:..+:. *. .=:.....:=+%=:. ..-=:. -. - -. .#+.=-. -+ .-=+-..:. .. :- : : .%-%. += .:+=. .:*:. .= - : -%*. -=. +*=.... ... +*.-=.. .@* :#. .:=.. ... =*===*.-#. .:*. .:+::. ... =-. .-::-+. .-:. .+-.....+-.=-. .. .-*- ..- ..: ... .==+....=*. .......--: .-----. -:=-:. . .-==:.-. -. =:..... ......:::. .: . .-.-=:.--..=. :-.-=====..:*+-.. .. ..-. ....+=... .: .: ..-+++++-..:. .::...+-.: ..... . ....++...........:*: . .:++..... . ..:*=. ...:. ....+.. ..+* +#. :#....... ..#: .*::**. .....+* ... .-..*+. = -=. .-. .:++ :.:#+ #: . -. .=@::**=-*- :- :. :...==-.. .:.+-.. :-. .:::=: .:. :*%*.... .. ...-=====+*: :-. -++:. .:*: .:*.: .. :++#. .=+.=* .. :-+-..-* .:. :%: ..:-+*@*. .+=. .:.=#. =. .%+ .. .. .. =%. ==. .-.*#...* -* .: ...:*=. ++. =. **..+ .-*.- ::::++.. .*=. =+ -* .. .+= .-:.:: -*. .. .= .*+ ..:*:..-: .#- .: :#. -: .*. .- .*. .: -=.:+: == .- .+= :+ :#:.. .*+ + +* .-. -+ = +* .#: -#. :. +* :: .#: :. +*.. -*.=*. #* .*+ *. :+. -+ *. -+ :+ .-+:= .**+. .*-.

← Back to Portfolio

Project Overview

WebCrawler is a powerful, efficient web scraping tool designed to navigate and extract data from websites with precision and speed. Built with modern C++ and optimized for performance, it handles complex crawling scenarios while respecting robots.txt and implementing intelligent rate limiting.

High Performance

Multithreaded processing and intelligent request management for maximum throughput and efficiency

Respectful Crawling

Built-in robots.txt compliance and configurable rate limiting to respect server resources

Smart Parsing

Advanced HTML parsing with libcurl integration for reliable data extraction

Graph Analysis

Built-in graph analysis capabilities for understanding website structure and relationships

Code Showcase

Here's a glimpse of the core WebCrawler implementation with configuration and URL management:

                
// WebCrawler constructor with configuration setup
WebCrawler::WebCrawler(const CrawlerConfig& config)
    : config_(config), httpClient_(config.httpConfig) {

    // Extract allowed domains from seed URLs
    for (const auto& seedUrl : config.seedUrls) {
        auto parsed = UrlParser::parse(seedUrl);
        if (parsed.valid) {
            allowedDomains_.insert(parsed.host);
        }
    }
}

void WebCrawler::addSeedUrl(const std::string& url) {
    auto parsed = UrlParser::parse(url);
    if (parsed.valid) {
        urlQueue_.push(url);
        allowedDomains_.insert(parsed.host);
    }
}

void WebCrawler::setCrawlCallback(CrawlCallback callback) {
    crawlCallback_ = std::move(callback);
}

void WebCrawler::start() {
    running_ = true;

    // Add seed URLs to queue
    for (const auto& seedUrl : config_.seedUrls) {
        if (visitedUrls_.find(seedUrl) == visitedUrls_.end()) {
            urlQueue_.push(seedUrl);
        }
    }
}
                
            

                
// Configuration structure for flexible crawler setup
struct CrawlerConfig {
    std::vector<std::string> seedUrls;
    int maxDepth = 3;
    int maxPages = 1000;
    std::chrono::milliseconds requestDelay{500};
    HttpConfig httpConfig;
    bool respectRobotsTxt = true;
};

// URL parser for domain extraction and validation
struct ParsedUrl {
    std::string protocol;
    std::string host;
    std::string path;
    bool valid = false;
};
                
            

Key Features

WebCrawler is built with enterprise-grade features for professional web scraping and data extraction:

Multithreaded Architecture: Built with C++ threading for handling multiple concurrent requests efficiently while maintaining thread safety

Smart Rate Limiting: Intelligent delays and request throttling to avoid overwhelming target servers and respect robots.txt

Robust Error Handling: Comprehensive error handling and timeout mechanisms for reliable crawling operations

Graph Analysis: Built-in graph structure analysis to understand website hierarchy and link relationships

Memory Management: Efficient memory usage with smart pointers and RAII principles for leak-free operation