Not logged in.  Login/Logout/Register | List snippets | | Create snippet | Upload image | Upload data

40
LINES

< > BotCompany Repo | #1023835 // WebScraper

JavaX fragment (include) [tags: use-pretranspiled]

Libraryless. Click here for Pure Java version (5444L/37K).

sclass WebScraper {
  S baseURL;
  new Set<S> urlsSeen;
  new LinkedHashSet<S> linksToFollow;
  Int maxPages; // includes cached pages
  
  *(S baseAndStartURL) {
    this(baseAndStartURL, baseAndStartURL);
  }
  
  *(S *baseURL, S startURL) {
    addLink(startURL);
  }
  
  void addLinks(Iterable<S> urls) { for (S url : urls) addLink(url); }
  void addLink(S url) {
    if (!urlsSeen.contains(url) && startsWith(url, baseURL))
      linksToFollow.add(url);
  }
  
  bool step() {
    ping();
    if (maxPages != null && l(urlsSeen) >= maxPages)
      ret false with print("Maximum number of pages reached: " + maxPages + ". Queue size: " + l(linksToFollow));
    if (empty(linksToFollow)) false;
    _loadURL(popFirst(linksToFollow));
    true;
  }
  
  void _loadURL(S url) {
    urlsSeen.add(url);
    addLinks(pairsA(webScraper_getLinks(url)));
  }
  
  run {
    while (step())
      print("URLs checked: " + l(urlsSeen) + ", queue size: " + l(linksToFollow));
    print("Scraping done. " + n2(l(urlsSeen), "URL") + " checked.");
  }
}

download  show line numbers  debug dex  old transpilations   

Travelled to 6 computer(s): bhatertpkbcr, mqqgnosmbjvj, pyentgdyhuwx, pzhvpgtvlbxg, tvejysmllsmz, vouqrxazstgt

No comments. add comment

Snippet ID: #1023835
Snippet name: WebScraper
Eternal ID of this version: #1023835/10
Text MD5: 24cb860524028c47729c2a1985dc7bfa
Transpilation MD5: 3962b7fd5f86fbfe14276c5b2dfc9e20
Author: stefan
Category: javax / html parsing
Type: JavaX fragment (include)
Public (visible to everyone): Yes
Archived (hidden from active list): No
Created/modified: 2019-07-10 13:10:03
Source code size: 1081 bytes / 40 lines
Pitched / IR pitched: No / No
Views / Downloads: 177 / 605
Version history: 9 change(s)
Referenced in: [show references]