Libraryless. Click here for Pure Java version (5444L/37K).
sclass WebScraper { S baseURL; new Set<S> urlsSeen; new LinkedHashSet<S> linksToFollow; Int maxPages; // includes cached pages *(S baseAndStartURL) { this(baseAndStartURL, baseAndStartURL); } *(S *baseURL, S startURL) { addLink(startURL); } void addLinks(Iterable<S> urls) { for (S url : urls) addLink(url); } void addLink(S url) { if (!urlsSeen.contains(url) && startsWith(url, baseURL)) linksToFollow.add(url); } bool step() { ping(); if (maxPages != null && l(urlsSeen) >= maxPages) ret false with print("Maximum number of pages reached: " + maxPages + ". Queue size: " + l(linksToFollow)); if (empty(linksToFollow)) false; _loadURL(popFirst(linksToFollow)); true; } void _loadURL(S url) { urlsSeen.add(url); addLinks(pairsA(webScraper_getLinks(url))); } run { while (step()) print("URLs checked: " + l(urlsSeen) + ", queue size: " + l(linksToFollow)); print("Scraping done. " + n2(l(urlsSeen), "URL") + " checked."); } }
download show line numbers debug dex old transpilations
Travelled to 6 computer(s): bhatertpkbcr, mqqgnosmbjvj, pyentgdyhuwx, pzhvpgtvlbxg, tvejysmllsmz, vouqrxazstgt
No comments. add comment
Snippet ID: | #1023835 |
Snippet name: | WebScraper |
Eternal ID of this version: | #1023835/10 |
Text MD5: | 24cb860524028c47729c2a1985dc7bfa |
Transpilation MD5: | 3962b7fd5f86fbfe14276c5b2dfc9e20 |
Author: | stefan |
Category: | javax / html parsing |
Type: | JavaX fragment (include) |
Public (visible to everyone): | Yes |
Archived (hidden from active list): | No |
Created/modified: | 2019-07-10 13:10:03 |
Source code size: | 1081 bytes / 40 lines |
Pitched / IR pitched: | No / No |
Views / Downloads: | 246 / 697 |
Version history: | 9 change(s) |
Referenced in: | #1034167 - Standard Classes + Interfaces (LIVE, continuation of #1003674) |