Libraryless. Click here for Pure Java version (5444L/37K).
sclass WebScraper { S baseURL; new Set<S> urlsSeen; new LinkedHashSet<S> linksToFollow; Int maxPages; // includes cached pages *(S baseAndStartURL) { this(baseAndStartURL, baseAndStartURL); } *(S *baseURL, S startURL) { addLink(startURL); } void addLinks(Iterable<S> urls) { for (S url : urls) addLink(url); } void addLink(S url) { if (!urlsSeen.contains(url) && startsWith(url, baseURL)) linksToFollow.add(url); } bool step() { ping(); if (maxPages != null && l(urlsSeen) >= maxPages) ret false with print("Maximum number of pages reached: " + maxPages + ". Queue size: " + l(linksToFollow)); if (empty(linksToFollow)) false; _loadURL(popFirst(linksToFollow)); true; } void _loadURL(S url) { urlsSeen.add(url); addLinks(pairsA(webScraper_getLinks(url))); } run { while (step()) print("URLs checked: " + l(urlsSeen) + ", queue size: " + l(linksToFollow)); print("Scraping done. " + n2(l(urlsSeen), "URL") + " checked."); } }
download show line numbers debug dex old transpilations
Travelled to 6 computer(s): bhatertpkbcr, mqqgnosmbjvj, pyentgdyhuwx, pzhvpgtvlbxg, tvejysmllsmz, vouqrxazstgt
No comments. add comment
| Snippet ID: | #1023835 |
| Snippet name: | WebScraper |
| Eternal ID of this version: | #1023835/10 |
| Text MD5: | 24cb860524028c47729c2a1985dc7bfa |
| Transpilation MD5: | 3962b7fd5f86fbfe14276c5b2dfc9e20 |
| Author: | stefan |
| Category: | javax / html parsing |
| Type: | JavaX fragment (include) |
| Public (visible to everyone): | Yes |
| Archived (hidden from active list): | No |
| Created/modified: | 2019-07-10 13:10:03 |
| Source code size: | 1081 bytes / 40 lines |
| Pitched / IR pitched: | No / No |
| Views / Downloads: | 447 / 953 |
| Version history: | 9 change(s) |
| Referenced in: | [show references] |