Not logged in.  Login/Logout/Register | List snippets | | Create snippet | Upload image | Upload data

43
LINES

< > BotCompany Repo | #1014153 // indexSimpleWikipedia

JavaX fragment (include)

srecord IndexedWikiPage(S title, long start, int len) {}

please include function iteratorFromFunction.
  
static IterableIterator<IndexedWikiPage> indexSimpleWikipedia() {
  File f = unpackSimpleWikipedia();
  final ByteCountingLineReader reader = new(bufferedFileInputStream(f, 1024*1024));
  
  ret main.<IndexedWikiPage> iteratorFromFunction(new O {
    int lines = 0, pages = 0;

    IndexedWikiPage get() ctex {
      long pageStart = 0;
      StringBuilder pageBuf = null;
      
      while licensed {
        long offset = reader.byteCount();
        S line = reader.readLine();
        if (line == null) break;
        line = trim(line);
        if (eq(line, "<page>")) {
          pageStart = offset;
          pageBuf = new StringBuilder;
        }
        if (pageBuf != null)
          pageBuf.append(line).append("\n");
        if (eq(line, "</page>")) {
          L<S> tok = htmlTok(str(pageBuf));
          S title = trim(htmldecode(join(contentsOfContainerTag(tok, "title"))));
          if ((++pages % 1000) == 0) {
            fractionDone(pages/228400.0);
            print("Pages: " + pages + " (" + title + ")");
            sleep(1);
          }
          ret new IndexedWikiPage(title, pageStart, toInt(reader.byteCount()-pageStart));
        }
      }
      fractionDone(1);
      reader.close();
      null;
    }
  });
}

Author comment

Began life as a copy of #1008067

download  show line numbers  debug dex  old transpilations   

Travelled to 13 computer(s): aoiabmzegqzx, bhatertpkbcr, cbybwowwnfue, cfunsshuasjs, gwrvuhgaqvyk, ishqpsrjomds, lpdgvwnxivlt, mqqgnosmbjvj, pyentgdyhuwx, pzhvpgtvlbxg, tslmcundralx, tvejysmllsmz, vouqrxazstgt

No comments. add comment

Snippet ID: #1014153
Snippet name: indexSimpleWikipedia
Eternal ID of this version: #1014153/9
Text MD5: d3460f1311734490734180efba5af218
Author: stefan
Category: javax / a.i. / networking
Type: JavaX fragment (include)
Public (visible to everyone): Yes
Archived (hidden from active list): No
Created/modified: 2018-04-15 14:11:58
Source code size: 1390 bytes / 43 lines
Pitched / IR pitched: No / No
Views / Downloads: 367 / 392
Version history: 8 change(s)
Referenced in: [show references]