Not logged in.  Login/Logout/Register | List snippets | | Create snippet | Upload image | Upload data

41
LINES

< > BotCompany Repo | #1008067 // streamInSimpleWikipedia

JavaX fragment (include)

sclass WikiPage {
  S title, text;
  
  *() {}
  *(S *title, S *text) {}
}

static IterableIterator<WikiPage> streamInSimpleWikipedia() {
  File f = unpackSimpleWikipedia();
  final BufferedReader reader = utf8bufferedReader(f);
  please include function iteratorFromFunction.
  ret main.<WikiPage> iteratorFromFunction(new O {
    int lines = 0, pages = 0;
    StringBuilder pageBuf = null;
    
    WikiPage get() ctex {
      S line;
      while ((line = reader.readLine()) != null) {
        line = trim(line);
        if (eq(line, "<page>"))
          pageBuf = new StringBuilder;
        if (pageBuf != null)
          pageBuf.append(line).append("\n");
        if (eq(line, "</page>")) {
          L<S> tok = htmlTok(str(pageBuf));
          S title = trim(htmldecode(join(contentsOfContainerTag(tok, "title"))));
          S text = trim(htmldecode(join(contentsOfContainerTag(tok, "text"))));
          if ((++pages % 1000) == 0) {
            fractionDone(pages/228400.0);
            print("Pages: " + pages + " (" + title + ")");
            sleep(1);
          }
          ret new WikiPage(title, text);
        }
      }
      fractionDone(1);
      reader.close();
      null;
    }
  });
}

Author comment

Began life as a copy of #1008015

download  show line numbers  debug dex  old transpilations   

Travelled to 13 computer(s): aoiabmzegqzx, bhatertpkbcr, cbybwowwnfue, cfunsshuasjs, gwrvuhgaqvyk, ishqpsrjomds, lpdgvwnxivlt, mqqgnosmbjvj, pyentgdyhuwx, pzhvpgtvlbxg, tslmcundralx, tvejysmllsmz, vouqrxazstgt

No comments. add comment

Snippet ID: #1008067
Snippet name: streamInSimpleWikipedia
Eternal ID of this version: #1008067/8
Text MD5: ebae3e08ef2f26ad05eaab4ab3c1888a
Author: stefan
Category: javax / a.i. / networking
Type: JavaX fragment (include)
Public (visible to everyone): Yes
Archived (hidden from active list): No
Created/modified: 2017-04-23 16:10:42
Source code size: 1244 bytes / 41 lines
Pitched / IR pitched: No / No
Views / Downloads: 459 / 473
Version history: 7 change(s)
Referenced in: [show references]