Write Moby Dick, approximately

  • Here is a 1.2Mb ASCII text file containing the text of Herman Melville's Moby-Dick; or, The Whale. Your task is to write a program or function (or class, etc. -- see below) which will be given this file one character at a time, and at each step must guess the next character.


    This is . Your score will be


    2*L + E

    where L is the size of your submission in bytes, and E is the number of characters it guesses incorrectly. The lowest score wins.


    Further particulars


    Your submission will be a program or function (etc.) that will be called or invoked or sent data multiple times. (1215235 times to be exact.) When it is called for the nth time it will be given the nth character of whale.txt or whale2.txt and it must output its guess for the (n+1)th character. The E component of its score will be the total number of characters that it guesses incorrectly.


    Most submissions will need to store some state in between invocations, so that they can track how many times they have been called and what the previous inputs were. You can do this by writing to an external file, by using static or global variables, by submitting a class rather than a function, using a state monad, or whatever else works for your language. Your submission must include any code required to initialise its state before the first invocation.


    Your program should run deterministically, so that it always makes the same guesses given the same input (and hence always gets the same score).


    Your answer must include not only your submission, but also the code you used to calculate the E part of its score. This need not be written in the same language as your submission, and will not be counted towards its byte count. You are encouraged to make it readable.


    Regarding the interface between your submission and this score-calculating program, anything is fine, as long as your program always gives one byte of output before receiving its next byte of input. (So, for example, you can't just pass it a string containing all of the input and get a string back containing all of the output.)


    You must actually run your test program and calculate/verify your score before submitting your entry. If your submission runs too slowly for you to verify its score then it is not qualified to compete, even if you know what its score would be in principle.


    The L component of your score will be calculated according to the usual rules for code golf challenges. If your submission will contain multiple files, please take note of the rules on scoring and directory structure in that case. Any data that your code uses must be included in your L score.


    You may import existing libraries but may not load any other external files, and your code may not access the whale.txt or whale2.txt file in any way other than described above. You may not load any pre-trained neural networks or other sources of statistical data. (It's fine to use neural networks, but you have to include the weight data in your submission and count it towards your byte count.) If for some reason your language or libraries include a feature that provides some or all of the text of Moby Dick, you may not use that feature. Aside from that you can use any other built-in or library features that you like, including ones relating to text processing, prediction or compression, as long as they're part of your language or its standard libraries. For more exotic, specialised routines that include sources of statistical data, you would have to implement them yourself and include them in your byte count.


    It is likely that some submissions will include components that are themselves generated by code. If this is the case, please include in your answer the code that was used to produce them, and explain how it works. (As long as this code is not needed to run your submission it will not be included in your byte count.)


    For historical reasons, there are two versions of the file, and you may use either of them in an answer. In whale2.txt (linked above) the text is not wrapped, so newlines appear only at the end of paragraphs. In the original whale.txt the text is wrapped to a width of 74 characters, so you have to predict the end of each line as well as predicting the text. This makes the challenge more fiddly, so whale2.txt is recommended for new answers. Both files are the same size, 1215236 bytes.




    To summarise, all answers should include the following things:



    • Your submission itself. (The code, plus any data files it uses - these can be links if they're large.)

    • An explanation of how your code works. Please explain the I/O method as well as how it predicts the next character. The explanation of your algorithm is important, and good explanations will earn bounties from me.

    • The code you used to evaluate your score. (If this is identical to a previous answer you can just link to it.)

    • Any code you used to generate your submission, along with an explanation of that code. This includes code that you used to optimise parameters, generate data files etc. (This doesn't count towards your byte count but should be included in your answer.)


    Leaderboard




    var QUESTION_ID=152856,OVERRIDE_USER=21034;function answersUrl(e){return"https://api.stackexchange.com/2.2/questions/"+QUESTION_ID+"/answers?page="+e+"&pagesize=100&order=desc&sort=creation&site=codegolf&filter="+ANSWER_FILTER}function commentUrl(e,s){return"https://api.stackexchange.com/2.2/answers/"+s.join(";")+"/comments?page="+e+"&pagesize=100&order=desc&sort=creation&site=codegolf&filter="+COMMENT_FILTER}function getAnswers(){jQuery.ajax({url:answersUrl(answer_page++),method:"get",dataType:"jsonp",crossDomain:!0,success:function(e){answers.push.apply(answers,e.items),answers_hash=[],answer_ids=[],e.items.forEach(function(e){e.comments=[];var s=+e.share_link.match(/\d+/);answer_ids.push(s),answers_hash[s]=e}),e.has_more||(more_answers=!1),comment_page=1,getComments()}})}function getComments(){jQuery.ajax({url:commentUrl(comment_page++,answer_ids),method:"get",dataType:"jsonp",crossDomain:!0,success:function(e){e.items.forEach(function(e){e.owner.user_id===OVERRIDE_USER&&answers_hash[e.post_id].comments.push(e)}),e.has_more?getComments():more_answers?getAnswers():process()}})}function getAuthorName(e){return e.owner.display_name}function process(){var e=[];answers.forEach(function(s){var r=s.body;s.comments.forEach(function(e){OVERRIDE_REG.test(e.body)&&(r="<h1>"+e.body.replace(OVERRIDE_REG,"")+"</h1>")});var a=r.match(SCORE_REG);a&&e.push({user:getAuthorName(s),size:+a[2],language:a[1],link:s.share_link})}),e.sort(function(e,s){var r=e.size,a=s.size;return r-a});var s={},r=1,a=null,n=1;e.forEach(function(e){e.size!=a&&(n=r),a=e.size,++r;var t=jQuery("#answer-template").html();t=t.replace("{{PLACE}}",n+".").replace("{{NAME}}",e.user).replace("{{LANGUAGE}}",e.language).replace("{{SIZE}}",e.size).replace("{{LINK}}",e.link),t=jQuery(t),jQuery("#answers").append(t);var o=e.language;/<a/.test(o)&&(o=jQuery(o).text()),s[o]=s[o]||{lang:e.language,user:e.user,size:e.size,link:e.link}});var t=[];for(var o in s)s.hasOwnProperty(o)&&t.push(s[o]);t.sort(function(e,s){return e.lang>s.lang?1:e.lang<s.lang?-1:0});for(var c=0;c<t.length;++c){var i=jQuery("#language-template").html(),o=t[c];i=i.replace("{{LANGUAGE}}",o.lang).replace("{{NAME}}",o.user).replace("{{SIZE}}",o.size).replace("{{LINK}}",o.link),i=jQuery(i),jQuery("#languages").append(i)}}var ANSWER_FILTER="!t)IWYnsLAZle2tQ3KqrVveCRJfxcRLe",COMMENT_FILTER="!)Q2B_A2kjfAiU78X(md6BoYk",answers=[],answers_hash,answer_ids,answer_page=1,more_answers=!0,comment_page;getAnswers();var SCORE_REG=/<h\d>\s*([^\n,]*[^\s,]),.*?(\d+)(?=[^\n\d<>]*(?:<(?:s>[^\n<>]*<\/s>|[^\n<>]+>)[^\n\d<>]*)*<\/h\d>)/,OVERRIDE_REG=/^Override\s*header:\s*/i;

    body{text-align:left!important}#answer-list,#language-list{padding:10px;width:380px;float:left}table thead{font-weight:700}table td{padding:5px}

    <script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script> <link rel="stylesheet" type="text/css" href="//cdn.sstatic.net/codegolf/all.css?v=83c949450c8b"> <div id="answer-list"> <h2>Leaderboard</h2> <table class="answer-list"> <thead> <tr><td></td><td>Author</td><td>Language</td><td>Score</td></tr></thead> <tbody id="answers"> </tbody> </table> </div><div id="language-list"> <h2>Winners by Language</h2> <table class="language-list"> <thead> <tr><td>Language</td><td>User</td><td>Score</td></tr></thead> <tbody id="languages"> </tbody> </table> </div><table style="display: none"> <tbody id="answer-template"> <tr><td>{{PLACE}}</td><td>{{NAME}}</td><td>{{LANGUAGE}}</td><td>{{SIZE}}</td><td><a href="{{LINK}}">Link</a></td></tr></tbody> </table> <table style="display: none"> <tbody id="language-template"> <tr><td>{{LANGUAGE}}</td><td>{{NAME}}</td><td>{{SIZE}}</td><td><a href="{{LINK}}">Link</a></td></tr></tbody> </table>




    Bounties


    From time to time I'll offer bounties to encourage different approaches.


    The first one, 50 points, was awarded to A. Rex for the best-scoring answer at the time.


    The second, 100 points, was also awarded to A. Rex, for the same answer, because they added a very good explanation to their existing answer.


    The next bounty, 200 points, will be awarded to either



    • A competitive answer that uses a new technique. (This will be based on my subjective judgment since it's my rep that goes into the bounty, but you can trust me to be fair. Note that your answer needs to contain sufficient explanation for me to understand how it works!) Such an answer needn't take the top score, it just needs to do reasonably well compared to existing answers. I'm particularly keen to see solutions based on recurrent neural networks, but I'll award the bounty to anything that seems different enough from the Markov models that dominate the current top scores.


    Or:



    • Anyone else who beats A. Rex's top score (currently 444444), using any method.


    Once the 200 point bounty is claimed I will most likely offer a 400 point one, updating the requirements accordingly.


    Comments are not for extended discussion; this conversation has been moved to chat.

    https://xkcd.com/1960/ appears to be a reference to this challenge!

    I thought of compressing this... but it is a bit too long that my computer crashed *shrug*

    Ok, so do we need to predict the n+1th possible character for the entire text file, or just the part that starts from Chapter 1?

    @Razetime you must attempt to guess the (n+1)th character for every n from 1 to 1215235. So that includes all the facts about whales and so on before the start of chapter 1. (I thought about cropping that out of the file, so it would start with "Call me Ishmael", but felt it was better to include every word that Melville wrote.)

  • ///, 2*1 + 1020874 = 1020876



     


    Prints a space.


    Comments are not for extended discussion; this conversation has been moved to chat.

    That is some extremely smart reward hacking! You must be an AGI ;)

  • Perl, 2·70525 + 326508 = 467558



    Predictor



    $m=($u=1<<32)-1;open B,B;@e=unpack"C*",join"",<B>;$e=2903392593;sub u{int($_[0]+($_[1]-$_[0])*pop)}sub o{$m&(pop()<<8)+pop}sub g{($h,%m,@b,$s,$E)[email protected]_;if($d eq$h){($l,$u)=(u($l,$u,$L),u($l,$u,$U));$u=o(256,$u-1),$l=o($l),$e=o([email protected],$e)until($l^($u-1))>>24}$M{"@c"}{$h}++-++$C{"@c"}[email protected] [email protected]=($h,@[email protected]);@[email protected][0..19][email protected]>20;@[email protected];for(@p,$L=0){$c="@c";last if" "ne [email protected] [email protected]<2 and$E>99;$m{$_}+=$M{$c}{$_}/$C{$c}for sort keys%{$M{$c}};$E+=$C{$c}}$s>5.393*$m{$_}or($s+=$m{$_},[email protected],$_)for sort{$m{$b}<=>$m{$a}}sort keys%m;$e>=u($l,$u,$U=$L+$m{$_}/$s)?$L=$U:return$d=$_ for [email protected]}


    To run this program, you need this file here, which must be named B. (You can change this filename in the second instance of the character B above.) See below for how to generate this file.



    The program uses a combination of Markov models essentially as in this answer by user2699, but with a few small modifications. This produces a distribution for the next character. We use information theory to decide whether to accept an error or spend bits of storage in B encoding hints (and if so, how). We use arithmetic coding to optimally store fractional bits from the model.



    The program is 582 bytes long (including an unnecessary final newline) and the binary file B is 69942 bytes long, so under the rules for scoring multiple files, we score L as 582 + 69942 + 1 = 70525.



    The program almost certainly requires a 64-bit (little-endian?) architecture. It takes approximately 2.5 minutes to run on an m5.large instance on Amazon EC2.



    Test code



    # Golfed submission
    require "submission.pl";

    use strict; use warnings; use autodie;

    # Scoring length of multiple files adds 1 penalty
    my $length = (-s "submission.pl") + (-s "B") + 1;

    # Read input
    open my $IN, "<", "whale2.txt";
    my $input = do { local $/; <$IN> };

    # Run test harness
    my $errors = 0;
    for my $i ( 0 .. length($input)-2 ) {
    my $current = substr $input, $i, 1;
    my $decoded = g( $current );

    my $correct = substr $input, $i+1, 1;
    my $error_here = 0 + ($correct ne $decoded);
    $errors += $error_here;
    }

    # Output score
    my $score = 2 * $length + $errors;
    print <<EOF;
    length $length
    errors $errors
    score $score
    EOF


    The test harness assumes the submission is in the file submission.pl, but this can easily be changed in the second line.



    Text comparison



    "And did none of ye see it before?" cried Ahab, hailing the perched men all around him.\\"I saw him almost that same instant, sir, that Captain 
    "And wid note of te fee bt seaore cried Ahab, aasling the turshed aen inl atound him. \"' daw him wsoost thot some instant, wer, that Saptain
    "And _id no_e of _e _ee _t _e_ore__ cried Ahab, _a_ling the __r_hed _en __l a_ound him._\"_ _aw him ___ost th_t s_me instant, __r, that _aptain

    Ahab did, and I cried out," said Tashtego.\\"Not the same instant; not the same--no, the doubloon is mine, Fate reserved the doubloon for me. I
    Ahab aid ind I woued tut, said tashtego, \"No, the same instant, tot the same -tow nhe woubloon ws mane. alte ieserved the seubloon ior te, I
    Ahab _id_ _nd I ___ed _ut,_ said _ashtego__\"No_ the same instant_ _ot the same_-_o_ _he _oubloon _s m_ne_ __te _eserved the __ubloon _or _e_ I

    only; none of ye could have raised the White Whale first. There she blows!--there she blows!--there she blows! There again!--there again!" he cr
    gnly towe of ye sould have tersed the shite Whale aisst Ihere ihe blows! -there she blows! -there she blows! Ahere arains -mhere again! ce cr
    _nly_ _o_e of ye _ould have ___sed the _hite Whale _i_st_ _here _he blows!_-there she blows!_-there she blows! _here a_ain__-_here again!_ _e cr


    This sample (chosen in another answer) occurs rather late in the text, so the model is quite developed by this point. Remember that the model is augmented by 70 kilobytes of "hints" that directly help it guess the characters; it is not driven simply by the short snippet of code above.



    Generating hints



    The following program accepts the exact submission code above (on standard input) and generates the exact B file above (on standard output):



    @S=split"",join"",<>;eval join"",@S[0..15,64..122],'open W,"whale2.txt";($n,@W)=split"",join"",<W>;for$X([email protected]){($h,$n,%m,@b,$s,$E)=($n,$W[$X]);',@S[256..338],'U=0)',@S[343..522],'for([email protected]){$U=($L=$U)+$m{$_}/$s;if($_ eq$n)',@S[160..195],'X<128||print(pack C,$l>>24),',@S[195..217,235..255],'}}'


    It takes approximately as long to run as the submission, as it performs similar computations.



    Explanation



    In this section, we will attempt to describe what this solution does in sufficient detail that you could "try it at home" yourself. The main technique that differentiates this answer from the other ones is a few sections down as the "rewind" mechanism, but before we get there, we need to set up the basics.



    Model



    The basic ingredient of the solution is a language model. For our purposes, a model is something that takes some amount of English text and returns a probability distribution on the next character. When we use the model, the English text will be some (correct) prefix of Moby Dick. Please note that the desired output is a distribution, and not just a single guess for the most likely character.



    In our case, we essentially use the model in this answer by user2699. We didn't use the model from the highest-scoring answer (other than our own) by Anders Kaseorg precisely because we were unable to extract a distribution rather than a single best guess. In theory, that answer computes a weighted geometric mean, but we got somewhat poor results when we interpreted that too literally. We "stole" a model from another answer because our "secret sauce" isn't the model but rather the overall approach. If someone has a "better" model, then they should be able to get better results using the rest of our techniques.



    As a remark, most compression methods such as Lempel-Ziv can be seen as being a "language model" in this way, though one might have to squint a little bit. (It's particularly tricky for something that does a Burrows-Wheeler transform!) Also, note that the model by user2699 is a modification of a Markov model; essentially nothing else is competitive for this challenge or perhaps even modeling text in general.



    Overall architecture



    For the purposes of understanding, it's nice to break up the overall architecture into several pieces. From the highest-level perspective, there needs to be a little bit of state management code. This isn't particularly interesting, but for completeness we want to stress that at every point the program is asked for the next guess, it has available to it a correct prefix of Moby Dick. We do not use our past incorrect guesses in any way. For efficiency's sake, the language model can probably reuse its state from the first N characters to compute its state for the first (N+1) characters, but in principle, it could recompute things from scratch every time it is invoked.



    Let's set this basic "driver" of the program aside and peek inside the part that guesses the next character. It helps conceptually to separate three parts: the language model (discussed above), a "hints" file, and an "interpreter". At each step, the interpreter will ask the language model for a distribution for the next character and possibly read some information from the hints file. Then it will combine these parts into a guess. Precisely what information is in the hints file as well as how it is used will be explained later, but for now it helps to keep these parts separate mentally. Note that implementation-wise, the hints file is literally a separate (binary) file but it could have been a string or something stored inside the program. As an approximation, we pretend that the model and interpreter are pretty short, so we can approximate that as "L ≈ 0".



    If one is using a standard compression method such as bzip2 as in this answer, the "hints" file corresponds to the compressed file. The "interpreter" corresponds to the decompressor, while the "language model" is a bit implicit (as mentioned above).



    Why use a hint file?



    Let's pick a simple example to analyze further. Suppose that the text is N characters long and well-approximated by a model wherein every character is (independently) the letter E with probability slightly less than a half, T similarly with probability slightly less than a half, and A with probability 1/1000 = 0.1%. Let's assume no other characters are possible; in any case, the A is pretty similar to the case of a previously unseen character out of the blue.



    If we operated in the L 0 regime (as most, but not all, of the other answers to this question do), there's no better strategy for the interpreter than pick one of E and T. On average, it will get about half of the characters correct. So E ≈ N/2 and the score ≈ N/2 also. However, if we use a compression strategy, then we can compress to a little more than one bit per character. Because L is counted in bytes, we get L ≈ N/8 and thus score ≈ N/4, twice as good as the previous strategy.



    Achieving this rate of a little more than one bit per character for this model is slightly nontrivial, but one method is arithmetic coding.



    Arithmetic coding



    As is commonly-known, an encoding is a way of representing some data using bits/bytes. For example, ASCII is a 7 bit/character encoding of English text and related characters, and it is the encoding of the original Moby Dick file under consideration. If some letters are more common than others, then a fixed-width encoding like ASCII is not optimal. In such a situation, many people reach for Huffman coding. This is optimal if you want a fixed (prefix-free) code with an integer number of bits per character.



    However, arithmetic coding is even better. Roughly speaking, it is able to use "fractional" bits to encode information. There are many guides to arithmetic coding available online. We'll skip the details here (especially of the practical implementation, which can be a little tricky from a programming perspective) because of the other resources available online, but if someone complains, maybe this section can be fleshed out more.



    If one has text actually generated by a known language model, then arithmetic coding provides an essentially-optimal encoding of text from that model. In some sense, this "solves" the compression problem for that model. (Thus in practice, the main issue is that the model isn't known, and some models are better than others at modeling human text.) If it was not allowed to make errors in this contest, then in the language of the previous section, one way to produce a solution to this challenge would have been to use an arithmetic encoder to generate a "hints" file from the language model and then use an arithmetic decoder as the "interpreter".



    In this essentially-optimal encoding, we end up spending -log_2(p) bits for a character with probability p, and the overall bit-rate of the encoding is the Shannon entropy. This means that a character with probability near 1/2 takes about one bit to encode, while one with probability 1/1000 takes about 10 bits (because 2^10 is roughly 1000).



    But the scoring metric for this challenge was well-chosen to avoid compression as the optimal strategy. We'll have to figure out some way to make some errors as a tradeoff for getting a shorter hints file. For example, one strategy one might try is a simple branching strategy: we generally try to use arithmetic encoding when we can, but if the probability distribution from the model is "bad" in some way we just guess the most likely character and don't try encoding it.



    Why make errors?



    Let's analyze the example from before to motivate why we might want to make errors "intentionally". If we use arithmetic coding to encode the correct character, we will spend roughly one bit in the case of an E or T, but about ten bits in the case of an A.



    Overall, this is a pretty good encoding, spending a little over a bit per character even though there are three possibilities; basically, the A is fairly unlikely and we don't end up spending its corresponding ten bits too often. However, wouldn't it be nice if we could just make an error instead in the case of an A? After all, the metric for the problem considers 1 byte = 8 bits of length to be equivalent to 2 errors; thus it seems like one should prefer an error instead of spending more than 8/2 = 4 bits on a character. Spending more than a byte to save one error definitely sounds suboptimal!



    The "rewind" mechanism



    This section describes the main clever aspect of this solution, which is a way to handle incorrect guesses at no cost in length.



    For the simple example we've been analyzing, the rewind mechanism is particularly straightforward. The interpreter reads one bit from the hints file. If it is a 0, it guesses E. If it is a 1, it guesses T. The next time it is called, it sees what the correct character is. If the hint file is set up well, we can ensure that in the case of an E or T, the interpreter guesses correctly. But what about A? The idea of the rewind mechanism is to simply not code A at all. More precisely, if the interpreter later learns that the correct character was an A, it metaphorically "rewinds the tape": it returns the bit it read previously. The bit it read does intend to code E or T, but not now; it will be used later. In this simple example, this basically means that it keeps guessing the same character (E or T) until it gets it right; then it reads another bit and keeps going.



    The encoding for this hints file is very simple: turn all of the Es into 0 bits and Ts into 1 bits, all while ignoring As entirely. By the analysis at the end of the previous section, this scheme makes some errors but reduces the score overall by not encoding any of the As. As a smaller effect, it actually saves on the length of the hints file as well, because we end up using exactly one bit for each E and T, instead of slightly more than a bit.



    A little theorem



    How do we decide when to make an error? Suppose our model gives us a probability distribution P for the next character. We will separate the possible characters into two classes: coded and not coded. If the correct character is not coded, then we will end up using the "rewind" mechanism to accept an error at no cost in length. If the correct character is coded, then we will use some other distribution Q to encode it using arithmetic coding.



    But what distribution Q should we choose? It's not too hard to see that the coded characters should all have higher probability (in P) than the not coded characters. Also, the distribution Q should only include the coded characters; after all, we're not coding the other ones, so we shouldn't be "spending" entropy on them. It's a little trickier to see that the probability distribution Q should be proportional to P on the coded characters. Putting these observations together means that we should code the most-likely characters but possibly not the less-likely characters, and that Q is simply P rescaled on the coded characters.



    It furthermore turns out that there's a cool theorem regarding which "cutoff" one should pick for the coding characters: you should code a character as long as it is at least 1/5.393 as likely as the other coded characters combined. This "explains" the appearance of the seemingly random constant 5.393 nearer the end of the program above. The number 1/5.393 ≈ 0.18542 is the solution to the equation -p log(16) - p log p + (1+p) log(1+p) = 0.



    Perhaps it's a reasonable idea to write out this procedure in code. This snippet is in C++:



    // Assume the model is computed elsewhere.
    unordered_map<char, double> model;

    // Transform p to q
    unordered_map<char, double> code;
    priority_queue<pair<double,char>> pq;
    for( char c : CHARS )
    pq.push( make_pair(model[c], c) );
    double s = 0, p;
    while( 1 ) {
    char c = pq.top().second;
    pq.pop();
    p = model[c];
    if( s > 5.393*p )
    break;
    code[c] = p;
    s += p;
    }
    for( auto& kv : code ) {
    char c = kv.first;
    code[c] /= s;
    }


    Putting it all together



    The previous section is unfortunately a little technical, but if we put all of the other pieces together, the structure is as follows. Whenever the program is asked to predict the next character after a given correct character:




    1. Add the correct character to the known correct prefix of Moby Dick.

    2. Update the (Markov) model of the text.

    3. The secret sauce: If the previous guess was incorrect, rewind the state of the arithmetic decoder to its state before the previous guess!

    4. Ask the Markov model to predict a probability distribution P for the next character.

    5. Transform P to Q using the subroutine from the previous section.

    6. Ask the arithmetic decoder to decode a character from the remainder of the hints file, according to the distribution Q.

    7. Guess the resulting character.



    The encoding of the hints file operates similarly. In that case, the program knows what the correct next character is. If it is a character that should be coded, then of course one should use the arithmetic encoder on it; but if it is a not coded character, it just doesn't update the state of the arithmetic encoder.



    If you understand the information-theoretic background like probability distributions, entropy, compression, and arithmetic coding but tried and failed to understand this post (except why the theorem is true), let us know and we can try to clear things up. Thanks for reading!


    Wow, impressive answer. I assume there is additional code needed to generate the `B` file? If so, please can you include that in your answer?

    Excellent! The first (and so far the only) answer to break the 500k score barrier.

    "tersed the shite whale" omg I'm crying

    Since no new answers were posted during the bounty period, I'm awarding it to your answer, as both the best scoring and the most sophisticated approach. If you ever have time, I would *really* appreciate a more in depth explanation of how this answer works, i.e. what exactly is the algorithm?

    @Nathaniel: I added an explanation to this post. Let me know if you think it's detailed enough to reproduce the solution yourself.

    I awarded the bounty for the explanation, with apologies for the slow time scale. I also replied to you in the chat room you created.

    I'd be interested in hearing more details about that theorem some time. Is it something in the literature, or is it something you proved yourself? Does it depend on properties of the input text? Where does that log(16) come from? (Is it related to the constant 2 in the scoring scheme?)

    @Nathaniel: Thank you very much for the second bounty. We (my collaborator and me) proved the theorem ourselves. I don't know of anything similar in the literature. Our attempt to write it up instead led to the discovery that this approach is suboptimal anyway and we should do something else: https://codegolf.stackexchange.com/a/182145/49643 Neither approach has been written up in full yet, very sadly.

    The log(16) does come from the scoring scheme. It's a little confusing because *twice* the length of the program in *bytes* is being added to the number of errors. As mentioned in the writeup, a naive calculation would thus suggest that "one should prefer an error instead of spending more than 8/2 = 4 bits on a character". Thus "16" is simply the number of possibilities that fits into these "4 bits" per error tradeoff. If you take all of the logs in base 2, then the equation has the same meaning but of course log(16) is just 4 exactly.

    Do let me know if you do end up writing something up from this - it'd be really nice to see! Information theory is one of my research interests.

    Congrats on the gold badge!

  • Node.js, 2*224 + 524279 = 524727



    Please refer to the change log at the end of this post for score updates.



    A function taking and returning a byte.



    a=[...l='14210100'],m={},s={},b={}
    f=c=>a.some((t,n)=>x=s[y=l.slice(n)]>t|/^[A-Z '"(]/.test(y)&&b[y],l+=String.fromCharCode(c),a.map((_,n)=>(m[x=l.slice(n)]=-~m[x])<s[y=l.slice(n,8)]||(s[y]=m[x],b[y]=c)),l=l.slice(1))&&x||32


    It consists of a simple PPM model that looks at the last 8 characters to predict the next one.



    We trust a pattern of length L when we have encountered it at least T[L] times, where T is an array of arbitrary thresholds: [1,1,2,1,2,3,5,2]. Furthermore, we always trust a pattern whose first character matches [A-Z '"(].



    We select the longest trusted pattern and return the prediction with the highest score associated to this pattern at the time of the call.



    Notes




    • This is obviously not optimized for speed, but it runs in about 15 seconds on my laptop.


    • If we were allowed to repeat the process several times in a row without resetting the model, the number of errors would converge to ~268000 after 5 iterations.


    • The current success rate of the prediction function is ~56.8%. As noticed by @immibis in the comments, if bad and correct guesses are mixed together, the result is not even barely readable.



      For instance, this snippet near the end of the book:



      Here be it said, that this pertinacious pursuit of one particular whale,[LF]
      continued through day into night, and through night into day, is a thing[LF]
      by no means unprecedented in the South sea fishery.


      becomes:



      "e e be it said, that thes woacangtyous sarsuet of tie oort cular thale[LF][LF]
      orsinued toeough tir on e togh and sheough toght an o ters af t shin[LF][LF]
      be to means insrocedented tn hhe sputh Sevsaonh ry,


      By replacing bad guesses with underscores, we have a better idea of what the function got right:



      _e_e be it said, that th_s _____n___ous __rsu_t of __e __rt_cular _hale_[LF]
      _o__inued t__ough ___ _n__ __gh__ and _h_ough __ght _n_o ____ __ _ _hin_[LF]
      b_ _o means _n_r_cedented _n _he __uth _e_____h_ry_


      NB: The above example was created with a previous version of the code, working on the first version of the input file.




    Test code



    /**
    The prediction function f() and its variables.
    */
    a=[...l='14210100'],m={},s={},b={}
    f=c=>a.some((t,n)=>x=s[y=l.slice(n)]>t|/^[A-Z '"(]/.test(y)&&b[y],l+=String.fromCharCode(c),a.map((_,n)=>(m[x=l.slice(n)]=-~m[x])<s[y=l.slice(n,8)]||(s[y]=m[x],b[y]=c)),l=l.slice(1))&&x||32

    /**
    A closure containing the test code and computing E.
    It takes f as input.
    (f can't see any of the variables defined in this scope.)
    */
    ;
    (f => {
    const fs = require('fs');

    let data = fs.readFileSync('whale2.txt'),
    len = data.length,
    err = 0;

    console.time('ElapsedTime');

    data.forEach((c, i) => {
    i % 100000 || console.log((i * 100 / len).toFixed(1) + '%');

    if(i < len - 1 && f(c) != data[i + 1]) {
    err++;
    }
    })

    console.log('E = ' + err);
    console.timeEnd('ElapsedTime');
    })(f)





    Change log




    • 524727 - saved 19644 points by switching to whale2.txt (challenge update)

    • 544371 - saved 327 points by forcing patterns starting with a capital letter, a quote, a double-quote or an opening parenthesis to be also always trusted

    • 544698 - saved 2119 points by forcing patterns starting with a space to be always trusted

    • 546817 - saved 47 points by adjusting the thresholds and golfing the prediction function

    • 546864 - saved 1496 points by extending the maximum pattern length to 8 characters

    • 548360 - saved 6239 points by introducing the notion of trusted patterns, with thresholds depending on their length

    • 554599 - saved 1030 points by improving the line feed prediction

    • 555629 - saved 22 points by golfing the prediction function

    • 555651 - saved 40 points by golfing the prediction function

    • 555691 - initial score


    For the curious, no, this does not produce anything like Moby Dick. It's a whole lot of `sidg tlanses,oeth to, shuld hottut tild aoersors Ch, th! Sa, yr! Sheu arinning whales aut ihe e sl he traaty of rrsf tg homn Bho dla tiasot a shab sor ty, af etoors tnd hocket sh bts ait mtubb tiddin tis aeewnrs, dnhost maundy cnd sner aiwt d boelh cheugh -aaieiyns aasiyns taaeiins! th, tla`. It does manage to get a few complete words sometimes. Like `whales`.

    @immibis The title of the challenge was chosen wisely. This is Moby Dick, _approximately_. :-)

    It's good to include your old score, crossed out using `` tags, so we can see that you've improved it.

    @Nathaniel There's been many updates, so that would be barely readable and not really informative. I've added a change log instead with short explanations about the improvements.

    @Arnauld thanks, that's helpful

    I think your program is actually doing a perfect translation into Gaelic.

    The comment of the produced output includes a comma followed by *a character other than a space.* Would the score improve if you assumed that most punctuation (apart from the apostrophe) will always be followed by a space?

    @Draco18s It's hard to tell whether this comma was a good or a bad guess. If it was a bad guess, the prediction function may have legitimately tried to put a letter after _whatever-other-letter-that-was-actually-there-instead-of-the-comma_ once it received it.

    @Arnauld That's a good point. I was just curious what the result *might* be.

    @Draco18s I've just tried it (for the comma only) and it resulted in a loss of 92 points. Commas are also quite often followed by line feeds, quotes or double-quotes, so it's probably better to just let the model decide.

    Ah, in which case I'd say "some form of white space." But oh well. (Followed by quotes...oh, at the *end* of someone speaking. Ok, white space and more punctuation! :D)

    @immibis - considering the output, maybe save this code for the inevitable 'call of cthulu' repeat.

    This answer Is currently the most upvoted but not the best score. Still a cool answer of course.

    I've updated the question with a version of the file where the text is not wrapped - you can probably save some points by using that

    @Nathaniel Thanks for the heads up. Updated.

    @immibis Stop this! You're awaking the old ones!

    Nitpicking, but shouldn't there be an underscore at the "*By replacing bad guesses with underscores, we have a better idea of what the function got right:*" part for the second `[LF]`s?

    @KevinCruijssen The second [LF] is a correct guess. The first one is a bad guess ([LF] instead of _g_ and _comma_, respectively).

    Nice solution! Could you save a few bytes (4?) by storing aliasing 'slice'?

    @Beska, if you mean Irish then no, no it's not! :p

    Full result: https://gist.github.com/v1993/b94bf26868d6e21e613ea01cca43e2cf. Best viewed in Google Translate from "Detect language"!

  • Python 3, 2·267 + 510193 = 510727





    Predictor



    def p():
    d={};s=b''
    while 1:
    p={0:1};r=range(len(s)+1)
    for i in r:
    for c,n in d.setdefault(s[:i],{}).items():p[c]=p.get(c,1)*n**b'\1\6\f\36AcWuvY_v`\270~\333~'[i]
    c=yield max(sorted(p),key=p.get)
    for i in r:e=d[s[:i]];e[c]=e.get(c,1)+1
    s=b'%c'%c+s[:15]


    This uses a weighted Bayesian combination of the order 0, …, 16 Markov models, with weights [1, 6, 12, 30, 65, 99, 87, 117, 118, 89, 95, 118, 96, 184, 126, 219, 126].



    The result is not very sensitive to the selection of these weights, but I optimized them because I could, using the same late acceptance hill-climbing algorithm that I used in my answer to “Put together a Senate majority”, where each candidate mutation is just a ±1 increment to a single weight.



    Test code



    with open('whale2.txt', 'rb') as f:
    g = p()
    wrong = 0
    a = next(g)
    for b in f.read():
    wrong += a != b
    a = g.send(b)
    print(wrong)

    The right tool for the job. Great score. Nice one.

    Possible clarification: `b"\0\3\6\r\34'&-20'\22!P\n[\26"` is the ascii representation of the weights, where small non-printable values are escaped in octal.

    I've updated the question with a version of the file where the text is not wrapped - you might try re-running your code on that (it might do slightly better)

    I tried to run it on TIO, but it doesn't work. Do you know what went wrong?

    @user202729 You need `sys.stdin.buffer`.

    @Nathaniel Yeah, it does better on `whale2.txt` by over 22000.

    By the way, how are the weights chosen?

    @Nathaniel I tried a few different things before settling on the same late acceptance hill-climbing algorithm that I used in my answer to “Put together a Senate majority”, where each candidate mutation is just a ±1 increment to a single weight.

    Thanks for the pointer to late acceptance hill-climbing (LAHC), which is an interesting technique that I didn't know about until now. What text did you use to optimise the weights, given that you were unable to use `whale.txt` or `whale2.txt` by the conditions of the puzzle?

    @Simon There is no such condition (the weights are fully included in the byte count of the predictor).

    Thanks for that explanation - if you could edit a synopsis into the question it would be great. (The experience with my previous challenge Paint Starry Night is that these optimisation procedures are the most interesting part of the answers, so it's much better if answers include the code used to do that, and an explanation of it. I included a rule in both challenges saying that they should.)

    Must `s` really be `bytes`, can't it be `str` (saving the `b` prefixes)? Or a tuple, initialized with `s=()` and updated with `s=c,*s[:15]`.

    @StefanPochmann `bytes` is a memory efficiency hack—this already uses 3 GB of memory, and while it would only be 5% worse with `str` or 30% worse with `tuple`, I’m happy at this stage to spend a handful of bytes from the sixth significant digit of my score to make the code a bit less painful to run.

    I don't speak Python but if I understood correctly then you do a linear mixing of the models which basically calculates a weighted arithmetic average of the models. I suggest switching to a weighted geometric mean. That's the main difference between the first PAQ versions and the current ones and greatly improved prediction for several reasons. See section "Adaptive model weighting" vs "Neural-network mixing". Neural network mixing is actually a single layer neural network with softmax activation trained on cross entropy simplified to binary.

    @Christoph My model combination actually is a weighted geometric mean. But PAQ’s averaging in the logistic domain is slightly different—I’ll have to see if that’s better.

  • Python 3, 2*279+592920=593478 2*250 + 592467 = 592967 2 * 271 + 592084 = 592626 2 * 278 + 592059 = 592615 2 * 285 + 586660 = 587230 2 * 320 + 585161 = 585801 2 * 339 + 585050 = 585728





    d=m={}
    s=1
    w,v='',0
    def f(c):
    global w,m,v,s,d
    if w not in m:m[w]={}
    u=m[w];u[c]=c in u and 1+u[c]or 1;v+=1;q=n=' ';w=w*s+c;s=c!=n
    if w in m:_,n=max((m[w][k],k)for k in m[w])
    elif s-1:n=d in'nedtfo'and't'or'a'
    elif'-'==c:n=c
    elif"'"==c:n='s'
    elif'/'<c<':':n='.'
    if v>4*(n!=q)+66:n='\n'
    if s:d=c
    if c<q:w=w[:-1]+q;v=s=0
    return n


    Try it online!



    A function using global variables. Learns as it goes, building a model at the word level: given what it's seen so far in this word, what's the most common next character? As more input comes in it learns common words from the text pretty well, and it also learns the most common character to start the next word.



    For example:




    • If what it's seen so far is 'Captai' it predicts an "n"

    • If it's "Captain" it predicts a space

    • If it's the start of a word, and the last word was "Captain" it predicts an 'A'

    • If the word so far is 'A' it predicts an 'h' (and then 'a' and 'b'; similarly for 'C').



    It doesn't do very well at the start, but by the end there are large parts of actual words coming out. The fallback option is a space, and after a single space it's an "a", unless the preceding letter was one of "nedtfo", a digit, or a hyphen or apostrophe. It also aggressively predicts line breaks after 71 characters, or if a space was expected after 66. Both of those were just tuned to the data ("t" is far more common after a space, but has more often already been predicted, so "a" is a better guess outside those six special cases).



    Learning what word pairs went together and preseeding the mapping turned out not to be worthwhile.






    It ends up with text like this:



    nl tneund his    I woi tis tnlost ahet toie tn tant  wod, ihet taptain Ahab ses
    snd t
    oeed Sft aoid thshtego Io, fhe soie tn tant tot the soie ahe sewbtoon
    swn tagd aoths eatmved fhe sewbtoon wor ta I sfey aote of totsonld nive betse
    d ahe
    hate Whale iorst Ihe e ioi beaos! -there soi beaos! -there soi beaos!


    which corresponds to this part of the input:




    all around him.



    "I saw him almost that same instant, sir, that Captain Ahab did, and I
    cried out," said Tashtego.



    "Not the same instant; not the same--no, the doubloon is mine, Fate
    reserved the doubloon for me. I only; none of ye could have raised the
    White Whale first. There she blows!--there she blows!--there she blows!




    You can see where proper nouns in particular come out pretty well, but the ends of words are mostly right as well. When it's seen "dou" it expects "doubt", but once the "l" appears it gets "doubloon".



    If you run it a second time with the same model it just built it immediately gets another 92k correct (51.7%->59.3%), but it's always just under 60% from the second iteration on.






    The measuring code is in the TIO link, or here's a slightly better version:



    total = 0
    right = 0
    with open('whale.txt') as fp:
    with open('guess.txt', 'w') as dest:
    for l in fp.readlines():
    for c in l:
    last = c
    if p == c: right += 1
    n = f(c)
    p = n
    total += 1
    dest.write(n)
    if total % 10000 == 0:
    print('{} / {} E={}\r'.format(right, total, total-right), end='')
    print('{} / {}: E={}'.format(right, total, total - right))


    guess.txt has the guessed output at the end.


    This is an excellent approach!

    too much ;)

    +1 because this approach reminded me of the LZW compression algorithm.

  • C++, score: 2*132 + 865821 = 866085



    Thanks to @Quentin for saving 217 bytes!





    int f(int c){return c-10?"t \n 2  sS \n  -  08........       huaoRooe oioaoheu thpih eEA \n   neo    enueee neue hteht e"[c-32]:10;}


    A very simple solution that, given a character, just outputs the character that most frequently appears after the input character.



    Verify the score with:



    #include <iostream>
    #include <fstream>

    int f(int c);

    int main()
    {
    std::ifstream file;
    file.open("whale2.txt");

    if (!file.is_open())
    return 1;

    char p_ch, ch;
    file >> std::noskipws >> p_ch;
    int incorrect = 0;
    while (file >> std::noskipws >> ch)
    {
    if (f(p_ch) != ch)
    ++incorrect;
    p_ch = ch;
    }

    file.close();

    std::cout << incorrect;
    }


    Edit: Using whale2.txt gives a better score.


    You can translate this array into a string literal and inline it directly in place of `L` to save a bunch of characters :)

    @Quentin Thanks! Now I'm wondering why I didn't think of that in the first place...

    -10 score with `c=` instead of `return`.

  • Python, 2*516 + 521122 = 522154



    Algorithm:



    Yet another python submission, this algorithm calculates the most likely next letter looking at sequences of length 1,...,l.
    The sum of probabilities is used, and there are a few tricks to get better results.





    from collections import Counter as C, defaultdict as D
    R,l=range,10
    s,n='',[D(C) for _ in R(l+1)]
    def A(c):
    global s;s+=c;
    if len(s)<=l:return ' '
    P=D(lambda:0)
    for L in R(1,l+1):
    w=''.join(s[-L-1:-1]);n[L][w].update([c]);w=''.join(s[-L:])
    try:
    q,z=n[L][w].most_common(1)[0];x=sum(list(n[L][w].values()))
    except IndexError:continue
    p=z/x
    if x<3:p*=1/(3-x)
    P[q]+=p
    if not P:return ' '
    return max(P.items(),key=lambda i:i[1])[0]
    import this, codecs as d
    [A(c) for c in d.decode(this.s, 'rot-13')]


    Results:



    Mostly gibberish, although you can see it picks up on the occasional phrase, such as "Father Mapple".



    errors: 521122
    TRAINING:
    result: tetlsnowleof the won -opes aIther Mapple,woneltnsinkeap hsd lnd the thth a shoey,aeidorsbine ao
    actual: ntal knobs of the man-ropes, Father Mapple cast a look upwards, and then with a truly sailor-like bu
    FINAL:
    result: mnd wnd round ahe ind tveryaonsracting th ards the sol ens-ike aeock tolblescn the sgis of thet t
    actual: und and round, then, and ever contracting towards the button-like black bubble at the axis of that s


    Test code:



    Pretty simple, outputs some examples of the text at different points.
    Uses whale2.txt, as this avoids some extra logic to calculate newlines.





    from minified import A

    def score(predict, text):
    errors = 0
    newtext = []
    for i, (actual, current) in enumerate(zip(text[1:], text[:-1])):
    next = predict(current)
    errors += (actual != next)
    newtext.append(next)
    if (i % (len(text) // 100) == 0):
    print ('.', end='', flush=True)
    return errors, ''.join(newtext)

    t = open('whale2.txt')
    text = t.read()
    err2, text2 = score(A, text)
    print('errors:', err2)
    print("TRAINING:")
    print(text2[100000:100100].replace('\n', '\\n'))
    print(text1[100001:100101].replace('\n', '\\n'))
    print("FINAL:")
    print(text2[121400:1215500].replace('\n', '\\n'))
    print(text[121401:1215501].replace('\n', '\\n'))

    Welcome to the site! This is a fantastic first submission. :)

    @DJMcMayhem, Thanks for the welcome. I've enjoyed watching for awhile now, this is the first contest to catch my attention for an entry.

  • C (gcc), 679787 652892



    84 76 bytes, 679619 652740 incorrect guesses





    p[128][128][128][128];a,b,c,d;g(h){p[a][b][c][d]=h;h=p[a=b][b=c][c=d][d=h];}


    Try it online!



    Update: ~27000 points off with updated file, 16 points (8 bytes) with a better-golfed function.



    Explanation



    The way this works is that as the code runs through the text, it memorizes the last character that terminated any given 4-character sequence, and returns that value. Somewhat similar to Arnauld's approach above, but relies on the inherent likelihood of two given 4-character sequences terminating the same way.



    De-golfed:



    p[128][128][128][128];
    a,b,c,d;
    g(h){
    p[a][b][c][d]=h; // Memorize the last character.
    h=p[a=b][b=c][c=d][d=h]; // Read the guess. We save several
    // bytes with the assignments inside indices.
    }

    ... TIO link is useless. So the function returns the value of the last assignment?

    Let me edit the answer with an explanation, then :)

    @Rogem I've added a de-golfed version (which I did because I couldn't follow it either) - hopefully this doesn't bother you but please roll back if desired.

    @AdamDavis on most implementations of C, all global variables start at zero. It's undefined behavior, so it's only used in code-golf.

    @NieDzejkob Ah, you're right, thanks! "ANSI-C requires that all uninitialized static/global variables have to be initialized with 0."

    @AdamDavis woah, it's not UB? TIL!

    @AdamDavis It's beautiful.

  • sh+bzip2, 2*364106 = 728212


    2*381249 + 0 = 762498


    dd if=$0 bs=1 skip=49|bunzip2&exec cat>/dev/null

    followed by the bzip2-compressed whale2.txt with the first byte missing


    Ignores its input; outputs the correct answer. This provides a baseline on one end; daniero provides a baseline on the other end.


    Builder script:


    #!/bin/sh
    if [ $# -ne 3 ]
    then
    echo "Usage $0 gen.sh datafile output.sh"
    exit 1
    fi

    cat $1 > $3
    dd ibs=1 if=$2 skip=1 | bzip2 -9 >> $3
    chmod +x $3

    I/O test harness (tcc; cut off first line for gcc). This test harness can be used by anybody on a suitable platform that submits a complete program that expects read/write I/O. It uses byte-at-a-time I/O to avoid cheating. Child program must flush output after every byte to avoid blocking.


    #!/usr/bin/tcc -run
    #include <stdio.h>
    #include <stdlib.h>
    #include <fcntl.h>
    #include <unistd.h>
    #include <errno.h>

    int main(int argc, char **argv)
    {
    volatile int result;
    int readfd[2];
    int writefd[2];
    int cppid;
    int bytecount;
    char c1, c2, c3;
    if (argc != 2) {
    printf("write X approximately -- service host\n");
    printf("Usage: %s serviceprocessbinary < source.txt\n", argv[0]);
    return 1;
    }
    /* Start service process */
    if (pipe(readfd)) {
    perror("pipe()");
    return 3;
    }
    if (pipe(writefd)) {
    perror("pipe()");
    return 3;
    }
    result = 0;
    if (!(cppid = vfork())) {
    char *argtable[3];
    argtable[0] = argv[1];
    argtable[1] = NULL;
    dup2(readfd[0], 0);
    dup2(writefd[1], 1);
    close(readfd[1]);
    close(writefd[0]);
    close(readfd[0]);
    close(writefd[1]);
    execvp(argv[1], argtable);
    if (errno == ENOEXEC) {
    argtable[0] = "/bin/sh";
    argtable[1] = argv[1];
    argtable[2] = NULL;
    /* old standard -- what isn't an executable
    * can be exec'd as a /bin/sh script */
    execvp("/bin/sh", argtable);
    result = ENOEXEC;
    } else {
    result = errno;
    }
    _exit(3);
    } else if (cppid < 0) {
    perror("vfork()");
    return 3;
    }
    if (result) {
    errno = result;
    perror("execvp()");
    return 3;
    }
    close(readfd[0]);
    close(writefd[1]);
    /* check results */
    read(0, &c2, 1);
    bytecount = 1;
    errno = 0;
    while (read(0, &c1, 1) > 0) {
    write(readfd[1], &c2, 1);
    if (read(writefd[0], &c3, 1) <= 0) {
    printf("%d errors (%d bytes)\n", result, bytecount);
    if (errno == 0)
    fprintf(stderr, "pipe: unexpected EOF\n");
    else
    perror("pipe");
    return 3;
    }
    if (c3 != c1)
    ++result;
    c2 = c1;
    ++bytecount;
    }
    printf("%d errors (%d bytes)\n", result, bytecount);
    return 0;
    }

    Can you explain how the I/O works in this submission?

    @Nathaniel: You start the thing and send bytes to it. Or don't bother because you already see the answer in front of you.

    Well ok but how is this achieved by your code? I'm only asking because I'm curious.

    @Nathaniel: dd opens the file given by $0 and skips the first 37 characters; and the rest is sent to bunzip2.

    I think what he's asking is: how does this not violate the `but may not load any other external files, and your code may not access the whale.txt file in any way other than described above.` clause?

    @Rogem The compressed data is put after what is shown here, and the code access itself.

    `skip=37` should probably be `skip=39` (the length of the code plus the newline after it). What's the purpose of `&exec cat`?

    The question says: *"Your submission will be a program or function (etc.) that will be called or invoked multiple times. When it is called for the `nth` time it will be given the nth character of `whale.txt` or `whale2.txt` and it must output its guess for the `(n+1)th` character."* -- How is this requirement accomplished? The code displays the whole text of `whale.txt` every time it is executed.

    @axiac "anything is fine, as long as your program always gives one byte of output before receiving its next byte of input."

    @axiac I guess to ignore the latter part.

    I don't understand @user202729's answer to axiac's question. If it always gives the complete text, how do you know which bytes of the text to ignore? If we can choose to ignore or not-ignore output bytes at will we might as well have a program that just outputs "abcdefghijklmnopqrstuvwxyz" regardless of input.

    @Vincent ... ignore what? I don't understand your question. No, literally "anything is fine". The I/O format of this answer is different from other answers. It's possible to use a stream of characters as input, as long as for each byte you read, you output one byte. (KevinCruissen's suggestion, Java related)

    Yes I was talking about the output. As I understand it, every time it is executed it outputs the whole text of Moby Dick. The first time it does this, we know we should only consider the first character of that output as the 'real' output and ignore the rest. The second time we know we should only look at the second character of the output etc. But how do we know? Or do I misread what the program does and does it really only output one byte each time?

    @Vincent The I/O format is flexible, it doesn't force the program to output one byte each time, just that it must output at least one byte after read a byte. In this case the program read 0 byte and output about a million byte, 0 < 1000000 so it's fine.

    @axiac: Yeah looks like I typoed it on rekeying it in after testing. I'm not interested in maintaining at after the predictors are starting to get good.

    @Vincent I thought quite a bit about this at the time it was posted, and decided it's OK, I/O wise. If I wanted to, I could write a test suite that sends one byte to this program via STDIN (which it would ignore), gets one byte from it via STDOUT, and so on, which would satisfy the requirements. At least, I think that would work. Strictly speaking, such a test script should also be included in the answer, but since it's no longer competitive I think we can let that slide.

    @Nathaniel you wrote in the question: *"Your submission will be a program or function (etc.) that will be called or invoked multiple times. (`1215235` times to be exact.) When it is called for the nth time it will be given the nth character of `whale.txt` or `whale2.txt` and it must output its guess for the (n+1)th character."*. This answer clearly does not follow this requirement.

    @axiac: Here's the expected test harness. On writing it I found I had a bug to fix.

    @axiac given the test harness, I'm happy to regard sending the program one byte from STDIN as "calling or invoking" it. The import thing is that the program returns one byte of output after each byte of input, which this one does actually do, when run through the test harness. As the question says, "anything is fine, as long as your program always gives one byte of output before receiving its next byte of input."

    I am actually surprised it saves so little... With a purely-textual file I would have expected about a x10 compression rate, not x4...

    @sillyfly: Yea. I expected about x6. Shrug.

    Have you tried killing some entropy in the (compressed) file? If (say) replacing `'z'` with `' '` saves more bytes than it costs in errors... (you have to save 0.5 bytes for every error you generate). In theory this would be best done by the compression algorithm being told the cost function and being permitted to pick the "wrong" value if it costs more than X bytes to put the right value in.

    @Yakk ... that involves some information theory... which is covered in @{A.Rex}'s solution. And we're not sure how the compression algorithm works, too.

    @user202729: The algorithm is approximately compressing words. Not really, but bzip2 compression rarely splits a word if the word occurs frequently.

  • C++, 2·62829 + 318786 = 444444



    To run this program, you need this file here, which must be named C.



    The program uses the same combination of Markov models as in our earlier answer. As before, this combination is essentially the model from this answer by user2699, but with a few small modifications.



    Seeing as how this answer uses the exact same model as before, the improvement is a better information-theoretic mechanism than the "rewind" mechanism described earlier. This allows it to make fewer errors while also having a smaller combined length. The program itself isn't golfed very much because it is not the major contributor to the score.



    The program is 2167 bytes long (including all the tabs for indentation and lots of other unnecessary characters, but before the test code), and the binary file C is 60661 bytes long, so under the rules for scoring multiple files, we score L as 2167 + 60661 + 1 = 62829.



    The program takes approximately 8 minutes to run on an m5.4xlarge instance on Amazon EC2 and uses a little over 16 GB of memory. (This excessive memory usage isn't necessary -- we just didn't optimize that either.)



    #include <map>
    #include <queue>
    #include <vector>
    using namespace std;

    FILE *in;
    unsigned int a, b = -1, c, d;
    string s, t;
    double l, h = 1, x[128][129], y[129], m[128];
    map<string, int> N;
    map<string, double[128]> M;
    int G, S;

    int f(int C)
    {
    int i, j;
    for (i = 0; i <= 20 && i <= S; i++) {
    t = s.substr(S - i);
    N[t]++;
    M[t][C]++;
    }
    s += C;
    S++;

    for (i = 0; i < 128; i++)
    m[i] = 0;

    int E = 0;
    for (i = 20; i >= 0; i--) {
    if (i > S)
    continue;
    t = s.substr(S - i);
    if (i <= 2 && E >= 100 && (i == 0 || t[0] != ' '))
    break;
    if (M.find(t) == M.end())
    continue;
    for (j = 0; j < 128; j++) {
    m[j] += M[t][j] / N[t];
    }
    E += N[t];
    }

    double r = 0;
    for (i = 0; i < 128; i++)
    r += m[i];
    for (i = 0; i < 128; i++)
    m[i] = m[i] / r;

    if (!in) {
    in = fopen("C", "r");
    for (i = 0; i < 4; i++)
    c = c << 8 | getc(in);
    } else {
    l = x[C][G]
    + (l - y[G]) * (x[C][G + 1] - x[C][G]) / (y[G + 1] - y[G]);
    h = x[C][G]
    + (h - y[G]) * (x[C][G + 1] - x[C][G]) / (y[G + 1] - y[G]);
    }

    priority_queue<pair<double, int>> q;
    for (i = 0; i < 128; i++) {
    q.push(make_pair(m[i], i));
    }

    int n = 0;
    double s = 0;
    while (q.size()) {
    i = q.top().second;
    q.pop();
    if (m[i] < s / (n + 15))
    break;
    s += m[i];
    n++;
    }

    r = 0;
    for (i = 0; i < 128; i++) {
    y[i + 1] = m[i] - s / (n + 15);
    if (y[i + 1] < 0)
    y[i + 1] = 0;
    r += y[i + 1];
    }
    for (i = 0; i < 128; i++)
    y[i + 1] /= r;

    for (i = 0; i < 128; i++) {
    r = 0;
    for (j = 0; j < 128; j++) {
    x[i][j + 1] = y[j + 1];
    if (i == j)
    x[i][j + 1] *= 16;
    r += x[i][j + 1];
    }
    for (j = 0; j < 128; j++)
    x[i][j + 1] /= r;
    x[i][0] = 0;
    for (j = 0; j < 128; j++)
    x[i][j + 1] += x[i][j];
    }

    y[0] = 0;
    for (i = 0; i < 128; i++)
    y[i + 1] += y[i];

    for (G = 0; G < 128; G++) {
    if (y[G + 1] <= l)
    continue;
    if (y[G + 1] < h) {
    d = a + (b - a) * ((h - y[G + 1]) / (h - l));
    if (c <= d) {
    b = d;
    l = y[G + 1];
    } else {
    a = d + 1;
    h = y[G + 1];
    }
    while ((a ^ b) < (1 << 24)) {
    a = a << 8;
    b = b << 8 | 255;
    c = c << 8 | getc(in);
    }
    }
    if (h <= y[G + 1])
    return G;
    }
    }
    // End submission here. Test code follows.
    int main()
    {
    FILE *moby = fopen("whale2.txt", "r");

    int E = 0;
    int c = getc(moby);
    while (c != EOF) {
    int guess = f(c);
    c = getc(moby);
    if (c != guess)
    E++;
    }

    printf("E=\t%d\n", E);

    return 0;
    }

License under CC-BY-SA with attribution


Content dated before 7/24/2021 11:53 AM