NovelGen 2019

I generated a novel using an algorithm for National Novel Generation Month.  Here’s the result: The Journey of the Book.

Previous Attempts

I used to try to write a novel every year during the month of November.  In the last few years, with my time vastly decreased, I've started doing NanoGenMo (generate a novel in 1 month).  The goal is to encourage experiments with all the interesting text generation technologies that have come about.

A few years ago I mostly used a hand-crafted Markov chain to generate text, drawing from a number of sources (Shakespeare, The King James Bible, Moby Dick, Lord of the Rings and H.P. Lovecraft).  I also used some of the python text analysis libraries to analyze sentences and keep the good ones.  However, I mostly ended up using simple regex matching (reasonable length, starting with a Capital, ending with a period) to strip out the nonsense.

gtp-2

GPT-2 was one of the big announcements in the world of text generation this year.  It was released with a ton of fanfare, including ominous footnote that the most powerful model would not be released for the public's safety.  The argument was something like: the resultant text would be too realistic and the onslaught of fake news would be too much for folks to keep up with.When it was first released, quite a few interested tools were built to let you play around with it (https://talktotransformer.com/).  

One of the most unique features of gtp-2 is it can take a set of text as input and continue to generate text matching the theme and voice.I'd never run the full thing before, so the first step was setting up gpt-2 locally.  Also, I utilized this fork of the official gpt-2 library that contains some extra helpful scripts.

Here are some guides that were pretty useful:

https://minimaxir.com/2019/09/howto-gpt2/

https://medium.com/@ngwaifoong92/beginners-guide-to-retrain-gpt-2-117m-to-generate-custom-text-content-8bb5363d8b7f

One of the box, gpt-2 comes with a model generated from raw internet text.  Most of this is what you'd expect - personal blogs, snippets of news, and quite a bit of political opinion.  Samples generated from these models sound about the same (an angry uncle on Facebook posting 9/11 conspiracies).

Training the Model

They key to generating an interesting narrative text with GPT-2 is training a new model.  I grabbed some useful text from Project Gutenberg: Shakespeare's Sonnets and Macbeth, The King James Bible (Genesis, Psalms, Revelation), Huck Finn, Lovecraft and a large collection of travel journals from the 19th century.

Training a model was simple enough, but took quite a bit of time.  Each iteration could take 30-60s, and a good model takes hundreds (or thousands) of iterations.  For my final model, I let my laptop churn away all day.

High Hopes

Originally I had some big ambitions for the structure of the model.  I generated a handful of unique models (poetry, travel journals, religious text, colloquial americana).  Then I put together a quick python script to recursively build a tree structure of branching themes.  The idea was to generate a core passage (the root), then use that to spin off branches using randomly selected models.  In turn, those nodes would use their core passage and spin off more branches.

The problem came about mostly from the limitations of the python scripts.  For one, most of the required libraries (tensorflow, etc) didn't match the version, so running the scripts spit out a ton of warnings.  The generate_with_input script was also pretty slow.  It would take 10+ minutes to spit out a page-length chunk of text.Building a framework to hold the tree in memory, run the gpt-2 scripts, ingest and parse the text wasn't a quick hack project.  Plus, if the thing had to run for hours, it was bound to freeze up or crash due to a bug.  The true solution would be to serialize the tree to disk and be able to resume where I picked up.  Not something I felt like coding.

I ended up simply writing a wrapper that would take in input and send the text to output.  I could then pipe this to disk.  Furthermore, I could bump up the number of samples created, so larger sections of text would be generated.  The input for a section was an interesting passage I found near the closing paragraphs of the preceding passage.  In the end, I ran the generator about 10 times, spitting out roughly 5-10k words each time, resulting in a novel of 74k words.

Cleanup

The raw output was half decent.  Large sections were filled with readable sentences, correct punctuation and even reasonable flow and transitions between paragraphs. Parts even felt like they were lifted verbatim from the source text, but doing a search of the inputs proved they were original.There were some sections where strange punctuation (headers, footers, indentations, missing parenthesis) had resulted in distorted text.  I ended up cleaning these up.I also broke the full text into logical sections (chapters and parts), doing a quick skim through.

There were only two sections I added to the text.  The first was due to one of the sources (Huck Finn).  There were some paragraphs where the N-word was heavily repeated.  Of course, the word fit with the sentences, being generated from the original dialogue.  But I felt it was a distraction, especially since it was repeated so heavily (18 times).  I ended up mostly replacing it with the word "man", and the feel of those sentences is mostly left intact.

The second was the title itself.  The Journey of the Book.

Probably the largest source of content were the set of 19th century travel journals, many of them to exotic places (The Himalayas, the Middle East, South America).  So much of the generated text is a journey.  Climbing over rocky mountains, drifting down mystical rivers, visiting the folk of strange tribal villages.  Near the end, there's talk of a fabled book, but the references are fleeting and strange.  The whole project has been a journey to generate a book, so the title fit.

Select Passages

I haven't read the whole thing, but here are a few sections that stood out:

Part 1, Chapter 4

We are now in this most remote mountain range, with the very rocks of wherever they may be; but where they shall remain is bound to be very hard. I shall have the satisfaction of being able to descend to it again, without having to face any difficulties whatsoever, unless my own mind and spirit become possessed by a terrible fear of things to come. There is no other path to be found; it may be an ever changing world.

Part 1, Chapter 5

For the gate at our door was full of broken bricks and fallen stones; there was no way out. The soldiers stood here and searched our feet. When the sun came out, and the moon came down, our path lay ahead. In our hearts our enemies cried aloud at the sight of the city, "What are these little stones and bricks, now falling to the ground?" "No, only the one that will go through us, the one that will die," cried out the soldiers, "at last, will we die together." Then they said in unison at our feet, "We will die together."

Part 1, Chapter 8

“I mean it ain't pretty quiet. It's pretty quiet in a lot of these places, too. But there was a lot of singing, a lot of swearing, and shouting, and a lot of the children were shooting at each other, too. And a lot of the folks that came to talk to the news were singing and shouting. They had the boys of the neighbourhood shooting at everybody, and they had the men of the neighbourhood screaming and throwing down in a frenzy. They seemed to take the best of everything that was said, and then turned it around so everybody could hear a few words and shout at them.”

Part 2, Chapter 1

I felt like I was going to land; that was not a chance to look in front of anything I might see, so I went to sitting on the stick, and turned on the other half of the stick and looked up at the stars; and then I rolled back on my blanket, which was the only thing I was looking at, and I said, with a little pause, to the stars, "Here's the next day's sunrise; don't you know your way to it till I tell you I'm going to. I'll be your guide when I go to night, and you shouldn't leave me. I want something for breakfast, and one-a-thousandth of it; and now I'll make up it. I want a book, and a book-case, and a book-case for the stars, and so on"

Part 2, Chapter 6

When my life was in danger of being destroyed by my own wickedness, the only way out was to go and get the head.  The way out was a terrible one, for it was a most terrible experience.
But we could not lose the Head either.  Every night he would go down there and kill me. This was it, the only way out.  Then we would go back up to the river. And after that I knew all about the heads. And every night I would stay up there and watch them, and they would shoot.  One morning I was standing on the bridge, and one one of the little men would say to me:
My head's dead!
And I cried and cried, and I cried and cried, and there was no sound but the water and the river, and the dogs shooting.

Part 3, Chapter 5

We have got to make a choice; the choice will be ours, and we can have our way.
There are two sides to the whole thing.
I can go to heaven by going to hell.  I can have my way by going to hell. But the devil will always be with me--if I can go to the last place, it will be my chance to go to the first place which is my home."

Part 3, Chapter 6

"It all got to be so hot it was too much; we didn't get to sleep that well. But I said, as soon as I got to it, it didn't matter if we slept; it was always a fine night, and we did sleep, and sleep. But I couldn't stand the cold; I couldn't stand the hot. And when I got up to bed, it was too late to get by, so I was lying, and I said: "When can I see the sky again?"
And she said: "When I see the sky again."
I said: "It is all fine now."

Some Larger Observations

GPT-2 is very impressive at constructing individual sentences and matching the tone and voice of a passage.  It has a few quirks.  

Often it will latch onto repetition in the source text and produce strange loops.  You can see the "scope" at which the algorithm breaks down.  On a word by word basis, the sentence structure is correct, but the repetition is out of control.

You wake up, old boy, let's have a good time! I'm awake! I'm awake!
What a riddle I had the last time!--a riddle! A riddle was a riddle like--a riddle--a riddle a riddle.
Go! let's have a good time, now, and try to be right, you won't get left, you won't; just try to be right; you`n't a good riddle, is it? It was like a--a riddle a riddle before it got the right place; and yet, as good it gets, so is the right way, right?There it is again--a riddle a riddle; a riddle where it gets the right place first. You'd never know that as soon as that riddle got where it couldn't go again.
When you wake up, I know you. I know you, all right, and I know we got you at the right place, but how? I know how you got there; I know how you got there. I know how you got here, and I know you.

This alludes to the larger issue: the algorithm has no knowledge or real intelligence. It's simply spitting out text that statistically matches the source text (at a word by word and syntax level).  It has no knowledge of what words mean.  Large sections of text contradict themselves and gpt-2 would be none the wiser. This can lead to a whimsical, Suessian quality to the output, but for writing something rigorous or logically intact, it's no better than a Markov Chain.

Ultimately, experimenting with gpt-2 can be fun, and you can certainly utilize it to generate a novel in a few hours.  The novel may even have some decent passages.  The entire thing will be nonsense.

As many others have said, the true horizon of artificial intelligence is building a framework that can support logical thought, that can digest all the pattern matching produced by the machine learning algorithms.

Here's to next year.