Honyaku Architecture

Azumanga Daioh characters playing instruments

Honyaku is a custom static site generator with some smarts specific to helping translate Japanese media. It’s written in Rust, and uses the tokio runtime for async.

I like to think of the design pillars for the site as:

Above all else, it should make it easier for me to immerse and learn Japanese.
- This is the main goal of the project.
Must be able to be served from just a webserver.
- Fewer moving pieces leads to simpler to maintain server stacks.
Must be able to function without JavaScript (though JavaScript may be used to enhance the experience, it should not be required).
- I just really don’t like writing JavaScript.
Automate the non-interesting parts of immersion learning.
- Simply put, if I am doing something repeatedly that isn’t helping me practice and immerse, then I need to automate it away.

Over the next several blog posts, I’ll talk a bit about the design, as well as what I’ve learned from the process of making this site.

Entry-Point

The binary for this tool is actually one that allows for multiple commands. This is useful because it is easy for me to add other “mains” to my program without creating a new binary each time.

Here’s the associated --help text for the binary:

usage: honyaku [<options>...] <command> [<args>...]

A static site generator for translating Japanese media.

commands:
  analyze   Analyze the provided Japanese text
  generate  Generates HTML code for the translated media
  lookup    Lookup a word in the Honyaku dictionary
  parse     Parse a provided Japanese text string (debug util)

options:
  --help        show this help documentation
  --version     show version information

For the purpose of most of these posts, I will be talking about generate. Here is the associated --help text for this command:

usage: honyaku generate [<options>...]

Generates the static site for honyaku.space.

options:
  --help             show this help documentation
  --input=VALUE      the source root for the site configuration (required)
  --output=VALUE     the target directory to generate into (required)
  --publish          whether or not to mark pending pages as published

Input Data

The binary is data-driven (for the most part), meaning that changes to how something is translated don’t require the binary to be recompiled.

The basic structure of this data is roughly like this:

.
├── articles
│   ├── index.yaml
│   └── updates
│       ├── index.yaml
│       └── 0.1.0
│           ├── index.md
│           ├── index.yaml
│           └── ...
├── manga
│   ├── index.yaml
│   └── azumangadaioh
│       ├── index.yaml
│       └── 001
│           ├── index.yaml
│           └── 001
│               ├── index.yaml
│               ├── 001
│               │   ├── index.yaml
│               │   └── comic.tiff
│               └── ...
└── ...

So it’s basically a directory tree of YAML files.

An Aside:
I’m well aware of the issues with YAML that make folks not a fan of it. I’ve employed a few strategies, which we will talk about in a later post, that makes YAML workable for me.
Is it perfect? - No.
But it’s easy, and for the kinds of content I have it’s the best general structured markup language.

YAML Schema

The YAML file instructs the generation process for how to deal with a given directory. For instance, a page like honyaku.space/manga needs to be processed as an intermediate page that links to other pages.

This is accomplished with the following YAML:

title:
  eng: "Manga"
  jpn: "漫画"
data: !Selector

The only exeption is the root directory (honyaku.space), which is treated specially and therefor does not have a YAML file detailing it.

`data` Property

The data property is what tells honyaku what kind of a page to generate.

In YAML, ! will allow you to tag a value, and this can be used to mark up data in the case of an enum that could have multiple values. So what this says is that the data for the page is of type HonyakuEntryData::Selector, and that it’s default-constructed (e.g. no custom properties).

The full data type looks like this:

#[derive(Clone,Debug,Deserialize)]
pub enum HonyakuEntryData {
    Article(HonyakuArticleData),
    Manga(HonyakuMangaData),
    Picture(HonyakuPictureData),
    Selector(HonyakuSelectorData),
    Siren(HonyakuSirenData),
    Video(HonyakuVideoData),
}

Each of the types stored in the enum contain the type’s specific properties.

Here are the properties for Selector:

#[derive(Clone,Debug,Deserialize)]
#[serde(deny_unknown_fields)]
pub struct HonyakuSelectorData {
    #[serde(default)]
    pub anki: bool,
    #[serde(default)]
    pub cover: Option<String>,
}

anki: Tells honyaku that this directory should have an Anki deck that contains all of the vocabulary for all of the words found in all of the child pages under this directory.
cover: Allows us to override the path for the image used for the selector element in the parent. By default it looks for a path named cover.jpg. So in this case the source directory must also have a cover.jpg file.

`title` Property

The title property has two keys, eng and jpn. This is a Translation object, and it tells us what the original Japanese text is, as well as the manually-curated translation.

This is probably the most widely-used type in the entire project. Here, it’s used for:

The text that populates one of the links to get to this page from the parent page (e.g. the text on honyaku.space).
The text that is present at the top of the generated page for the breadcrumb (e.g. the text at the top of honyaku.space/manga).
The categories and titles that get set in the Atom feed for the page (e.g. anywhere you see the text in honyaku.space/manga/feed.xml).
Referenced by child pages when constructing the breadcrumbs (e.g. at the top of pages like honyaku.space/manga/azumangadaioh).

Technically-speaking, there are more possible properties on this type. But for the time being we’re going to ignore that. We’ll talk about the other properties in another blog post.

The key thing to note here is that title is not an element of data. This means that each directory we recurse into must have a title, and that a title contains both the Japanese and English text.

Readers with a keen eye may have noticed that the Japanese text doesn’t contain any markup, and yet somehow the page is able to display ふりがなfurigana (the small ひらがなhiragana characters over-top of the 漢字かんじkanji).
This is possible thanks to the JapaneseAnalyzer type. A type which takes Japanese text, and spits out structured data around the words in the text (complete with how to read them).
I sound like a broken record, but again this is a topic for another time.

Generator

This type helps walk the tree and generate the pages.

One of these is generated per page that we’re processing, and it holds on to all of the important state (such as where we are in the directory tree, how to write a new file, as well as collecting information from the children for things like Anki decks and Atom feed events).

It’s purposefully separated from the input data, however. It’s kind of the driver that moves things along, but it doesn’t have information about what it’s currently generating. These were kept separate so that I could programmatically recurse into a directory without the need of a YAML file should the need arise.

You can think of this as just a fancy type that holds common data, and paths for the current directory that we are processing. It has some helper functions that makes publishing content easier as well. Such as:

write_to_file: Publishes a file from the source tree to the destination tree (taking into consideration the proper paths to resolve).
script_to_file: Similar to write_to_file, except it runs a minimizer on the script data before publishing it.
styles_to_file: Similar to write_to_file, except it runs a minimizer on the styles data before publishing it.
image: Publishes an image to the current destination tree from the source tree, and then returns an object to reference the image. (Images are quite complicated, so this will probably be a whole separate blog post.)
create_html: Returns an object for writing HTML to, when the writer is finished a minimizer is run on the HTML before pushing to disk.
traverse_into: Constructs a new generator that clones common information from the source generator, and then traverses into the directory provided in both the source and destination trees.

As well as a few other functions for other, less-obvious actions.

Main

Finally, with that all out of the way, I can show you main. This is where we start processing the home index.html (honyaku.space).

let start = Instant::now();

// Create a new site generator instance for building honyaku.
// Note: Some of these types are constructed commonly in the parent function.
//       All commands use the `dictionary` and CLI `flags`, for instance.
//       You only need to know roughly what these are, the names should suffice.
let mut generator = Generator::new(GeneratorConfig {
    // Categories for the Atom Feed on this page (inherited by the children).
    categories: vec![
        AtomCategory::from("Japanese"),
        AtomCategory::from("Translation"),
        AtomCategory::from("English"),
    ],
    dictionary,          // The dictionary, constructed in the parent.
    flags: cmd.flags(),  // From the CLI, constructed in the parent.
})?;

// Write all of the static data (needs to be done before writing pages).
// Note: Files omitted for brevity, but this publishes images/scripts/styles.
//       Basically, anything that's not data-driven and are roughly global.
tokio::try_join!(
    generator.write_to_file(...),
    generator.script_to_file(...),
    generator.styles_to_file(...),
)?;

// Process the current page as a selector.
//
// This function contains all of the common logic that most of the other similar
// `!Selector` pages use. It's called manually here to start the process. But it
// will call it automatically on child items as needed after this point.
//
// It recurses into the children and dynamically dispatches to handlers based on
// the `data` type of the content. Past this point things are data-driven.
let report = process_selector_view(
    &generator,
    /*parents=*/Arc::new(Vec::new()),          // No parents, it's the root.
    /*notes=*/Some(include_str!("index.md")),  // The text in ノート on home.
).await?;

// Print a nice message to explain the final status.
println!();
println!("==== Site Generation Complete! ====");
println!("Generation Duration  : {:?}", start.elapsed());
println!("Published Content    : {}", report.published);
println!("Translation Successes: {}", report.ok);
println!("Translation Warnings : {}", report.warnings);
println!("Translation Errors   : {}", report.errors);

For example, when I run this today, I see:

==== Site Generation Complete! ====
Generation Duration  : 312.299619ms
Published Content    : 53
Translation Successes: 1113
Translation Warnings : 0
Translation Errors   : 0

The time (~312ms) is only possible due to the high amount of caching and general avoiding work that Honyaku does. It basically want’s very much to NOT do something it’s been asked to do.

If something will definitely change, it will have to do work. Sometimes, it will do some of the work until it gets far enough to check some other data to see if it can skip the rest and throw what it’s done away.

This has the added benefit of me only needing to rsync what has actually changed when I go to actually publish the site on the server. But the main reason for this optimization is keeping the generation times low.

It definitely takes longer if it finds that it has work to do. For instance, here is how long it takes if it finds that it has to publish a high-resolution image (it needs to build different thumbnail sizes and re-encode):

==== Site Generation Complete! ====
Generation Duration  : 23.332598908s
Published Content    : 53
Translation Successes: 1113
Translation Warnings : 0
Translation Errors   : 0

We’ll talk about more of these work-avoidance strategies in a later post.

Summary

So that’s an introduction to how this site works!

I’m not sure how often I’ll do these posts, but I think they’re kind of fun, and they get me to question why I have whatever design I have for the logic - because I have to explain why I have it. So I think it’s worth doing.

Here’s a brief summary of what we talked about today:

honyaku is a binary that can generate Honyaku.
It’s data-driven, and uses a tree of YAML files to define the website.
There’s an important type called Translation that allows us to define a line of translated text. Here, it’s used for the title of a page.
The generator tries very hard to avoid doing work - this is partially to keep the mtimes clean for rsync, but mostly to make generation fast.