Rust, and uses the tokio runtime for async.
I like to think of the design pillars for the site as:
Over the next several blog posts, I’ll talk a bit about the design, as well as what I’ve learned from the process of making this site.
The binary for this tool is actually one that allows for multiple commands. This is useful because it is easy for me to add other “mains” to my program without creating a new binary each time.
Here’s the associated --help
text for the binary:
usage: honyaku [<options>...] <command> [<args>...]
A static site generator for translating Japanese media.
commands:
analyze Analyze the provided Japanese text
generate Generates HTML code for the translated media
lookup Lookup a word in the Honyaku dictionary
parse Parse a provided Japanese text string (debug util)
options:
--help show this help documentation
--version show version information
For the purpose of most of these posts, I will be talking about generate
. Here is the associated --help
text for this command:
usage: honyaku generate [<options>...]
Generates the static site for honyaku.space.
options:
--help show this help documentation
--input=VALUE the source root for the site configuration (required)
--output=VALUE the target directory to generate into (required)
--publish whether or not to mark pending pages as published
The binary is data-driven (for the most part), meaning that changes to how something is translated don’t require the binary to be recompiled.
The basic structure of this data is roughly like this:
.
├── articles
│ ├── index.yaml
│ └── updates
│ ├── index.yaml
│ └── 0.1.0
│ ├── index.md
│ ├── index.yaml
│ └── ...
├── manga
│ ├── index.yaml
│ └── azumangadaioh
│ ├── index.yaml
│ └── 001
│ ├── index.yaml
│ └── 001
│ ├── index.yaml
│ ├── 001
│ │ ├── index.yaml
│ │ └── comic.tiff
│ └── ...
└── ...
So it’s basically a directory tree of YAML files.
An Aside:
I’m well aware of the issues with YAML that make folks not a fan of it. I’ve employed a few strategies, which we will talk about in a later post, that makes YAML workable for me.
Is it perfect? - No.
But it’s easy, and for the kinds of content I have it’s the best general structured markup language.
The YAML file instructs the generation process for how to deal with a given directory. For instance, a page like honyaku.space/manga needs to be processed as an intermediate page that links to other pages.
This is accomplished with the following YAML:
title:
eng: "Manga"
jpn: "漫画"
data: !Selector
The only exeption is the root directory (honyaku.space), which is treated specially and therefor does not have a YAML file detailing it.
data
PropertyThe data
property is what tells honyaku
what kind of a page to generate.
In YAML, !
will allow you to tag a value, and this can be used to mark up data in the case of an enum that could have multiple values. So what this says is that the data
for the page is of type HonyakuEntryData::Selector
, and that it’s default-constructed (e.g. no custom properties).
The full data
type looks like this:
#[derive(Clone,Debug,Deserialize)]
pub enum HonyakuEntryData {
Article(HonyakuArticleData),
Manga(HonyakuMangaData),
Picture(HonyakuPictureData),
Selector(HonyakuSelectorData),
Siren(HonyakuSirenData),
Video(HonyakuVideoData),
}
Each of the types stored in the enum
contain the type’s specific properties.
Here are the properties for Selector
:
#[derive(Clone,Debug,Deserialize)]
#[serde(deny_unknown_fields)]
pub struct HonyakuSelectorData {
#[serde(default)]
pub anki: bool,
#[serde(default)]
pub cover: Option<String>,
}
anki
: Tells honyaku
that this directory should have an Anki deck that contains all of the vocabulary for all of the words found in all of the child pages under this directory.cover
: Allows us to override the path for the image used for the selector element in the parent. By default it looks for a path named cover.jpg
. So in this case the source directory must also have a cover.jpg
file.title
PropertyThe title
property has two keys, eng
and jpn
. This is a Translation
object, and it tells us what the original Japanese text is, as well as the manually-curated translation.
This is probably the most widely-used type in the entire project. Here, it’s used for:
Technically-speaking, there are more possible properties on this type. But for the time being we’re going to ignore that. We’ll talk about the other properties in another blog post.
The key thing to note here is that title
is not an element of data
. This means that each directory we recurse into must have a title, and that a title contains both the Japanese and English text.
Readers with a keen eye may have noticed that the Japanese text doesn’t contain any markup, and yet somehow the page is able to display (the small characters over-top of the ).
This is possible thanks to the
JapaneseAnalyzer
type. A type which takes Japanese text, and spits out structured data around the words in the text (complete with how to read them).I sound like a broken record, but again this is a topic for another time.
This type helps walk the tree and generate the pages.
One of these is generated per page that we’re processing, and it holds on to all of the important state (such as where we are in the directory tree, how to write a new file, as well as collecting information from the children for things like Anki decks and Atom feed events).
It’s purposefully separated from the input data, however. It’s kind of the driver that moves things along, but it doesn’t have information about what it’s currently generating. These were kept separate so that I could programmatically recurse into a directory without the need of a YAML file should the need arise.
You can think of this as just a fancy type that holds common data, and paths for the current directory that we are processing. It has some helper functions that makes publishing content easier as well. Such as:
write_to_file
: Publishes a file from the source tree to the destination tree (taking into consideration the proper paths to resolve).script_to_file
: Similar to write_to_file
, except it runs a minimizer on the script data before publishing it.styles_to_file
: Similar to write_to_file
, except it runs a minimizer on the styles data before publishing it.image
: Publishes an image to the current destination tree from the source tree, and then returns an object to reference the image. (Images are quite complicated, so this will probably be a whole separate blog post.)create_html
: Returns an object for writing HTML to, when the writer is finished a minimizer is run on the HTML before pushing to disk.traverse_into
: Constructs a new generator that clones common information from the source generator, and then traverses into the directory provided in both the source and destination trees.As well as a few other functions for other, less-obvious actions.
Finally, with that all out of the way, I can show you main
. This is where we start processing the home index.html
(honyaku.space).
let start = Instant::now();
// Create a new site generator instance for building honyaku.
// Note: Some of these types are constructed commonly in the parent function.
// All commands use the `dictionary` and CLI `flags`, for instance.
// You only need to know roughly what these are, the names should suffice.
let mut generator = Generator::new(GeneratorConfig {
// Categories for the Atom Feed on this page (inherited by the children).
categories: vec![
AtomCategory::from("Japanese"),
AtomCategory::from("Translation"),
AtomCategory::from("English"),
],
dictionary, // The dictionary, constructed in the parent.
flags: cmd.flags(), // From the CLI, constructed in the parent.
})?;
// Write all of the static data (needs to be done before writing pages).
// Note: Files omitted for brevity, but this publishes images/scripts/styles.
// Basically, anything that's not data-driven and are roughly global.
tokio::try_join!(
generator.write_to_file(...),
generator.script_to_file(...),
generator.styles_to_file(...),
)?;
// Process the current page as a selector.
//
// This function contains all of the common logic that most of the other similar
// `!Selector` pages use. It's called manually here to start the process. But it
// will call it automatically on child items as needed after this point.
//
// It recurses into the children and dynamically dispatches to handlers based on
// the `data` type of the content. Past this point things are data-driven.
let report = process_selector_view(
&generator,
/*parents=*/Arc::new(Vec::new()), // No parents, it's the root.
/*notes=*/Some(include_str!("index.md")), // The text in ノート on home.
).await?;
// Print a nice message to explain the final status.
println!();
println!("==== Site Generation Complete! ====");
println!("Generation Duration : {:?}", start.elapsed());
println!("Published Content : {}", report.published);
println!("Translation Successes: {}", report.ok);
println!("Translation Warnings : {}", report.warnings);
println!("Translation Errors : {}", report.errors);
For example, when I run this today, I see:
==== Site Generation Complete! ====
Generation Duration : 312.299619ms
Published Content : 53
Translation Successes: 1113
Translation Warnings : 0
Translation Errors : 0
The time (~312ms) is only possible due to the high amount of caching and general avoiding work that Honyaku does. It basically want’s very much to NOT do something it’s been asked to do.
If something will definitely change, it will have to do work. Sometimes, it will do some of the work until it gets far enough to check some other data to see if it can skip the rest and throw what it’s done away.
This has the added benefit of me only needing to rsync
what has actually changed when I go to actually publish the site on the server. But the main reason for this optimization is keeping the generation times low.
It definitely takes longer if it finds that it has work to do. For instance, here is how long it takes if it finds that it has to publish a high-resolution image (it needs to build different thumbnail sizes and re-encode):
==== Site Generation Complete! ====
Generation Duration : 23.332598908s
Published Content : 53
Translation Successes: 1113
Translation Warnings : 0
Translation Errors : 0
We’ll talk about more of these work-avoidance strategies in a later post.
So that’s an introduction to how this site works!
I’m not sure how often I’ll do these posts, but I think they’re kind of fun, and they get me to question why I have whatever design I have for the logic - because I have to explain why I have it. So I think it’s worth doing.
Here’s a brief summary of what we talked about today:
honyaku
is a binary that can generate Honyaku.Translation
that allows us to define a line of translated text. Here, it’s used for the title of a page.rsync
, but mostly to make generation fast.