Cartas a la Familia: A Lesson in Internationalization

Until late 2019, the Center for Digital Research in the Humanities (CDRH) had no multilingual sites. Despite creating and maintaining dozens of sites with content spanning disciplines, the only projects that even came close were the Omaha & Ponca Digital Dictionary, and The Good Person: Excerpts from the Yorùbá Proverb Treasury. Though both of these have contents in multiple languages, the overall websites (navigation, about pages, etc) are English only. With the launch of Cartas a la Familia, or Family Letters, the CDRH has taken its first step at a truly multilingual site with a project that is in both Spanish and English. We are excited that we have built some of the infrastructure needed to create more multilingual sites, and we hope that we have the opportunity to create such projects. If we learned anything from the process of creating Cartas a la Familia, it’s that the hardest part of a multilingual site is not the technology or the language, it’s us. More on that later, but first, the technology!

The landing page for Cartas a la Familia

The Technical Side

Cartas a la Familia was the product of a collaboration between Dr. Isabel Velázquez from UNL’s Department of Modern Languages and Literatures, the Shanahan family of Davey, NE, the Center for Digital Research in the Humanities, and nearly a dozen graduate and undergraduate students. The digital project contains letters spanning decades between family members corresponding between Mexico, Colorado, and Nebraska, photographs from the family’s collection, teaching tools, and analysis about the experience of immigrating to a new community.

It was critical that this project be available in both Spanish and English. Students were already hard at work transcribing and translating letters so that they were available in both Spanish and English. Meanwhile, the CDRH dev team was contemplating a few questions:

Given that the website would be powered by existing CDRH software, Orchid, how could we alter it to support multiple languages?
How could we use our CDRH API and schema, designed for metadata in English, to provide search capabilities?

Certainly, the technologies behind the scenes for question 1 (Ruby on Rails) and 2 (Elasticsearch) have the ability to accommodate many languages, so it was mostly a question of how we would take advantage of those abilities while keeping in mind our own self-imposed limitations for things like schema complexity and number of fields. Let’s look at the questions more closely.

Ruby on Rails: Orchid

Several years ago, the dev team at the CDRH recognized the need to create some sort of template for websites based around images or documents. Many of our websites had the same needs — searching, browsing, and general issues like responsive design and accessibility. In response to this, we created Orchid, a Ruby on Rails engine which spins up a new site which connects to our API, provides a customizable search and browse interface, and sets up a basic URL structure. We use it to power most of our modern sites, such as The Willa Cather Archive. The great news for us in terms of supporting multiple-languages is that Ruby on Rails offers internationalization (I18n) features.

screenshot of search results on willa cather archive, with side bar, individual results, pagination, and sorting — Nearly all of the elements of the search on the Willa Cather Archive, seen here, are provided by Orchid as default functionality. This means that new projects do not need to worry about pagination, filtering, sorting, and basic display.

Rails’ I18n features are pretty straightforward. Instead of writing words in some HTML (for example, <h1>Title</h1>
), you use a reference to a place to find the words (<h1><%= t "header.title" %></h1>). Then you just have to provide what to display there for each language. For example, instead of writing “Alphabetically (A-Z)” on the sort selection button, now the code there reads search.sort.alpha_asc and the appropriate line is selected from either the corresponding English or Spanish mapping, depending on the website’s selected language. For example, here is a snippet from en.yml and es.yml which provides the reference for search.sort.alpha_asc:

en:
search:
sort:
      alpha_asc: "Alphabetically (A-Z)"

es:
search:
sort:
      alpha_asc: "Por orden alfabético (A-Z)"

It didn’t take terribly long for us to go through all the existing text and labels and organize them into an English reference file. Then a student translated them into Spanish to create a matching file. We wrote a little bit of code to select the default locale and provide toggling between English and Spanish and TADA! With the click of a button the titles, navigation, and other basic elements of the site displayed in English or Spanish, depending on your language selection.

The label and text replacement method described above is less ideal for longer texts or anything that has dynamic behavior (such as pulling in search results). Once again, for analysis and content sections, Rails’s I18n saved the day with templating. A typical page in our application might be about.html.erb, the extensions meaning first resolve the “erb” or Ruby markup in the file, then read it as HTML. With I18n hooked up, Rails will first look for a file that specifies the requested language. Family Letters uses about.es.html.erb and about.en.html.erb where the two files should be functionally the same except for the language of the text.

We did a little bit of work to customize the website from Orchid’s default to be able to send a request for searching in English, Spanish, or both languages, but then it was time to think more about the API….how would we actually search in either language?

Elasticsearch: The CDRH API

Elasticsearch is a search engine we use to power our CDRH API for searching and filtering content from documents such as title, dates, text, associated people, and more. Much like Ruby on Rails, Elasticsearch provides some out of the box language options. Elasticsearch is great at bringing back relevant results even when they are not exact hits. For example, if you search for “fish,” Elasticsearch will also bring back results for “fishes,” “fishing,” “fished,” and other versions of the word through something known as “stemming.” Elasticsearch has field types which allow you to specify language analyzers. We ended up adding a field to our API schema that analyzes Spanish language text so that it would have similar searching capabilities in terms of stemming and other tools as the existing (English) text field. So far so good!

The next consideration was how to represent terms like people’s names, category, format, etc. Currently, our API schema has fields to represent that information, and they are generally considered to be exact-match type of fields. You could consider using a filter like saying “return any documents where the category is ‘explore’.” The tricky part is, understanding that a particular keyword term for a document is one concept represented by words / phrases in two languages, how to display that to the user?

One possibility is to use two fields to represent the same information. For example, you could have category_en with “explore” and category_es with “explora.” This is, essentially, duplication and feels dangerous from a programmer’s and metadata enthusiast’s perspective of making sure data is maintainable. Additionally, it complicates the logic Orchid needs to use to request results from a particular keyword, since it would change depending on the current language. After a few minutes of thinking, more problems emerge. What if a user has selected categoría = explora but then switches the site language to English? Though Orchid and the API would probably be able to figure out that categoría is asking for category_en or category_es, neither would have any way of knowing what the appropriate English equivalent of explora is to return the same results for the site’s updated language selection. No, using two fields for the same information (albeit in different languages) didn’t seem like the right solution.

The option we ultimately pursued was to keep the API entirely in English and translate using the same types of locale files in Orchid as mentioned above. Now in the Rails app, we have translations for the filter and browse pages that look something like this:

subcategory:
    Document: Documento
    Envelope: Sobre
    Letter: Carta
    Note: Nota Manuscrita

This isn’t an ideal system. For one thing, it is assuming English as the default language and then translating to Spanish, even though the site’s actual primary language is Spanish. For another, problems arise if there is one word we would use to describe something that might be described in several ways in Spanish, because the one-to-one nature of the above translations does not support something like “Document” mapping to multiple terms in Spanish. Fortunately, that didn’t end up being an issue for this project, but it could very easily become an issue for a future project.

There may be other options available to us, as well, which will require further research and possibly larger changes to our API structure. At the time that we were working on Cartas a la Familia, we unfortunately didn’t have time to step back and really reconsider our structures from the ground up, but we have been applying for grants which would allow us to do so. Assuming we get the funding and personnel, comparing solutions other institutions have used for faceting keywords in multiple languages and determining the best course for ourselves is going to be one of the big tickets on the TODO list.

The Real Problem

Of course, multiple languages will always present some new obstacles when creating a site, regardless of your current infrastructure and team. If you want to make changes to content in one place, you’re going to have to make sure that change is translated wherever it appears. Even if programmers have adequate language skills to identify and make those types of edits, it takes a highly skilled individual to craft the translations. These are workflow problems, but at least for our team more issues are introduced when moving away from languages prevalent in our region of the world, like Spanish. I’m afraid that though I would feel confident enough with several European languages to make edits and identify incorrect captions or search result problems, I couldn’t handle common languages like Arabic or Mandarin. I would have a similar problem with Omaha-Ponca, with the additional difficulty that I doubt Elasticsearch provides search stemming in that language.

The above language difficulties, however, still dance around our biggest problem while developing Cartas a la Familia. This blogpost was spurred by a discussion on Twitter about the lack of multilingual digital humanities projects in the United States, and the common excuses used to explain the deficit and avoid creating more. While discussing our work on Cartas a la Familia, Karin (@nirak), the CDRH designer and development manager reflected: “The technical part isn’t actually all that hard, it turns out.” And I would agree, most of the technology we use supports dozens of languages; the only reason we weren’t ready for a multi-language site was because we hadn’t already been using those features. And why not? Our real problem is that our default assumption is that we’re building English language sites and tools for English language users, and that non-English languages are an edge case. Even this blog post itself is written in English, assuming an English audience!

So here’s the part where I can roll out excuses of my own about why our infrastructure itself wasn’t built out of the box to support multiple languages, many of which are pretty damning: we rarely have had projects where languages come up (why not?), none of our tech team is fluent in a second language (maybe we should be / maybe our team should be different), we are doing our best to juggle a lot of things with a small team and short time (but languages weren’t a priority, in that case)…no matter which excuse we pick, it’s a problem.

The good news is, thanks to Dr. Velázquez and Cartas a la Familia, the CDRH development team is starting to take steps to address this deficit. Language considerations are going to be central if we are given the opportunity to rethink our publishing infrastructure from the ground up with a grant. Even if we can’t rethink it entirely, going forward we are giving a lot more thought to which changes will advance or inhibit the language support we’ve already added to Orchid. In a few years, Dr. Ng’ang’a Wahu-Muchiri’s The Ardhi Initiative will likely test the CDRH’s infrastructure with multiple African and European languages, both in terms of documents which need to be searchable with multiple translations, and potentially in terms of the overall site languages. Changing the overall composition of a small tech team is a long goal, to ensure that similar assumptions to the one we made of “English site, English contents, English audience” are not continually repeated. Cartas a la Familia was a good lesson in my own shortsightedness. I look forward to tackling more projects spanning multiple regions / cultures, and to continue my work with more thoughtfulness regarding how my design choices can support the global DH community from the beginning.