Five years ago, I discovered a tragic shortage of free online Chinese>Hungarian dictionaries. I soon decided that it was my mission to fill this gap. There was just one small problem I had to overcome: I don’t really speak Chinese.
Here, then, is what I learned about the translation industry as I created the Chinese-Hungarian Dictionary and Corpus (CHDICT), an online open-source, collaboratively edited Chinese>Hungarian dictionary.1
What Is CHDICT?
It’s any and all of the following:
- A downloadable text file with an open public license, where each line contains one Chinese word and its meaning in Hungarian.
- A website where you can look up Chinese or Hungarian words or find the meaning of hand-drawn Chinese characters.
- The translation into Hungarian of the English senses in CC-CEDICT, an established Chinese-English dictionary, and the inspiration for my own project.2
- An online collaboration space where people add, improve, and correct dictionary entries.
I compiled the core portion of CHDICT between February 2015 and May 2017, by translating around 11,000 entries from the venerable CC-CEDICT and the Chinese>German Dictionary HanDeDict, another free online collaborative dictionary.3 My main objective during this initial phase was to translate the English section of CC-CEDICT, focusing on the most frequently used Chinese words, plus those lesser known terms that are included in the Hanyu Shuiping Kaoshi, the People’s Republic of China’s official Chinese Proficiency Test.
In a subsequent phase, I used machine learning to extract bilingual vocabulary from three million Chinese>Hungarian movie subtitles. In this way I was able to manually review and add another 4,000 entries to CHDICT in only a few months. Over the same period, the dictionary began to attract contributors, who added or amended hundreds of entries on their own.
I want to use this opportunity to share what the process of developing a dictionary for niche language combinations has taught me about professional translation.
Everything You Translate Is Another Translator’s Input
A dictionary is typically an input that you use as a translator, and not the output of your work. When you translate a dictionary, it turns everything upside down. As I thought more about this, I realized that most of our output as translators is a direct input into the work of other translators. That’s clearly the case when you build a shared glossary. But even when you just commit segments into a translation memory, you’re growing a resource that another translator will probably use later. From this angle, translating a dictionary no longer seems particularly special.
And here’s the thing: every translation that’s stored digitally will eventually be vacuumed up by a machine translation (MT) engine. Certainly that will happen to every piece of content published on the internet. You can rest assured, however, that even highly confidential internal-use-only texts will find their way into the inner parameter space of a highly confidential, internal-use-only MT engine.
Clarity about Intellectual Property and Copyright Is Super Important
I couldn’t have created CHDICT, or at least not in the way that I did, were it not for CC-CEDICT’s public license. Intriguingly, that license was not always in place. The original CEDICT operated without any explicit license until 2007, when the individual running the project simply disappeared. That was a really tough spot to be in for the people who wanted to carry on the torch! Things were eventually sorted out, though, and those who took over the dictionary added “CC” to highlight the now-official Creative Commons license.
But what sort of intellectual property (IP) is a dictionary? Language is a common good; nobody owns the words in its lexicon. Think of it like this: nobody owns gravity, but the physicist who writes a book about gravity does own the book’s IP.
While modern dictionary-making has always been a commercial enterprise in the U.K. and the U.S., it’s typically an academic pursuit elsewhere. In both cases, it’s clear that the dictionary stands as a work on its own, and its IP is jealously guarded by the holders of private licenses.
What is the IP in the case of a translation, whether it’s a translation of a dictionary or any other content? The fact that this question was never properly resolved represents a major disadvantage to the translation industry. The active conversations on this subject from 10–15 years ago simply subsided. And now that it’s become clear that every translation is somebody else’s input, we’re in a situation where MT services can fill millions of translation requests per day and not pay a cent to the creators of the sentence pairs on which the MT engines are trained.
A Public License Is the Best Guarantee for a Work’s Survival in the Digital Age
Before the digital revolution, humanity’s knowledge was stored in libraries in the form of physical books. Today, humanity’s knowledge is stored digitally, online.
Online digital content has very different survival patterns from printed books. If you stop maintaining your server, the content is gone (think link rot). On the other hand, copies long outlive the original (think Google never forgets). The best strategy for the long-time survival of any digital work, then, is a copy-free public license.
If I get run over by a bus, replicas of CHDICT’s data file will still be stored in many places, and eventually someone else will revive the project and republish the dictionary.
Contrast that with the thousands of out-of-print books that the publisher has no incentive to reprint. Or with a non-public digital work whose copyright holder gets run over by a bus.
I witnessed a unicorn kind of event while working on CHDICT: the open-sourcing of Taiwan’s “official” online Chinese dictionary, MOEDICT.4 That came about as a result of Taiwan’s 2014 Sunflower Student Movement, during which young protesters upended the island’s political landscape by temporarily seizing control of the national legislature. Urged by the students occupying Parliament, the government created a digital ministry, appointed an anarchist hacker to lead it, and started accommodating regular “gov0” hackathons. One of these hackathons examined MOEDICT. The results of this examination convinced officials that the information the online dictionary contained classified it as a public good (i.e., not a restricted access site) and to publish it under an open license. Now, don’t tell me dictionaries are boring stuff!
Linguists Will Become Natural Language Processing Wizards
Okay, I’m obviously a geek, and one obsessed with CAT tools. I coded my own dictionary translation environment to make translating CC-CEDICT entries extremely efficient. To see what this environment looks like, see Figure 1.
The functionality is slightly different from a regular CAT tool, but, yes, the goal is to make all the items on the left in Figure 1 turn green. And yes, the shortcut to confirm an entry is Ctrl+Enter.
I’m not suggesting that all linguists will be coding their own CAT tools in the future. But the real underlying work regarding CHDICT, particularly in the second phase of its development, was about mobilizing natural language processing techniques to tease lexical information out of heaps of bilingual and even monolingual data. The details would be tedious to recount here, but if you’re really into this stuff, I wrote an article about it.5
My situation was that I only knew my domain (the Chinese lexicon) superficially, so I had to do lots of research to understand what my source words meant. That’s not unlike what happens in a normal translation setting. At the outset, we’re not experts in the domain of a particular text, so we do research to make sure we get it right. Then we reproduce the text’s message and intent in the target language (and between us, we have an extremely good command of that target language). In other words, Translation = Research + Mastery of the Target Language.
Natural language processing techniques give you superpowers for the research part. It’s like your brain is augmented to access the information encoded in millions of segment pairs. Actual understanding, and artful expression, are the human privilege in this mix.
Tools Encode the Social Dynamics of Collaboration
You would normally think of a dictionary’s users as those who look up words in it. However, because CHDICT is a collaborative, open dictionary, the smaller number of users who add to it and edit the entries are equally important.
How that contribution and collaboration happens is directly driven by the website. For example, English Wikipedia is a free-for-all: anyone can make an edit with a single click, and the change is immediately published. (This method apparently works with a very large pool of contributors.) CC-CEDICT, on the other hand, uses a pretty rigid process: only registered users can submit edits and new entries. Edits and new submissions get queued up and only go live if someone from a small, closed circle of reviewers processes them. This seems to work fine for CC-CEDICT, although the dictionary’s been getting less activity in recent years.6
For CHDICT, I knew I could count on a much smaller pool of contributors. I chose not to add many obstacles before someone could start contributing, but there is still a mechanism in place that drives changes through a four-eyes review process for quality.
The meta-story here is that the forms of engagement, authority, power relations, and overall social dynamics are all coded into the tools we use to collaborate. Since translation has become a fully collaborative online process, that gives tools such as CHDICT immense power to shape the industry’s dynamics. As such, tool developers would be well advised to wield this power wisely and ethically.
Some of the Best Things Don’t Have a Business Model
CHDICT has been online for over two years now, and I have a pretty good idea about its usage statistics. It has over 100 monthly users who spend an average of 20 minutes on the site for each visit. Roughly 2,000 words are looked up each week. These statistics are in line with my original expectations, which I based on the usage statistics of HanDeDict and on a rough estimate of Hungarian speakers who are professionally interested in Chinese.
CHDICT is open to all free of charge, so it’s not a money-maker, but that’s fine with me. Most of the best things in life don’t come with a viable business model. I only wish that, as a society, we could figure out a way to better compensate, in particular, amazing translators of truly complex texts. I mean the likes of well-known literary translators like Emily Wilson or Ken Liu, but also the hundreds of highly skilled people who translate valuable works that don’t make it to the bestseller lists.
Remember, if you have any ideas and/or suggestions regarding helpful resources or tools you would like to see featured, please e-mail Jost Zetzsche at firstname.lastname@example.org.
- You can access the Chinese-Hungarian Dictionary and Corpus (CHDICT) at https://chdict.zydeo.net/en.
- You can access the Chinese-English Dictionary (CC-CEDICT) at https://cc-cedict.org/editor/editor.php.
- You can access the Chinese>German dictionary HanDeDict at https://handedict.zydeo.net/de.
- You can access MOEDICT, a free online dictionary provided by Taiwan’s Ministry of Education, at www.moedict.tw.
- Ugray, Gábor. “Etudes in Chinese-Hungarian Corpus-Based Lexical Acquisition,” https://chdict.zydeo.net/files/ug-mszny-2018-final.pdf.
- Ugray, Gábor. “CC-CEDICT Contributions Follow Zipf’s Law,” http://bit.ly/CC-CEDICT-contributions.
Gábor Ugray is co-founder of memoQ, a leading collaborative translation environment and translation management system. He is now memoQ’s head of innovation. When he is not busy testing new product ideas, he blogs at jealousmarkup.xyz and tweets as @twilliability. He developed CHDICT, an open-source, collaboratively edited Chinese>Hungarian dictionary and corpus.