Me and Regex

what is regex?

Regex or Regular Expressions is not a programming language but more pattern identification. Its main purpose is quality check in translated or other texts and documents. According to Riccardo Schiaffino, RegEx "is a search-and-replace function on steroids. Regular expressions can assist our  translation work by allowing us to search, replace, and filter text in ways that would otherwise be impossible in our software tools." (https://www.ata-chronicle.online/highlights/regular-expressions-an-introduction-for-translators/

If you are a linguist or have some affinity for languages, you will pick up regex quickly, after some trials and errors.:)

For us, translators, regex is important because CAT tools use regular expressions for creating segmentation and auto-translation rules.

See below my first attempts to create some basic rules that can be used for Hungarian translations.  

 RegEx for English to Hungarian Translations

Example 1: Hungarian (or other names) with more than 1 space between them

Regular Expression: [a-záéúőóüö.](\s\s+)[A-ZÁÉÚŐÓÜÖ] 

Explanation: This regex looks for one or more spaces between words that follow each other with capital letters including Hungarian characters or common Latin characters. It is designed particularly for checking Hungarian and English proper names that contain 2 or more components. Note that the extra space between regular words (lower cases) was not picked up.


The Regular expression first was checked in regex101.com:

 

 As you can see it, it picked up all the extra spaces between the names regardless of whether they contained 2 or more elements or a period between them. (I just realized, this regex can be used also to check if there is an extra space between sentences that end with a period including the ones that start with Hungarian letters which is super helpful and definitely broadens its usage!)


I added the regex in Trados and with the Verify option, it gave me warnings for extra spaces. (Please note that I had some formatting issues with how they were displayed in Trados and placed in segments but this is just another confirmation that the Regular Expression works also to pick up extra spaces between everything that ends with any character or a period and starts with a capital letter.)

Example 2: English and other quotation marks replaced with Hungarian (lower and upper) quotation marks

Regular Expression: ("|'|<|>|‘|“)(.*)("|'|<|>|’|”) 

Substitution: „$2” 

Explanation: It's common to leave English upper quotation marks in translated texts simply because they don't have a direct way to put them into the text in Hungarian, but they are considered to be grammatically incorrect. This expression looks for segments that start or end with other than Hungarian lower and upper quotation marks including ", ', ‘, ’. “, ”, <, >. The replacement changes them to start with a lower quotation mark and ends with the upper quotation mark. 

Note: The French quotation mark was not included because Hungarian uses them, too. 

The Regular Expression was checked in regex101.com It picked up all the wrong quotation marks and left the Hungarian and French. The substitution replaced them all with Hungarian quotation marks.


In Trados, I used the Replace option, included the regex and substitution, and again, it picked up the wrong quotation marks all the way and with the Replace or Replace All I could change all of them to the Hungarian one.


Have fun!

Author: Annamaria Szvoboda, October 11, 2020