The role of CAT tools in Patent Translation
Tesi di laurea in Laboratorio CAT
Serena CUSCIANNA
UniversitĂ del Salento, A.A. 2018/2019
3. COMPUTER-AIDED TRANSLATION
The advent of computational linguistics marked an important milestone in the evolution of translation activity. The combination of language and information technology led to the introduction of software tools that could be used by a professional translator. This paragraph will introduce Computer-Aided Translation (CAT) and the core components of the software that enables its implementation, reducing translation time and costs. Finally, a general overview of the software currently available on the market will be presented, focusing on SDL Trados Studio, as it has been the CAT tool I used during my internship experience.
3.1 CAT tools: definition, advantages, and drawbacks
Before delving deeper into the concept of Computer Assisted Translation, it is necessary to shed some light on the terminology. It is possible to develop a taxonomy (Naldi, 2014, pp. 8-9) based on the amount of work done by the computer, from a higher degree to a lower degree of mechanization:
- Fully Automated Machine Translation (FAMT): the translation is carried out entirely by the computer without any human support. However, human support may be required during the final result review phase (i.e., in the post-editing phase);
- Human Aided Machine Translation (HAMT): the translation is performed automatically by the computer but the human contribution is fundamental both for the preparation of the machine translation task (pre-editing) and for the final revision of the result (post-editing and proofreading);
- Machine Aided Human Translation (MAHT): translators carry out both translation and supervision, but are assisted in both phases by appropriate tools;
- Human Translation (HT): translators perform all the tasks (translation, supervision, and proofreading).
The conceptually important difference between MAHT and HAMT is, however, increasingly subtle from a practical point of view, as many software tools classified as MAHT suggest translation (i.e., as machine translation mechanisms) if the translation memory does not contain a usable term or expression. In both cases, the use of these tools is useful for documents containing several repetitive structures and a stable and recurrent terminology. Typically, this is the case for technical documents, e.g. patents. Therefore, CAT is to be considered limited to technical translation rather than valid for literary translation (Naldi, 2014, p. 10). Hence, CAT lies between MATH and HAMT. However, MAHT systems represent the majority of CAT systems. This is why, out of simplicity, it is often said that CAT corresponds to MAHT.
Over the years, several scholars have tried to define a CAT tool. Almost thirty years ago, Sager (1994, p. 326) defined it as âa translation strategy whereby translators use computer programs to perform part of the process of translationâ. At the dawn of the new century, Craciunescu, Gerding-Salas, & Stringer-O’Keeffe (2004, p. 7) tried to give a more detailed explanation:
In practice, computer-assisted translation is a complex process involving specific tools and technology adaptable to the needs of the translator, who is involved in the whole process and not just in the editing stage. The computer becomes a workstation where the translator has access to a variety of texts, tools, and programs: for example, monolingual and bilingual dictionaries, parallel texts, translated texts in a variety of source and target languages, and terminology databases. Each translator can create a personal work environment and transform it according to the needs of the specific task. Thus computer-assisted translation gives the translator on-the-spot flexibility and freedom of movement, together with immediate access to an astonishing range of up-to-date information. The result is an enormous saving of time.
With technological development, definitions have also become more and more specific. For example, according to Volk, & Jekat (2010), a CAT tool is a piece of software that allows the translator to create, to use and to maintain multilingual lexicon-databases and text-databases.
The ideal target audience (TA) for CAT tools encompasses professional translators (ranging from freelancers to employees of large companies) and translation agencies, or Language Service Providers (LSPs). However, why is it particularly advantageous for a translator to use CAT tools?
CAT tools increase the consistency and quality of translations, thus avoiding multiple translations for the same text in the source language (SL). A CAT application independently manages the formatting and layout of the majority of electronic file formats, extrapolating only the textual content on which the translator can focus all her/his attention. Moreover, with a CAT tool, it is possible to manage more efficiently and accurately the different translation workflow phases, allowing translators, reviewers, proofreaders and project managers to collaborate profitably. One can work on many texts and translation projects at the same time, either independently or in collaboration with the other professionals (if this feature is provided by the CAT tool). The intrinsic characteristics of a CAT tool also makes it possible for past translations to be an important asset for future use. For example, by organizing projects by domain or client, a higher level of consistency in the work will be achieved. As a result, the advantages of CAT systems may be summarized in three points:
- Increased productivity: CAT tools speed up translation activity by retrieving portions of text already translated and interlingual terminology equivalences already identified. In this way, the translator’s activity is limited, in most cases, to check the suggestion obtained, without reformulating or retyping the translation (Lecci & Di Bello, 2012);
- Increased quality: the system proposes the use of the same translation for identical portions of text, ensuring consistency and higher quality at the phraseological and terminological level, both within the same document and between projects. Nevertheless, it is still important to consider the context to determine whether it is necessary to reject or rework the suggestions received by the software (Lecci & Di Bello, 2012);
- Increased earning potential, as a result of the aforementioned aspects: translation agencies tend to demand considerable price reductions when a translation in processed using a CAT tool. Yet, as Vallianatou (2005, p. 5) rightly points out, «Price reductions, when requested, are indeed annoying, but if the reduction is within acceptable limits, the productivity enhancement is not canceled by the lower rate». Concluding the translation in advance means being able to accept more jobs, and, thus, potentially earn more money.
Although CAT systems have so far been described in terms of positive features, thus shaping them as the object of desire for a translator, they also have limitations. Firstly, with non-scientific texts, the translator may require less stringency in the association of two portions of text, due to the flexibility typical of literary translations. Besides, only electronic documents can be used with CAT tools. Any hardcopy source texts must be previously scanned using appropriate Optical Character Recognition (OCR) programs. Other drawbacks are the difficult translation of textual parts contained in images (which can be performed on vectorial images with advanced CAT tools or require to be processed in advance with OCR software), the incompatibility with some digital formats (from few to several formats, depending on the adopted CAT tool), the impossibility to work on protected files, and the impossibility to detect and report any language mistakes in source documents.
3.2 Core components
Over the years, CAT software has become more and more sophisticated, gradually evolving into powerful tools that are now used by the majority of LSPs. The main subsystems constituting a CAT tool are:
- one or more Translation Memory (TM): the core element of every CAT tool. It is a bilingual database in which translations are typically stored as sentence pairs, technically named âsegmentsâ;
- one or more Terminology Database, also known as Term Base (TB) or glossary: a multi-lingual database in which terminology is stored and managed;
- aligner: component able to create a TM from a previously translated source text and its corresponding target text, so that the resulting alignment pairs (once verified by properly) may be used as reference material;
- translation workflow manager: set of components designed to facilitate and optimize the translation pipeline;
- translation editor: writing environment to create and to edit translations, usually provided with advance filtering capabilities allowing the user to search amongst segments depending on several criteria;
- Machine Translation (MT) module: component allowing to call MT engines, either embedded into the CAT tool or provided by third parties.
This dissertation, and specifically the next subsections, will focus on the analysis and description of TM and TB usage patterns in patent translation.
3.2.1 Translation memory (TM)
TMs are one of the most valuable aids for a translator in CAT. Although the first research on the subject appeared in the 1980s, it was only in the 1990s that TMs began to be marketed (GarciÌa, 2014). TMs exploit the concept of interoperability, that is to say «the ability of computer systems or software to exchange and make use of information» (Lexico, n.d.).
A translation memory is an updatable bilingual database containing segments in the source language (SL) associated with segments in the target language (TL). «Segments are usually understood to correspond to sentences or other more or less easily distinguishable text portions, such as titles» (Somers, 2003, p. 34). Each CAT software automatically segments the source texts according to predefined, and possibly user-modifiable, segmentation rules. Usually, punctuation marks are used as segment delimiters. Each pair of associated segments forms a translation unit (TU). When translating, the TM is automatically consulted to check whether or not identical or similar source segments to those to be translated are present in the memory (Lecci & Di Bello, 2012, p. 15). In the case of total or partial correspondence, the results are presented as “translation suggestions”. But before discussing how the translator can manage such suggestions, it is necessary to take a step back to explain how to create a TM.
Somers (2003, p. 33) indicates three different ways of building a TM: building it up as you go, importing it from elsewhere, or crating if from a parallel text. The simplest, but also the most time-consuming way to have a TM is to create a new one from scratch and populate it as you work on documents. Another equally simple way is to import a database from elsewhere. For example, customers often provide the translator with their TM to ensure a higher degree of consistency between projects. The fact that developers have agreed on a common interchange format facilitates the exchange of TMs between customer and translator, or two or more translators. The last and technically most complex way is to automatically populate a TM from existing translations. In this case, it is useful to use the alignment tools that provide for matching up the source text and the translation segment by segment into translation pairs.
In standard CAT system terminology, the concept of coincidence between segments is referred to as matching. There are different types of matching that are generally indicated with a percentage:
- if a source segment to be translated is identical to one contained in the translation memory, it is called exact/perfect match (100%);
- if the CAT system detects a full correspondence also in the previous and/or next segment at the same time, then it will be called context match (101%). In some CAT systems, such as Trados, it is sufficient that only the previous segment matches. As a general rule, 100% and 101% matches can be accepted by the translator without further intervention;
- in the case of a partial correspondence, it will be referred to as a fuzzy match. These matches are identified based on a minimum (and user-modifiable) match threshold indicating the degree of similarity with the segment being translated. The similarity varies according to the differences in segment content. Thus, fuzzy matches must be edited by the translator. In the most advanced CAT tools, fuzzy matches between 99% and 98% (i.e., those concerning slight punctuation differences only) can be automatically corrected by the CAT tool, if the user activate the corresponding option. How each CAT tool calculates the percentage of fuzzy matches is quite complex and often not perfectly disclosed in commercial systems, for proprietary reasons. However, between two or more CAT tools, the difference in the match retrieval techniques determine how efficient the tool is;
- if the TM does not suggest any translation, it is called no match. In this case, the translator must type a translation of the segment from scratch, which will be added to the memory in the form of a new TU (Lecci & Di Bello, 2012, p. 16).
CAT tools automatically detect repetitions and propagate the corresponding target segment within the translation document as soon as it is entered at the first occurrence. As it has already been pointed out, the main advantages of using TMs include speeding up the translation process, increasing the degree of textual consistency.
The advantages of using TM systems were already recognized by a study conducted by Elina Lagoudaki (2006), entitled «Translation Memories Survey 2006: Usersâ perceptions around TM use». The aforementioned survey, based on a corpus of 874 professional translators from 54 Countries, showed that 82.5% of individuals used TM system mainly with technical texts. Besides, 96% of the users used a TM tool for the translation task, followed by terminology management tasks (51%) and quality assurance checks (47%). All the findings of that survey helped to form a clearer picture of the relationship between translation professionals and TM systems. However, a small percentage of respondents (17.5%) said they did not use any TM system at all. Among the reasons for not using TM systems were the drawbacks outlined in the previous paragraph of this dissertation. Besides those, however, it should be considered that translating a text segment by segment reduces translation flexibility and fosters the tendency to translate literally. However, as will be explained later, in the case of patent translation, this is not a problem. Finally, the self-propagation of segments can lead to the reproduction of the same mistake (Naldi, 2014, p. 51).
3.2.2 Termbase (TB)
Identifying equivalents for specialized terms is a major part of any translation project. Each subject field such as engineering, life science, law, etc., has significant amounts of domain-specific terminology. The search for reliable terminological equivalence is a very time-consuming process as the translator has to find a target term that is not only semantically equivalent to the source one but also with an occurrence of use as similar as possible. By using terminology tools, translators may increase productivity and improve the consistency of their translations, avoiding customer dissatisfaction and misunderstandings.
From a historical point of view, before the termbases as we understand them today, in the 1960s there were term banks, that is to say large-scale collections of electronic term records. With the advent of the first desktop computers in the 1980s, terminology management systems began to be marketed as CAT tools (Bowker, 2003, pp. 50-51). However, according to Bowker (2003, pp. 51-52), “term bank” and “termbase” refer to two similar but different concepts:
One common difference between term banks and termbases is that the former strive to complete detailed records in order to meet the needs of a wide range of users. In contrast, the records in termbases are generally for the personal use of the translator who creates them; therefore, these records are frequently less detailed and may contain only those pieces of information that translators find useful or relevant to their needs.
A TB is an updatable multilingual database that contains terms related to a specific domain organized as series of records. Each record includes a SL term, called source term, and its equivalents, called arrival term(s), in one or more TL. Each record can be enriched with additional information. Originally, users had to choose from a pre-defined set of fields. In contrast, most contemporary TB tools have adopted a more customizable entry structure, which allows the translator to add definitions, contexts, usage notes, images, audio clips, etc. Most TBs can operate as stand-alone applications, however many contemporary systems can also be integrated with other more complex software. Furthermore, a TB can be used in monolingual form for checking terminology in an SL text, for example, to check terminology consistency in user manuals or product datasheets. Alternatively, it can be used in bilingual form for checking terminology in the TL text. When translating, TB is automatically consulted to check whether or not identical or similar source terms are available. Any suggestions are automatically displayed while typing to the translator who can decide whether or not to include them in the translation (Lecci & Di Bello, 2012, p. 16). This type of specialized retrieval feature is known as automatic terminology lookup. If other terms considered relevant but not present in the associated TB are detected during translation, the translator has the possibility to add them to the TB in real-time.
As with TMs, TBs too are developed in standard formats so that they can be shared between translators and accessed more easily. «This option can help to ensure consistency on projects where several translators may be working on different parts of a long document» (Bowker, 2003, p. 59). It is often the clients themselves who provide a TB to the translator so that he or she will adopt the requested in-house terminology. Nevertheless, it is worth pointing out that TBs are often not exploited to their full potential. The deadlines demanded in the localization industry are often tight and prevent the development of detailed glossaries. In addition, terminology can change rapidly resulting in TB being considered obsolete. Therefore, users often tend to treat TBs «as disposable items, rather than as long-standing records» (Bowker, 2003, p. 52).
3.3 Market analysis
Although the majority of translators uses CAT systems regularly, some language professionals are still reluctant to incorporate them into their workflows (Smartcat, 2019). However, a limited understanding of how CAT tools work and the fear of automation making humans redundant should not lead to an underestimation of the enormous positive impact of CAT systems on business development. Although they all serve the same purpose, the market offers a wide range of CAT tools. Among the main players are SDL Trados Studio, MemoQ, Wordfast (Classic & Pro), DejavuÌ and Across. Many agencies expect translators to have working knowledge of and/or license for at least one of the above-mentioned systems. In addition to the above, there are a number of alternatives. There is no perfect CAT tool and users have to choose depending on their technical and professional needs (i.e., on what CAT tools are required by the agencies you work for), what is the budget for the possible purchase of a software license, what is her/his level of IT skills, or what functions will she/he actually use.
A distinction should be made based on CAT tool type, whether desktop or cloud-based. According to Garcia (2014, p. 79), indeed, computer processing power and connectivity have been crucial in the evolution of CAT.
The difference in scope between current CAT systems and those in the 1990s can be better understood within the framework of two trends: cloud computing, where remote (internet) displaced local (hard drive) storage and processing; and Web 2.0, with users playing a more active role in web exchanges.
Desktop CAT tools are downloadable and installable on the userâs computer(s), while cloud-based CAT tools can be accessed via the Web with a browser. The Web-based systems have their advantages. «Where teams of translators are involved, a segment just entered by one can be almost instantly reused by all. […] Management tasks can also be simplified and automated» (GarciÌa, 2014, p. 80). However, over time, the translators themselves have not been enthusiastic about this type of system. The resistance is presumably focused on the same raison d’eÌtre of Web-based systems: remote administration and resource control (GarciÌa, 2014, p. 80).
Another criterion to be taken into account in the analysis of a CAT system is the type of distribution or software license granted. Several license typologies there exist, amongst which we can recall: freeware, always free; shareware, free for a trial period and then for payment; payware, without free demo version and to be paid before the first use. A decade ago, CAT systems were a very expensive tool almost exclusively developed for professional translators. Today, potential users have become more numerous and variegated, so the cost of licensing has fallen. However, cost is not one of the main factors that drive the user to use one CAT tool rather than another, as demonstrated by Tabor (2019) in a survey on CAT tool use. Taking into consideration only the two criteria listed so far (i.e., system type and license type), Figure 3 shows some examples of the most popular CAT tools today (Source: Smartcat, www.smartcat.ai (2019)).
Figure 4 is a histogram that clearly demonstrates that desktop tools are among the most used CAT tools (Source: Jared Tabor, www.ProZ.com, www.ProZ.com (2019)).
Other key aspects not to be overlooked when deciding which CAT tool to work with include system requirements (hardware/software), supported languages, file formats accepted as input, TM and TB format, and compatibility with other CAT systems.
3.3.1 SDL Trados Studio
The data presented in the previous paragraph (see chap. 3.3) clearly shows that SDL Trados Studio is the most widely used CAT tool. SDL Trados Studio was founded as a Language Service Provider (LSP) in 1984. Throughout the 1980s and 1990s, it became increasingly popular in the translation market, until it was acquired by the multilanguage provider SDL International in 2005, thus creating the now well-known SDL Trados Studio. In the following years, the software underwent considerable changes (GarciÌa, 2014, p. 78). Today, the latest updated version is SDL Trados 2019 (from now on, Trados).
With Trados, users can translate and revise documents in a wide range of formats, carry out quality checks on target texts, create and manage complex translation projects and create and manage translation memories (Lecci & Di Bello, 2012, p. 13). This CAT tool is distributed in three versions (Starter, Freelance, and Professional) with substantial differences in price, performance, and capacity. For example, comparing the most basic version, the Starter, and the top-of-the-range version, the Professional- Single User, the price from 99⏠rises to 2.495âŹ. When it comes to performance, with the Starter version users can work on five languages simultaneously, while with the Professional version there are no limits. The same applies to the maximum number of TU stored in a TM: for the Starter version the limit is 5000 TU, for the Professional version it is unlimited.
Trados presents an integrated environment in which different tools can be found for translation, proofreading and project management by a translator or project manager. Thus, all functionality is accessible from a single interface that can be managed through different “views” focusing on different tools and utilities (Lecci & Di Bello, 2012, p. 18). Trados allows the user to follow two different workflows: the single-file translated workflow and the project package translation workflow. In the first case single document is worked on, while in the second case two or more documents at the same time are worked on, organized in “packages”. The four basic views are âHome viewâ, âProject viewâ, âEditor viewâ, and âTranslation Memories viewâ. When a document to be translated or a translation project is opened, two other views are added, âFiles viewâ and âReports viewâ. Together, they form the “Navigation Pane”. At the top of the interface is the “Application Ribbon” which is contextual to the operation carried out within Trados. This means that the content will be different depending on the view the user is working in. Figure 5 shows a screenshot of the standard user interface in SDL Trados Studio 2017 once a project has been opened. It shows, in red, the Application Ribbon and, in green, the Navigation Pane.
The functions of each view may be summarized as follows:
- the âHome viewâ enables the translators to access the Open Document or New Project command;
- the Projects view, enables translators to view and work with projects, as well as to view the details of the projects uploaded in the program;
- the âFiles viewâ, is where translators work with project files. «In the Files view, translators can open files for translation, open files for review, perform batch processing on files, and also view word counts and translation progress for those files» (Kurniawati, Rahajeng, Kristanto, & Kastuhandani, 2016, p. 94);
- the âReport viewâ shows the reports of some of the operations performed automatically by the program during the creation of a project, the imports of any related TM, and the amount of translation performed/to be performed (Lecci & Di Bello, 2012, p. 26);
- the âEditor viewâ is where documents and projects are translated and reviewed;
- the âTranslation Memories viewâ, is where translators create and manage translation memories.
As proposed by Melby (1998), each workflow in SDL Trados may be divided into stages: before the translation, during the translation, and after the translation. The âBefore the translationâ stage is a preparatory phase, in which the translator opens the document or project he or she has to work on, and creates a TM and TB (if necessary) or associates an existing TM and TB. In the âDuring the translationâ stage, the translator starts his work by translating the document. «The translation editor overview has a table layout, where the source text is presented on the left-hand side and the translation is on the right-hand side» (Kurniawati, Rahajeng, Kristanto, & Kastuhandani, 2016, p. 98). The suggestions provided by the TM associated with the active project appear in the âTranslation resultsâ window in the upper part. The translator, therefore, has two options: type the text in the right column or accept the TM’s suggestions. Once sure of the translation of the segment, the translator will have to confirm the segment. If Trados finds any differences in the formatting or the elements of the TU, it will notify the user with an error message. Otherwise, the TU will be recorded in the TM (if this option has been set). As translation advances, the translator may encounter self-propagated segments and various types of matches. The types of text that can be processed with Trados are numerous and each with specific characteristics and constituent elements. Equally numerous are the functionalities offered by Trados to manage these elements (for example, tags, placeables, list of variables, etc.). Figure 6 shows an “Editor view” of the completed translation of patent EP 3 155 494 B1 with the “Results Window” opened on the translation suggestions of TU no. 72 (selected and highlighted in blue) from the TM associated with the project.
The third and last stage (i.e., the âAfter the translation stageâ), «includes how to verify the text after the translation work has been done and how to edit the errors in the translation» (Kurniawati, Rahajeng, Kristanto, & Kastuhandani, 2016, p. 100). In particular, it consists of conducting a quality check that comprises both a spell check and the verification of different aspects, such as punctuation, numbers or terminology. At the end of the translation process, it is necessary to finalize the project. Finalization updates the TMs used and generates the final target files in the original formats. Finalization is performed with the Batch Task Sequence called ‘Finalize’.
Read the next chapter “Patent translation“.