Introduction to Computational Linguistics

A few years ago, amidst unfulfilled hopes of the pandemic’s end and new initiatives by InnovaLang in translation innovation, we started a discussion with experts in artificial intelligence, machine translation, computer-assisted translation (CAT and beyond), and computational linguistics. We resume the conversation on the latter with Marco Tomatis, an expert in applied linguistics and linguistic engineering, employed at the University of Turin.

The aforementioned discussion aimed to establish a connection between these areas of research and work, laying theoretical and conceptual foundations useful for developing our Machine Translation (MT) Engine. It also sought to confirm the substantial absence of “syncretism” among artificial intelligence, machine translation, computer-assisted translation systems, and computational linguistics, with the perspective of academic research to develop a convergence point to be formalized as a new theoretical starting point on the automation of translation processes.

Marco, tell us something about yourself first!

I graduated in Modern Foreign Languages and Literatures from the University of Turin back in 1997, after a technical diploma in electronics. After a brief work experience in machine translation at the Dima Group and a more substantial one at the Regional Ethnographic Linguistic Center of Turin, where I dealt with the digitization and filtering of original sound material, I obtained my Ph.D. in “Linguistics, Applied Linguistics, and Linguistic Engineering” from the University of Turin in 2005.

Following this, I had the opportunity to actively engage in research on various projects related to the diverse world of Natural Language Processing (NLP), from the development of Corpora to the encoding of texts according to TEI standards. I also served as an adjunct professor of “English Language,” “Applied Computer Science for Multimedia Communication,” and “General Linguistics” at the University of Turin, as well as “Computational Linguistics” at the University of International Studies (Unint) of Rome.

How did you acquire these skills?

Acquiring diverse skills, all belonging to the world of linguistics and natural language processing, is the result of passion and experience gained in the field over the years. It’s worth noting how technological evolution in terms of computing power and data storage capacity has inevitably influenced and modified the theoretical and practical approach to the more delicate and problematic aspects that the design of NLP systems requires addressing.

For example, from the late ’90s to today, I have observed an evolution in the approach to machine translation, characterized by the gradual shift from rule-based models active on various levels of linguistic structure to models focused on stochastic analysis of translation data.

This evolution has also involved the realm of programming languages used: over this period, we have seen the great success of Prolog, whose logic-based setup was soon replaced by an approach more closely linked to “regular expressions,” a symbolic representation system for character sequences originating from the Unix operating environment and now commonly implemented in all computer sectors involved in human-machine interaction.

How would you introduce computational linguistics to someone unfamiliar with it but working in a linguistic environment?

The main difficulty in introducing computational linguistics lies in the fact that it is a hybrid and multidisciplinary field that requires in-depth knowledge of linguistics (particularly the analysis of structure at all levels), statistics, and computer science (operating systems and programming languages) to be properly mastered.

Unfortunately, in Italy, the humanities and hard sciences struggle to integrate and communicate with each other, often due to a theoretical foundation that is divergent and unable to overcome certain rigid schemas traditionally imposed by the discipline itself.

My experience, on the contrary, has taught me that the points of contact between the disciplines involved are significantly greater than one might believe, but a radical change in perspective is necessary: in this sense, Noam Chomsky’s insights are the most evident example.

Therefore, my suggestion for those interested in taking their first steps in this field is to be guided on a path that, in addition to providing a solid theoretical foundation, offers a practical approach to solving elementary (though absolutely essential) problems such as the process of tokenizing an electronic text.

What aspects of this discipline do you find most interesting?

Computational linguistics presents captivating challenges, primarily on the linguistic level: there are still problematic areas of linguistic analysis that could find a solution precisely through the use of automatic systems, which, as such, impose a clear stance in terms of categorization.

Closely related to this aspect is the potential of models based on the stochastic approach (a mathematical approach to identifying probabilities related to random events) to “guess” the nature of a given term unknown to the system simply by referring to the quantization of the term itself within the text portion under examination. Since different approaches can produce different results, I find it highly interesting to identify the best balance between natural language processing through rules and its statistical management by creating a database as large and complete as possible.

From this perspective, the possibility of improving individual disciplines by leveraging the potential of integrated research represents an interesting and impactful challenge.

What are its possible fields of application?

Natural language processing has countless fields of application. Just to mention the most well-known, we go from systems supporting humanistic research and improving the usability of digital texts in electronic libraries (TEI encoding) to speech recognition and synthesis systems, the increasingly widespread “chatbots” for automatic user support of a given service, integrated and individual e-learning platforms, and machine translation and computer-assisted translation systems.

Do you find a significant gap between the academic approach and practical applications?

Unfortunately, I have noticed a certain disconnect between the traditional approach to problems prevalent in the academic world and the decidedly more pragmatic one that characterizes applied solutions: with a few exceptions, academic research generally struggles to respond to the private sector’s demands with innovative solutions capable of solving concrete problems quickly.

Do you have a funny anecdote to share about your activity in this field?

Even before graduating, a professor (now retired for several years) who later supervised my thesis used to call me “The computer man” because of my interests that went far beyond the classical boundaries of linguistics: when I presented my thesis project on the automatic creation of an English-Italian machine dictionary, she realized too late that what I was doing had nothing to do with lexicography in the strict sense…

Thank you, Marco!

Marco Tomatis’ LinkedIn profile here.