How Computers Learn to Understand Human Language

Konstantin Vorontsov, a scientist with a doctorate in physical and mathematical sciences and a professor at the Russian Academy of Sciences, specializes in data analysis, artificial intelligence, and machine learning.

Searching by Meaning, Not Just Words

In 2010, I defended my doctoral thesis, which was purely theoretical and mathematical. Although I worked on applied tasks, I wanted to dedicate my research to something more socially significant. Eventually, I turned to text analysis, becoming interested in thematic modeling and text clustering.

How can we understand the topics within a small collection of texts? This is used in search engines and more. Over time, I got involved in exploratory search tasks. How can we search by meaning, not just keywords? How can we use such powerful search methods for self-education, beyond the typical specific queries on Yandex or Google?

Humanities Help Technicians Understand Texts

My next step involved working with natural language understanding, not in a broad sense, but in a very pragmatic one. My colleagues and I recently established a new lab at the AI Institute at Moscow State University, where we focus on detecting text manipulation, propaganda, and everything related to deception and misinformation.

This work intersects with fields like linguistics, psychology, media linguistics, and political linguistics, even touching on history. For us, propaganda is an object of study, not a practice. For example, when we see something involving reinterpretation of World War II, we need historians on our team to create labeled samples for these topics.

Our main goal is to formalize knowledge from the humanities by developing training datasets. We aim to build technologies that can automatically detect manipulation across streams with hundreds or thousands of daily messages: news media, social media. But we can only create these technologies by embedding the expert knowledge of our humanities colleagues into the models we build, and we do this through labeling.

We are now creating professional labeling tools. Our experts—journalists or media linguists—can label meanings. What constitutes manipulation? What is the psychological or emotional impact of a text on a person? What in the text might prompt someone to change their viewpoint or evoke emotion or action? These are the challenges we’re currently focused on.

Detecting Manipulation in Texts

When we annotate text and identify propaganda techniques or speech manipulations, linguists, especially experienced ones who have studied these language phenomena for a long time, can spot these elements. A text fragment may have a beginning and an end, a sentence or a group of sentences. Is there an influence that compels readers to change their attitude toward a subject mentioned in the text? That’s the manipulation target.

For instance, there might be one segment that is the manipulation itself and another that is the manipulation target. These segments are connected. Highlighting, linking, or annotating text segments is the main element, similar to how images are segmented and classified.

Words Are Vectors, But They Need Context Adjustments

How did we arrive here? Just 10 years ago, technology has evolved so dramatically that we’re now solving tasks once deemed unthinkable even five years ago. Detecting manipulation and propaganda in text was unapproachable back then; achieving a 60% quality level didn’t seem valuable. Yet, attempts were always made. What changed?

Models emerged that could convert words into vectors reflecting their meaning. These vectors absorb meanings but are universal for an entire text collection or language, which isn’t ideal. Linguists know that word meaning heavily depends on its immediate surroundings and context. What sentence was it in? What was the overall message?

Thus, a word’s meaning vector should adjust with each use, depending on its context. And that’s exactly what transformer models do, allowing word representations to be context-sensitive, not universal across the language.

Leave a Comment