Sequential models for language processing and their application to identity validation processes

Natural Language Processing (NLP) is a field of artificial intelligence that aims to provide machines with the ability to read, understand and interpret human language. In recent years, due to advances in deep learning and the development of sequential models, the NLP area has taken a relevant role in different industries. Among the most interesting applications, currently under research and development, are the processes of identity validation from written and spoken language. In this document we present these concepts and illustrate applications for identity validation.

Machine learning is a set of algorithms with the ability to analyze data and learn rules or patterns from it and then make predictions for a specific task (i.e., tasks such as knowing if two faces are the same, or two fingerprints, or if a document is authentic or not). Machine learning has led today’s advances in artificial intelligence. Instead of hand-coding software routines with a specific set of instructions and rules to perform a particular task, the machine is automatically “trained” using large amounts of data by algorithms that give it the ability to learn how to perform the task [MANRIQUE2020-2].

Deep learning can be considered a subset of machine learning derived from advances in algorithms and hardware that allow more complex models to be trained using much larger volumes of data. Deep learning works with artificial neural networks, which are designed to mimic the neural interactions in the human brain that govern learning and memorization processes. Until recently, the size of the artificial neural networks that could be built were limited by computational power and therefore had limited applications. However, various advances at the algorithmic, software and hardware levels have made it possible to build larger and more sophisticated neural network models. Virtual assistants such as Alexa/Cortana, self-driving cars, and translation systems are some examples of AI applications and systems that run on deep artificial neural networks.

Artificial Neural Networks

Artificial neural networks are composed of layers of nodes, just as the human brain is composed of neurons.

Figure 1. Artificial Neural Network

Figure 1 shows the structure of a conventional artificial neural network. The network is being trained to classify images of cows and horses. The images are processed to be represented as a vector of pixels of defined size. These vectors enter the network through the “input layer” and propagate to the so-called “hidden layers”. As shown in the figure, nodes within individual layers are connected to adjacent layers. The connection lines between the different nodes represent the propagation of information, which is weighted by parameters called “weights”. A “heavier” node will exert more effect on the next layer of nodes. Each node in the hidden layer processes the inputs and places an output in the next hidden layer according to an activation function. The final layer compiles the weighted inputs to produce an output. The output in this case returns the result of the classification task (is it a horse or is it a cow?).

How is an artificial neural network constructed? For this, algorithms that obtain the network parameters (weights) must be applied. These algorithms are fed with labeled data to “learn” the appropriate parameters. In the example of figure 1 the data are thousands of images of cows and horses. The more data, the better the chances of obtaining an accurate network in the classification task.

The network is said to be deeper as a function of the number of hidden layers and nodes it has. The descriptive capacity of the network increases with the increase of layers, however, large amounts of data are necessary to perform an appropriate training.

Natural Language Processing and Sequential Models

Natural language processing (NLP) is a form of artificial intelligence that aims to give machines the ability to read, understand and interpret human language. For computers, this is extremely difficult to achieve due to the large amount of unstructured data and the absence of real-world context or intent.

In recent years, due to advances in deep learning, natural language processing has taken a relevant role in different industries. Search engines use NLP strategies to generate relevant results based on similar search behavior or identified user intent from previous searches. Advances in NLP have also enabled the development of language translation systems. Current machine translators have high levels of accuracy comparable/superior to human translators. Automatic sentiment and polarity analysis techniques, widely used in marketing and advertising, have also been enhanced by new deep neural network architectures. Finally, advances in spoken language generation have enabled the development of virtual assistants such as Alexa or Cortana.

To build advanced NLP applications, new architectures and algorithms have been developed to enhance sequential data processing such as written and spoken language. The so-called “sequential models” allow to deal with the order/progression characteristics of the language that establish the dependencies between words. Within the sequential models, the so-called “Recurrent Neural Networks” (RNN) became popular due to their effectiveness. Recurrent Neural Networks were created because there were some problems with other existing architectures of both classical and deep neural networks:

They cannot handle sequential data.
Considers only the current input.
They cannot memorize previous entries.

RNNs are ideal for handling language data (written or spoken) because they are essentially sequences of words related to each other. The probability of occurrence of a word depends on previous and even future words.

Fig 2. General architecture of an RNN

Figure 2 shows the typical diagram of an RNN, specialized in labeling the words in a sentence of size t according to their typology (i.e., verb, noun, etc.). At each time each of the words composing the sentence is sequentially incorporated into the RNN. In this discussed architecture, the current output ŷt depends on the previous inputs and activations. At any time, the current input is a combination of the input in xt and the output of the immediately preceding layer that received as input xt-1. In the example in the figure the word “arming” could be confusing (verb, noun) for other network architectures, but not for RNN as they take into account in this case the output of the previous layers.

In some cases, the dependencies between words may be longer in time. For example, in the sentence “The students, who were in the middle of an exam, managed to concentrate despite the noise”. The identification of the correct conjugated form of the verb “achieve” depends not on the immediately preceding words “an exam”, but on words much further back in the sequence “The students”. These long-range dependencies in the sequence can be modeled using RNNs with memory gates. The so-called Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU) are gates that are added to RNNs to regulate the flow of information by keeping past information that is relevant.

RNN applications for identity validation

One of the areas where RNNs have been evaluated is for identity validation. Identity expressed by the way a person manipulates written or spoken language. Stylometry is an area of research that aims to identify primarily linguistic characteristics from words, syntactic and semantic structures found in a written text. These characteristics are usually specific to the person and can define his or her linguistic style [DAEL2013].

The first stylometric investigations were carried out on long written text produced by writers (i.e., novels, poems, etc.) [JUOLA2006]. The results of the computational models constructed suggested that style was unique and little variable over time. For example [SALEH2014] reported accuracies above 97% using books written by 10 different authors. In digital environments and in particular in social networks, authorship attribution is much more challenging due to the gigantic number of users, as well as much shorter text samples.

RNNs have proven to be robust methods for this task [BEVENDORFF2020] and in general provide greater descriptive capability than other types of modeling. Figure 3 shows in a general way an RNN architecture that allows the identification of the author of a particular text. In this case the architecture is known as “Many to One” since the input is many words while the output is one and represents the possible author.

Fig 3. “Many to One” RNN architecture for author identification

Despite major advances with LSTMs, recent research findings have shown the difficulty of using linguistic style as a method of identity validation [KALA2018, BEVENDORFF2020]. Among the main factors are:

An author’s style changes depending on the topic about which he/she speaks/writes (e.g., politics vs. soccer).

An author’s style depends on the setting. The language used in social networks is different from that used in forums and Q&A platforms.

Accurate identification models require text samples of considerable length. These are generally cumbersome to acquire.

Although stylometry is not currently viable as an identification factor, other identity validation methods derived from text production in digital environments have emerged. Keystroke dynamics recognition is another form of identity validation in which RNNs have also been employed [LU2020, THE2013, LICHAO2017]. Keystroke recognition has been defined as the process of measuring and evaluating a typing rhythm on digital devices such as computer keyboards, cell phones and in general touch screen devices.

According to the results reported in [LICHAO2017] and [LU 2020] there are accuracies between 84% and 95%, and they depend mainly on the number of users against which it must compare and the length of the input sequences with which the user model is built.

Conclusion

Advances in NLP driven by deep learning and represented in so-called sequential models have enabled the development of many language-based applications. Virtual assistants and translators are examples of such advances. Their application for identity validation processes, while still under development and research, is promising and there are advances that exploit stylometry and typing dynamics.

Rubén Manrique

Translated by: Anasol Monguí

Bibliography

[THE2013] Teh PS, Teoh AB, Yue S. (2013) A survey of keystroke dynamics biometrics. ScientificWorldJournal. 2013;2013:408280. Published 2013 Nov 3. doi:10.1155/2013/408280

[DAEL2013] Daelemans W. (2013) Explanation in Computational Stylometry. In: Gelbukh A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2013. Lecture Notes in Computer Science, vol 7817. Springer, Berlin, Heidelberg.

[BEVENDORFF2020] Janek Bevendorff. (2020) Overview of PAN 2020: Authorship Verification, Celebrity Profiling, Profiling Fake News Spreaders on Twitter, and Style Change Detection.

[JUOLA2006] P. Juola, J. Sofko, P. Brennan A Prototype for authorship attribution studies Literary Linguist. Comput., 21 (2) (2006), pp. 169-178

[MANRIQUE2020] Rubén Manrique. (2020) Identidad digital basada en texto: estrategias, avances y limitaciones.

[MANRIQUE2020-2] Rubén Manrique. (2020) Inteligencia artificial y su impacto en verificación de identidad.

[KALA2018] Sundararajan, Kalaivani (2018). Analysis of Stylometry as a Cognitive Biometric Trait.

[PAWEL2016] Kobojek, Paweł and K. Saeed (2016). Application of Recurrent Neural Networks for User Verification based on Keystroke Dynamics.

[LICHAO2017] Sun, Lichao et al. (2017). Sequential Keystroke Behavioral Biometrics for Mobile User Identification via Multi-view Deep Learning.

[LU2020] Xiaofeng Lu, Shengfei Zhang, Pan Hui, Pietro Lio (2020) Continuous authentication by free-text keystroke based on CNN and RNN, Computers & Security.

Sequential models for language processing and their application to identity validation processes

Next Post