When we talk about Natural Language Processing (NLP), it’s about understanding and deriving meaning from human languages by computers. In this formulation, the challenge for NLP seems to be extremely difficult. Average 20-year-old native speaker of American English knows 42,000 words (from 27,000 words for the lowest 5% to 52,000 for the highest 5%) and thousands of grammatical concepts. We need such a big volume of linguistic knowledge for communication in a professional context, writing books and articles, and we spend tens of years to be good at this. But in everyday life our needs are lower, using a vocabulary of 3000 words is enough to cover around 95% of common texts, such as news items, blogs, tweets, and to learn from a text context. This facilitates the meaning extraction for computers, especially if we need to perform “simple” tasks like summarization, relationship extraction, topic segmentation, etc.
Historically, the research on NLP started with the IBM-Georgetown Demonstration in 1954. In this experiment, the automatic translation (in a very rudimentary form) from Russian to English was exhibited. Generally, researchers in NLP were mainly focused on machine translation at 40s-60s. And it is amazing how much it was done with very poor computational resources in grammar and lexicon building.
NLP started to grow much faster with the growth of the World Wide Web in the 1990s and boomed out in the early 2010s when neural network models started to get adopted in NLP.
Nowadays, we have commercial models based on NLP in healthcare, media, finance, and human resources. In this article, we will consider basic principles NLP or simply how to get a computer to understand unstructured text and extract data from it.
First of all, let’s say a few words about specialty of the data used in NLP. Textual data is generated from conversations, tweets, declarations, posts and comments on Facebook, it is arbitrary and ambiguous. These pieces of machine-readable texts are collected in a dataset or a text corpus (pl. corpora), this name is usually used in linguistic and NLP. The content of the corpus is defined by specification of the NLP task. In topic segmentation, corpora contain balanced number of texts of different topic areas: religion, science, science fiction, detective fiction, etc. The Brown corpus is an example of such a corpus, it consists of 500 English text samples (1 million words) that are tagged with 80 different tags. In machine translation task, corpora contain texts in multi languages. Another example of a corpus is the Switchboard corpus that was collected for the speech recognition for phone operation. It contains conversational speech and corresponding text of the different sexes and all different dialects in the United States.
The first step in the pipeline is to split our working document into units. In text analysis, the unit is called token, a string of symbols that represent text. Usually, one token corresponds to one word. Splitting into tokens stars from sentence segmentation. Machine can easily separate sentences using a punctuation mark (modern NLP pipelines use more complex techniques in a case of unformatted documents). Then we split apart words whenever there’s space or punctuation mark between them, that is called tokenization.
The next step is to predict part of speech for each token and tag it. Parts of speech (e.g., verbs, nouns, adjectives) are important, because they indicate how a word is functioning within the context of a sentence. NLTK is one of the widely used tools for part-of-speech tagging.
Since in most languages words appears in different forms (e.g. be, is, was, are, or apple, apples) it is very helpful to find basic form or lemma of each word in the sentence. This is called lemmatization. Lemmatization process is often used to decrease the number of overall word classes. We can provide lemmatization by using a table with lemma forms of words (for example, WordNet for English) based on their parts of speech.
There are also words that you might want to filter before statistical analysis, like “the”, “and”, “is”. These words are called stop words and they produce a lot of noise due to their frequency in texts.
These steps are basic for extracting deeper levels of understanding your data, like morphological, syntactic and semantic levels. Then depending on the NLP task one or several models are used for further analysis. We’ll consider several of them.
Classification is very popular text classification technique used for sentiment analysis of reviews, message polarity, reactions, etc. For example, this technique computes relative frequencies of positive words (“good”, “awesome”) and negative words (“bland”, “horrible”) in order to find negative and positive reviews.
The most commercially successful text classification model is that one used for spam filtering: Naïve Bayes classification model uses corpus of spam and not spam emails to predict the probabilities of word’s presence in spam emails based on their frequencies.
Effective language models should take into account the localization words in their context. Such an approach will be able to detect sentiment much better than classification models. “Bag-of-words” technique evaluates the frequency with which words co-occur with themselves and other words in a specific context. Thus, the model explores relationships between documents that contain the same mixture of individual words. This could be very useful when we want to do topic segmentation, because texts of a specific discipline usually contain specific vocabulary.
The context in which the words appear plays a huge role in conveying meaning. Extensions of the “bag-of-words” model consider co-occurrences of phrases in texts. This is called n-gram analysis (n simply means number of words in a phrase to scan on) and it is very effective in extracting meaning from texts.
Nowadays, neural networks play a more and more significant role in text processing. Almost all big challenges of NLP are addressed now to neural networks, such as machine translation, summarization, paraphrasing, question-and-answer, dialog, and other challenges. Alexa’s speech recognition, Google Translate are just a few examples of such applications. Although there are a lot of limitations in existing NLP models, e.g. it is very difficult to machine to recognize sarcasm in sentiment analysis or natural language changes with time (new words appear, some words meanings changing).