Laws and Preprocessing of Text

To be able to complete this practicum module, you need the following things:

· Jupyter Notebook

· Python 3

· NLTK

In Information Retrieval Systems, we need text preprocessing for the improvement of information retrieval from a corpus. In this practicum, we will implement several steps of text processing and several text laws investigations.

Exercise 1 | Text Preprocessing

  1. The first, we import the Python libraries needed to preprocess the text.

2. Then, Text files reading: read all texts in Wikipedia abstracts (https://stackoverflow.com/questions/12330522/how-to-read-a-file-without-newlines)

3. Check the contents of set_data by displaying the first till second elements in set_data where each element is displayed with a new line split.

Then, we make a program code to separate each word that composes sentences in the file that has been read in point (1)

a. Tokenization : convert text into tokens with no punctuations.

1. We use RegexTokenizer to make program code Tokenization.

2. Check the contents of set_data_tokenized by displaying the first to second elements of set_data_tokenized.

This step we display 10 data elements after tokenization and the output is :

b. Stopword removal

  1. Make a program code to remove the stopword from the words that you get from the result of set_data_tokenized.

We must download nltk stopwords so that our program running properly.

2. Check the contents of set_data_without_stopword by displaying the first till second elements in set_data_without_stopword. Then we print the the result of data after stopword

This code want to print the data 100 data earlier. So the result of code is :

c. Normalization (using Porter Stemmer)

  1. Make a program code that can eliminate the inflection of words that you get from stopping results to its basic form by using a Porter stemmer.

2. Check the contents of base_ data_words by displaying the first to second elements of base_data _sets. Then we print the the result of data after Normalization.

This code want to print the data 100 data earlier. So the result of code is :

Exercise 2 | Text Laws

a. Zipf’s law

  1. Print unique terms with frequency, plot them in a log-log graph (hint: Matplotlib loglog), and observe Zipf’s law.

Streaming output will be cut to the last 5000 lines. Output is :

2. We can see the frequency of each word from data like that :

3. We can see the ranking of each word from our data like that :

4. Creating graphs in the form of log graphs

In Zipf’s law will calculate the frequency of occurrence of words in the txt file used, in this law will also count 1 word as 1 frequency. So if a word is enclosed in a sign (“ “) then the word will be considered as 1 word. Likewise with words that have a hyphen, the word will be considered 1 word. So that it can hinder the calculation of the frequency contained in the file, therefore the words in this txt file must go through the preprocessing stage to be able to calculate each word frequency. Based on the resulting table to display the ranking of each word frequency, the word “quot” is the word with the highest or most frequent occurrence then inserted with the word “the” having the second highest frequency of occurrence and based on the tabek the word with the lowest frequency of occurrence is incorrect. the other is the word “bioactiv” and the word 總統選舉 with a frequency of occurrence of 1 so that if the data for all rankings of word occurrences are visualized, there is a downward curve line.

To be able to see the ranking as well as the frequency of occurrence of the word , then the results of both will be plotted in a log-log graph, where if we observe sb-x indicates the log rank of word occurrences and sb-y is the log of the frequency of occurrence. In the graph we get the resulting curve is linear but still looks rough. This indicates that the frequency of the ranked words does not match the expected straight line. But at least through zifp’s law, what has been done has proven that the astrack.wiki data has followed the distribution of zifp’s law.

b. Benford’s law

  1. Plot the distribution of the first digit in frequencies obtained and observe Benford’s law. Try again while neglecting the one-digit frequencies (frequencies less than 10), and check if the law still applies. The first we load the library for Benford’s law

2. Then we install benfords law and we make inisiation benfords with alpha = 0.05 then we print and make to plot with the title plot is Benfords Laws.

The output is :

Benford’s Law or Benfords law is a law that can estimate the frequency of occurrence of a number in a series of numerical data. If the numerical data set is generated without any element of manipulation or intentionality, then the frequency of occurrence of these numbers will be in accordance with the expected frequency in Benford’s otherwise the data has entered into fraud. Law Benford’s law is known as the law regarding the first digit of the occurrence of a word or we can conclude Benford law is based on the frequency of the first / leading digit of numerical data where each frequency of this leading number is sequential from 1–9. Benford’s law is also used to analyze data anomalies on a data set. If we see from the results of Benford’s law above that the prediction of the appearance of word distribution according to Benford’s law deviates. In the application of Benford’s law, it only succeeded in predicting the occurrence of digits starting with 2 which if we look at the data for the appearance of the second digit, it is almost exactly as predicted by Benfords law, because the bar chart is almost exactly at the red dot of Benfords law. Then for the frequency of data prefix digit 1 passes the red dot on the prediction of Benford’s law, so we can say that the appearance of the digit prefix 1 is an anomaly or abnormal based on the data we use, namely astrack data. wiki so we can conclude that for other digit prefixes besides the “2” digit prefix, it shows that the frequency proportion graph is not in the same direction or not in accordance with the expectations of Benfords law.

c. Heap’s law

Plot the growth of vocabulary while you go through the collection and observe Heap’s law. Try to fit the law to your graph and report the best fitting 𝑘 and 𝑏 constants.

  1. The first we try k= 40, b= 0,59, and the label of plot is Pertumbuhan Kosakata pada Hukum Heap and the title is Heap’s Law

The output is :

2. We try again with k= 100 and b= 0.60 and the label of plot is Pertumbuhan Kosakata pada Hukum Heap and the title is Heap’s Law

The output is :

3. We try again with k=100, b=0,5 and the label of plot is Pertumbuhan Kosakata pada Hukum Heap and the title is Heap’s Law

The output is:

4. We try again with k= 50, b=0,40 and the label of plot is Pertumbuhan Kosakata pada Hukum Heap and the title is Heap’s Law

The output is :

The heap’s law is used to see how words grow in a data set. Heap’s law is also an empirical law that describes the number of different words in a data set. Where the value of k and b is an independent parameter that is determined empirically, the range of k values ​​is 10–100 and the range of b values ​​is 0.4<x<0.7. Based on the above heaps law experiment we have to find the appropriate value to choose the values ​​of k and b, the heaps law is said accurate if the resulting curve is symmetrical. Based on the experiments carried out on the heaps law above, the values ​​of k = 100 and b = 0.6 in Experiment 2 were the right values ​​because they produced a better curve image than other values ​​of k and b. The x-axis represents the size of the text (words) while the y-axis represents the number of vocabulary in the data or the growth of words in the data. So through this heaps law we can see how the vocabulary growth graph in the data is in the second experiment with a word growth of 80000.

Assigment

Propose a text preprocessing stage in the code you created in Exercise 1 based on your analysis of the text preprocessing results you got in Exercise 1.

After doing the preprocessing stage and applying the preprocessed data into Zifp’s, Benford, Heap’s laws, the researcher wants to propose the preprocessing stage of case folding data and replace stemming/normalization into lemmatization. The reason for using lemmatization is because stemming usually refers to a crude heuristic process that cuts off the ends of a word and often removes the suffix contained in a word. In contrast to Lemmatization, lemmatization usually refers to the use of words correctly by using vocabulary and morphological analysis of a word, so that lemmatization considers the context of a word more in terms of paying attention to the relationship between words in a data.

So that at the stages proposed, namely case folding and lemmatization, still using the tokenization attribute from the previous preprocessing stage, it will be continued with the case folding stage -> stopword removal -> lemmatization so that after the data has passed through the preprocessing stage, the data will be used for legal implementation.

  1. The first we print the result of data after tokenization

2. Next step, Make program code case folding

3. Then, Display 10 data elements after case folding

4. Doing stopword removal again, then print the result of data after stopword removal

5. Repeat observations of Zipf’s Law, Benford’s Law, and Heap’s Law based on the preprocessing stages of your proposed text.

6. Doing step zif’s law

7. We can seethe frequency items of word

8. We also can see the ranking of each word

9. Creating graphs in the form of log graphs

In Zipf’s law will calculate the frequency of occurrence of words in the txt file used, in this law will also count 1 word as 1 frequency. So if a word is enclosed in a sign (“ “) then the word will be considered as 1 word. Likewise with words that have a hyphen, the word will be considered 1 word. So that it can hinder the calculation of the frequency contained in the file, therefore the words in this txt file must go through the preprocessing stage to be able to calculate each word frequency. Based on the resulting table to display the ranking of each word frequency, the word “quot” is the word with the highest or most frequent occurrence then inserted with the word “may” having the second highest frequency of occurrence and based on the tabek the word with the lowest frequency of occurrence is incorrect. the other is the word “副總統選舉” and the word “aethilla” with a frequency of occurrence of 1 so that if the data for all rankings of word occurrences are visualized, there is a downward curve line.

To be able to see the ranking as well as the frequency of occurrence of the word , then the results of both will be plotted in a log-log graph, where if we observe sb-x indicates the log rank of word occurrences and sb-y is the log of the frequency of occurrence. In the graph we get the resulting curve is linear but still looks rough. This indicates that the frequency of the ranked words does not match the expected straight line. But at least through zifp’s law, what has been done has proven that the astrack.wiki data has followed the distribution of zifp’s law.

The conclusion of this practicum is to learn about the preprocessing stages of Information Retrieval in general and how to implement text Laws such as Zifp’s, Benfords, and Heaps. To get good results, it is necessary to pay better attention to the data preprocessing stage before implementing the text law that will be used. Like this practicum, if we rerun the preprocessing stage and re-implement the text law, the word with the highest frequency in Zifp’s law will change as well as the word growth heaps law if we re-implement it.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store