This is the official documentation of Python package indepth and is currently a work in progress document. Please come back later to find the completed documentation.

Table of contents

Overview
Installation
remov_punct
wrdCnt
vocabSize
commonWrd
realWrds
MostSimilarSent
sbjVrbAgreement
modalRuleError
PrpDonot
VrbTenseAgreementError
a_an_error
motionVerbs
coherentWrds
hyponymPolysem_cnt
concretMeaningPOS
buildFeatures

Overview

This package is a collection of functions that aim to enable the user to perform non-trivial tasks in realm of natural language processing. Non-trivial tasks in the context of this package refer to the tasks that entail sourcing scientific research papers and building ones own engineering solutions based on them. This package is in a constant state of evolution and the author intends to add only those functions that solve non-trivial problems and are not available out of the box at the time of its publication. Please read the contribution guidelines for more details about contributing to this package.

The functions 1 to 16 given below aid in generating features from text to compare the writing style. Given a gold standard text, these features help in identifying the text that matches the gold standard. This package aids in building the features but lets the user decide the best algorithm to select the similarity of the texts.


remov_punct(s)

Input: string
Output: string
This function removes all the punctuations and special characters (listed below) in the text. It takes a string as input and returns the same string without the punctuations and special characters.

source code
def remov_punct(withpunct):
    punctuations = set(['!','(',')','-','[',']','{','}',';',':',',','<','>','.','/','?','@','#','$','%','^','&','*','_','~',"\\"])
    without_punct = ""
    char = 'nan'
    for char in withpunct:
        if char not in punctuations:
            without_punct = without_punct + char
    return(without_punct)   

#example
remov_punct("The comma, dissappears along with the period here.")
#output : The comma dissappears along with the period here

source code interpretation

This function takes the input string and inspects one character at a time. If the character is a member of the set punctuations, then it is skipped. The final output is just the same as the input without the characters that are present in the set.


realWrds(s)

Input: string
Output: int
This function counts the number of words in the text. Here, words are looked up from the nltk.corpus as in the source code below.

source code
#invoke library
from nltk.corpus import words

#Function
def realWrds(s):
    realWrds = []
    withoutpunct = remov_punct(s)
    wrds = withoutpunct.split(' ')
    realwordCnt = len(set(wrds).intersection(set(words.words())))
    return(realwordCnt)  

#example    
realWrds("There are 2 words here")
#output
2
source code interpretation

This function takes the input string and removes the punctuations in it by using the function remov_punct. The text is then split into words by using a space separator. These words are then matched with the words in the words of nltk.corpus to count the number of matching words. This function will be improved in the future versions by looking up bettter thesauraus options. The example above shows the limitation of this function.


commonWrd(s)

Input: String
Output: Int
This function computes the feature common word score. This score is inspired by Coh-Metrix index Familiarity referenced in the paper cited below. This score assumes that using commonly used words helps in ease of understanding by the reader. Brown corpus is used as a proxy to identify the commonly used words.

source code
def commonWrd(d):
    sumWrdFreq = 0
    all_text = brown.words()
    wrdfreq = FreqDist([w.lower() for w in all_text])
    wrdlist = list(set(dropStopWrds(d)))
    for wrd in wrdlist:        
        sumWrdFreq += wrdfreq[wrd.lower()]
    commonWrds_Score = sumWrdFreq/float(len(wrdlist))
    return(commonWrds_Score)
#example
commonWrd("How common are the words in this text. Let us find out with this example")
#output : 
392.75

source code interpretation

In order to compute the common word score, firstly, each word in brown corpus and its frequency is computed and stored in a lookup file sumWrdFreq. Secondly, the stop words in the string passed to the function are suppressed. Thirdly, the frequency counts of the words are looked up from the sumWrdFreq. Finally, these counts are averaged across the number of non-stopwords in the string

Reference

McNamara, D.S., Crossley, S.A. & Roscoe, R. Natural language processing in an intelligent writing strategy tutoring system. Behav Res 45, 499–515 (2013). https://doi.org/10.3758/s13428-012-0258-1


vocabSize(s)

Input: String
Output: Int
This function computes the size of vocabulary. This score is inspired by Coh-Metrix index Lexical diversity referenced in the paper cited below.

source code
#function to measure size of vocabulary post stop words suppression.
def vocabSize(s):
    withoutpunct = remov_punct(s)
    withoutStopwords = dropStopWrds(withoutpunct)
    vocabLen = len(set(withoutStopwords))
    return(vocabLen)
#example
vocabSize("The vocabulary size of this string is five")
#output : 
5
source code interpretation

The stopwords in the input string are suppressed and then the unique number of words are counted. It must be noted that unreal words are also counted as part of vocabulary and thus difers from the function realWrds.

Reference

McNamara, D.S., Crossley, S.A. & Roscoe, R. Natural language processing in an intelligent writing strategy tutoring system. Behav Res 45, 499–515 (2013). https://doi.org/10.3758/s13428-012-0258-1