NLP and text mining

Basic Structure of NLP

Chat message / text message -> NLP <->Knowledge base
                                                             -> Data Storage

Lexical ambiguity - Words give different meaning and its relevant when used with sentence

eg: she goes to bank . Bank refers to financial institution or river bank

Syntactical ambiguity - grammatical plays different role and give different meaning

Chicken is ready to eat . Here chicken is ready to eat for us or chicken is ready for its meal

Referential ambiguity -

The boy told his father about theft. He was very upset. Here He refers to father in first sentence

Tokenization

1) Breaks complex sentence into words
2) Understand each of the word with respect to the sentence
3) Procedure structural description on input sentence

code :

from nltk.tokenize import word_tokenize

AI_tokens=word_tokenize(AI)

Frequent words :

from nltk.probability import FreqDist

fdist=FreqDist(AI_tokens)

//display word count for all words
for words in AI_tokens:
    fdist[words.lower() ] +=1
fdist

// display top 10 frequent word

fdist_top10=fdist.most_common(10)

// display number of paragraph

from nltk.tokenize import blankline_tokenize
AI_blank=blankline_tokenize(AI)

Bigram - two consecutive token
Trigram - three consecutive token

from nltk.util import bigrams trigrams ngrams

string ="I'm passionate to wok with data. Also more passionate to share my learning with others"

quotes_tokens=nltk.word_tokenize(string)

quotes_bigram=list(quotes_tokens(nltk.bigrams))

Stemming:

Normalize word into base form. It works by removing prefix or sufix from words

Eg : Affected, Affection, Affectionate will be stemmed to Affect

code :

from nltk.stem import PorterStemmer

pst=PorterStemmer

pst.stem("having") // this will print have

word=["give","giving","given","gave"]

Porter stemmer will provide following output for above list

give-> give
giving->give
given->give
gave-> gave

whereas lancaster is more aggressive and provide following

from nltk.stem import LancasterStemmer

give-> giv
giving->giv
given->giv
gave-> gav

we have snowball stemmer for language specific

Lemmatization

Similar to Stemming but give proper root word

eg : gone and goes will map to go

code:

from nltk.stem import wordnet
from nltk.stem import wordnetLematizer

word_lem=wordnetLematizer()

for words in word_stem:
   print(words + ":" +word_lem.lematize(words))

POS stash : Point of speech . specify word is noun or verb or so on.

STOP word :

Stop word are useful to form a sentence but they are not helpful in NLP.

eg : above , below 

code :

from nltk.corpus import stopwords

stopwords.word("english") // will display stop words in english language


// using regular expression to remove punctuation like . , etc.

import re
punctuation= re.compile(r'[,.|?!:;0-9]')

post_punctuation=[]
for words in AI_token :
   word=punctuation.sub("",words)
   if len(word) >0 :
     post_punctuation.append(word)

POS :

Tag and definition of each tag like noun,plural,singular and so on

eg : "The dog killed the bat" . Here this sentence is derived from POS as 

The - determiner
dog - noun
killed - verb
the - determiner
bat - noun

code :

sent="Timmith is natural when it comes to drawing"
sent_tokens=word_tokenize(sent)

for token in sent_tokens :
    print(nltk.pos_tag([token])) // will tag each word with POS

Name Entity recognition (NER):

Its way to classify noun as either movie, organization etc.,  sometimes it can classify wrongly so we need to build validation layer on top of  knowledge graph like google knowledge, IBM watson and wikipidea

code :

from nltk import ne_chunk
NE_sent="The US President stay in white house"

NE_tokens =word_tokenize(NE_sent) // tokenize word
NE_tags=nltk.pos_tag(NE_tokens) // tag POS from token
NE_NER=ne_chunk(NE_tag) // using ne_chunk identify NER

Syntax :

Rules, principle and process to form sentence. with this rule we can create syntax tree. its syntactical structure of sentences define where each part of sentence should come.

for doing this we need ghost script

Chunking :

picking individual pieces and grouping of words into token.

eg : "we caught the blank panther" . will be classified from POS as

we - Determiner - noun PP
caught - verb
the  - Determiner |
blank - adjective  | => noun PP
panther - noun     |


Code:

new = "we caught the blank panther"
new_token=nltk.pos_tag(word_tokenize(new))

grammar_np=r"NP: {<DT>?<JJ>*<NN>}"
chunk_parser=nltk.RegexParser(grammar_np)

chunk_result=chunk_parser.parse(new_tokens) // without ghost script it will throw error because of absence of ghost script but when u look at end you will see list that has done chunking


Implement machine learning algorithm :

import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import count_vectorizer

print(os.listdir(nltk.data.find("corpora")) // whole bunch of copora

from nltk.corpus import movie_reviews // pick movie review from corpus

print(movie_reviews).categories  // print type of categorize positive and negative

neg_rev=movie_reviews.fileids('neg')

rev=nltk.corpus.movie.reviews.words("/usr/tmp/pochoe.csv")
rev_list=[]
for rev in neg_rev:
    rev_text_neg=rev=nltk.corpus.movie.reviews.words(rev)
    review_one_string= " ".join(rev_text_neg)
    review_one_string=review_one_string.replace(' ,',',') // Remove space that comes with comma 
    review_one_string=review_one_string.replace(' .','.') // replace space that comes with dot
    rev_list.append(review_one_String)

// Same steps need to be done for postive review category as well

neg_target=np.zeros((1000,),dtype=np.int)
pos_target=np.ones((1000,),dtype=np.int)

target_list=[]
for neg_tar in neg_target:
 target_list.append(neg_tar)
for pos_tar in pos_target:
 target_list.append(pos_tar)


y=pd.series(target_list)

from sklearn.feature_extraction.text import count_vectorizer

count_vect=CountVectorizer(lowecase=True,stop_words='english',min_df=2)
x_count_vect=count_vect.fit_transform(rev_list)

x_count_vect.shape // will print (2000,16228)

x_names=count_vect.get_feature_names()

x_count_vect=pd.DataFrame(x_count_vect.toarray(),columns=x_names)



from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import confusion_matrix 


x_train_cv,x_test_cv,y_train_cv,y_test_cv=train_test_split(x_count_vect,y,test_size=0.25,random_state=5)

x_train_cv.shape
(1500,16228)

x_test_cv.shape
(500,16228)

from sklearn.naive_bayes import GaussianNB

gnb=GaussianNB()

y_pred_gnb=gnb.fit(x_train_cv,y_train_cv).predict(x_test_cv)

from sklearn.naive_bayes import MutlinomialNB

clf_cv=MutlinomialNB()

clf_cv.fit(x_train_cv,y_train_cv)

y_pred_cv=clf_cv.predict(x_test_cv)

print(metrics.accuracy_score(y_test_cv,y_pred_cv)) // print 1.0 means overfitting

score_clf_cv=confusion_matrix(y_test_cv,y_predict_cv) // print score of accuracy



Post a Comment

0 Comments