Basic Structure of NLP
Chat message / text message -> NLP <->Knowledge base-> Data Storage
Lexical ambiguity - Words give different meaning and its relevant when used with sentence
eg: she goes to bank . Bank refers to financial institution or river bank
Syntactical ambiguity - grammatical plays different role and give different meaning
Chicken is ready to eat . Here chicken is ready to eat for us or chicken is ready for its meal
Referential ambiguity -
The boy told his father about theft. He was very upset. Here He refers to father in first sentence
Tokenization
1) Breaks complex sentence into words2) Understand each of the word with respect to the sentence
3) Procedure structural description on input sentence
code :
from nltk.tokenize import word_tokenizeAI_tokens=word_tokenize(AI)
Frequent words :
from nltk.probability import FreqDistfdist=FreqDist(AI_tokens)
//display word count for all words
for words in AI_tokens:
fdist[words.lower() ] +=1
fdist
// display top 10 frequent word
fdist_top10=fdist.most_common(10)
// display number of paragraph
from nltk.tokenize import blankline_tokenize
AI_blank=blankline_tokenize(AI)
Bigram - two consecutive token
Trigram - three consecutive token
from nltk.util import bigrams trigrams ngrams
string ="I'm passionate to wok with data. Also more passionate to share my learning with others"
quotes_tokens=nltk.word_tokenize(string)
quotes_bigram=list(quotes_tokens(nltk.bigrams))
Stemming:
Normalize word into base form. It works by removing prefix or sufix from wordsEg : Affected, Affection, Affectionate will be stemmed to Affect
code :
from nltk.stem import PorterStemmerpst=PorterStemmer
pst.stem("having") // this will print have
word=["give","giving","given","gave"]
Porter stemmer will provide following output for above list
give-> give
giving->give
given->give
gave-> gave
from nltk.stem import LancasterStemmer
give-> giv
giving->giv
given->giv
gave-> gav
we have snowball stemmer for language specific
Lemmatization
Similar to Stemming but give proper root word
eg : gone and goes will map to go
code:
from nltk.stem import wordnet
from nltk.stem import wordnetLematizer
word_lem=wordnetLematizer()
for words in word_stem:
print(words + ":" +word_lem.lematize(words))
POS stash : Point of speech . specify word is noun or verb or so on.
STOP word :
Stop word are useful to form a sentence but they are not helpful in NLP.
eg : above , below
code :
from nltk.corpus import stopwords
stopwords.word("english") // will display stop words in english language
// using regular expression to remove punctuation like . , etc.
import re
punctuation= re.compile(r'[,.|?!:;0-9]')
post_punctuation=[]
for words in AI_token :
word=punctuation.sub("",words)
if len(word) >0 :
post_punctuation.append(word)
POS :
Tag and definition of each tag like noun,plural,singular and so on
eg : "The dog killed the bat" . Here this sentence is derived from POS as
The - determiner
dog - noun
killed - verb
the - determiner
bat - noun
code :
sent="Timmith is natural when it comes to drawing"
sent_tokens=word_tokenize(sent)
for token in sent_tokens :
print(nltk.pos_tag([token])) // will tag each word with POS
Name Entity recognition (NER):
Its way to classify noun as either movie, organization etc., sometimes it can classify wrongly so we need to build validation layer on top of knowledge graph like google knowledge, IBM watson and wikipidea
code :
from nltk import ne_chunk
NE_sent="The US President stay in white house"
NE_tokens =word_tokenize(NE_sent) // tokenize word
NE_tags=nltk.pos_tag(NE_tokens) // tag POS from token
NE_NER=ne_chunk(NE_tag) // using ne_chunk identify NER
Syntax :
Rules, principle and process to form sentence. with this rule we can create syntax tree. its syntactical structure of sentences define where each part of sentence should come.
for doing this we need ghost script
Chunking :
picking individual pieces and grouping of words into token.
eg : "we caught the blank panther" . will be classified from POS as
we - Determiner - noun PP
caught - verb
the - Determiner |
blank - adjective | => noun PP
panther - noun |
Code:
new = "we caught the blank panther"
new_token=nltk.pos_tag(word_tokenize(new))
grammar_np=r"NP: {<DT>?<JJ>*<NN>}"
chunk_parser=nltk.RegexParser(grammar_np)
chunk_result=chunk_parser.parse(new_tokens) // without ghost script it will throw error because of absence of ghost script but when u look at end you will see list that has done chunking
Implement machine learning algorithm :
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import count_vectorizer
print(os.listdir(nltk.data.find("corpora")) // whole bunch of copora
from nltk.corpus import movie_reviews // pick movie review from corpus
print(movie_reviews).categories // print type of categorize positive and negative
neg_rev=movie_reviews.fileids('neg')
rev=nltk.corpus.movie.reviews.words("/usr/tmp/pochoe.csv")
rev_list=[]
for rev in neg_rev:
rev_text_neg=rev=nltk.corpus.movie.reviews.words(rev)
review_one_string= " ".join(rev_text_neg)
review_one_string=review_one_string.replace(' ,',',') // Remove space that comes with comma
review_one_string=review_one_string.replace(' .','.') // replace space that comes with dot
rev_list.append(review_one_String)
// Same steps need to be done for postive review category as well
neg_target=np.zeros((1000,),dtype=np.int)
pos_target=np.ones((1000,),dtype=np.int)
target_list=[]
for neg_tar in neg_target:
target_list.append(neg_tar)
for pos_tar in pos_target:
target_list.append(pos_tar)
y=pd.series(target_list)
from sklearn.feature_extraction.text import count_vectorizer
count_vect=CountVectorizer(lowecase=True,stop_words='english',min_df=2)
x_count_vect=count_vect.fit_transform(rev_list)
x_count_vect.shape // will print (2000,16228)
x_names=count_vect.get_feature_names()
x_count_vect=pd.DataFrame(x_count_vect.toarray(),columns=x_names)
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import confusion_matrix
x_train_cv,x_test_cv,y_train_cv,y_test_cv=train_test_split(x_count_vect,y,test_size=0.25,random_state=5)
x_train_cv.shape
(1500,16228)
x_test_cv.shape
(500,16228)
from sklearn.naive_bayes import GaussianNB
gnb=GaussianNB()
y_pred_gnb=gnb.fit(x_train_cv,y_train_cv).predict(x_test_cv)
from sklearn.naive_bayes import MutlinomialNB
clf_cv=MutlinomialNB()
clf_cv.fit(x_train_cv,y_train_cv)
y_pred_cv=clf_cv.predict(x_test_cv)
print(metrics.accuracy_score(y_test_cv,y_pred_cv)) // print 1.0 means overfitting
score_clf_cv=confusion_matrix(y_test_cv,y_predict_cv) // print score of accuracy
0 Comments