NLP and text mining

Basic Structure of NLP

Chat message / text message -> NLP <->Knowledge base
-> Data Storage

Lexical ambiguity - Words give different meaning and its relevant when used with sentence

eg: she goes to bank . Bank refers to financial institution or river bank

Syntactical ambiguity - grammatical plays different role and give different meaning

Chicken is ready to eat . Here chicken is ready to eat for us or chicken is ready for its meal

Referential ambiguity -

The boy told his father about theft. He was very upset. Here He refers to father in first sentence

Tokenization

1) Breaks complex sentence into words
2) Understand each of the word with respect to the sentence
3) Procedure structural description on input sentence

code :

from nltk.tokenize import word_tokenize

AI_tokens=word_tokenize(AI)

Frequent words :

from nltk.probability import FreqDist

fdist=FreqDist(AI_tokens)

//display word count for all words

for words in AI_tokens:

fdist[words.lower() ] +=1

fdist

// display top 10 frequent word

fdist_top10=fdist.most_common(10)

// display number of paragraph

from nltk.tokenize import blankline_tokenize
AI_blank=blankline_tokenize(AI)

Bigram - two consecutive token
Trigram - three consecutive token

from nltk.util import bigrams trigrams ngrams

string ="I'm passionate to wok with data. Also more passionate to share my learning with others"

quotes_tokens=nltk.word_tokenize(string)

quotes_bigram=list(quotes_tokens(nltk.bigrams))

Stemming:

Normalize word into base form. It works by removing prefix or sufix from words

Eg : Affected, Affection, Affectionate will be stemmed to Affect

code :

from nltk.stem import PorterStemmer

pst=PorterStemmer

pst.stem("having") // this will print have

word=["give","giving","given","gave"]

Porter stemmer will provide following output for above list

give-> give

giving->give

given->give

gave-> gave

whereas lancaster is more aggressive and provide following

from nltk.stem import LancasterStemmer

give-> giv

giving->giv

given->giv

gave-> gav

we have snowball stemmer for language specific

Lemmatization

Similar to Stemming but give proper root word

eg : gone and goes will map to go

code:

from nltk.stem import wordnet

from nltk.stem import wordnetLematizer

word_lem=wordnetLematizer()

for words in word_stem:

print(words + ":" +word_lem.lematize(words))

POS stash : Point of speech . specify word is noun or verb or so on.

STOP word :

Stop word are useful to form a sentence but they are not helpful in NLP.

eg : above , below

code :

from nltk.corpus import stopwords

stopwords.word("english") // will display stop words in english language

// using regular expression to remove punctuation like . , etc.

import re

punctuation= re.compile(r'[,.|?!:;0-9]')

post_punctuation=[]

for words in AI_token :

word=punctuation.sub("",words)

if len(word) >0 :

post_punctuation.append(word)

POS :

Tag and definition of each tag like noun,plural,singular and so on

eg : "The dog killed the bat" . Here this sentence is derived from POS as

The - determiner

dog - noun

killed - verb

the - determiner

bat - noun

code :

sent="Timmith is natural when it comes to drawing"

sent_tokens=word_tokenize(sent)

for token in sent_tokens :

print(nltk.pos_tag([token])) // will tag each word with POS

Name Entity recognition (NER):

Its way to classify noun as either movie, organization etc., sometimes it can classify wrongly so we need to build validation layer on top of knowledge graph like google knowledge, IBM watson and wikipidea

code :

from nltk import ne_chunk

NE_sent="The US President stay in white house"

NE_tokens =word_tokenize(NE_sent) // tokenize word

NE_tags=nltk.pos_tag(NE_tokens) // tag POS from token

NE_NER=ne_chunk(NE_tag) // using ne_chunk identify NER

Syntax :

Rules, principle and process to form sentence. with this rule we can create syntax tree. its syntactical structure of sentences define where each part of sentence should come.

for doing this we need ghost script

Chunking :

picking individual pieces and grouping of words into token.

eg : "we caught the blank panther" . will be classified from POS as

we - Determiner - noun PP

caught - verb

the - Determiner |

blank - adjective | => noun PP

panther - noun |

Code:

new = "we caught the blank panther"

new_token=nltk.pos_tag(word_tokenize(new))

grammar_np=r"NP: {<DT>?<JJ>*<NN>}"

chunk_parser=nltk.RegexParser(grammar_np)

chunk_result=chunk_parser.parse(new_tokens) // without ghost script it will throw error because of absence of ghost script but when u look at end you will see list that has done chunking

Implement machine learning algorithm :

import pandas as pd

import numpy as np

from sklearn.feature_extraction.text import count_vectorizer

print(os.listdir(nltk.data.find("corpora")) // whole bunch of copora

from nltk.corpus import movie_reviews // pick movie review from corpus

print(movie_reviews).categories // print type of categorize positive and negative

neg_rev=movie_reviews.fileids('neg')

rev=nltk.corpus.movie.reviews.words("/usr/tmp/pochoe.csv")

rev_list=[]

for rev in neg_rev:

rev_text_neg=rev=nltk.corpus.movie.reviews.words(rev)

review_one_string= " ".join(rev_text_neg)

review_one_string=review_one_string.replace(' ,',',') // Remove space that comes with comma

review_one_string=review_one_string.replace(' .','.') // replace space that comes with dot

rev_list.append(review_one_String)

// Same steps need to be done for postive review category as well

neg_target=np.zeros((1000,),dtype=np.int)

pos_target=np.ones((1000,),dtype=np.int)

target_list=[]

for neg_tar in neg_target:

target_list.append(neg_tar)

for pos_tar in pos_target:

target_list.append(pos_tar)

y=pd.series(target_list)

from sklearn.feature_extraction.text import count_vectorizer

count_vect=CountVectorizer(lowecase=True,stop_words='english',min_df=2)

x_count_vect=count_vect.fit_transform(rev_list)

x_count_vect.shape // will print (2000,16228)

x_names=count_vect.get_feature_names()

x_count_vect=pd.DataFrame(x_count_vect.toarray(),columns=x_names)

from sklearn.model_selection import train_test_split

from sklearn import metrics

from sklearn.metrics import confusion_matrix

x_train_cv,x_test_cv,y_train_cv,y_test_cv=train_test_split(x_count_vect,y,test_size=0.25,random_state=5)

x_train_cv.shape

(1500,16228)

x_test_cv.shape

(500,16228)

from sklearn.naive_bayes import GaussianNB

gnb=GaussianNB()

y_pred_gnb=gnb.fit(x_train_cv,y_train_cv).predict(x_test_cv)

from sklearn.naive_bayes import MutlinomialNB

clf_cv=MutlinomialNB()

clf_cv.fit(x_train_cv,y_train_cv)

y_pred_cv=clf_cv.predict(x_test_cv)

print(metrics.accuracy_score(y_test_cv,y_pred_cv)) // print 1.0 means overfitting

score_clf_cv=confusion_matrix(y_test_cv,y_predict_cv) // print score of accuracy

Reference :

https://www.youtube.com/watch?v=05ONoGfmKvA

Grantecoryt

Grantecoryt

NLP and text mining

Basic Structure of NLP

Tokenization

code :

Frequent words :

Stemming:

code :

Lemmatization

code:

STOP word :

Name Entity recognition (NER):

Syntax :

Implement machine learning algorithm :

Post a Comment

0 Comments

Popular Posts

An elderly man with sudden cardiogenic shock, diffuse ST depressions, and STE in aVR

NASA: Crystal V Spray Feminine Hygiene

How to measure performance and textural properties of personal care products

Night Raven: White Hopes, Red Nightmares, Part 2

New Primary Song: Family History Is the Story of Me

Hagen Quartet - Schubert, Webern, and Haydn, 22 November 2018

Multi Grafted Apples , Currant Tomatoes, Beneficial Insects and Forest Garden Fruits - Week 15

Archive

Recent

Categories

HOT

Menu Footer Widget