MINING CAPSTONE TASK 1
In this task I opted to use python and the toolkits give to attain
results for the task given. It proved highly useful in the task of obtaining an
overview of the topics to be discussed and the reviews that were there.
The specific packages I opted to use are the genism and sklearn to
incorporate the topic extraction process.
TASK 1.1 TOPIC MINING OF ALL
In order to come into terms with what the reviewers were talking about
with reference to the topic data, I chose to use the LDA topic model in the
extraction process in order to attain 10 topics from all the reviews in the
data that were for the restaurants.
In order to vectorize the review data I chose to apply TfidfVectorizer.
The transformation produced results that were linear and I used IDF
reweighting where I specified to gram range to be either 1 or 2. This basically
collected data or terms with either one or two words.
In order to visualize the data, I opted to use D3 to acquire the
To effectively represent the topic models, I chose to use word cloud
visualization using different font sizes to represent the significance of each
term in any given topic model.
Some observations acquired form the data are as follows;
From the data visualized it can be seen that the
topics 6,7,9 have a great affinity on the emphasis for a specific cuisine/ food
like; pizza, Chinese food. At the same time topics such as 1,4,9 talk about
foods or drinks such as fish chips, chicken and “food drinks”
The representation shows that there is mostly good
comments towards the restaurants as indicated within topics 0 and 2.
With reference to topic 3 it can be seen that a very
important topic that comes up when the customers are reviewing a restaurant is
the time. It is very important to the customers.
Graphical representation of the topics mind from the raw data restaurant
TASK 1.2 TOPIC MINING OF
POSITIVE AND NEGATIVE REVIEWS
In the quest to explore the topic distribution for the subsets of all
the reviews I was able to attain certain results. Specifically, this task
required that the observations made to the subsets of positive reviews and
For the positive results I used reviews with star number =>4, while
for the negative reviews I used reviews with star number =