Name: Textual Classification for Sentiment Detection. Brand Reputation Analysis on the Web using Natural Language Processing and Machine Learning
Price: 27.95 EUR
Availability: InStock
Author: Mike Nkongolo
ISBN: 978-3-668-70168-7

Cloud computing makes it possible to build scalable machine learning systems for processing massive amounts of complex data, be them structured or unstructured, real-time or historical, the so-called Big Data. Publicly available cloud computing platforms have been made available, for instance, Amazon EC2, EMR, and Google Compute Engine. More importantly, open source APIs and libraries have also been developed for ease of programming on the cloud, for instance, Cascading, Storm, Scalding, Apache Spark and Trackur. Meanwhile, computational intelligence approaches, examples of which include evolutionary computation, immune-inspired approaches, and swarm intelligence, are also employed to develop scalable machine learning and data analytics tools.

In this project, we presented the sentiment-focused web crawling problem and designed a sentiment-focused web crawler frame-work for faster discovery and retrieval of sentimental context on the Web. We have developed a computational framework to perform automated reputation analysis on the Web using Natural Language Processing and Machine Learning. This paper introduces such framework and tests its performance on automated sentiment analysis for brand reputation. In addition, we proposed different strategies for predicting the polarity scores of web pages.

Experiments have shown that the performance of our proposed framework is more efficient than existing frameworks. Reputation analysis is a useful application for organizations that are looking for people's opinions about their products and services.

Our approach consists of 4 parts: in the first part, the framework performed Web crawling based on the query specified by the user. In the second part, the framework locates relevant information within textual data using Entity Recognition. In the third part, relevant information was recorded in the database for feature extraction/engineering and classification. Lastly, the framework displayed the data for reputation analysis. In the training phase, we used data provided by the marketing team of the University of the Witwatersrand, Emoticons, a subset of the SentiStrength lexicon and ClueWeb09 dataset. Each domain was labelled accordingly (positive/negative and neutral) with equal numbers of polarity in plain text. In the test phase, the classifier predicted the polarity of real-time data. We used accuracy as evaluation metric to measure how much our classifier acted precisely.

Excerpt

Abstract

Cloud computing makes it possible to build scalable machine learning sys-

tems for processing massive amounts of complex data, be them structured

or unstructured, real time or historical, the so-called Big Data. Publicly

available cloud computing platforms have been made available, for instance,

Amazon EC2, EMR, and Google Compute Engine. More importantly, open

source APIs and libraries have also been developed for ease of program-

ming on the cloud, for instance, Cascading, Storm, Scalding, Apache Spark

and Trackur. Meanwhile, computational intelligence approaches, examples of

which include evolutionary computation, immune-inspired approaches, and

swarm intelligence, are also employed to develop scalable machine learning

and data analytics tools. In this project, we presented the sentiment-focused

web crawling problem and designed a sentiment-focused web crawler frame-

work for faster discovery and retrieval of sentimental context on the Web.

We have developed a computational framework to perform automated repu-

tation analysis on the Web using Natural Language Processing and Machine

Learning. This paper introduces such framework and tests its performance

on automated sentiment analysis for brand reputation. In addition, we pro-

posed different strategies for predicting the polarity scores of web pages.

Experiments have shown that the performance of our proposed framework is

more efficient than existing frameworks. Reputation analysis is a useful ap-

plication for organizations that are looking for people's opinions about their

products and services. Our approach consists of 4 parts: in the first part, the

framework performed Web crawling based on the query specified by the user.

In the second part, the framework locates relevant information within tex-

tual data using Entity Recognition. In the third part, relevant information

were recorded in the database for feature extraction/engineering and classi-

fication. Lastly, the framework displayed the data for reputation analysis.

In the training phase, we used data provided by the marketing team of the

University of the Witwatersrand, Emoticons, a subset of the SentiStrength

lexicon and ClueWeb09 dataset. Each domain was labelled accordingly (pos-

itive/negative and neutral) with equal numbers of polarity in plain text. In

the test phase, the classifier predicted the polarity of real-time data. We

used accuracy as evaluation metric to measure how much our classifier acted

precisely. Additionally, we included negation detection to improve the accu-

racy of the classifier. Furthermore, we have observed better results in both

training and test stages.

Keywords: Sentiment Detection, Reputation Analysis, Web Crawl-

ing.

R´

esum´

R´

esum´

Internet augmente `

a la vitesse de l'´

eclair et les donn´

ees qui y sont stock´

ees

sont vastes. La croissance d'Internet a donner naissance a une source ´

enorme

de donn´

ees. Ces donn´

ees peuvent renfermer des informations sentimentale,

notamment sur la fa¸con dont les gens pensent sur diff´

erents probl`

emes. De nos

jours, les opinions des personnes jouent un r^

ole pr´

epond´

erant dans l'industrie.

C'est la raison pour laquelle, les grandes et petites entreprises, ´

etudient les

methodes automatiques pour r´

ecup´

erer les informations dont elles ont besoin

a partir de gros volumes de donn´

ees stock´

ees sur le Web. L'analyse automa-

tique de la r´

eputation des entreprises est une m´

ethode efficace pour r´

esoudre

ce genre de problematique. L'analyse automatique de la r´

eputation des en-

treprise d´

etermine automatiquement la fa¸con dont les mots-cl´

es, les termes,

ou le contenu g´

en´

er´

e par l'utilisateur peuvent nuire `

a un nom de marque, `

un produit ou `

a une entreprise mentionn´

es dans un texte. L'analyse de la

r´

eputation automatique utilise la d´

etection du sentiment qui implique des

m´

ethodes avanc´

ees telles que l'apprentissage machine et le traitement auto-

matique du langage naturel pour capturer la polarit´

e pouvant ^

etre positive,

n´

egative ou neutre `

a partir de textes simples. Cette recherche se focalise sur

l'exploration du Web pour l'analyse automatique de la r´

eputation des entre-

prises sur Internet. Une analyse de la r´

eputation automatique est effectu´

sur l'Universit´

e du Witwatersrand pour ´

etudier sa popularit´

e sur Internet. Il

existe une large gamme de champs pour lesquels des informations peuvent

etre r´

ecup´

er´

ees. Cette recherche ´

etudie les sentiments concernant Wits `

a par-

tir des donn´

ees publiquement disponibles. La recherche pr´

esente une nouvelle

perspective pour l'exploration Web cibl´

ee. Nous avons propos´

e un syst`

eme

d'exploration ax´

e sur le Web pour faciliter la d´

ecouverte rapide du contenu

sentimental. Le system propos´

e peut ^

etre appliqu´

e de mani`

ere g´

en´

erique pour

collecter, traiter et afficher la r´

eputation de diff´

erentes marques/entreprises

en temps r´

eel. Cette ´

etude d´

ecrit ´

egalement des outils qui permettent le

d´

eveloppement de technologies prenant en charge le traitement textuel pour

acc´

el´

erer la d´

etection des sentiments pour une analyse de la r´

eputation des

entreprises/marques. Dans cette perspective, nous proposons une application

simulant le syst`

eme proposer. Ce syst`

eme est clairement d´

efini pour effectuer

l'exploration Web cibl´

ee.

mot-clef : Detection sentimental, Analyse r´

eputationnel, Explora-

tion toil´

ee.

Acknowledgements

Firstly, I would like to thank my supervisor, Professor Turgay Celik, for his

advice and guidance throughout the research and writing process.

iii

Contents

Abstract

R´

esum´

Acknowledgements

iii

Introduction

1.1

Aims and Objectives of the Research

. . . . . . . . . . . . . .

1.2

System Architecture

. . . . . . . . . . . . . . . . . . . . . . .

1.2.1

Schematic Description of the Architecture

. . . . . . .

1.3

Literature Review

. . . . . . . . . . . . . . . . . . . . . . . . .

1.4

Textual Data Retrieval

. . . . . . . . . . . . . . . . . . . . . .

1.5

Sentiment Analysis, NLP and Machine Learning

. . . . . . . .

1.5.1

N-gram

. . . . . . . . . . . . . . . . . . . . . . . . . .

1.5.2

Bag-of-Words

. . . . . . . . . . . . . . . . . . . . . . .

1.5.3

Autoencoders

. . . . . . . . . . . . . . . . . . . . . . .

1.5.4

Learning Algorithms

. . . . . . . . . . . . . . . . . . .

1.5.5

Neural Networks (NNs)

. . . . . . . . . . . . . . . . . .

1.5.6

Named Entity Recognition (NER)

. . . . . . . . . . . .

1.6

Evaluating the System

. . . . . . . . . . . . . . . . . . . . . .

1.6.1

Evaluating Coverage

. . . . . . . . . . . . . . . . . . .

1.6.2

Evaluating Accuracy

. . . . . . . . . . . . . . . . . . .

1.6.3

F-Measure

. . . . . . . . . . . . . . . . . . . . . . . . .

1.6.4

Accuracy

. . . . . . . . . . . . . . . . . . . . . . . . .

1.7

Content Extraction

. . . . . . . . . . . . . . . . . . . . . . . .

1.7.1

Word Extraction

. . . . . . . . . . . . . . . . . . . . .

1.7.2

Training Phase

. . . . . . . . . . . . . . . . . . . . . .

1.7.3

Emoticons

. . . . . . . . . . . . . . . . . . . . . . . . .

1.7.4

Our Training Approach

. . . . . . . . . . . . . . . . . .

1.7.5

Training

. . . . . . . . . . . . . . . . . . . . . . . . . .

1.7.6

Analysis

. . . . . . . . . . . . . . . . . . . . . . . . . .

Contents

1.8

Graphical User Interface

. . . . . . . . . . . . . . . . . . . . .

1.9

Results

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.10 Testing

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.10.1 Completeness of ANN (Accuracy)

. . . . . . . . . . . .

1.11 Empirical Testing of ANN

. . . . . . . . . . . . . . . . . . . .

1.12 Discussion

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.13 Conclusion

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

List of Figures

1.1

The platform hierarchy.

. . . . . . . . . . . . . . . . . . . . .

1.2

Reputation Mining (Morinaga et al., 2002).

. . . . . . . . . .

1.3

The architecture of the semantic content analysis framework

(Musto et al., 2015).

. . . . . . . . . . . . . . . . . . . . . . .

1.4

COBRA architecture (Spangler et al., 2009).

. . . . . . . . .

1.5

Data collection and labelling.

. . . . . . . . . . . . . . . . . .

1.6

N-gram model.

. . . . . . . . . . . . . . . . . . . . . . . . . .

1.7

BoW model.

. . . . . . . . . . . . . . . . . . . . . . . . . . .

1.8

Autoencoder.

. . . . . . . . . . . . . . . . . . . . . . . . . . .

1.9

Supervised Classification.

. . . . . . . . . . . . . . . . . . . .

1.10 Basic structure of Neural Networks.

. . . . . . . . . . . . . .

1.11 Named Entity Recognition.

. . . . . . . . . . . . . . . . . . .

1.12 A text extraction sample by BoilerPipe.

. . . . . . . . . . . .

1.13 The pre-processing step of a Web page using Stanford POS

tagger.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.14 A sample of the Wits marketing dataset.

. . . . . . . . . . . .

1.15 Emoticons and their variations (Read, 2005).

. . . . . . . . .

1.16 Graph of positive features (x axis) Vs the Cost (y axis).

. . .

1.17 Graph of Negative features (x axis) Vs the Cost function (y

axis).

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.18 Graph of Cost vs. The Number of Iterations.

. . . . . . . . .

1.19 Home-page of the web-based application. This figure shows

the initial screen of the application. At this page, the user is

required to performed reputation analysis.

. . . . . . . . . . .

1.20 log-in screen of the web-based application. Once the user ac-

cepted to performed reputation analysis, he will have access

to the log-in screen.

. . . . . . . . . . . . . . . . . . . . . . .

1.21 Results for:"Wits University".

. . . . . . . . . . . . . . . . .

1.22 Results for:"Wits students protest".

. . . . . . . . . . . . . .

1.23 Learning process of ANN. Features (x axis) and Time (y axis)

1.24 Negative Features and the Bag-of-words performance.

. . . .

List of Tables

1.1

Binary Confusion Matrix.

. . . . . . . . . . . . . . . . . . . .

1.2

BoilerPipe extraction performance.

. . . . . . . . . . . . . . .

1.3

The words-list.

. . . . . . . . . . . . . . . . . . . . . . . . . .

1.4

List of English Stop Words.

. . . . . . . . . . . . . . . . . . .

1.5

Evaluation metrics for ANN.

. . . . . . . . . . . . . . . . . . .

1.6

Confusion Matrix approximation.

. . . . . . . . . . . . . . . .

1.7

The Bag-of-Words performance.

. . . . . . . . . . . . . . . . .

1.8

The meanings of colors for the GUI.

. . . . . . . . . . . . . . .

1.9

Predicting the Web page polarity.

. . . . . . . . . . . . . . . .

1.10 Polarity prediction using SentiStrength.

. . . . . . . . . . . . .

1.11 The Bag-of-Words performance.

. . . . . . . . . . . . . . . . .

1.12 Confusion Matrix for ANN.

. . . . . . . . . . . . . . . . . . .

1.13 Approximation of the ANN Confusion Matrix. The empirical

error approximation starts from 0.0569% up to 0.13%.

. . . . .

1.14 Performance of ANN for Testing and Training.

. . . . . . . . .

vii

Nomenclature

(i)

, y

(i)

) Output of our hypothesis on input x, using parameters W, b (this

should be a vector of the same dimension as the target value)

Learning rate (numeric value (R))

The parameter of the encoder (recognition model)

The parameter of the decoder (generative model)

Specific hypothesis function parameter (numeric value (R))

(l)

Activation/output of unit i in layer l of the network (a

(1)

= x

)

g()

Neural network activation function (returns r (R))

W,b

(x) The i

training example

Sentimentality (s

: sentences with positivity, s

: sentences with neg-

ativity and s

: sentences with neutrality)

The average of sentimentality scores of sentences (S

is the set of sen-

tences in page p, where SS

- s

- 3))

(l)

The parameter associated with the connection between unit j in layer

l (and unit i in layer l + 1)

Input features for a training example R

Output/target values

(l)

Total weighted sum of inputs to unit i in layer l (a

(l)

= f (Z

(l)

))

An upper case, boldface letter is a matrix; an upper case, light (non-boldface) letter

is a set. A lower case, boldface letter is a vector.

viii

Dedication

Dedicated to my supervisor, Prof. Turgay Celik and my father Prof. Jean-

Paul Nkongolo Mukendi.

Epigraph

The more you try, the luckier

you get.

C´

edric Villani

Chapter 1

Introduction

Knowing the reputation of your own products or competitors' products is im-

portant for marketing and customer relationship management. Questionnaire

surveys are conducted for this purpose, and open questions are generally used

in the hope of gaining valuable information about corporate/brand reputa-

tions. It is very costly to gather and analyze the large volume of high quality

survey data, which is necessary for meaningful brand reputation analysis.

One approach which promises to reduce costs in this regard is to automati-

cally extract opinions about specific products from the Internet. The purpose

of this research is to provide a framework for automatically collecting and

analyzing opinions for reputation analysis on the Internet.

Organizations or companies need to know the public's feelings and judgments

on their products or services. In order to achieve this, they must manage

opinion polls or take a survey of a target group. With the popularization of

internet usage, a significant repository of textual opinions and reviews has

been created. The most popular sources are C-Net, IMDB, Amazon, Rotten-

Tomatoes, twitter, facebook (

Zeinalipour-Yazti et al.

2004

). Our research

concentrates on brand reputation analysis and as a case study we will focus

on Wits.

The availability of these textual opinions has changed the information gath-

ering process. It is possible to read the opinions and experiences of hundreds

of people about almost every existing product. Reading through all this in-

formation in order to reach a conclusion on whether a product or a service

is good or bad, is a time-consuming task. Moreover, drawing an inference

(positive, negative or neutral) when there are different conflicting opinions is

very difficult. The process of reputation analysis is a powerful conduit that

can automatically extract opinions and sentiments from online sources, and

Introduction

classify them as positive, negative or neutral.

1.1

Aims and Objectives of the Research

The information gathered from such a brand reputation mining activity can

be used by organizations to:

· compare their performance against competitors

· assess specific marketing strategies;

· gauge how a particular product or service is received in the market.

The successful conduct of such brand reputation mining entails three

broad challenges:

· the identification and collection of mentions on Web media

· the application of data mining techniques to the gathered information

in order to determine the sentiment associated with opinions expressed

in the form of mentions by individuals or groups according to the views

expressed;

· display of the results.

This research was conducted for the following reasons:

· implementing a real-time reputation analysis platform

· developing efficient algorithms to address the problem of reputation

analysis;

· implementing solution techniques of studying the reputation of Wits

University on the Internet;

· evaluating the proposed framework systematically.

Introduction

1.2

System Architecture

The number of web-pages and the amount of content available is unimagin-

ably large because the size of the World Wide Web has tremendously grown

over the past few years.

In order to access this information, a system is needed to scan through all

available websites and pick out relevant sites that the user browsing the In-

ternet is interested in. The most common systems used for this purpose in

current times are crawlers. The proposed framework integrated a database

and a crawler.

1.2.1

Schematic Description of the Architecture

The following Figure 1.1 (page 4) illustrates the automation of brand repu-

tation on the Internet. To create the dataset used for this research, text was

taken from different links provided by the crawler. The crawler then pop-

ulated the database with relevant information. Prior to storing the textual

data into the database, a named entity recognition checked for the validity

of the data. This named entity recognition (corporate/brand name recogni-

tion) was able to detect texts which includes a corporate/brand name using

keywords specified in the query. This phase is related to natural language

processing and machine learning. The database contains four components

separated into four tables (label, sentiments, crawler and raw text). We then

extract observations/sentences including keywords, then classified each ob-

servation/sentences using a classifier. The classification of extracted observa-

tions/sentences focused on their polarity (positive, negative, neutral). This

was achieved through iterative feature extraction and feature engineering.

To perform the sentiment detection, feature extraction used NLP techniques

such as N-gram and bag-of-words (

Tang et al.

2014

The feature engineering utilized the autoenconders approach by reconstruct-

ing a model of the extracted observations/sentences based on the keywords.

This model was then utilized by the classifier. Here, we considered the pro-

cess of converting words to vectors. There are several methods for doing so.

We considered a very simple implementation, proposed by

Lebret and Col-

lobert

(

2015

), which uses autoencoders to jointly learn representations for

words and phrases. ANN and SentiStrength

was used to classify extracted

SentiStrength is a lexicon-based sentiment analysis library. Given a short piece of text

written in English, the library generates a positive/negative and a neutral sentiment score

for each word in the text.

Introduction

observations/sentences (

Vural et al.

2013

). Following this, the results of the

classification step was displayed using a Web-based interface.

We opted for Artificial Neural Networks because it produces good results

in complex domains and they are suitable for both discrete and continuous

data, especially better for the continuous domain (

Ikonomakis et al.

2005

SentiStrength captures the inherent characteristics of the textual data better

and minimizes the upper bound on the generalization error. Its ability to

learn can be independent of the dimensionality of the feature space (Global

minima vs. Local minima) (

Ikonomakis et al.

2005

Figure 1.1: The platform hierarchy.

This framework focuses mainly on the implementation of various com-

ponents to create the final robust and accurate reputation analysis system.

These are the main tasks completed in this research:

· Dataset collection: This involves collecting the required textual datasets

using a crawler.

· Feature extraction/engineering: This task includes designing and

extracting a novel, discriminative set of features from the input texts

Introduction

from the datasets. This is probably the most important step in the

research, as this step directly effect the eventual sentiment detection

rates of the system.

These features are gradient-based spatio-temporal features, as well as

deep learning-based features.

· Classifier selection, design, and implementation: This task is

selecting, designing, and implementing appropriate classifier for the

system.

The classifier was trained on the features extracted.

The

chosen classifier is ANN.

· Query browser: The query browser takes the user's request/query.

Furthermore, the crawler uses this request/query to extract the textual

data by utilizing the key words in the query/request as reference while

crawling the Internet.

· Reporting and Visualization Module: After classifying textual

data, this module concentrates mainly on the visualization of the clas-

sified textual data. This implies that for each web-page crawled, we as-

signed a classification specifying its polarity. Positivity will be marked

by a yellow color, negativity by a red color and neutrality by a green

color.

· Data Labelling: At this level we have given the opportunity to read

the content of the web page in order to test the accuracy of the classifier.

The user can read the textual content of a web-page and find out its

polarity. He can now compare the discovered polarity with that given

by the classifier.

1.3

Literature Review

Morinaga et al.

(

2002

) implemented a system that produces reputation analy-

sis on the Internet. The proposed system included one component for opinion

extraction and another for text extraction; the first part is nothing more than

a question-answering system to an application-specific question, and the sec-

ond part has four basic functions: extracting features from different words,

extracting words in a simultaneous manner, the extraction and analysis of

sentences and finally, the analysis of correspondence. To juxtapose these two

parts, the authors performed the labeling of opinions, which allowed them

to solve the problem of supervised learning.

Morinaga et al.

(

2002

) used real

data to explicitly demonstrate that the proposed system may allow users to

Introduction

capture crucial knowledge about the reputation of the products of interest

and to effectively minimize the cost of collecting and analyzing opinions.

Their system can also be applied to mining far beyond the field of industrial

products; such as: events, individuals, government, services, and companies.

The purpose of the research presented by

Morinaga et al.

(

2002

) was

to specify the reputation of a company/brand by carefully studying online

opinions. By using texts collected on the Internet, it is logical that some in-

formation on a product will not be necessary. However, opinions of products

that describe the experience of an individual product will be necessary. The

difference between

Morinaga et al.

(

2002

)'s work and our system presented in

this research is the data mining approach. Obviously,

Morinaga et al.

(

2002

)

uses sentiment analysis, however there is no topic discovery element as in-

cluded in our framework (a classification approach for topic discovery has

been implemented). The strategy of

Morinaga et al.

(

2002

) was to designate

a text with the label A as 1 and with any other label as 0. They then noted

a set of D texts as being a binary sequence.

They noted a subset of D which consists of texts comprising of a word or

phrase w as E(w) and the remaining sequence as D - E(w). Assume that

I(E(w)) and I(D - E(W )) represent the theoretical complexity of E(w) and

D - E(w), respectively. Generally, for any binary sequence x, its complexity

in information theory, will be noted as I(x), and one can compute it using

stochastic complexity. Figure 1.2 (next page) shows the reputation and ex-

traction system flow used by

Morinaga et al.

(

2002

The system supports the opinions extraction and applies the analysis of the

extracted opinions. Indeed, the user can input the product name into the

system, and the function performing the opinion retrieval will use the search

engine to retrieve web pages that include these names. The search engine

then retrieves all phrases that express opinions pertaining to these products

and feeds them into an opinion database (see Figure 1.2). The text mining

component, a particularly crucial component, accepts as input an analysis

condition specifying the target category, and produces an output that con-

sists of the exploration results.

Introduction

Figure 1.2: Reputation Mining (

Morinaga et al.

2002

Musto et al.

(

2015

) created CrowdPulse, a purely agnostic system for

textual analysis of social flows. The system performs social media analy-

sis and makes use of algorithms for semantic processing. The system was

implemented in order to detect the most dangerous zones of Italian ter-

ritory according to the content posted on social media. This system has

been deployed to monitor the state of the city of L'Aquila after the terrible

and shocking earthquake of April 2009. With this in mind, the system has

demonstrated its effectiveness in the context of the use of such technology

in a remarkably innovative way. CrowdPulse presents itself as a real-time

framework of semantic analysis of social flows. This platform focuses on an-

alytic approach. Each analysis is performed while executing the extraction

of heuristic processing. In a typical case, a user interacts with the social

network system on which he wants to analyze/apply the heuristic. Then, the

user must specify the type of process that he would like to perform on the

data content. The platform aims to extract, analyze, aggregate and classify

a huge amount of data/information, which is very important for users. It is

therefore imperative to emphasize that the system is totally independent of

the domain, so that it can be aggregated.

The architecture of the system is explained in Figure 1.3 (next page),

after which, a small description of each part of the system is provided.

· Social Extractor: populates a relational database of all information by

referring to social network APIs. This relational database is adapted

in real time, and is also powered by a heuristic approach (for example

Introduction

Figure 1.3: The architecture of the semantic content analysis framework

(

Musto et al.

2015

to collect all tweets containing a specific hashtag, posts/tweets from

different places, and all messages crawled).

· Semantic Tagger: categorizes each form of content. In this step, an

algorithmic method such as Tag.me and DBpedia Spotlight is applied

(

Musto et al.

2015

· Sentiment Analyzer: gives a polarity to each content. In this context,

Musto et al.

(

2015

) has created a lexicon approach that deals with

vocabularies thus associating a polarity (positive, negative or neutral).

· Domain-specific processing: produces the results required for each spe-

cific scenario. It incorporates a variety of data mining and machine

learning techniques.

· Analytics Console: produces outputs, providing data visualization wid-

gets.

Once the extraction processes have been triggered, all the content is pro-

cessed semantically. This step is encapsulated and the output is finally stored

locally. The information is aggregated and then presented to the user via an

interactive interface adapted in real time. The aggregation of data and the

type of widgets presented depend largely on the analysis and the results that

Introduction

the user would like to obtain: in some cases, it may be important to repre-

sent on a pie chart the feeling of the population, or to control the growth of

feeling in a time space, while in other cases the user will ask to store all the

geographic content on a map to analyze the propagation of certain subjects

or on several domains and so on. The analysis possibilities that could come

from it are infinite.

Similarly,

Spangler et al.

(

2009

), described an integrated brand and repu-

tation analysis solution that mines CGM (Consumer Generated Media) con-

tent for insight, called COBRA (Corporate Brand and Reputation Analysis).

Spangler et al.

(

2009

) have implemented a platform that monitors and gives

feedback on the reputation of the brand. The main purpose of the COBRA

platform is to detect different product categories, topics, issues and brands

that need to be monitored. With this in mind, the platform has become

much more focused on keyword-based queries attached to a brand in order

to extract sufficient data (see Figure 1.4). This method can be very greedy

when it comes to the use of bandwidth and data storage. The difference

with the system that we propose is that our primary objective remains the

analysis of the reputation of companies/brands. As a result, the data we are

collecting is imperatively less ambiguous (non-significant data is ignored),

and the data collection applied is optimized.

Figure 1.4: COBRA architecture (

Spangler et al.

2009

First, COBRA is an implementation of 3 systematic components, as we

Introduction

can see in Figure 1.4.

Typically, a generic engine named ETL (Extract

Transform Load) continuously supports CGM content in structured or un-

structured forms and then makes use of an information warehouse as can be

seen on the left side of Figure 1.4. Subsequently, an analytics engine gives

users the opportunity to model the analysis that will be used to semantically

mark the brand, topics, and problems of different contextual sources (top

right in Figure 1.4). Finally, an alarmist system considers the tagged data

for the purpose of generating brand image and reputation alerts (bottom

right in Figure 1.4). COBRA embodies an array of analytical techniques to

identify and monitor brand image and reputation alerts. Finally, COBRA

identifies alerts through an approach that progressively filters data through

four stages:

· Extended keyword-based queries: this component retrieves information

from identified data sources on the Internet, and collects content in-

cluding brand name matches. For multiple brands, the user's request

must contain the brand names and any possible variances. The main

purpose is to collect enough information about the entities to be ana-

lyzed, including brands and companies. However, the analysis phases

can treat and filter the data. We argue that such requests should in

no way be elusive in terms of extracting a whole range of insignificant

information, which can affect the results of the analysis.

· Snippets: textual collection method which analyzes content available

on the Internet. The majority of content available on the Internet is of-

ten irrelevant. The information may contain different topics. Therefore,

COBRA produces results based on the query stored in the relational

database. To collect the textual content that mentions specific marks,

Spangler et al.

(

2009

) used the Java regular expression syntax. They

eventually also reduced the total size of data that users can retrieve

by focusing on the relevant text segments for the subject, instead of

focusing on the entirety of the documents.

· Analytical modeling: COBRA also uses a variety of analytical tools to

collect and identify the brands or the problems/topics. Users begin by

identifying the main models, the brand and company names, they then

report to COBRA. Subsequently, the basic models and filtering models

are constructed by making use of the domain knowledge of users, for

instance, knowledge about the candy industry and brands, or using the

knowledge of the system generated by textual exploration.

· Orthogonal filtering: except for the first three steps of the filtering

methods, such as small queries, the generation of extracted contents,

Introduction

the conceptual analytic modeling of keywords, COBRA starts with

a unique filtering method, called orthogonal filtering, which identifies

important alerts.

1.4

Textual Data Retrieval

Normally, we defined an entry page for textual data retrieval. One web page

should contain URLs of other web pages, then we retrieved these URLs from

the current page and add all of these affiliated URLs into the crawling queue.

Next, we crawled another page and repeat the same process as the first one

recursively. Essentially, we assumed the crawling scheme as depth-search or

breadth-traversal. And as long as we accessed the Internet and analyze the

web page, we crawled a website.

Following this, we extracted the body content of the web page crawled and

applied Entity Recognition to eliminate irrelevant information. Lastly, the

relevant information was recorded in the database for later processing (see

Figure 1.5).

Figure 1.5: Data collection and labelling.

1.5

Sentiment Analysis, NLP and Machine

Learning

The sentiment analysis process classified the polarity of the text retrieved

at different levels - an attempt is made to determine whether the opinion

expressed in a text is positive, negative or neutral.

1.5.1

N-gram

In our research, feature extraction used N-gram/bag-of-words techniques that

are prominent in modern Natural Language Processing. N-grams is simply

Introduction

an aggregation of sequences as they appear in the texts (see Figure 1.6). The

intuition is that N represents the frequency of the aggregates sequence. The

parsing must be applied in order to obtain the syntactic paths. Nowadays,

many analyzers are available for many languages. Unfortunately, there are no

parsers for all languages. However for English or Spanish, there is a plethora

of parsers (

Volcani and Fogel

2006

Figure 1.6: N-gram model.

1.5.2

Bag-of-Words

The bag-of-words model is a representation used in Natural Language Pro-

cessing. This model represents a text as a multi-set of its words, without

focusing on grammar or the order of words.

However, the bag-of-words

model keeps multiplicity. The bag-of-words can also be used in computer

vision (

Pang et al.

2002

). This model is usually juxtaposes with classifica-

tion algorithms in which the frequency of each word to be used is utilized

as a property forming a classifier (Deep Learning). In our research, the bag-

of-words model was used as a tool generating frequencies. After the textual

transformation into a bag-of-words, we performed pompous calculations to

measure the textual characteristics. One of the most used features was the

frequency of words (number of times a term appears in the text (see Figure

1.7)).

Introduction

Figure 1.7: BoW model.

1.5.3

Autoencoders

In machine learning, documents are usually represented as bag-of-words

(BoW), which reduces a piece of text with arbitrary length to a fixed length

vector. Despite its simplicity, BoW remains the dominant representation in

many applications including text classification (

Koncz and Paralic

2011

There has also been a large body of work dedicated to learning useful repre-

sentations for textual data. By exploiting the co-occurrence pattern of words,

one can learn a low dimensional vector that forms a compact and meaningful

representation for a document.

The new representation is often found useful for subsequent tasks such as

topic visualization and information retrieval. Autoencoders have attracted

a lot of attention in recent years as a building block of Deep Learning

(

Mescheder et al.

2017

). In our framework, autoencoder act as the feature

learning methods by reconstructing inputs with respect to a given loss func-

tion. We implemented a neural network of autoencoder, the hidden layer

was taken as the learned feature. While it is often trivial to obtain good

reconstructions with plain autoencoders, much effort has been devoted on

regularizations in order to prevent them against overfitting.

Introduction

Figure 1.8: Autoencoder.

An autoencoder always consists of two parts (see Figure 1.8), the encoder

and the decoder, which can be defined as transitions and such that:

: X F

: F X

(1.1)

In the simplest case, where there is one hidden layer, the encoder stage

of an autoencoder takes the input of the generated features x R

= X

and maps it to z R

= F :z = (W x + b) where z is usually referred as

latent representation. Here, is an element-wise activation function such as

a sigmoid function. W is a weight matrix and b is a bias vector. After that,

the decoder stage of the autoencoder maps z to the reconstruction x of the

same shape as x:

x = (W z + b)

(1.2)

where , W and b for the decoder may differ in general from the corre-

sponding , W , b for the encoder, depending on the design of the autoen-

coder. Autoencoders are unsupervised learning models as discussed in the

following subsection 1.5.4.

1.5.4

Learning Algorithms

In supervised learning, we have a set of examples that consists of input-

output pairs. The desired predictor is a function that maps an input to a

relevant output or label. This set of examples is divided into two distinct

subsets (a training set and a test set) (

Pang et al.

2002

). The training set is

a bunch of correctly labeled examples while the test set contains unseen or

Introduction

new input data that is labeled by the predictor (see Figure 3.5). However, in

unsupervised learning, the predictor is a function that detects the patterns

in the input data even though no explicit feedback is provided. There is no

training set, and no labeled data are involved. The function groups all inputs

into several sub-groups based on their common patterns. This task is called

clustering (

Pang et al.

2002

). The semi-supervised learning is useful when

there are a few labeled examples and more unlabeled ones. The algorithm

generates an appropriate predictor to label the new data using knowledge of

both labeled data and the clustering function.

Figure 1.9: Supervised Classification.

1.5.5

Neural Networks (NNs)

Neural networks are generally regarded as being a major segment of the

supervised learning discipline. As with all types of classification in supervised

learning, the goal of a neural network is to utilize labelled data in order to

train the network to be able to classify any new instances of data that may

come in as either being expected or unexpected (

Furnkranz et al.

1998

). This

labelled data consists of a vector X, corresponding to the various features

of a particular object or environment and their values. A vector Y is also

included, which, in a binary classification problem, classifies each observation

as either expected or unexpected, based on the values of the vector X at that

position (

Furnkranz et al.

1998

). An example of such a problem would be in

text classification, where positive/negative and neutral observations would be

classified differently in the Y vector, with different combinations of features

in the X vector.

After learning a parameter vector, W , that multiplies

with X to minimize the error of classification with already-labelled data,

the program would then be able to create labels for new data, and thus to

determine whether new instances of textual data are positive/negative or

neutral.

Introduction

Figure 1.10: Basic structure of Neural Networks.

Features are input to the system as a layer of nodes (see Figure 1.10).

In order to deal with non-linear relationships between the data, the values

obtained in the input layer of nodes are then converted using a set of param-

eters w

into a new set of features, known as the hidden layer (

Furnkranz

et al.

1998

). This process can be repeated for a number of hidden layers,

however, networks with one or two hidden layers are most commonly used

for their lower computation times. Finally, at the end of the hidden layers,

the features are once again converted into a final output value, which clas-

sifies the observation as being either positive/negative or neutral. On the

first run-through of this algorithm, randomized parameters are used, which

are unlikely to give accurate results (

Furnkranz et al.

1998

). Thus, through

a process known as back-propagation, labelled training data can be used to

iteratively update each set of parameters, until the algorithm has a reliable

set in order to make accurate classifications. To update the parameters, a

cost function is developed, which is based on the mean error of the observa-

tion in relation to the current classification function (

Furnkranz et al.

1998

Hence, the algorithm is progressively updated until it is deemed to be suit-

able to tackle any new data of the same nature given to it. The value of the

cost function eventually converges to a minimum, which is when the learning

process will be complete. In this research, the neural network algorithm is

implemented by propagating the activation function g(S) values from the

input layer to the output layer.

(t)

= 1, t [1, 2]

(1.3)

(1)

= X

(i)

(1.4)

Introduction

(t)

= g(S

(t)

), t [2, 3]

(1.5)

where

(t)

(t-1)

j,i

(1.6)

and

g(S) =

1 + e

-S

(1.7)

Where, u

(t)

represents the activation value of i

node of the t

layer and

g(S) represents the activation function. is a specific hypothesis function

parameter with a numerical value (R).

1.5.6

Named Entity Recognition (NER)

NER has segmented entities, it has sought to locate and categorize differ-

ent entities such as names of people, organizations, locations, hour expres-

sions, quantities, monetary values, percentages etc. (see Figure 1.11). In our

research, Named Entity Recognition detected and classify company/brand

names within textual content.

Figure 1.11: Named Entity Recognition.

1.6

Evaluating the System

The performance of the system relies on the sentiment analysis. Efficiency

of sentiment analysis applications was calculated through experiments in the

Introduction

form of test data. For binary classifiers there are different metrics to measure

the performance and the testing phase can be split up into the following

phases (

Wang et al.

2012

1.6.1

Evaluating Coverage

The coverage of the system refers to the percentage of the total observations

that the system is able to make classification for (

Gon¸calves et al.

2013

). A

high coverage value is desired to make sure that every feature used by the

system is able to have classification generated for them. The formula for

calculating the coverage is shown below (

Gon¸calves et al.

2013

P recision =

T P

T P + F P

(1.8)

By using this metric, we are able to determine the reliability of the system

in terms of making classification.

1.6.2

Evaluating Accuracy

In order to evaluate accuracy, we used the metrics of precision and recall.

We need a baseline to use as a reference (ground truth). As false/true nega-

tives/positives relate to the predictions, we must first establish a set of data

points whose classification is known ahead of time. Then, the predicted out-

comes are compared to the known cases. The false positive is the case that

a model predicts something that is known not to be so. The false negative

is the case that a model predicts something not to be so when the known

classification is that it is so. Precision calculates how many classifications

made were consistent. Recall compares the number of correct classifications

made to the total number that could have been made based on the dataset.

Both a high coverage and a high recall are desirable for the classification

system. The formulae for calculating the recall is shown below (

Gon¸calves

et al.

2013

Recall =

T P

T P + F N

(1.9)

where true positives stands for "TP" and false positive stands for "FP".

False negative is represented by "FN". High recall means that the classifier

can predict correctly most of the Web pages.

Introduction

1.6.3

F-Measure

F-measure also called harmonic means is the combination of precision and

recall that also is called balanced F-score and is calculated as follow (

Hatzi-

vassiloglou and McKeown

1997

F = 2

P recision Recall

P recision + Recall

(1.10)

In this formula, precision and recall are weighted evenly.

1.6.4

Accuracy

Another statistical metric of how well a binary classifier performs correctly on

test data is accuracy (

Hatzivassiloglou and McKeown

1997

). To calculate

the accuracy both true positive and true negative values among the total

number of examined cases are considered. The accuracy formula is :

Accuracy =

T P + T N

T P + T N + F P + F N

(1.11)

We have a classification algorithm f (x|), after training, the prediction

of x represents the positive polarity if f (x|) , at a certain level . We

suggest that f (x|) [0, 1] embodies probabilistic space if x has a positive

polarity, this implies that ¯

P (+|x) f (x|). We say that x has a nega-

tive polarity if f (x|) < and ¯

P (-|x) 1 - f (x|). Then, based on the

true x label, we consider 4 cases in our confusion matrix. We also count

the number of their occurrence. True Positive (TP), False Negative (FN),

False Positive (FP), and True Negative (TN). For neural networks, we fo-

cused mainly on the classification of positive and negative polarity using a

binary classification. In this case, the FN / FP represents a misclassification.

Positive

(Actual)

Negative

(Actual)

Positive

(Predicted)

True Positive

(T P )

False Positive

(F P )

Negative

(Predicted)

False Negative

(F N )

True Negative

(T N )

Table 1.1: Binary Confusion Matrix.

Introduction

1.7

Content Extraction

In the cleanup section, we extracted appropriate content, and irrelevant in-

formation such as advertising, and menu bars was ignored from the main

content. To perform this extraction and such filtering, we used another tool

named BoilerPipe. BoilerPipe enabled the framework to retrieve the body

content from web pages. It offers 5 options of textual collection, among the 5

option CanolaExtractor is considered as the most efficient option because it

has surpassed the extraction performance of the other 4 functions on a larger

number of pages (see Table 1.2). The textual extraction of a Web page by

CanolaExtractor from BoilerPipe is presented in Figure 1.12 (page 21).

BoilerPipe Options

Size of features extracted

KeepEverythingExtractor

600kB

ArticleSentencesExtractor

650kB

NumWordsRulesExtractor

567kB

CanolaExtractor

1MB

LargestContentExtractor

700kB

Table 1.2: BoilerPipe extraction performance.

1.7.1

Word Extraction

We also incorporate the Stanford Core NLP library which is used extensively

for textual segmentation from the content of the extracted page.

Stanford

Core NLP is a set of Natural Language Processing tools, this library can

provide all basic forms of words, the aggregation of speech and the structural

tag of texts from a raw English text entry.

Our platform first sends the text to Stanford POS tagger for some pre-

processing steps such as: sentence segmentation, tokenization all sentences

and tagged tokens (see Figure 1.3, page 22).

In order to extract the Web page words-list, our framework builds an

array including all stemmed tokens (the words) along with their frequencies

and part of speech tags (see Table 1.3, page 21).

After constructing the words-list, all words which are tagged as verbs and

adjectives are sent to the classifier in order to extract their polarity scores.

The total scores of all positive/negative and neutral words are calculated.

Based on the higher score, our classifier predicts the Web page polarity. If

NLP Stanford Core Library (version 1.2.0), http://nlp.stanford.edu/software/corenlp.shtml.

Introduction

Figure 1.12: A text extraction sample by BoilerPipe.

Tokens

Word

Frequency

Tag

bought

buy

VBD

speedy

VBZ

...

Table 1.3: The words-list.

the classifier can correctly predict the polarity of the Web page, it will be

added to the database. Our classifier goes through all real-time/testing data

sets and predict the polarity of extracted sentences/observations.

The author bases his analysis on the contents of the following website: http://

www.bbc.co.uk/programmes/profiles/N8TcrLGxrf6dYzLZP1zhQj/meet-the-

candidates.

Introduction

Figure 1.13: The pre-processing step of a Web page using Stanford POS

tagger.

One of the strength points of this method is the ability for further expansion

by considering more out-of-domain data sets as well as adding the additional

sentiment lexicons.

Stop Words are words which do not contain important significance to be

used in Search Queries. In this research, these words are filtered out from

search queries because they returned vast amount of unnecessary information.

The following Table 1.4 shows a list of English stop words ignored.

a, about, above, across, after, afterwards

again, against, all, almost, alone, along

already, also, although, always, am, among

amongst, amoungst, amount, an, and, another

any, anyhow, anyone, anything, anyway, anywhere

...

Table 1.4: List of English Stop Words.

1.7.2

Training Phase

In the training phase, the classifier takes a lexicon and a Web page words-

list as input and computes the polarity of the Web page by querying for

sentimental values of its all adjectives and verbs.

Introduction

Figure 1.14: A sample of the Wits marketing dataset.

1.7.3

Emoticons

To easily classify the polarity of a text/message, it is necessary to focus on the

emoticons it contains. We can define emoticons as a representation of happy

or sad feelings. To specify the polarity of the emoticons, we considered an

entire group of common emoticons. Emoticons have been used in combination

with other techniques to implement a set of learning data. Figure 1.5 (next

page) shows a sample of emoticons that we used to train the classification

algorithm.

1.7.4

Our Training Approach

The entire textual dataset was broken down into individual observations

using Python code, with the most commonly appearing words being iden-

tified and recorded. The particular number of common words chosen can

be increased or decreased depending on the size of the dataset, or if results

generated are not accurate enough. Thus, the logic behind this is that if an

observation's polarity is to be predicted it should contain at least some of the

words commonly associated with past observations polarity examined. Af-

ter building this list of observations, training and testing datasets were then

built from the obtained textual data. A multidimensional array was built

for the dataset, utilizing the previously obtained common words as features,

Introduction

Figure 1.15: Emoticons and their variations (

Read

2005

and each record in the dataset as a separate row in the array. By reading

the text file containing the observations, each observation was read in one-

by-one, using the % tags to separate each observation. Each record was then

compared, word-by-word, to the list of common words generated previously.

If any of the words in the observation matched with a common word, this was

recorded in the multidimensional dataset array by incrementing the corre-

sponding value in the array. Entire datasets for the textual data were read in

using this method, with a corresponding targets array being labelled "0" or

"1", depending on which observation was currently being read. The finalized

array was then split into a training and a testing dataset, for the purposes

of the neural network method. This particular system uses a 4-layer total

network (with 2 hidden layers), so three sets of random weights were first

generated for the transitions between each layer. Forward propagation is first

applied to the network, using the following activation function (

Goodfellow

et al.

2016

h(x) =

1 + e

(1.12)

where W is the weight vector and x is the value of the input for that node.

As a result, hypotheses are generated for each node in the hidden layers (to

be used as features for the next layer) as well as a hypothesis for the output

in the final layer. Backpropagation was then applied, in order to update the

weights between the output layer and the second hidden layer, the second

hidden layer and first hidden layer, and first hidden layer and input layer,

Introduction

respectively. This is done to improve hypothesis generation accuracy, as men-

tioned previously by

Goodfellow et al.

(

2016

). Following this, the parameter

with which to split the data into positive/negative was obtained by testing

the previously obtained factors on a combined portion of the dataset with

positive/negative labels. This test was done repeatedly, each time updating

the parameter value until the optimal separation is found.

1.7.5

Training

The implementation created for the intake and preprocessing of the dataset

was found to proceed quickly enough to get results in a reasonable amount of

time. This was based on the current amount of data collected for the tests.

Obviously, it is expected that the more data is added to the dataset, the

longer the code will take to generate results.

The ANN algorithm, when tested on the basis of true positives, true nega-

tives, false positives, and false negatives, was found to produce approximately

11% false positives and 2.7% false negatives. This is shown in Table 1.5 be-

low. ANN accurately classified each observation using a binary polarity with

an empirical error of 0.1379%.

Positive

(Actual)

Negative

(Actual)

Positive

(Predicted)

22 = 15.172413 %

16 = 11.034482%

Negative

(Predicted)

4 = 2.7586206 %

103 = 71.034482%

Table 1.5: Evaluation metrics for ANN.

In terms of approximations, the range of the ANN's confusion matrix is

given in Table 1.6.

Positive

(Actual)

Negative

(Actual)

Positive

(Predicted)

20 to 33

4 to 18

Negative

(Predicted)

4 to 16

89 to 103

Table 1.6: Confusion Matrix approximation.

Introduction

Empirical error = 0.13 to 0.17%

As we can see from the confusion matrix approximation, there is more

or less 33 positive observations that are predicted as positive; with at least

103 negative observations that are classified as negative. In contrast, there

were at least 34 observations that are misclassified. ANN is an adaptive al-

gorithm which change its inner structure based on the information passing

through it. Therefore, learning in ANN means that a processing unit could

update its input/outputs due to the change in environment. For training,

we used some training samples with unique features (words, sentences); and

to performed testing we used some testing sample with other unique features.

Figure 1.16: Graph of positive features (x axis) Vs the Cost (y axis).

Figure 1.17: Graph of Negative features (x axis) Vs the Cost function (y

axis).

From the Graphs of ANN shows in Figure 1.16 and Figure 1.17 (page

26), we note that there is a strong correlation between the volume of the

Introduction

dataset and the classifier prediction. This implies that for a big dataset

with multiple observations, the algorithm will decrease the value of the cost

function. However, if the volume of the dataset is minimized, the algorithm

will increase the value of the cost function. The cost function of the ANN

reached a minimum, thus showing that the algorithm was able to converge

correctly. The graph in Figure 1.18 shows the value of the cost function

decreasing.

Figure 1.18: Graph of Cost vs. The Number of Iterations.

1.7.6

Analysis

From the results obtained from the ANN implementation, it can be seen that

most of the data was able to be classified correctly. This is likely due to the

textual polarity data not having too much overlap, resulting in the boundary

value being able to separate them properly. Due to the varied nature of the

feature data, with features potentially changing as the dataset gets larger, it

can be seen that this classification method could possibly have a reduction

in prediction accuracy as the sample size increases, due to more observations

potentially overlapping. However, a larger dataset could, instead, increase

hypothesis generation accuracy as the algorithm would have more data to

work with and the boundary between positive/negative could be better un-

derstood.

The quality of the data is thus important in this respect. In addition, for

the neural networks implementation, it was found that the algorithm only

misclassified a minor number of observations, with an empirical error of 0.13

With the number of iterations.

Introduction

Words

Occurrence

Wits

200

University

119

Students

Research

2016

School

South

Student

African

Academic

Table 1.7: The Bag-of-Words performance.

or 0.17%, suggesting a correct classification rate of 71%. The vast majority

of observations were thus classified correctly, based on this result. The algo-

rithm was thus able to handle the complex relationships between the various

features, making use of the hidden layers to account for these relationships.

It was largely successful in classifying each observation as either positive or

negative. Once again, a larger dataset could potentially increase output ac-

curacy for this algorithm. The results of the bag-of-words is shown is Table

1.7 (page 28) with the key word Wits having the highest occurrence within

the textual dataset.

1.8

Graphical User Interface

The data visualization that were used in this work is discussed in this sec-

tion. To visualized the data we incorporated our framework into Trackur

and created searches for the terms that a user want to track, such as brand

names and corporate terms. Trackur API allows developers to store struc-

tured data in their databases. In this research Trackur was used as a crawler

and thus it allowed the platform to utilize tens/millions of crawled textual

data. This data was extracted from social medias and other source of in-

formation. Our design presents the information in an understandable way,

which makes it possible to understand the result produced by the system. We

simply implement a principle of structural organization defining categories

of information by function or importance. The color allowed us to classify

web pages crawled with respect to their polarity. Table 1.8 (page 29) is a

Trackur is a social media monitoring tool for individuals up through large companies

and agencies, with this API, new items are found in almost real-time.

Introduction

list of some meanings about colors and how they are interpreted by the GUI

framework:

Colors

Meaning

Red

Negatives polarity

Yellow

Positives polarity

Green

Neutral polarity

Table 1.8: The meanings of colors for the GUI.

"NA" is used to indicate that the textual contents of a certain Web page

is not available. "Date" is used to indicate the date a website is written.

"Sentiment" indicates the polarity prediction. "Source" is used to determine

the source of web pages. A snippet is a small text segment around a specified

keyword. The following Figure 1.21 displays the result for the query "Wits

University". Similarly, Figure 1.22 displays the result for the query "Wits

students protest".

Figure 1.19: Home-page of the web-based application. This figure shows

the initial screen of the application. At this page, the user is required to

performed reputation analysis.

The user name and password would have been given to the user by the

system administrator. The backend MySQL database contains a table for

The author refers to the following website: http://reputationanalysis.wifeo.com/

Introduction

Figure 1.20: log-in screen of the web-based application. Once the user ac-

cepted to performed reputation analysis, he will have access to the log-in

screen.

all users that stores the credentials for both their user names and passwords.

If both the user name and password match, then the user is logged into the

system. Otherwise, the user is presented with an error message and is asked

to re-enter their credentials.

Figure 1.21: Results for:"Wits University".

The polarity of the entire Web page was displayed in the final reputation

analysis application.

Introduction

Figure 1.22: Results for:"Wits students protest".

1.9

Results

In this section, we discuss the use of ANN and SentiStrength for sentiment

detection in reputation analysis. Our study was aimed at investigating the

use of textual data in Web mining for reputation analysis using the afore-

mentioned classifier. Since our proposed framework is supervised, given a

Web page, the method first count the number of positive/negative and neu-

tral observations. If the number of neutral observations s

is larger than the

number of positive observations s

and negative observations s

, the Web

page p

is considered as neutral. If the number of positive observations is

larger than the number of negative/neutral observations, the Web page is

considered as positive, otherwise as negative (see Table 1.10).

Condition

Prediction

> s

) (s

> s

)

> s

) (s

> s

) = p

> s

) (s

> s

)

> s

) (s

> s

) = p

> s

) (s

> s

)

> s

) (s

> s

) = p

= s

) (s

> s

)

= s

) (s

> s

) = p

= s

) (s

> s

)

= s

) (s

> s

) = p

= s

) (s

> s

)

= s

) (s

> s

) = p

= s

) (s

= s

)

= s

) (s

= s

) = p

Table 1.9: Predicting the Web page polarity.

We also compare our results with the results obtained in

Wang and Araki

(

2008

). One of the more interesting services available on the crawler used is

Introduction

insights for data.

The crawler/API provides a basic "search" feature but can also include some

incredibly detailed filters. For instance, we can look at data from various

perspective.

1.10

Testing

To evaluate our classifiers, we apply them to real-time domain to predict

the polarity of the Web pages Crawled. For a Crawled Web page, first, its

contents is extracted, stored and the polarity computed.

The following Table 1.11 shows the classification of sentences using Sen-

tiStrength.

Sentence

Highest polarity

Compound

Make sure you prepared:) or : you will fail

neg: 0.314

-0.3594

The pass rate is positive

neu: 0.1

0.00

I really enjoyed this class

pos: 0.545

0.5563

I dislike this class, it is boring

neu: 0.505, neg: 0.495

-0.5994

I like you

neu: 0.286, pos: 0.714

0.3612

Table 1.10: Polarity prediction using SentiStrength.

Additionally, we showed the keywords detected in terms of reputation

prediction (next page, Table 1.12).

Since the core keyword is "Student(s)", one can say that the textual

dataset contained information about students protest in south Africa. Nev-

ertheless, we can draw very different meanings from the same results.

1.10.1

Completeness of ANN (Accuracy)

In this subsection we report the average of Accuracy for comparability to

earlier results in text classification. Finally, we summarized the average for

the learning algorithm. Table 1.13 shows the confusion Matrix for the ANN

with an empirical error of 0.073%.

Introduction

Words

Occurrence

Students

113

Student

University

Universities

Protest

CPUT

South

2016

FeesMustFall

Violence

Table 1.11: The Bag-of-Words performance.

Positive

(Actual)

Negative

(Actual)

Positive

(Predicted)

89 = 72.3577235 %

2 = 1.62601626 %

Negative

(Predicted)

7 = 5.6910569 %

25 = 20.3252032 %

Table 1.12: Confusion Matrix for ANN.

1.11

Empirical Testing of ANN

The issue of conducting computational experiments has been addressed since

the late 70's (

Chen et al.

1999

). Empirical testing of algorithms has been

the focus of research in a variety of contexts. One of the major limitations of

ANN is the learning process which is relatively slow and the implementation

took slightly longer to arrive at results due to backpropagation having to be

done repeatedly (see Figure 1.24). The framework presented by

Wang and

Araki

(

2008

) achieved 70% for opinion sentences classification. This result

is lower than the one achieved by our framework. The proposed framework

performed 92% of Accuracy, this surpassed the accuracy presented by

Wang

and Araki

(

2008

1.12

Discussion

Results from this experiment demonstrate the limiting factor of ANN which

suffered from significant slowdowns with larger datasets.

An increasing

amount of data to process as well as additional features exponentially in-

Introduction

Positive

(Actual)

Negative

(Actual)

Positive

(Predicted)

83 to 90

1 to 10

Negative

(Predicted)

4 to 15

22 to 30

Table 1.13: Approximation of the ANN Confusion Matrix. The empirical

error approximation starts from 0.0569% up to 0.13%.

ANN

Accuracy

Training

103

86.20%

Testing

92.68%

Average

55.5

5.5

89.44%

Table 1.14: Performance of ANN for Testing and Training.

Figure 1.23: Learning process of ANN. Features (x axis) and Time (y axis)

creased the processing time required for the learning algorithm to finish.

However, since the algorithm tries to find an exact fit for the textual data,

the accuracy of hypotheses generated is likely to be high. Thus, an individual

considering this algorithm for use may have to look at the tradeoff between

processing time and accuracy and make a determination in that regard.

It can thus be concluded that using both ANN and SentiStrength algorithms

for the purpose of reputation analysis is a viable proposition. The ANN

is able to learn a set of parameters in order to compare the text in a given

Introduction

Figure 1.24: Negative Features and the Bag-of-words performance.

observation to the obtained words and thus decide whether to classify the ob-

servation as positive/negative. The SentiStrength algorithm can effectively

determine the polarity of a sentence. The compound score is computed by

summing the valence scores of each word in the lexicon, and then normalized

to be between -1 (most extreme negative) and +1 (most extreme positive)

(

Araujo et al.

2016

). Therefore, it is clear to see that such fast and efficient

methods of classification are able to be implemented and incorporated in a

reputation analysis framework.

1.13

Conclusion

The Internet is growing at lightning speed and the data stored therein is vast.

The increasing growth of the Internet makes it an enormous source of data,

especially on how people feel about different issues. Nowadays, the opinions

of people play a crucial role in industry. So large and small industries, are

studying automatic approaches to retrieve the information they need from

large volumes of data on the Internet. Reputation analysis is an effective

method to deal with this problem. Reputation analysis automatically deter-

mines how different keywords, terms, topics or user-generated content may

harm a brand name, product or company that are mentioned. Reputation

analysis utilizes sentiment detection that involves advanced methods such

as machine learning and natural language processing to capture the polarity

such as positive, negative, or neutral, with or without their strength, from

plain texts. This research focuses on Web mining for reputation analysis. A

reputation analysis is performed on the University of the Witwatersrand to

study its popularity on the Internet. There exists a wide range of fields for

which information can be retrieved. This research investigated sentiments

about Wits from publicly available data. The system can be use to retrieve,

Introduction

to process and to display the reputation of different brands/corporate. This

study also describes tools that enable the development of technologies that

support text processing to speed up sentiment detection in reputation anal-

ysis. In this perspective, we offer an application of how the proposed frame-

work works for a clearly defined system such as focused web crawling. Our

work is totally different from the work presented by

Wang and Araki

(

2008

Wang and Araki

(

2008

) utilized unsupervised techniques, our framework ap-

plied supervised algorithms.

References

[Ackoff 1989] Russell L Ackoff. From data to wisdom. Journal of applied

systems analysis, 16(1):39, 1989.

[Aggarwal and Zhai 2012] Charu C Aggarwal and ChengXiang Zhai. Mining

text data. Springer Science & Business Media, 2012.

[Araujo et al. 2016] Matheus Araujo, Julio Reis, Adriano Pereira, and Fabri-

cio Benevenuto. An evaluation of machine translation for multilingual

sentence-level sentiment analysis. In Proceedings of the 31st Annual ACM

Symposium on Applied Computing, pages 11401145. ACM, 2016.

[Asghar et al. 2014] Muhammad Zubair Asghar, Aurangzeb Khan, Shakeel

Ahmad, and Fazal Masud Kundi. A review of feature extraction in senti-

ment analysis. Journal of Basic and Applied Scientific Research, 4(3):181

186, 2014.

[Blanco and Moldovan 2011] Eduardo Blanco and Dan I Moldovan. Some

issues on detecting negation from text. In FLAIRS Conference, pages

228233, 2011.

[Chang and Lin 2011] Chih-Chung Chang and Chih-Jen Lin. Libsvm: a li-

brary for support vector machines. ACM transactions on intelligent sys-

tems and technology (TIST), 2(3):27, 2011.

[Chen et al. 1999] Chun-Hung Chen, S David Wu, and Liyi Dai. Ordinal

comparison of heuristic algorithms using stochastic optimization. IEEE

Transactions on Robotics and Automation, 15(1):4456, 1999.

[Cooley et al. 1997] Robert Cooley, Bamshad Mobasher, and Jaideep Srivas-

tava. Web mining: Information and pattern discovery on the world wide

web. In Tools with Artificial Intelligence, 1997. Proceedings., Ninth IEEE

International Conference on, pages 558567. IEEE, 1997.

[Dadvar et al. 2011] Maral Dadvar, Claudia Hauff, and Franciska MG

de Jong. Scope of negation detection in sentiment analysis. 2011.

References

[Drake 2003] Miriam Drake. Encyclopedia of library and information science,

volume 1. CRC Press, 2003.

[Durant and Smith 2006] Kathleen T Durant and Michael D Smith. Mining

sentiment classification from political web logs. In Proceedings of Workshop

on Web Mining and Web Usage Analysis of the 12th ACM SIGKDD Inter-

national Conference on Knowledge Discovery and Data Mining (WebKDD-

2006), Philadelphia, PA, 2006.

[Frakes and Baeza-Yates 1992] William B Frakes and Ricardo Baeza-Yates.

Information retrieval: data structures and algorithms. 1992.

[Freitag 1998] Dayne Freitag. Information extraction from html: Application

of a general machine learning approach. In AAAI/IAAI, pages 517523,

1998.

[Fu et al. 2012] Tianjun Fu, Ahmed Abbasi, Daniel Zeng, and Hsinchun

Chen. Sentimental spidering: leveraging opinion information in focused

crawlers. ACM Transactions on Information Systems (TOIS), 30(4):24,

2012.

[Furnkranz et al. 1998] Johannes Furnkranz, Tom Mitchell, Ellen Riloff,

et al. A case study in using linguistic phrases for text categorization on

the www. In Working Notes of the AAAI/ICML, Workshop on Learning

for Text Categorization, pages 512, 1998.

[Godsay 2015] Manasee Godsay. The process of sentiment analysis: a study.

International Journal of Computer Applications, 126(7), 2015.

[Gon¸calves et al. 2013] Pollyanna Gon¸calves, Matheus Ara´

ujo, Fabr´icio Ben-

evenuto, and Meeyoung Cha. Comparing and combining sentiment analy-

sis methods. In Proceedings of the first ACM conference on Online social

networks, pages 2738. ACM, 2013.

[Goodfellow et al. 2016] Ian

Goodfellow,

Yoshua

Bengio,

and

Aaron

Courville.

Deep learning (2016).

Book in preparation for MIT Press.

URL: http://www. deeplearningbook. org, 2016.

[Hatzivassiloglou and McKeown 1997] Vasileios Hatzivassiloglou and Kath-

leen R McKeown. Predicting the semantic orientation of adjectives. In

Proceedings of the eighth conference on European chapter of the Associa-

tion for Computational Linguistics, pages 174181. Association for Com-

putational Linguistics, 1997.

References

[Ikonomakis et al. 2005] M Ikonomakis, Sotiris Kotsiantis, and V Tampakas.

Text classification using machine learning techniques. WSEAS transac-

tions on computers, 4(8):966974, 2005.

[Kantor 1994] Paul B Kantor. Information retrieval techniques. Annual re-

view of information science and technology, 29:5390, 1994.

[Keim 2002] Daniel A Keim. Information visualization and visual data min-

ing. IEEE transactions on Visualization and Computer Graphics, 8(1):18,

2002.

[Kennedy and Inkpen 2006] Alistair Kennedy and Diana Inkpen. Sentiment

classification of movie reviews using contextual valence shifters. Compu-

tational intelligence, 22(2):110125, 2006.

[Koncz and Paralic 2011] Peter Koncz and Jan Paralic. An approach to fea-

ture selection for sentiment analysis. In Intelligent Engineering Systems

(INES), 2011 15th IEEE International Conference on, pages 357362.

IEEE, 2011.

[Konstantinova et al. 2011] Natalia Konstantinova, Sheila CM De Sousa,

and JA Sheila. Annotating negation and speculation: the case of the

review domain. In RANLP student research workshop, pages 139144,

2011.

[Koppel and Schler 2006] Moshe Koppel and Jonathan Schler. The impor-

tance of neutral examples for learning sentiment. Computational Intelli-

gence, 22(2):100109, 2006.

[Kosala and Blockeel 2000] Raymond Kosala and Hendrik Blockeel.

Web

mining research: A survey. ACM Sigkdd Explorations Newsletter, 2(1):1

15, 2000.

[Kucuktunc et al. 2012] Onur Kucuktunc, B Barla Cambazoglu, Ingmar We-

ber, and Hakan Ferhatosmanoglu. A large-scale sentiment analysis for ya-

hoo! answers. In Proceedings of the fifth ACM international conference on

Web search and data mining, pages 633642. ACM, 2012.

[Kumar et al. 1999] Ravi

Kumar,

Prabhakar

Raghavan,

Sridhar

Ra-

jagopalan, and Andrew Tomkins. Extracting large-scale knowledge bases

from the web. In VLDB, volume 99, pages 639650, 1999.

[Larsen and Aone 1999] Bjornar Larsen and Chinatsu Aone. Fast and effec-

tive text mining using linear-time document clustering. In Proceedings of

References

the fifth ACM SIGKDD international conference on Knowledge discovery

and data mining, pages 1622. ACM, 1999.

[Lebret and Collobert 2015] R´

emi Lebret and Ronan Collobert. " the sum

of its parts": Joint learning of word and phrase representations with au-

toencoders. arXiv preprint arXiv:1506.05703, 2015.

[Martin et al. 2006] Olivier Martin, Irene Kotsia, Benoit Macq, and Ioannis

Pitas. The enterface'05 audio-visual emotion database. In Data Engi-

neering Workshops, 2006. Proceedings. 22nd International Conference on,

pages 88. IEEE, 2006.

[Mescheder et al. 2017] Lars Mescheder, Sebastian Nowozin, and Andreas

Geiger. Adversarial variational bayes: Unifying variational autoencoders

and generative adversarial networks. arXiv preprint arXiv:1701.04722,

2017.

[Morinaga et al. 2002] Satoshi Morinaga, Kenji Yamanishi, Kenji Tateishi,

and Toshikazu Fukushima. Mining product reputations on the web. In Pro-

ceedings of the eighth ACM SIGKDD international conference on Knowl-

edge discovery and data mining, pages 341349. ACM, 2002.

[Mullen and Collier 2004] Tony Mullen and Nigel Collier. Sentiment anal-

ysis using support vector machines with diverse information sources. In

EMNLP, volume 4, pages 412418, 2004.

[Musto et al. 2015] Cataldo Musto, Giovanni Semeraro, Pasquale Lops, and

Marco de Gemmis. Crowdpulse: A framework for real-time semantic anal-

ysis of social streams. Information Systems, 54:127146, 2015.

[Nasukawa and Yi 2003] Tetsuya Nasukawa and Jeonghee Yi.

Sentiment

analysis: Capturing favorability using natural language processing. In Pro-

ceedings of the 2nd international conference on Knowledge capture, pages

7077. ACM, 2003.

[Nicola 2013] Raluca Georgeta Nicola. Categorization and visualization of

Twitter data. PhD thesis, Technical University of Dresden, 2013.

[Nkongolo 2017] Mike Nkongolo. A Web-Based Prototype Course Recom-

mender System using Apache Mahout. GRIN Verlag, 2017.

[O'Leary 2013] Daniel E O'Leary. Artificial intelligence and big data. IEEE

Intelligent Systems, 28(2):9699, 2013.

References

[Pang et al. 2002] Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan.

Thumbs up?: sentiment classification using machine learning techniques.

In Proceedings of the ACL-02 conference on Empirical methods in natural

language processing-Volume 10, pages 7986. Association for Computa-

tional Linguistics, 2002.

[Pyle 1999] Dorian Pyle. Data preparation for data mining, volume 1. mor-

gan kaufmann, 1999.

[Read 2005] Jonathon Read. Using emoticons to reduce dependency in ma-

chine learning techniques for sentiment classification. In Proceedings of the

ACL student research workshop, pages 4348. Association for Computa-

tional Linguistics, 2005.

[Routray et al. 2013] Preeti

Routray,

Chinmaya

Kumar

Swain,

and

Smita Praya Mishra. A survey on sentiment analysis. International Jour-

nal of Computer Applications, 76(10), 2013.

[Russell et al. 1995] Stuart Russell, Peter Norvig, and Artificial Intelligence.

A modern approach.

Artificial Intelligence. Prentice-Hall, Egnlewood

Cliffs, 25:27, 1995.

[Saif et al. 2012] Hassan Saif, Yulan He, and Harith Alani. Semantic senti-

ment analysis of twitter. The Semantic WebISWC 2012, pages 508524,

2012.

[Scott and Matwin 1999] Sam Scott and Stan Matwin. Feature engineering

for text classification. In ICML, volume 99, pages 379388, 1999.

[Sebastiani 2002] Fabrizio Sebastiani. Machine learning in automated text

categorization. ACM computing surveys (CSUR), 34(1):147, 2002.

[Spangler et al. 2009] Scott Spangler, Ying Chen, Larry Proctor, Ana

Lelescu, Amit Behal, Bin He, Thomas D Griffin, Anna Liu, Brad Wade,

and Trevor Davis. Cobramining web for corporate brand and reputation

analysis. Web Intelligence and Agent Systems: An International Journal,

7(3):243254, 2009.

[Stuart and Majewski 2015] Keith Douglas Stuart and Maciej Majewski. In-

telligent opinion mining and sentiment analysis using artificial neural net-

works.

In International Conference on Neural Information Processing,

pages 103110. Springer, 2015.

References

[Taboada et al. 2008] Maite Taboada, Kimberly Voll, and Julian Brooke.

Extracting sentiment as a function of discourse structure and topicality.

Simon Fraser Univeristy School of Computing Science Technical Report,

2008.

[Tang et al. 2014] Duyu Tang, Furu Wei, Bing Qin, Ting Liu, and Ming

Zhou. Coooolll: A deep learning system for twitter sentiment classification.

In Proceedings of the 8th International Workshop on Semantic Evaluation

(SemEval 2014), pages 208212, 2014.

[Thelwall et al. 2012] Mike Thelwall, Kevan Buckley, and Georgios Pal-

toglou. Sentiment strength detection for the social web. Journal of the

Association for Information Science and Technology, 63(1):163173, 2012.

[Turney 2002] Peter D Turney.

Thumbs up or thumbs down?: semantic

orientation applied to unsupervised classification of reviews. In Proceedings

of the 40th annual meeting on association for computational linguistics,

pages 417424. Association for Computational Linguistics, 2002.

[Unwin 2000] Antony Unwin. Visualisation for data mining. In International

Conference on Data Mining, Visualization and Statistical System, S´

eoul,

Korea, 2000.

[Volcani and Fogel 2006] Yanon Volcani and David Fogel.

System and

method for determining and controlling the impact of text, November 14

2006. US Patent 7,136,877.

[Vural et al. 2013] A Gural Vural, B Barla Cambazoglu, Pinar Senkul, and

Z Ozge Tokgoz. A framework for sentiment analysis in turkish: Applica-

tion to polarity detection of movie reviews in turkish. In Computer and

Information Sciences III, pages 437445. Springer, 2013.

[Wallace et al. 2012] Byron C Wallace, Issa J Dahabreh, Thomas A Trikali-

nos, Joseph Lau, Paul Trow, Christopher H Schmid, et al. Closing the gap

between methodologists and end-users: R as a computational back-end. J

Stat Softw, 49(5):115, 2012.

[Wang and Araki 2008] Guangwei Wang and Kenji Araki. A graphic rep-

utation analysis system for mining japanese weblog based on both un-

structured and structured information. In Advanced Information Network-

ing and Applications-Workshops, 2008. AINAW 2008. 22nd International

Conference on, pages 12401245. IEEE, 2008.

References

[Wang et al. 2012] Hao Wang, Dogan Can, Abe Kazemzadeh, Fran¸cois Bar,

and Shrikanth Narayanan. A system for real-time twitter sentiment anal-

ysis of 2012 us presidential election cycle. In Proceedings of the ACL 2012

System Demonstrations, pages 115120. Association for Computational

Linguistics, 2012.

[Zeinalipour-Yazti et al. 2004] Demetrios Zeinalipour-Yazti, Vana Kaloger-

aki, and Dimitrios Gunopulos. Information retrieval techniques for peer-

to-peer networks. Computing in Science & Engineering, 6(4):2026, 2004.

[Zhang et al. 2003] Shichao Zhang, Chengqi Zhang, and Qiang Yang. Data

preparation for data mining. Applied Artificial Intelligence, 17(5-6):375

381, 2003.

[Zhang et al. 2011] Shu Zhang, Wenjie Jia, Yingju Xia, Yao Meng, and Hao

Yo. Product features extraction and categorization in chinese reviews.

In The Sixth International Multi-Conference on Computing in the Global

Information Technology, ICCGI, 2011.

[Ziegler and Skubacz 2012] Cai-Nicolas Ziegler and Michal Skubacz.

To-

wards automated reputation and brand monitoring on the web. In Mining

for Strategic Competitive Intelligence, pages 109119. Springer, 2012.

Excerpt out of 54 pages - scroll top

Details

Title: Textual Classification for Sentiment Detection. Brand Reputation Analysis on the Web using Natural Language Processing and Machine Learning
College: University of the Witwatersrand
Course: Machine learning - Artificial Intelligence - Big Data - Natural Language Processing
Author: Mike Nkongolo (Author)
Year: 2018
Pages: 54
Catalog Number: V419732
ISBN (eBook): 9783668701670
ISBN (Book): 9783668701687
File size: 2546 KB
Language: English
Tags: textual classification sentiment detection brand reputation analysis natural language processing machine learning