Lade Inhalt...

Behavior of Users Talking about Pathologies and Diseases on Twitter

von Dennis Salcedo (Autor) Alejandro León (Autor)

Projektarbeit 2015 4 Seiten

Informatik - Angewandte Informatik


1 Abstract

With the amount of data available on social networks, new methodologies for the analysis of information are needed. Some methods allow the users to combine different types of data in order to extract relevant information. In this context, the present paper shows the application of a model via a platform in order to group together information generated by Twitter users, thus facilitating the detection of trends and data related to particular pathologies. In order to implement the model, an analyzing tool that uses the Levenshtein distance was developed, to determine exactly what is required to convert a text into the following texts: 'gripa'-"flu", "dolor de cabeza"-"headache", 'dolor de estomago'-"stomachache", 'fiebre'-"fever" and 'tos'-"cough" in the area of Bogotá.

Among the information collected, identifiable patterns emerged for each one of the texts; there was a clear relationship with the concepts and a contrast with the Levensthein distance and sentiment analysis. It was discovered that by employing this proposed model it is possible to use Twitter in order to obtain information and, eventually, to use artificial intelligence techniques and data mining in predicting the behavior of these pathologies.

2 Introduction

Social networks are important because of their user’s opinions on diverse topics (Martos E Carrión 2010 and Soumen C 2003). Gathering, processing and analyzing those opinions is an important factor for making decisions, therefore, the study or analysis of mass opinion through social networks is an issue that has emerged as a key methodology in modern sciences (Linto C Freeman 2006) – such as psychology (Daniel T. Gilbert, Susan T. Fizke, Gardner L 1998) and economy (Ana S Garcia 2003), among many others – because it has an impact on the content generated by users (Robin B, Jonathan G, Andreas H, Robert J 2001).

The mass analysis can be provided through web mining (George C 2001), which is defined as the application of data mining techniques to uncover patterns of the web, along with gate keeping functions (Shoemaker, P.J 2009), which are defined as the process of elaborating and eliminating information in messages. This means that for an analysis of social networks a network (Ron S 2006 and Fisterra 2014) with patterns or text about specific pathologies is required.

Therefore, a characteristic Levenshtein distance analyzer, linking with diagrams of grouping, relationship and feeling to see how the information is behaving was needed. The result provided a close approach to the people who tweeted with a negative attitude to the pathologies.

The importance of conducting an analysis of information and structure for pathologies is important for the study of data mining and big data. This means, it can be determined how many users posting a tweet are actually sick.

3 Obtaining the information

The information is collected using a python script in which the Twitter and json libraries are used. Note that in order to use the Twitter API, a Twitter application must be created in the website's developers area (Zeokat 2013). In order to gather tweets associated to a city, they need to be linked to a city code by using the platform of coordinates, GeoPlanet (Willi S 2010). In this case the code lines are:

illustration not visible in this excerpt

This way, it is possible to find all the tweets in the city of Bogotá that were retweeted on the same day.

4 Applying the Levenshtein Distance

The Levenshtein Distance shows the number of operations that you need in a thread to finish another one (Vladimir I Levenshtein 1965). It was used because of the simplicity of the algorithm (insertion, deletion, or substitution of a single character); here you can take a look at the behavior of information as a rough example:

illustration not visible in this excerpt

The analyzer removes information such as special characters and blank spaces in the thread. Additionally, it gets a portion of the thread to perform this analysis on the desired patterns; the resulting thread is built with a maximum of 4 words (Table 1).

A snapshot of the corpus is:

illustration not visible in this excerpt

After cleaning the following information is obtained (Table 1):

illustration not visible in this excerpt

Table 1: Structure of the corpus with the pathology flu.

It is noted that there are similarities in the number of operations performed to obtain the desired pattern (Fig. 1).

illustration not visible in this excerpt

Fig. 1 Levensthein Distance applied to the corpus that contains flu.

Therefore relevant information showing the pattern associated with the pathology to improve the filter using the analyzer is taken.



ISBN (eBook)
631 KB
behavior users talking pathologies diseases twitter



Titel: Behavior of Users Talking about Pathologies and Diseases on Twitter