deteccion de miembros clave en una comunidad … · de datos y las t ecnicas de an aisis de redes...

82
UNIVERSIDAD DE CHILE FACULTAD DE CIENCIAS F ´ ISICAS Y MATEM ´ ATICAS DEPARTAMENTO DE INGENIER ´ IA INDUSTRIAL DETECCI ´ ON DE MIEMBROS CLAVE EN UNA COMUNIDAD VIRTUAL DE PR ´ ACTICA MEDIANTE AN ´ ALISIS DE REDES SOCIALES Y MINER ´ IA DE DATOS AVANZADA TESIS PARA OPTAR AL GRADO DE MAGISTER EN GESTI ´ ON DE OPERACIONES MEMORIA PARA OPTAR AL T ´ ITULO DE INGENIERO CIVIL INDUSTRIAL H ´ ECTOR IGNACIO ´ ALVAREZ G ´ OMEZ PROFESOR GU ´ IA: SEBASTI ´ AN A. R ´ IOS P ´ EREZ MIEMBROS DE LA COMISI ´ ON: FELIPE I. AGUILERA VALENZUELA GAST ´ ON A. L’HUILLIER CHAPARRO LUIS A. GUERRERO BLANCO SANTIAGO, CHILE OCTUBRE 2010

Upload: others

Post on 13-Jun-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

UNIVERSIDAD DE CHILEFACULTAD DE CIENCIAS FISICAS Y MATEMATICASDEPARTAMENTO DE INGENIERIA INDUSTRIAL

DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD VIRTUAL DEPRACTICA MEDIANTE ANALISIS DE REDES SOCIALES Y MINERIA DE DATOS

AVANZADA

TESIS PARA OPTAR AL GRADO DE MAGISTER EN GESTION DEOPERACIONES

MEMORIA PARA OPTAR AL TITULO DE INGENIERO CIVIL INDUSTRIAL

HECTOR IGNACIO ALVAREZ GOMEZ

PROFESOR GUIA:SEBASTIAN A. RIOS PEREZ

MIEMBROS DE LA COMISION:FELIPE I. AGUILERA VALENZUELAGASTON A. L’HUILLIER CHAPARRO

LUIS A. GUERRERO BLANCO

SANTIAGO, CHILEOCTUBRE 2010

Page 2: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Esta tesis esta dedicada a todos los que creenque con fuerza de voluntad se pueden lograr las cosas.Nada es imposible de hacer, solo es mas o menos complicado.Siempre miren hacia adelante, que siempre habra una solucion.Y si no la encuentran, tengan por seguro que siempre habrauna mano amiga dispuesta a ayudar.

Page 3: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Agradecimientos

Bueno, siempre es dificil comenzar los agradecimientos, sobre todo cuando hay una grancantidad de gente a la cual aprecio y debe estar aquı. Tratare de hacerlo lo mas ordenado posiblepara que sea de facil lectura.

En primer lugar, quiero agradecer a mis padres, por su constante apoyo y carino, por siempreestar ahı cuando los necesite y por darme el espacio necesario para retroceder poco y avanzar asaltos. Sin su gran paciencia, nada de lo que me ha sucedido hubiera sido posible. A mi hermanoHector Matıas, sin darse cuenta me ayudo a desarrollar mi tesis. Preguntas que al formularlas seven tan sencillas de responder, pero que en la realidad su respuesta no es trivial, sirvieron pararesolver puntos pendientes en mi trabajo. Creo que tienes un talento ahı que deberıas pulir, ya queesa es parte fundamental del Ingeniero. Estoy muy orgulloso de como te estas desarrollando en tucarrera, sigue con esa disciplina y compromiso, que llegaras lejos.

Ahora quisiera agredecer a Andrea, mi polola y ahora prometida, por esos empujones dados enlos momentos precisos, por esos retos que sacuden y me hicieron avanzar. Gracias por la motivacionentregada, sobre todo en la recta final, de no ser por eso, tal vez quizas todavıa estarıa estancado enmi trabajo. Gracias por ser el complemento que siempre anduve buscando, por toda la companıay apoyo que has representado durante estos 5 anos, que se transformaran en muchos mas. Te amomucho.

A mi comision, muchas gracias por su soporte y aguantar los cambios presentados duranteel trabajo, gracias por reaccionar de buena forma a ellos. A Sebastian Rıos, mi profesor guıa porpresentarme un area tan interesante, como actual. Con este trabajo se me abrio un mundo quedurante mi tiempo en la escuela no pude conocer por no estar en los contenidos de ningun curso.Las ganas de profundizar en el tema estan, por lo que espero no estar tan alejado del area. AFelipe Aguilera, mi co-guıa, gracias por tus intervenciones durante las presentaciones previas, porayudarme en la fase inicial de recoleccion de datos, y por facilitarme la comunidad que fue el pilarfundamental de mi trabajo. A Gaston L’Huillier, companero y colega de investigacion, muchasgracias por tu contribucion tanto en las publicaciones que hicimos juntos, como en la convivenciadıa a dıa en la oficina. Tu buena onda y positivismo nunca se me van a olvidar. Finalmente a LuisGuerrero, muchas gracias por su paciencia y capacidad de reaccion a cada cosa que le pedı al finalde la tesis. Gracias tambien por el enfoque que le dio a sus preguntas, que me hizo ver el trabajodesde otra perspectiva.

Dentro de mi pasar por la Universidad, he conocido mucha gente que de una u otra formaparte de lo que soy hoy en dıa. Aunque no lean esto, quisiera darle las gracias a los profesores RaulUribe, Jaime Gonzales, Juan Paulo Wiff, Francisco Santamarıa, Eduardo Olguın y Juan Velasquez,quienes aparecieron en momentos claves en mi carrera. Su forma de ser hace que sean profesoresque no olvidare facilmente. Muchas gracias por sus ensenanzas.

ii

Page 4: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

A mis amigos mechones, companeros de epicas batallas (academicas y no acacemicas) ymaratonicas jornadas de estudio. Pasamos por hartas cosas mientras estuvimos juntos en PlanComun. Aunque nuestros caminos se separaron despues, aun los recuerdo con gran carino y esperoque no se pierda el contacto nunca y que nunca les falte rock. Mencion especial a Murgas y la Viole,sigan siempre fiel a sus caminos. Estoy muy orgulloso de ustedes dos.

No se me pueden olvidar todos aquellos que estuvieron en la salita DOCODE, mi oficinapor todo este ano. Fue muy agradable trabajar ahı, dudo encontrar un ambiente tan grato en otrolugar. A Edumerlo, companero en labores de redes sociales, estoy muy orgulloso por tu trabajo enla Memoria, te lo mereces. Gaston, que siempre estaba en la sala, haciendo casi innecesario tenerllave. Pato Moya, que le daba el cambio de tuerca a la sala (siempre es bueno trabajar con gente deotras carreras). Gabriel, Gerardo y Felipe, ustedes son los siguientes, ya lo han demostrado con sustrabajos previos, ası que esta fase final no sera nada mas que un tramite. A todos ustedes, muchasgracias.

Y desde el punto de vista academico, debo darle las gracias a Vicente, por haberme dado laconfianza desde el principio para ser el auxiliar de su ramo. Ha sido muy grato trabajar duranteestos 4 anos como el Auxiliar de Conta, hecho que dio paso a ejercer otros cargos docentes. AFernando Ordonez por la fe que tuvo en Alvaro y en mı al aceptarnos como sus auxiliares, esperoque hayamos estado a la altura durante estos tres semestres. Y finalmente, a Richard Weber porla confianza que tuvo al elegirme como profesor de Catedra. Es un paso muy importante para mi,y estoy enormemente agradecido por la oportunidad. A Julie Lagos, el pilar del MGO, muchasgracias por tus gestiones e infinita paciencia, que sin ellas los plazos no se hubieran podido cumplir.

Finalmente, un agradecimiento a toda la gente que he llegado a conocer en esta universidad.Companeros de clases, magister, almuerzo, boletineros, seishines, ayudantes con los que he trabajadoy colegas auxiliares. A mis alumnos, tanto de clases auxiliares como de Catedra, les digo que aunqueno se note, creanme que los tengo muy presentes y clase a clase trato de ser un aporte un poco masalla de explicar un par de formulas en una diapo o una pizarra, eso lo hace cualquiera. Siemprerecuerden que detras de los numeros hay personas, y que la empatıa y el trabajo en equipo no sonsolo cosas que van en en currıculum, deben ser parte de la vida misma.

Un gran capıtulo en mi vida se cierra hoy, y como la vida misma, automaticamente se abreuno nuevo con desafıos aun mas grandes. Espero que todos aquellos a quienes estimo sigan ahıdandome ese apoyo tan valioso que me ha hecho superar cada desafıo que me he propuesto. Sontodos geniales.

Hector I. Alvarez GomezOctubre, 2010

Page 5: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Resumen Ejecutivo

Actualmente, el uso de internet tiene diferentes propositos, entre ellos se encuentran lasRedes Sociales. Su principal objetivo es el comunicar lo mas posible a las personas, sin importar suubicacion geografica. La interaccion entre los usuarios genera una cultura de intercambio, creandouna Comunidad en la Red Social.

Existen diferentes tipos de Comunidades, entre ellas las Comunidades de Practica, definidascomo comunidades donde la interaccion entre usuarios esta basada en la necesidad de aprendersobre un area tecnica en especıfico. Cuando las interacciones entre los miembros de la comunidadse lleva a cabo por la Internet, se les llaman Comunidades de Practica Virtuales (VCoP en ingles).En este tipo de comunidades, es importante que el proposito por el cual fue creada sea cumplido atraves del tiempo por parte de los miembros.

Para los administradores, no es facil de identificar quienes son los miembros cuyos aportesson importantes para la comunidad, debido a la cantidad de usuarios y post que se generan dıaa dıa. Estos miembros claves son generalmente descubiertos con tecnicas de Analisis de RedesSociales (SNA en ingles), pero estas tecnicas no consideran el contenido que ellos entregan. Alcontrario, tecnicas de minerıa de datos en documentos logran medir el contenido desarrollado enla comunidad, pero no consideran las interacciones entre usuarios. Esto implica que es necesariotener una metodologıa que mezcle ambos enfoques y ası poder encontrar a los miembros clave enterminos del contenido.

En esta tesis, se desarrollo un sistema hıbrido que combine el contenido procesado por Minerıade Datos y las tecnicas de Anaisis de Redes Sociales para encontrar miembros clave. La idea principales obtener una representacion de la red en un grafo que considere los conceptos o topicos contenidosen un post.

Para este proposito, tres configuraciones fueron definidas, de acuerdo a ”quien le responde”un miembro cuando postea. Ademas, dos filtros de contenido fueron aplicados, obteniendo losconceptos y topicos tratados en la comunidad. Algoritmos de Grado y HITS fueron utilizados paraencontrar miembros clave, logrando definir dos tipos: los motivadores, que atraen a otros miembrosa participar, y los respondedores, que responden las preguntas de la comunidad.

Los resultados muestran que se lograron encontrar miembros claves para todas las config-uraciones de la red, ademas de mostrar una precision igual o superior al 70% por algoritmo. Losadministradores entregaron nominas de miembros claves antes y despues de ver los resultados, yse pudo apreciar un aumento en la cantidad de miembros claves, demostrando que la aplicacion detecnicas de SNA ayudan a mejorar la busqueda de miembros clave.

La aplicacion de filtros de contenido no presentan mejoras significativas. Sin embargo, elincluir el contenido sera de ayuda para el trabajo futuro, donde se podran generar redes tematicasde acuerdo a los conceptos o topicos de la comunidad.

iv

Page 6: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Summary

Nowadays, internet is used for many differents purposes, among them there is Social Network.The main objective of Social Networks is to communicate as much as posible people no matter thegeographical position of them. The interaction between users generates a sharing culture, creatinga Community, in the network itself.

There are different classification of Communities, between them are the Communities ofPractices: communities where members interaction is based in need of learn about an specifictopic. When the interaction between community members is develop by Internet, it calls VirtualCommunity of Practice (VCoP). In this kind of community, it is important that the purposes inwhich the community was created be accomplish through time by members contributions.

For administrators, it not easy to identify who are the members which contribution areimportant for the community, due to the amount of users and post generated day by day. Commonllythis key-members are discovered with techniques like Social Network Analysis, but this techniquesonly consider member participation and not the users contribution. On the other hand, to measurethe content developed in the community, Documental Text Mining techniques are used, but theynot consider members interaction. So, it is necessary to develop a methodology which mix bothapproaches in order to find key-members in terms of the content.

In the thesis, an hybrid approach which combines content contribution obtained by TextMining and SNA for key-members discovery is developed. The main idea is to obtain a graph repre-sentation of the community network which consider the concepts or topics containing in members’post.

Three network configuration where defined, according to the assumption of ”who is replying”a member when posts. Also, two content filtered where applied, obtaining the concepts and topicstreated in the community. Degree and HITS algorithm of SNA key-members discovery were applied,finding two kinds of key-members: motivators, who encourage other members to participate andrepliers, who replies the answers of the network.

Results shows that is possible to find key-members including the content to the graph repre-sentation, also, a precision equal or greater than 70% was obtained for each network configurationand rank algorithm. Administrators generate lists of key-members before and after they reviewedresults, and was possible to apreciate an increment in the quantity of key-members, demonstratingthat SNA techniques helps to improve key-member discovery.

Application of content filtering does not present meaningful enhancements. However, toinclude the content will help for future work, where thematic networks will be build by consideringcommunity concepts or topics.

v

Page 7: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Contents

Agradecimientos i

Resumen iv

Summary v

Contents vi

List of Tables ix

List of Figures xi

1 INTRODUCTION 1

1.1 Key-member discovery problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2.1 General Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2.2 Specific Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2.3 Expected Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.4 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 PREVIOUS WORK 7

2.1 Social Network Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 Graph representation of a network . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1.2 Metrics used in Social Network Analysis . . . . . . . . . . . . . . . . . . . . . 9

2.1.3 Social Network Analysis Applications . . . . . . . . . . . . . . . . . . . . . . 10

vi

Page 8: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

CONTENTS

2.1.4 Social Network Analysis on Virtual Communities of Practice (VCoP) . . . . . 11

2.2 Text Mining for content reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3 METHODOLOGY 15

3.1 Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2 Preprocessing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2.1 Text Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.2.2 Concept based text mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2.3 Using Latent Dirichlet Allocation for Topic classification . . . . . . . . . . . . 21

3.3 Network Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.3.1 Concept-based & Topic-based Network Filtering . . . . . . . . . . . . . . . . 24

3.3.2 Network Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.3.3 Network Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.4 Social Network Analysis key-member discovery . . . . . . . . . . . . . . . . . . . . . 29

3.5 Analysis and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4 APPLICATION 31

4.1 A real Virtual Community of Practice . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.2 SNA-KDD application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.2.1 Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.2.2 Text Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.2.3 Network Configuration and key-member discovery . . . . . . . . . . . . . . . 35

4.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.3.1 Topics obtained . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.3.2 Resulted Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.3.3 Key-members Discovered . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.3.4 Key-members detection algorithm comparison . . . . . . . . . . . . . . . . . . 48

4.3.5 Filter algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5 CONCLUSIONS AND FUTURE WORK 52

5.1 SNA-KDD methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

vii

Page 9: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

CONTENTS

5.2 Content Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.3 Density reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.4 Key-members discovered . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.5 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.5.1 Content contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.5.2 Concepts approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.5.3 Thematic networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.5.4 SNA and Topic extraction computational tool . . . . . . . . . . . . . . . . . . 57

REFERENCES 59

Appendix A SNA-KDD Results 65

A.1 Key-member obtained . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

A.2 Key-member Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

A.3 Topic extracted . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

viii

Page 10: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

List of Tables

Table 3.1 List of goals and membership values of a singular post . . . . . . . . . . . . . . 21

Table 4.1 Plexilandia activity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

Table 4.2 Plexilandia general measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

Table 4.3 Database Evolution through SNA-KDD development . . . . . . . . . . . . . . . 33

Table 4.4 Topics obtained . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

Table 4.5 Topic meanings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

Table 4.6 2009 Densities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

Table 4.7 Creator Oriented Motivators Key-members . . . . . . . . . . . . . . . . . . . . 43

Table 4.8 Creator Oriented Repliers Key-members . . . . . . . . . . . . . . . . . . . . . . 43

Table 4.9 Administrators’ Key-members . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

Table 4.10 Global Key-members precision . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

Table 4.11 Type A Key-members precision . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

Table 4.12 Type B Key-members precision . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

Table 4.13 Type C Key-members precision . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

Table 4.14 Administrator Key-members enhancements . . . . . . . . . . . . . . . . . . . . 47

Table 4.15 Global Key-members precision enhancement . . . . . . . . . . . . . . . . . . . . 48

Table 4.16 Kendall τ coefficient for rank algorithm . . . . . . . . . . . . . . . . . . . . . . 49

Table 4.17 Kendall τ coefficient for filter algorithm compared with Counting graph . . . . 51

Table A.1 Reply Oriented Motivators Key-members . . . . . . . . . . . . . . . . . . . . . 65

Table A.2 Reply Oriented Repliers Key-members . . . . . . . . . . . . . . . . . . . . . . . 66

Table A.3 All Previous Oriented Motivators Key-members . . . . . . . . . . . . . . . . . . 66

Table A.4 All Previous Oriented Repliers Key-members . . . . . . . . . . . . . . . . . . . 66

ix

Page 11: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

LIST OF TABLES

Table A.5 Topic Extracted . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

x

Page 12: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

List of Figures

Figure 1.1 Key-member Discovery Process . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Figure 2.1 Directed and Undirected graph examples . . . . . . . . . . . . . . . . . . . . . 8

Figure 2.2 Complex Network: A School Network . . . . . . . . . . . . . . . . . . . . . . . 9

Figure 3.1 SNA-KDD Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

Figure 3.2 Thread Post Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

Figure 3.3 Networks Configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

Figure 4.1 Initial Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

Figure 4.2 Key-member Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

Figure 4.3 Creator Oriented Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

Figure 4.4 Reply Oriented Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

Figure 4.5 All previous Oriented Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

Figure 4.6 2009 densities comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

Figure 4.7 Density reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

Figure 4.8 2009 Creator Oriented Density . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

Figure 4.9 2009 Reply Oriented Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

Figure 4.10 2009 All Previous Oriented Density . . . . . . . . . . . . . . . . . . . . . . . . 42

Figure 4.11 Motivators Creator reply ranks Scatter plot . . . . . . . . . . . . . . . . . . . 50

Figure 4.12 Repliers Creator oriented ranks Scatter plot . . . . . . . . . . . . . . . . . . . 50

Figure 5.1 Kawada Kawaii Visualization for 2009 Creator oriented networks . . . . . . . 54

Figure A.1 Key-member Final Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

xi

Page 13: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

List of Algorithms

1 Initialize Semantic Weights Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2 Creator Reply Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3 Last Reply Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4 All Previous Reply Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

xii

Page 14: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Chapter 1

Introduction

In this chapter, a general background of this thesis purpose is presented, followed by this thesisgeneral and specific objectives. Then the methodology used for the development of the thesis isdiscussed. Finally, the thesis structure with a brief introduction for all chapters is presented.

1.1 Key-member discovery problem

Nowadays, people are more connected thanks to Internet. Internauts have experienced a prolifera-tion of many different web-services and applications which are used day by day and are becomingmore popular, such as e-business (a way to buy and sell products on-line), on-line games (rol playinggames, interactive games) streaming media (like Youtube1 and Grooveshark2), instant messagingsoftware (like G-talk or MSN ) and on-line social networks (like Facebook3, LinkedIn4 and MySpace5).In On-Line Social Networks (OSN), users can interact and communicate each other, making newfriends or contacts, establishing ties with old classmates, discuss about different topics, organizeevents, among other features. This way, people are not only more connected today compared withlast decade; also its communication is more fluid and quick.

In Social Networks, continuous interaction between members provokes a sense of belonging,presenting benefits for the network itself, increasing the network activity in terms of experiences,knowledge, or opinions sharing, all of this being develop in a community context. The result ofthis exchange results not only in reinforced ties between users, but also the creation of new ones,growing the connections between them, and maintaining or enhancing their healthy. In the case of

1http://www.youtube.com2http://www.grooveshark.com3http://www.facebook.com4http://www.linkedin.com5http://www.myspace.com

1

Page 15: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Chapter 1. INTRODUCTION

a traditional community (which it is developed in a real-life context), members need both physicaland temporal coordination in order to interact and share each other. Then, when a communityinteract through a computer-mediated space it is called Virtual Community [26]. In this type ofcommunity, members of different places around the world can interact in a friendly environmentwhere the physical coordination and temporal synchronization is replaced for a virtual interactionthrough a web-page.

There are different kinds of Virtual Communities. Kim et al. [28] organizes Social WebCommunities describing the kind of users, uses, and needed features for every type of community.Wenger [58] identified three different Virtual Communities (VC), depending on the objectives pur-sued: access to information (VC of Interest), to complete a particular objective (VC of Purpose)or knowledge about an specific topic, skill or profession (VC of Practice). Specifically, it definesVirtual Communities of Practice (VCoP) as a group of people who share about specific topics anddepth their own knowledge and expertise interacting on a friendly interface.

On-Line Social Networks can contain a VCoPs. In fact, communities begin their life withan existing Social Network and the collaboration between their users. When their interactions arebrought together by sharing interests, goals, needs, or practices, a community appears and userswho participate becomes in members of this [9]. Community members participation invite newusers to be part of it, improving the quality of the conversations between them, which contributeto the creation of new community members.

VCoPs are commonly related to professional organizations, academic communities, evengroups of artists who wants to improve their techniques by learning from colleagues. This meansthat the main objective of VCoPs is to share information about specific topics, and is expected thatmembers collaborate by proposing discussions around this Main topics. But in occasions members’own topics could not be the same as community has, either users propose topics which are notrelated with desired topics, increasing their number, or just ignorance about how to use the virtualspace in which the community develops, deviating the discussion from community Main topics [20].So the task of VCoPs’ administrators is to check out that topics generated by members contributeto this.

Although, there are members whose contributions are according to community Main topics.Even more, some times members lead conversations to this topics and motivate other to participatein the discussion. Administrators wants to know which members contributes to the well developmentof the community in the terms described before. In other words, administrators want to know whoare the key-members of the community.

There are several definitions on what a key-member would represent: The most participativemember [49], the member who answers others members’ questions [35] or the member who encourageothers members to participate [5]. In other words, key-members participate more than other users,keeps the community active, and help to less experienced members by replying their questions. Allof these definitions, however, presents key-members according to their participation, but do nottake into consideration how meaningful are their contributions for the community. Therefore, with

2

Page 16: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Chapter 1. INTRODUCTION

this approach it only matters if a member replies in a certain topic, not if the post answers theoriginal question or contribute to the general discussion.

If key-members are identified, they would help other community members to develop theirskills, increase their knowledge, facilitate the comprehension of discussions, and generate participa-tion. Without them, the knowledge generation it blocks and the members activity could decreaseat the level of provoke the death of the community [49]. In other words, they are members whokeeps community alive and running. Also, there are communities where key-members are paid todevelop content. Therefore, if administrators do a wrong key-member detection, the impact of theirdecisions over the community will not be the expected by administrators. For that reason, it isimportant to recognize who are key-members and have a methodology to find them.

To detect them, administrators’ time and effort is needed, because they have to review whichcontent contributes to the community. If the community is small in terms of members and posts,administrators can find the key-members manually, but generally this is not the case. The amountof register users in OSNs’ is increasing year by year. For example, Twitter had 475.000 members byFeb. 2008 while it had 7.038.000 members by Feb. 2009, which means 1382% of growth. This growthhas been experienced in many others OSNs’, so one of the great challenges for administrators is todiscover key-members in large communities, where the manual inspection is not possible. Therefore,an automatic key-member detection system is needed.

To face the scalability problem, administrators commonly use techniques such as SocialNetwork Analysis (SNA), which deliver graph representation of the structure of the network andpatterns of interaction between members [57], helping to the visualization of the community, andalso to recognize their behaviour, like who are the relevant members by algorithms which measurestheir participation in the community.

Other approach used is Text Mining (TM), which is the application of Data Mining in textsdata. In this case, community members’ posts are used as input for the TM algorithms, resultingin patterns of the content presenting in the community. The patterns would be helpful if theadministrators wanted to know which are the top topics of the community in a given moment, andcompared with the expected main topics.

Both methods analyze different aspects of the community. On the one hand, SNA can mea-sure the activity of the community, but not the activity content. On the other hand, TM candiscover the relevant content for the community, but not the relationship between the members.Applied separated, only is possible to study the participation of members, or the content generatedin the community (in global terms). But administrators also need to measure how much a mem-ber contribute to communitys’ Main topics, and this methods can not evaluate this aspect of thecommunity if they are applied alone.

For that reason, the hypothesis of this work is that the combination of SNA with Text Miningmethods will present an improvement of VCoPs’ understanding, in terms to the relation betweenmembers including the content that they are sharing. This will help administrators to discover the

3

Page 17: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Chapter 1. INTRODUCTION

key-members according their participation and how align are with the Main topics, improving notonly the process of find them, but also the understanding of the network structure and membersbehaviour within.

1.2 Objectives

1.2.1 General Objective

The main objective of this thesis is to design and develop a hybrid approach, by using SNA andText Mining, in order to enhance key-members discovery on Virtual Communities of Practice.

1.2.2 Specific Objectives

1. To characterize VCoPs, in order to build a graph representation.

2. Use the different graph configuration in order to evaluate the best configuration to extractexperts in a VCoP

3. Improve traditional SNA approach with KDD.

4. To evaluate the proposed approach in a real world Virtual Community of Practice.

1.2.3 Expected Results

Among the results expected in this thesis, there are:

• A methodology for key-member discovery which include the interaction between members andthe content generated by them.

• Classification of community content.

• A database which contain information about members, the discussions threated in the com-munity and the structure of interactions.

• Graph representations of the community.

• To develop an algorithm which build a graph representation for Virtual Community of Prac-tice, considering both, SNA only, and SNA with Text Mining configurations.

• For each graph or network structure, apply Social Network Analysis in order to discoverkey-members.

4

Page 18: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Chapter 1. INTRODUCTION

• Have the key-members of the community identified by different methods. Also, an expectedresult is that these key-members change when the content had been included in the algorithmsin favor of the real key-members.

• As content is included in the discovery methodology, is expected that not only key-memberbe founded, but also the interaction would be filtered and this could result in a less densegraph representation.

1.3 Methodology

The general idea of key-member discovery process is explained in Figure 1.1. First, the neededdata of the VCoP is collected, which is processed to obtain a graph representation of the network.Then, SNA is applied in order to find key-members. With this results, administrators have theinformation to take decisions about community development.

Figure 1.1: Key-member Discovery Process.

The methodology used for the development of this thesis is structured in the following steps:

1. Related work

For the accomplishment of this thesis, it is necessary to have a knowledge about Social NetworkAnalysis and their applications over Virtual Communities of Practice. Also, Text Miningmethods for content extraction are required in order to have the relevance of the contentgenerated by community members. For that reason, state of the art of both, SNA and TMwill be reviewed to establish the most appropriated methods which fulfill thesis main purpose.

2. Graph definition and configuration

The community graph representation will depend of their definition, in other words, thedefinition of both nodes and arcs. Also, how to configure the graph in terms of links betweennodes is relevant for further experiments, because it will represent the interaction betweencommunity members. Key-member discovery algorithms will be applied.

3. Social Network Analysis definition and previous work

Social Network Analysis will be reviewed, specifically the state of the art of SNA applied inOSNs’, VCs’, and VCoPs’. This review will consider not only key-members detection works,

5

Page 19: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Chapter 1. INTRODUCTION

also other issues that can be solved with SNA, such as clusters, structure analysis, brokersdetection, among others.

4. Application and Evaluation of the Proposed Model

Text Mining methods will be developed in a real world VCoPs’, resulting in a new represen-tation of the members content. Then, different graph representations will be set and builtaccording to the community data and how will be considered the replies between members.Then, SNA will be applied in the different network configurations, in order to find communitykey-members.

5. Results analysis and conclusions

After SNA and TM methods are applied, different results are presented: An improvement ofthe visualization, a database structure to store all the information about key-members, contentclassification according to the TM methods, and a new method to find key-members. Thismethod have two approaches to be evaluated: first, comparing results with administrators’key-members, and second, a benchmark between SNA methods about how different are theresults obtaining by each one.

1.4 Thesis Structure

In the next chapter, related work about state of the art in Social Network Analysis, their applicationsin Virtual Communities of Practice for key-members detection, Text Mining for topic extractionand content reduction improvements is presented. The main idea of this chapter is establish thatactual approach it is not considering the content which community members develops.

On chapter 3, the main contribution of this thesis is presented, following the SNA-KDDprocess: Community structure and data required for this work, Text Mining approach for communitycontent reduction, Network configuration, according to how the replies in the community are defined;network filtering, considering the results of Text Mining methods; network construction algorithm,which explain how to build the graph with the filter and without it; network visualization methodsto have a graphical understanding, SNA methods applied to find key-members in the differentnetwork configuration, and finally how results will be analyzed and evaluated.

Then, on chapter 4, an experiment on a real life VCoP is presented. Here, the VCoP isdescribed in terms of content, users, and main topics. Also, the text processing method for theneeded content representation and evaluation method are presented. Then, main results for bothtraditional and proposed system are presented and analyzed. These results are presented accord-ing to the evaluation criteria previously introduced, including the benchmark between differentcombinations of graph and detection methods.

Finally, on chapter 5 the main conclusions are presented, including our main findings andcontributions, as well as the future work and lines for research.

6

Page 20: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Chapter 2

Previous Work

In this chapter, state of the art and previous work is reviewed. Firstly, the Social Network Analysisapproach and their applications are presented. Secondly, research of SNA on Virtual Communitiesof Practice are reviewed, including key-member discovery approaches. Finally, Text Mining tech-niques for content reduction are exposed, presenting both scopes; semi-automatic and automatictechniques.

2.1 Social Network Analysis

Social Network Analysis [57] helps to understand relationships in a given community by analyzingwith a graph representation. It focuses on study ties between people, groups of people, organizations,even countries. When these ties are combined, it form a network, what is the objective to beanalyzed. The main goal of social network analysis is detecting and interpreting patterns of socialties among actors [12].

Main concepts related with SNA are: first, there is an interdependencies between actors,because their actions in the network affects other; second, the ties (or linkages) between actorsrepresents a transference of a certain resource; and third, network models conceptualize structuresas a pattern of actors’ relations. In other words, SNA analyses a set of actors and linkages amongthem which represent the network, instead of analyze the behaviour of an actor individually, likeothers approaches does.

Network representation is a primary task, because according to the interactions defined willbe the patterns that SNA find. In this part, actors, their relationships, and how to represent it areestablished. Then, techniques to find networks patterns are applied. Which technique is used itdepends of what aspect of the network is studied:

7

Page 21: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Chapter 2. PREVIOUS WORK

• Cohesion: Who are related with who, if there is sub-networks, how strong are them, andwhat happen if some of this sub-networks are removed from the network.

• Brokerage: How information is transported in the community, and who collaborate in thistask. Both who generates and who act as a bridge to deliver the information to other.

• Ranking: Who are the most important, pointed, or popular actors and how this popularityaffects the development of the network.

• Roles: More than find the most popular, some times members develop an specific task in thenetwork. So, it is necessary to identify the role that actors are assuming when interact withother.

One of the main benefits of using SNA, is that the visualization improves analysis andcomprehension of the network [44, 62], comparing with statistical analysis. For example, Melo etal. [11] shows that the application of SNA facilitates data comprehension, and its better thanapplication of statistics like box score. They use NBA data as an example, because the amount ofgenerated data by this league too huge to make classic statistics in order to predict a team successduring the full season.

2.1.1 Graph representation of a network

In general, a network is defined as it follows: a set A of a actors and a set R of r relations betweenactors. Then, the graph G which represent a network is composed by a pair G(N,E), whereN = {n1, . . . na} correspond to the nodes and E = {e1, . . . er} the edges or arcs of the network. Inthe case of SNA, nodes are actors of the network and edges or arcs are the ties between them.

Figure 2.1: Graph examples.

There are two kinds of ties: directed and undirected. When edges are used, the relationshipis undirected. For example, if a interact with b (being a and b actors of the network), then theedge e1 = (a, b) = (b, a) represent their interaction, and will be represented in the graph by a linebetween a and b. On the contrary, arcs represent a directed relationship. If a starts an interactionwith b, then the arc a1 = (a, b) represent this tie, and graphically will be represented as an arrowfrom a to b. Figure 2.1 illustrate both possible configurations the network.

8

Page 22: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Chapter 2. PREVIOUS WORK

There are more kinds of configurations, depending what is wanted to study with the network.For example, to represent a School, actors will be students and teachers, and two relationshipscould be if a student is in a certain teacher class and which students are friends. Figure 2.2 showsthis network: the blue squares are teachers, the red circles are students, the continuous line is a“class relationship” and the break lines are the “friendship relationship”. The “class relationship”is directed because a student is in the class of this teacher, and the “friendship relationship” isundirected because friendship is, generally, a mutual relation.

Figure 2.2: A School Network.

2.1.2 Metrics used in Social Network Analysis

Classical SNA metrics are following presented:

• Degree: The basic measure in Social Networks. In general terms, it count the total of arcsand/or edges that a node have. If the network is directed, two sets are defined. The first isD+(i), i ∈ N , which correspond to all arcs started from node i, and the second is D−(j), j ∈ N ,which represent all arcs who finish with node j. Then, the Out-degree of a node j is definedas the number of arcs a ∈ A where a ∈ D+(j), while the In-degree of a node i are the numberof arcs where a ∈ D−(i).

• Centrality: It is defined as a ratio between the Degree of a node and the maximum degreewhich have a node of the network. Closeness centrality of a node i ∈ N is the ratio betweenthe number of reachable nodes from i and the sum of distances between i with this nodes.Betweenness centrality of a node i is the proportion between the number of shortest pathbetween a pair of other nodes which include i, and the total of shortest path in the network.

• Core: Filter the network according to the degree of nodes. A K-Core is a sub-network whichcontain only nodes with a degree great or equal to k.

9

Page 23: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Chapter 2. PREVIOUS WORK

2.1.3 Social Network Analysis Applications

Some general research had been previously presented. Chakrabarti et al. [7] present a survey aboutGraph Mining, explaining the patterns and algorithms that could be used in a network. Getoor etal. [22] describes the benefits of use Link Mining for SNA and present a taxonomy of common LinkMining task. According to related work, it is possible to classify SNA applications in the followingcategories.

1. Network behaviour

SNA is useful to understand different behaviour in the network. In the case of user behaviour,Musial et al. [40] presents most common measures for users understanding, such as NodeDegree and Prestige and Centrality, which were described in section 2.1.2. On the otherhand, to describe the behaviour of the network itself, Mislove et al. [39] study and analyzethe structure of different large-scale on-line social networks. They describe their structure interms of users, links, and groups, as explained in section 2.1. Also confirm the existence ofsome phenomena like Small-world : networks which have a small diameter and exhibit highclustering, and Power-law : probability that a node has a degree of k is proportional to k−γ ,for large k and γ > 1. Depending what it is required to study by administrators, one of thisapproaches would be used.

With a more applied approach, Wang et al. [56] studied the correlation between the electronicWord of Mouth (e-WOM) and products sales prediction in a cell phone discussion boardthrough a SNA perspective. They use the interaction between users to see if the conversationsaffects the cell phone buy decision. Pfeil et al. [46, 45] analyzed the behaviour of an olderpeople community to see how they seek and give support each other.

2. Community properties

In occasions, ties among a group of members are stronger than with the rest, forming a sub-community. That is why some of the objectives of SNA applications is to find these insidethe network. Fortunato [17] present a survey about sub-community detection in graphs. Indifferent aspects, like biological and social networks, explained deeply the characteristic of agraph and the algorithms to find communities in them. Also, Cocciolo et al. [10] presentshow to discover a community behaviour in a on-line document repository.

With an interesting approach, Alberich et al. [1] use SNA over the Marvel Universe, a worldof comic book characters. They studied the relations between the characters and found outthat Marvel Universe satisfy Small-world and Power-law properties, and also discover a highlevel of interaction between good characters, forming a community. In addition, Gleiser [23]explained that the community formed by the superheros confirms why “good always wins”.Because the villains does not have a community and commonly fight isolated, contrastingcollaborative work that heroes have.

10

Page 24: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Chapter 2. PREVIOUS WORK

3. Users behaviour

To understand users behaviour and why they belong to a certain community, Kwon et al.[33] investigate why a user choose or not leave a Social Network. They found out that use-fulness and easy to use perception are reasons for which users keeps in the community. Also,individuals who have higher social identity, altruism, and telepresence are more likely to beparticipative in this communities. At this point, issues like the contents of networks or goalsthat users pursued are not included in the analysis, being not applied to every community.

Relating to members in the community, roles understanding and and key-members discoveryare very similar. In role discovery, SNA is applied to understand the different classifications ofusers. For example, Yelupula et al. [60] uses e-mail information to understand the roles insidea company. These roles were compared with the organizational structure of the enterprise,and results were significant, having high levels of accuracy by each found cluster. On theother hand, key-member discovery have the objective to find the most important users of aSocial Network, rank users but not to classify them.

Another research was made by Kumar et al. [31]. They expose a segmentation of the networkin three regions: one of isolated users, which have enter into the network, but never interactwith the rest of users or members (they are merely witnesses of the interactions); sub-networkswhich interact almost only with themselves; and a giant well-connected core which do not needkey-members to persist in time.

Key-members helps to the keep the community active, but there are another kind of memberwhich helps to this purpose. They are middle-man members (also known as brokers), whichfacilitates the transference of information between members. Kossinets et al. [30] demonstratehow much information is lost if brokers are deleted from the network. The amount of thesebrokers affect the network, because if there are too many, the network will not suffer loss ofinformation. More about SNA for key-member discovery is explained in next section, detailinghow the approach is useful for the Virtual Communities of Practice.

2.1.4 Social Network Analysis on Virtual Communities of Practice (VCoP)

For VCoPs are very important to generate, store, and keep knowledge resulting from members’interaction. The success of a VCoP depends on a governance mechanism [49] and key members’participation (so called leader [5] or core members [49]). Likewise, every VCoP members’ goal is tolearn specific knowledge from the community. Therefore, it must be considered to the analysis thecontent of posts which are interesting for members.

As explained in section 1.1, there are many definitions about what is a key member: themost participative member [49], the member who answers the others members’ questions [35] or themember who encourage others members to participate [5]. However, non of these definitions takeinto consideration the content or the meaning of their interactions (posts, reply, etc.). Even more,this approaches does not measure the contribution that members does to the Main topics wheninteract with other members. Interaction is usually measure by user A reply post of user B, if A

11

Page 25: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Chapter 2. PREVIOUS WORK

successfully answer or not the question it is not taken into account. Those approaches they justconsider participation, even if the post is about a topic totally different than the thread or post inwhich is replying. Therefore, we hypothesized that key members defined this way, would lead to anincomplete analysis of their behaviour in the community.

In the thesis approach, a key member is defined as the member who participates (asking oranswering) according to a specific purpose of VCoP. The more aligned to VCoPs’ purposes definethe greater importance degree a member has. This way, a key member is obtained by combiningposts content with SNA techniques in a single process.

Key-member discovery is a very important administrator’s task, because these members arethe ones that keep the community alive. They share their experiences, knowledge, create tutorials,develop videos on a subject to help other non-experts members, etc. Many times, administrators orcommunity owners, may pay these experts to develop some contents for the community, since theyknow that these contents will produce high impact on the community, producing great interactionbetween members and help to capture new members.

In small communities, administrators or owners know almost all members and their partici-pations, because the quantity of both members and posts can be checked manually. Therefore, theyall know who-is-who in the community. However, in bigger communities, where there are thousandsof members publishing thousands of posts daily, this task becomes unmanageable. In general, ad-ministrators do not have time to read every post, or the amount of posts makes impossible to beanalyzed by a human administrator.

As Social Network could have a community, is very common to apply SNA in VCoPs. Aswas explained in previous section, the result is a graphical representation which helps to find com-munity core, sub communities, network clusters, peripheral members, etc. Key members belong tocommunities’ core, therefore, should be applied core algorithms to discover them such as HITS [29]or measures described in section 2.1.2 like degree or centrality [40].

The same issues treated in SNA are interesting to solve on VCoP. For example, communitiesdetection [17], moderation management [20], and members analysis [15]. Some cases of SNA onVCoP are as follows:

1. Members behaviour

Toral et al. [54] uses SNA in a VCoP to analyze the role of brokers. As explained before,these members are the link between askers and repliers. They used a graph representation ofthe community, based on the replies between members. The value of the arc is a measure ofmember’s interaction. They found the evolution through time of the brokers and how valuablethey are for the community. They do not consider the content generated by the community,only use SNA in the traditional way.

Lin et al. [34] focuses in understanding which factors or perceptions encourage a memberto participate and share his knowledge in a professional VC. They calculated ratios like reci-

12

Page 26: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Chapter 2. PREVIOUS WORK

procity, compatibility and loyalty. This study is relevant because establishes some qualitiesthat defines a key-member.

Fang et al. [16] performs a statistical work to establish why a member keeps his knowledge-sharing intention through time. They made a survey on which measures three research streamsrelated to the member’s intentions. But not only measure the trust between members, alsoproves that trust member-manager has influence in the community too. This approach isvery important, because a healthy community depends on knowledge-sharing intention of themembers, specially of key-members.

Chen et al. [8] does a similar work as Fang [16], but the core was to explain why members giveor receive knowledge to/from other community members. Between the factors, interpersonaltrust and knowledge sharing self-efficacy are the most relevant to understand contributions in acommunity. Both works help to emphasize that trust between members is important to have ahealthy community. Therefore, key-members play a relevant work for knowledge contribution,motivating other members to participate and frequently bringing up new key-members.

In addition, in [13] it is shown a marketing approach to determinate the influence that VirtualCommunities have in customers’ consuming-decisions. They identified six categories of mem-bers according his interest and participation in the community, remarking the core memberswho are the most frequent visitors and the ones who spend more time sharing his knowledgeand participating in different threads.

2. Key-members discovery approaches

Expert detection approach focuses in extract members interaction and recognize the mostparticipative members with algorithms like betweenness [50], centrality or HITS [29].

Liu et al. [35] uses a community based in a Q&A discussion board and define the expert asa person who answer similar questions in the past. Zhang et al. [61] works in a web forumand defined their expert as a person which knowledge matches with the words obtained by aquery browser. They used different rank algorithm to recognize the experts and then classifyin expertise levels. A pending issue in this work is that they cannot measure the expertise ofthe members, they know who is an expert, but no how much expert he or she is.

Fu et al. [19] uses e-mail data as a VCoP and create the network based on who is replyingto, if this reply is directed, a copy or a background copy. This configuration is helpful to havean idea of the present work network configuration. Campbell et al. [6] uses e-mail data too,and extract the quality of the corpus to determine if a user is a expert or not. Then build anexpert graph and applied HITS to find the experts in this graph. This work has an similarapproach with present work, only that the amount of data used by them is lower and it is notstrictly a VCoP.

Ehrlich et al. [15] uses a work network and establish the relationship between members as“who knows who”. Then determined three levels of expert: the person who knows most peoplethe person who has the most knowledge and the person who is a bridge between others (inother words, the broker).

Amatriain et al. [2] have another approach. Firstly, they recognize the experts of the commu-nity, and then use this information into a recommendation system. This work is relevant to

13

Page 27: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Chapter 2. PREVIOUS WORK

understand how important is to know who are the experts in order to make better enhance-ments for the community.

The expertise is measured by member participation or user content generated only, but theyare not combined. Both alternatives presents difficulties: ignoring the content created by userscould result in experts like flooders, trolls or spammers, and worst, not consider the real communityexperts. On the other hand, ignore the interactions and consider only the content as an isolatedfactor, could consider a non participative member as an expert. An hybrid methodology whichconsider the content of the members’ interaction, not only would greatly improve key-memberdetection, but also present new features that either approaches can present separately.

2.2 Text Mining for content reduction

As described in previous section, several techniques have been proposed to extract key members[42]. Classify users according his relevance within the community [47, 60], discovering and describingresulting sub-communities [32], among other applications. However, all these approaches leave asidethe meaning of relationships among users. Therefore, analysis based only on reply of mails or poststo measure relationships’ force or weakness it is not a good indicator. It is necessary to incorporatethe content, either e-mail corpus or post replies. Web Text Mining [55] is useful to find patternsof web text content. In particular for this work, as members generated content in posts, patternscould be used as a reduction of the communitys’ content.

In order to have patterns which reduce community content, two approaches are followingpresented. The first is presented by Rıos et al. in [52], where used a Concept-based Text Miningapproach to extract the goals accomplishment of a VCoP. The objective of a VCoP is to generateknowledge by members interaction. This knowledge is classified by administrators or communityowners in a set of goals. Each of them is related to a set of terms which are scored according to howimportant is for the goal. This way, post contribution to the knowledge development is measuredby a set of goal scores.

The other approach is presented by Blei et al. [4], where use a probabilistic model in orderto find underlying topics. Basically, given a number of topics, they estimate a probability that aword belongs to each of them. Applications of this approach are presented by McCallum et al.[37, 38]. They determine roles and topics in Text-based Social Networks, where topics depend onthe interaction between members. Usually, link structure is used to find sub-communities, but asthe same as key-members discovery, include the content will improve the precision of the discovery.Pathak et al. [43] include topic extraction which used to study relationships between members. Thisresults in a sub-community for eachtopic. Also, member roles where discovered by their interactionsin the community. However, both approaches were applied over e-mail data, which could not havea clear purpose or goal to accomplish.

14

Page 28: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Chapter 3

Proposed methodology forKey-member Discovery

In the following chapter, the main contribution of this thesis is presented. A methodology to solvethe Key-member discovery problem will be described by using an SNA-KDD approach, includ-ing text preprocessing such as LDA [4] and Concept-Based [52], graph topologies and building,community content incorporation, SNA expert detection algorithms and evaluation of the results.

As explained in Section 1.1, key-members discovery problem has the purpose of find rele-vant members on a VCoP. These key-member are defined as a person who answer other members’questions or encourage and promote community participation. Section 2 present approaches whichconsider only members participation to find them, but it is not sufficient, because their contentcould not contribute to the community. It is necessary to incorporate community content to themembers’ interaction, because improves the relation by filtering replies which do not contribute tocommunity Main topics and also measures how meaningful are the comments that members does.

To explain the methodology which is applied to combine this approaches, an adaptation ofthe Knowledge Discovery in Databases (also know as KDD) is used, SNA-KDD. The idea is touse KDD steps and incorporate Text Processing and SNA to the process. Figure 3.1 illustratethe modifications over KDD approach: first, the preprocessing step is applied by Text Miningtechniques, in transformation step the data is used to configure a network in order to have a graphrepresentation of the community. The data mining step is replaced by a SNA step, in which patternsare extracted from the graph, discovering the key-members of the configuration. All of this stepsare following explained.

15

Page 29: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Chapter 3. METHODOLOGY

Figure 3.1: SNA-KDD Process.

3.1 Data Selection

VCoPs’ usually are supported by forum systems (like VBuletin, PHPbb, etc.). The forum is thevirtual place in which members interact each other and generate knowledge. Then, the forum hascategories where different topics are discussed. For example, a forum have categories like Sports,Movies, Music, etc., which are not related one to each other. In VCoPs’, on the contrary, thecategories are related with main purposes of the practice that members are interested to develop.Each conversation in the VCoP is arranged in threads; generally started by a member question, andevery member can participate in a thread by replying with a post. In other words, the membersinteraction in a VCoP is represented by posts in the different threads.

As the object of this thesis is to discover VCoPs’ key-members, members data is necessary.Data like nicknames or user ID will be used to identify them, know with whom is interacting, andassociate the content to the correct member. Another relevant data is the community content,representing by the members’ posts. Like others web features, the content of the post could be text,images, hyper links, videos, etc. For the purpose of present work, the content used will be textsof the community posts. All the data related with the post is necessary, such as the thread andcategory it belongs, date of the post, who posted and the text of it.

3.2 Preprocessing Data

In order to use content for key-member discovery, members’ posts will be used, but it could notbe possible to use it directly. In terms of the message itself, forums sometimes include quotes,so members’ content would be replicated other members’ post. In this case, a member would be

16

Page 30: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Chapter 3. METHODOLOGY

classified as a key-member only for the content that he is quoting. So a first filter approach is toidentify the quotes and deleted from post, keeping only the new generated content. Also, there arepost which not represent a contribution for the community, such as spam, trolling or flood posts.This kind of messages have to be detected and ignored for the latter analysis to compare membersreplies.

From the point of view of the words of a post, misspelling and acronyms difficult the com-parison between a pair of post. Also, there are terms which not correspond to words that are usedin forums, such as emoticons or terms like *laughs*, hahaha, LOL, ROFL or XD.

To solve this problem, stemming and stopwords filtering is applied. The first reduce eachword to his root. For example, conjugated verbs were replaced by their infinitive, plural words bysingular, and all expression with not represent a contribution to the content were replaced with theword “useless”. Also, other terms like hyperlinks, images, or forum tags where replaced by the word“misc”. When posts were stemmed, every stopword, such as articles, pronouns, adverbs, “misc”,and “useless”, were deleted from the text. The result is a filtered post with words that could beuseful for further analysis.

But even after the application of stemming and stopwords filtering, the number of usefulwords could be too high for a word-to-word comparison. Is not necessary to use every single wordto study if a member is replying or not a question, instead, topics or concept mentioned in postscould be necessary to study their behaviour.

When replies content is included, it is possible to have a better graph representation. If arevery aligned to VCoPs’ Main topics, will be a positive interaction and should be maintained, inother case, it should not be considered for further analysis. As a result, once we apply contentreduction and include it in members’ interaction, the resulting network is a filtered version of theoriginal, which keeps only meaningful relationships.

3.2.1 Text Processing

To represent the text data for text processing, the following notation will be introduced. Let Va vector of all different words that defines the vocabulary used in the community after the pre-processing step. We will refer to a word v, as a basic unit of discrete data, indexed by {1, ..., |V|}.Then, a post message pi is a sequence of a subset of Si words from V, where |pi| = Si. To composethe post, let wij where

wij ={

1 if word vi ∈ pj0 ∼ (3.1)

17

Page 31: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Chapter 3. METHODOLOGY

and|V|∑i=1

wij = Sj . Finally, a corpus is defined by a collection of P post messages denoted by

C = (p1, ...,p|P|).

A vectorial representation of the posts corpus is given by TF-IDF= (mij), i ∈ {1, . . . , |V|} andj ∈ {1, . . . , |P|} , where mij is the weight associated to whether a given word is more important thananother one in a post. The mij weights considered in this research is defined as an improvement ofthe tf-idf term [53] (term frequency times inverse document frequency), defined by

mij =fij∑|V|k=1 fkj

× log(|C|

1 + ni

)(3.2)

where fij is the frequency of the ith word in the jth post and ni is the number of postscontaining word i. The tf-idf term is a weighted representation of the importance of a given wordin a post that belongs to the corpus. The term frequency (TF) indicates the weight of each word in apost, while the inverse document frequency (IDF) states whether the word is frequent or uncommonposts, setting a lower or higher weight respectively. As posts were filtered eliminating stopwordsand stemmed, there would be posts without words. To fix an undefined value of the tf-idf, the IDFwas adapted as shown in Equation 3.2.

3.2.2 Concept based text mining

Fuzzy Logic for Conceptual Classification

The following approach is based on Sebastian Rıos thesis ([51]). Some definitions are needed tostart. Linguistic variables (LV) values are not numbers but words or sentences in natural language.These variables are more complex but less precise. Let u be a LV, we can obtain a set of terms T (u)which cover its universe of discourse U . e.g. T (temperature) = {cold, nice, hot} or T (pressure) ={high, ok, low}.

A Fuzzy Relation (×) is a representation of the membership value between spaces of objects.Let A = {a1, . . . , an} and B = {b1, . . . , bm} two sets of objects, then a fuzzy relation A×B is definedby equation 3.3, and the membership value between ai and bj is represented as ai × bj = µij .

µ : A×B → µij ∈ [0, 1] (3.3)

A Fuzzy Composition (⊗) is a rule to compose fuzzy relations. Let A, B and C set ofobjects, Q(A,B) and R(B,C) fuzzy relations of A×B and B×C respectively, and µQ and µR the

18

Page 32: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Chapter 3. METHODOLOGY

membership functions for Q and R. Then the fuzzy composition is defined by equation 3.4, where∨and ∧ are the compositional rules, and ◦ represents the composition between the fuzzy relations.

µ[A×B]⊗[B×C] = µQ◦R(a, c) =∨{µQ(a, b) ∧ µZ(b, c)} (3.4)

In order to use LV for conceptual classification, we assume that a post can be representedas a fuzzy relation [Concepts× Posts] also called [C × P ]. Which is a matrix where each row is aconcept and every column is a post. To obtain such matrix we can rewrite this relation in a moreconvenient manner in Equation 3.5 [36]. In this expression the “Terms” are words that can be usedto define a concept, and “P” refers to the set ofPost.

[Concepts× Posts] = [Concepts× Terms]⊗ [Terms× Posts] (3.5)

As defined above, let |P| the total amount of web posts in the whole VCoP, |V| the totalnumber of different words among all posts, and K the total number of concepts defined for theVCoP site. Then we can characterize the matrix [Concepts×P ] by its membership function shownin Equation 3.6, where µC×P = µ[C×T ]⊗[T×P ] represents the membership function of the fuzzycomposition in Equation 3.5, and membership values are in [0, 1]. In other words, how much a postof P belongs to a concept of C.

µC×P (x, z) =

µ1,1 µ1,2 . . . µ1,|P|µ2,1 µ2,2 . . . µ2,|P|

......

......

µK,1 µK,2 . . . µK,|P|

(3.6)

There are several alternatives to perform the fuzzy composition, [41] performed a studybetween six different reasoning models. To decide which one will be used in this thesis, two aspectwere considered.

On the one hand, if a concept appears in a post does not imply that all terms related to itwere mentioned, because is measured the membership degree of a post relating a concept. Evenmore, with a subset of terms should be sufficient to express the meaning of a concept [51]. Then, ifsome terms are not present in a post, the degree of expressing a concept should not suffer alterations.

On the other hand, the membership degree is measured through a fuzzy relation betweenconcepts and posts, as defined above, meaning that the membership value has a range between [0, 1].Any value over 1 obtained by equations has not interpretation. For that reason, equation values

19

Page 33: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Chapter 3. METHODOLOGY

over 1 will have a membership value of one. This is a reason to use compositional rule Equation 3.7.

µQ◦R(c, p) = min{1, µQ(c, t) ∗ µZ(t, p)} (3.7)

Where Q(C, T ) and R(T, P ) are the fuzzy relations between [Concepts × Terms] and[Terms × Post], sharing the set of Terms. Let µQ(c, t) with c ∈ C ∧ t ∈ T and µR(t, p) witht ∈ T ∧ p ∈ P membership functions for Q and Z respectively. Comparing with Equation 3.4,

∨is

the limited sum defined by min(1, ·) and ∧ is the algebraic product = (a ∗ b).

Identification and Definition of Concepts

In order to apply the above proposal, it is needed to begin identifying the relevant concepts forthe study. It is important to remark that it is not the purpose to have a conceptual classificationfor information retrieval, which may include thousands of concepts and terms in order to retrieveall relevant documents regardless of the keywords used in the user’s query. It is required conceptswhich describes visitors’ alignment to community purposes. To do so, experts’ knowledge whomidentify which are the most interesting concepts to describe visitors’ behavior in the web site isused.

Then, we use the help of a thesaurus and dictionaries to extract terms to define the relevantconcepts i.e. to express every concept like a list of terms (assuming that a concept is a LV). Weused synonyms, quasi-synonyms, antonyms, etc.

Afterwards, we need to define the membership values for the fuzzy relations [Concepts ×Terms] and [Terms×P ]. To represent the membership values of matrix [Terms×WP ], equation3.2 was used to calculate relative frequency of words in a web page to represent the membershipvalues of matrix [Terms× P ].

More complex is the definition of [Concepts× Terms] values. We performed this operationby asking the expert to assign the degree of a term to represent a concept. To do so, he comparedtwo terms each time and gave a value between 0 and 1. For example, a synonym can receive a valuenear 1; a quasi-synonym, may receive a value near between 0.65 and 1; an antonym can be set to0, etc. This method is an indirect method with one expert.

Finally, we obtained the fuzzy relation µG×P (x, z) applying Equation 3.4. In Table 3.1 wepresent a column of matrix µG×P (x, z), which represents the goals classification for a single postfrom VCoP. From this Table we can say that post have a strong relation with the goal 1 and goal2, almost no relation with goals 3, 4 and 5.

On the other hand, we are able to apply an automatic approach for automatic fuzzification.The problem of these approaches are that usually are designed to include all possible concepts. They

20

Page 34: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Chapter 3. METHODOLOGY

Table 3.1: List of goals and membership values of a singular postGoals µG×PGoal 1 0.88Goal 2 0.72Goal 3 0Goal 4 0.12Goal 5 0.01....

take all words or the most repeated words in the corpus as concepts. Then they go to a thesaurusand use an algorithm for automatic fuzzification. Then, the algorithms incorporate all possibleconcepts which produce that results are very complex to understand. Most of these approaches areused for information retrieval. This way, they should use hundreds of concepts with hundreds orthousands of terms to retrieve all relevant documents when the user enters a query. This is not ourcase, our main goal is to perform a semantic filtering of the network, not retrieve documents basedon a query.

3.2.3 Using Latent Dirichlet Allocation for Topic classification

A topic model can be considered as a generative probabilistic model that relates documents andwords through variables which represent the main topics inferred from the text itself. In this context,a document can be considered as a mixture of topics, represented by probability distributions whichcan generate the words in a document given these topics. The inferring process of the latentvariables, or topics, is the key component of this model, whose main objective is to learn from textdata the distribution of the underlying topics in a given corpus of text documents.

A main topic model is the Latent Dirichlet Allocation (LDA) [3, 4, 25]. LDA is a Bayesianmodel where latent topics of documents are inferred from estimated probability distributions overthe training dataset. The key idea of LDA, is that every document of the Corpus has a probabilitydistribution over a set of topics (T ), where every topic is modeled as a probability distribution over asubset of words (vi ∈ V). These distributions are sampled from multinomial Dirichlet distributions.

The advantage of this method over concept based approach is that is not necessary to havethe topics defined before, because they are discovered by the algorithm. Only experts opinion toprovide a description for each discovered topic is needed.

As described by [4], the latent Dirichlet allocation model can be represented as a probabilisticgenerative process described by the following sequence of events:

21

Page 35: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Chapter 3. METHODOLOGY

For a given post p:

1. The words which appears are independent events. To represent this, let S ∼ Poisson(ξ) bethe number of words in a given post. The sth word of the post is represented by ws.

2. Let β the distribution over words for each topic and θ a multinomial distribution over topicsfor each post, where θ ∼ Dir(α). The Dirichlet distribution is used in Bayesian statistics toestimate hidden parameters of a categorical distribution [21], which is in this case, the topics.

3. Then, for each word ws ∈ p, let zs ∼ Multinomial(θ) a vector where zps is the probability thattopic s is in post p.

4. Finally, a word ws from p(ws|zs, β), which is a multinomial probability conditioned on thetopic zs, is chosen.

where the final set of topics T is built by the top k topics zs of n words, for which k and nmust be defined a-priori in the experimental setup.

For LDA, given the smoothing parameters β and α, and a joint distribution of a topic mixtureθ, the idea is to determine the probability distribution to generate from a set of topics T , a postcomposed by a set of S words w (p = (w1, ..., wS)),

p(θ, z,p|α, β) = p(θ|α)S∏s=1

p(zs|θ)p(ws|zs, β) (3.8)

where p(zs|θ) can be represented by the random variable θi, such that topic zs is presented indocument i (zis = 1). A final expression can be deduced by integrating equation 3.8 over the randomvariable θ and summing over topics z ∈ T . Given this, the marginal distribution of a message canbe defined as follows:

p(w|α, β) =∫p(θ|α)

(S∏s=1

∑zs∈T

p(zs|θ)p(ws|zs, β)

)dθ (3.9)

The final goal of LDA is to estimate previously described distributions to build a generativemodel for a given corpus of messages. There are several methods developed for making inferenceover these probability distributions such as variational expectation-maximization [4], a variationaldiscrete approximation of equation 3.9 empirically used by [59], and by a Gibbs sampling Markovchain Monte Carlo model [24] which have been efficiently implemented and applied by [48].

Both approaches are helpful to have a content reduction and a measure of how much a postcontribute to the community. To have this measures, an algorithm which measures the community

22

Page 36: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Chapter 3. METHODOLOGY

posts according the concepts or topic contribution is implemented. Algorithm 1 presents the pseudo-code for post filtering which starts with filtered post as an input and finish with a topic or conceptpost score. When LDA is applied, the set of parameters {k, n, α, β} are determined according tothe volume of the Vocabulary, and the documents used (in this case, the volume of post) [48].Afterwards, the TF-IDF matrix is multiplied by the Topic or Concept matrix (called SemanticMatrix ), in order to clean-up the overall corpus vectorial representation.

The output of this algorithm is a topic or concepts score for each post. These measure howmuch are contained the topics or concepts in posts. Now, interaction between members can beevaluated by comparing their topics or concept scores, and how similar are they.

Algorithm 1 Initialize Semantic Weights MatrixInput: V (Vocabulary)Input: P (Filtered Posts)Input: k (Number of Topics or Concepts)Output: Semantic Weights Matrix SWM[|P|, k]

1: TF-IDF[|P|, |V|] (Eq. 3.2)2: SM[k V] ← Build SM (semantic matrix) by classifying according to Topics or Concepts3: SWM[|P|, k] ← TF-IDF ⊗ SMT

3.3 Network Configuration

To build the social network, members’ interaction must be taken into consideration. In general,members’ activity is followed according to its participation on the forum. Likewise, participationappears when a member post in the community. Because the activity of the VCoP is describedaccording members’ participation, the network will be configured according to the following: Nodeswill be the VCoP members, and arcs will represent interaction between them. How to link themembers and how to measure their interactions to complete the network is our main concern.

There are two kinds of forums. Directed Forums, which shows clearly to whom is reply-ing a member, and post are aligned according to which member is replying and the time whenit was posted, and Undirected Forums, where it is not possible to identify to whom is replying,posts are aligned only according their time in which was posted. Figure 3.2 illustrate both forumclassifications.

For Undirected Forums, it is necessary to take assumptions about to which members isreplying. In this thesis, three VCoPs’ network representation are defined, according the followingreplying schema of members:

1. Creator reply Network: When a member create a thread, every reply will be related tohim/her. This network representation is the less dense network (density is measured in termsof the number of arcs that the network have).

23

Page 37: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Chapter 3. METHODOLOGY

Figure 3.2: An example of thread post sequence.

2. Last reply Network: Every reply of a thread will be a response of the last post. Thisnetwork representation has a middle density.

3. All previous reply Network: Every reply of a thread will be a response to all posts whichare already in a specific thread. This network representation is the most dense network.

In figure 3.3 the latter three approaches of forum reply representations are presented. Arcsrepresents members’ replies and nodes represent the members who made the posts. In a traditionalapproach, the weight of arcs will be a simple counter of how many times a given member replies toother.

Figure 3.3: Three different network configurations which represent a given thread interaction.

In order to consider members replies according to the community purpose (for any of theseconfigurations), and to filter noisy posts, both concept based and topic based message reduction isperformed.

3.3.1 Concept-based & Topic-based Network Filtering

Previous work [52] brings a method to evaluate community goals accomplishment. In this workwe will use this approach to classify the members’ posts according VCoPs’ goals. These goals aredefined as a set of terms, which are composed by a set of keywords or statements in natural language.

24

Page 38: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Chapter 3. METHODOLOGY

The idea is to compare two members’ posts with a distance measure. If the distance is over acertain threshold θ, an interaction will be considered between them. We support the idea that thiswill help us to avoid irrelevant interactions. For example, in a VCoP with k concepts (or topics),for a thread t, let P tj a post of user j that is a reply to post P ti of user i . The distance betweenthem will be calculated with Equation 3.10.

dm(P ti , Ptj ) =

∑k g

tikg

tjk√∑

k gtik

2∑k g

tjk

2(3.10)

Where gtik is the score of concept (or topic) k in post of user i in the thread t, calculated byAlgorithm 1 explained in Section 3.2. It is clear that the distance exists only if P tj is a reply to P tiin any of the three types defined above. After that, the weight of arc ai,j is calculated according toequation 3.11.

ai,j ={

1 if dm(P ti , Ptj ) ≥ θ

0 ∼ (3.11)

We used this criteria in all three configurations previously described (Creator reply, Lastreply & All previous reply).

3.3.2 Network Construction

Now that posts are measured, the following is to construct the filtered graph with the ConceptBased or LDA approach. Algorithm 2 presents the pseudo-code on how a the graph Gc = (N ,A) isbuilt by using the Creator reply network. Equation 3.10 is used to compare post scores. The resultis a filtered graph by content, where members replies to thread creators.

A similar algorithm is used for the other orientations. The only change is the post which iscompared. In the case of Creator reply network algorithm, was the post who creates the thread.On the other hand, Last reply network uses the last post of the thread included by the algorithm.Algorithm 3 presents the construction of this network.

In the case of All previous reply network, posts are compared with each post presented in thethread. Algorithm 4 present the pseudo-code for the last network topology. To build a non-filteredgraph, the algorithms are the same, with the difference that equation 3.10 is not used.

About arcs weight, for the approach of this thesis, only interest if an appropriated interactionexists between two members, not how many interaction between them exists. In all this pseudo-codes the result is a graph with arc weights equals to 1, even if they have more pair of post whichare over θ.

25

Page 39: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Chapter 3. METHODOLOGY

Algorithm 2 Creator Reply NetworkInput: {V,P, k, Users}Output: Network Gc = (N ,A)

1: Build SWM according to Algorithm 12: Initialize N = {},A = {}3: for each thread t ∈ P do4: i← t.creator5: N ← N ∪ i6: for each j ∈ {t.replies}, i 6= j do7: if dm(P ti , P

tj ) ≥ θ then

8: N ← N ∪ j9: ai,j ← 1

10: A ← A∪ ai,j11: end if12: end for13: end for

Algorithm 3 Last Reply NetworkInput: {V,P, k, Users}Output: Network Gc = (N ,A)

1: Build SWM according to Algorithm 12: Initialize N = {},A = {}3: for each thread t ∈ P do4: i← t.creator5: N ← N ∪ i6: for each j ∈ {t.replies}, i 6= j do7: if dm(P ti , P

tj ) ≥ θ then

8: N ← N ∪ j9: ai,j ← 1

10: A ← A∪ ai,j11: end if12: i← j13: end for14: end for

26

Page 40: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Chapter 3. METHODOLOGY

Regarding to the threshold θ, their function is to filter replies which are not similar in termsof content scores with replied post. For that reason, and as a first approach, an strict filter thresholdof 0.8 is used in this thesis. This way, only very similar interactions will be considered in filteredgraphs. It is proposed to find which value of θ is appropriated to filter the network.

Algorithm 4 All Previous Reply NetworkInput: {V,P, k, Users}Output: Network Gc = (N ,A)

1: Build SWM according to Algorithm 12: Initialize N = {},A = {}, P rev = {}3: for each thread t ∈ P do4: Prev = {}5: i← t.creator6: N ← N ∪ i7: Prev ← Prev ∪ i8: for each j ∈ {t.replies}, i 6= j do9: for each k ∈ Prev do

10: if dm(P tk, Ptj ) ≥ θ then

11: N ← N ∪ j12: ak,j ← 113: A ← A∪ ak,j14: end if15: end for16: Prev ← Prev ∪ j17: end for18: end for

3.3.3 Network Visualization

There are many techniques to present a network. The most common is a Circular visualization,where the nodes are aligned around a circle and the arcs between them are inside of it.

Others approaches focus in the aesthetic of the graphs, trying to avoid crossing edges orarcs, minimize the distance between nodes, non incident edges or arcs, or the angle formed by twoincident edges or arcs at a vertex, symmetry of the graph visualization or minimize the area ofdrawing.

Force directed methods for graph visualization define a system of forces which act on nodesand arcs or edges, adjusting the position of nodes depending on the distance r between a pair ofthem, the attraction force fa, and the repulsive force fr.

Eades [14] present a first approach replacing arcs for springs with specific nature length, andput to nonadjacent nodes springs with infinite nature length. Then, the attract force is defined by

27

Page 41: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Chapter 3. METHODOLOGY

Equation 3.12, where ca is the attract factor, and the repulsive force by Equation 3.13, where cr isthe repulsive factor. The system begins with a random distribution, and then begin to iterate untilthe changes of nodes position be small enough.

fa = calog(r) (3.12)

fr =crr2

(3.13)

Fruchterman and Reingold [18] include the number of nodes and restrict the area of drawing,by defining a parameter k = c

√area

#nodes where c is founded experimentally. In this case, fa and fr

are calculated by Equations 3.14 and 3.15 respectively.

fa =r2

k(3.14)

fr = −k2

r(3.15)

Finally, Kamada and Kawai [27] not only consider the geometrical distance, but also includein their system the graph distance in terms of the shortest path between a pair of nodes. Then theyminimize both distances, including them in the system’s energy function. This function is definedby Equation 3.16, where k is a constant, pu and pv are the position of nodes u and v ∈ G(N,E),and d(u, v) is the shortest path between nodes u and v.

Es =∑

u,v∈G(N,E)

k(|pu − pv| − d(u, v))2 (3.16)

In the case of Eades approach, the system finds their equilibrium by iterating the nodesposition. On the contrary, Fruchterman-Reingold and Kamada-Kawai solve complex equation sys-tems to find the equilibrium. The algorithmic time will depend of the number of nodes and arc thegraph has the network, because the iterative process and the number of equations depend of thistwo factors.

For this thesis, Circular visualization and Kamada-Kawai approach will be used to presentthe resulting networks of the VCoP. The reason is to see how dense are the different networks con-figurations (in the case of Circular Visualization), and to present the interaction in the community(in the case of Kamada-Kawai)

28

Page 42: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Chapter 3. METHODOLOGY

3.4 Social Network Analysis key-member discovery

Once graphs are built according to section 3.3, it is possible to apply SNA on them. In Section 2 itwas explained the different applications that SNA has, each of them can be applied in these graphs,but for the scope of the thesis is relevant key-members discovery only. Thanks to the incorporationof members generated content, the participation represented in graphs correspond to a similaritybetween asker and repliers posts.

To find key-members, members’ degree is measured. As explained in Section 2.1.2, degreecorrespond to the number of arcs that a member has, in occasions it is sufficient with this; if not,in-degree and out-degree would be used. There are different techniques to find out the most relevantmembers of the community, such as centrality, degree (or in-degree or out-degree), which are verycommon in SNA [57], or HITS [29].

HITS is a technique used originally to classify web pages according to link structure. Whenusers search web pages only by keywords, results could not have relevant pages, presenting pageswith the keywords but without relevant information instead. Kleinberg [29] established that if apage have relevant information, other pages have to reference it (by hyperlinks), bringing to thispage an authority over that pages. Also, there are pages which not present the information, butthey link to pages which it has, being a hub between the user and the required information.

Therefore, this approach focuses in the link structure of a set of pages, finding the bestauthorities and hubs among them, being a good hub a web site who point into a good authorityweb site, and vice versa; a good authority is pointed by good hubs.

The iterative algorithm defines for each page p ∈ P (being P a set of pages), an autorithyvalue (ap), and a hub value (hp). At the initialization, all pages have the same score for both,authority and hub. Then, for each iteration this values change according Equation 3.17 and 3.18:the authority of a page depend of how valuated are the pages that are pointing it, and the hub valuedepend of the authority of the pages that the page itself is pointing.

ap =∑

q:(q,p)∈E

hq (3.17)

hp =∑

q:(p,q)∈E

aq (3.18)

After this reevaluation of the pages values, each vector which represent both values are nor-malized. The algorithm stops when there is no significant variations between consecutive iterations.So, when the difference between vectors is equal or less to ε, the algorithm stops. The result is aranking about the authority and hub values of the set of pages.

29

Page 43: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Chapter 3. METHODOLOGY

In a VCoP context, as the action of point represent a reply, a good hub is a member whoreplies to goods askers and a good authority is an asker who is replied by a good authority. Withthis approach, the key-members appears separately in the different rankings: top hubs are moti-vators key-members, who generate content according the communitys’ purposes or goals, and topauthorities are repliers key-members, who encourage or attracts other members to participate, andconsequently, provokes that experts reply their questions.

As two kinds of key-members were defined, the output of HITS is useful to identify them,instead other techniques, which only order members by its participation without knowing what kindof contribution (motivate or reply) does the key-member. The result of this technique is a rank ofmembers, arranged by its participation in the community, considering their content contribution tothe community.

3.5 Analysis and Evaluation

Previous task developed has different point of analysis, beginning with the preprocessing data.The first analysis is the meaning of the resulting LDA topics. Just like K-Means, every topicextracted with LDA has to be analyzed to understand what represent. Community administratorscan explain which means the set of words which represent a topic, having a characterization ofeach extracted topic. On the contrary, Concept based approach does not require this analysis,because administrators decided before what are the main concepts that he want to measure in thecommunity.

In graph construction case, different topologies could be analyzed. Content filtered graph hastwo objectives, the first is include community content in the graph structure to compare interactions,and the other is the graph reduction. Filtering graphs imply that some members interactions willnot be consider to build the network, eliminating arcs that in traditional SNA are considered. Inconclusion, it is possible to compare topologies in terms of number of members, number of arcs andgraph density.

The evaluation step is relevant in the algorithms to find key-members. The comparisonbetween algorithms will help to understand their respectively results. Two kinds of key-memberswhere defined in previous Section, motivators and repliers key-members, which depends if a memberis highly pointed or if a member points many other, respectively. Therefore, algorithms results forboth types are different. In the case of motivators, Out-degree and Hub will rank them, and forrepliers, In-degree and Authority will be used. Also, degree and HITS algorithm will be comparedto evaluate how similar are.

It is also important to have administrators point of view, because he knows the community,their members and contributions. Algorithms’ precision will be evaluated with administrators’ key-members, and will present two meanings. In the one sense, how similar are algorithm results, andin the other, to understand administrators’ criteria to consider some members as key-members.

30

Page 44: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Chapter 4

Application in a real VCoP

In this chapter, the thesis application in a real VCoP is presented. Following the steps of themethodology explained in Section 3. A description of the experimental VCoP1, followed by thebasic database and how it is completed during the development of the steps is detailed. Then, theresults of each step are presented and discussed with their preliminary improvements.

4.1 A real Virtual Community of Practice

Plexilandia2 is a VCoP formed by a group of people who have met towards the building of musiceffects, amplifiers and audio equipment (like “Do it yourself” style). It was created with the purposeof share common experiences in the construction of plexies3. Today, plexilandia count with morethan 2500 members in almost 8 years of existence. All these years they have been shearing anddiscussing their knowledge about building their own plexies, effects. Besides, there are other relatedtopics such as luthier, professional audio, buy/sell parts.

Although, they have a basic community information web page, most of their members’ in-teractions are produced on the discussion forum. Table 4.1 presents the activity in the differentcategories of the forum since the beginning of the community until 2010 is shown.

In the beginning there was only one administrator. Today, due to the growth of the commu-nity, this task is performed by several administrators (in 2008 they count with 5 administrators). Infact, the amount of information weekly generated makes impossible to let the administration task

1Work developed by Felipe Aguilera and Sebastian Rıos.2http://www.plexilandia.cl3”Plexi” is the nickname given to Marshall amp heads model 1959 that have the clear perspex (a.k.a plexiglass)

fascia to the control panel with a gold backing sheet showing through as opposed to the metal plates of the latermodels.

31

Page 45: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Chapter 4. APPLICATION

in just one person. Nowadays, an administrator has the following tasks:

• Re-classification of posts: it is frequent that members post a message in the wrong forumcategory. For example, buy and sell advertisement should be placed in the “general forum”but newcomers place them in other sections. Therefore, administrators have to move the postinto the right category.

• Members moderation: administrators must protect that members use the forums to discusstopics which are related with the community and use appropriate language. This task is notas frequent than classification, because other active members help to detect these situations,facilitating the administrator work.

• Participation: although communitys’ knowledge is distributed in all its members, some mem-bers have greater degree of knowledge or expertise about some topics. Due to diverse reasons(community founders, experts in an area, etc.), administrators are important knowledge gen-erators. Therefore, administrators are active participants of most discussions. They motivatediscussions, create new threads and create new categories.

Table 4.1: Plexilandia activityForum 2002 2003 2004 2005 2006 2007 2008 2009 2010 TOTALAmplifiers 392 2165 2884 3940 3444 3361 2398 1252 525 20362Effects 184 1432 3362 3718 4268 5995 4738 2317 731 26745Luthier 34 388 849 1373 1340 2140 926 699 452 8201General 76 403 855 1200 2880 5472 3737 1655 666 16944Pro Audio — — — — — 342 624 396 132 1494Synthesizers — — — — — — — 104 65 169TOTAL 686 4388 7950 10231 11932 17310 12423 6423 2571 73914

In their first six years of life, this community has undergone a great sustained growth inmembers’ contributions, reaching a peak in 2007. From this year, community participation isconstantly decreasing, so community owners want to have measures which help to enhance thecommunity in order to improve members participation.

The vision of administrators and experts about the community is based mostly by experienceand time participating in the community. They also have some basic and global measures. Forexample, total number of posts, connected members, etc. However, they don’t have informationabout members browsing behavior, members content quality or how this members contribute tocommunity purposes.

To have a characterization of the community, table 4.2 summarizes general measures obtainedthrough the development of the forum.

32

Page 46: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Chapter 4. APPLICATION

Table 4.2: Plexilandia general measuresMeasures TotalUsers 2857Categories 6Total Threads 10678Total Post 73914

4.2 SNA-KDD application

It is clear that a database is involved through every step of the SNA-KDD process. The databaseis being completed with relevant information for key-member discovery every time that a step isfulfilled. Table 4.3 show which table is updated in each step of the SNA-KDD process.

Table 4.3: Database Evolution through SNA-KDD developmentData Selection Text Processing Network Configuration SNA

users xpost x

concepts xterms xtime x

topics xwords x

CB post score xLDA post score x

graph xorientation x

filter xranking xscores x

In the Data Selection step, all information about members and their posts are inserted intables users and posts. Also, as concepts are defined previously by administrators, tables conceptsand terms contain all related information.

Then LDA is run and the topics extracted, the words which composed these topics andtheir probabilities are inserted in tables topics and words. With both topics and concepts defined,Text Proccessing step algorithm for content reduction is applied, obtaining content posts scores.CB post score and LDA post score contains the scores for each post.

Network configuration step creates graphs with their different topologies, and meta dataabout them is inserted in tables graph, orientation, filter, whose combination describe a graph.After this, SNA step is applied over graphs, obtaining members ranks for each topology, which is

33

Page 47: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Chapter 4. APPLICATION

inserted in table scores, while table ranking have the information about the ranking algorithmsused.

4.2.1 Data Selection

To obtain the data, it was necessary to review the forum database first. The required data wasdisperse in the different databases tables, so the required data from original database was extractedto build a small and summarized database. The structure of the initial database is illustrated infigure 4.1. The first tables are data of the users and the posts in the community. Then, two tablesrelated to the concepts are included, one of them contain concepts meta data (concepts) and theother have the terms which represent each concept, with their corresponding score (terms).

Figure 4.1: Preliminar database.

4.2.2 Text Processing

Both text processing algorithms are executed as explained in Section 3.2. There are differencesbetween both approaches. In the case of Concept based, the concepts are already detailed in thedatabase, opposed to LDA, in which a previous algorithm is run over the community texts in orderto obtained the topics. After that, table topics is built, which have the same structure of terms,that means a table with words which composed a topic with their respective probability (instead ofthe term score).

34

Page 48: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Chapter 4. APPLICATION

The next step of text processing is to obtain the scores for each post for both LDA an CBalgorithm with fuzzy classification explained in section 3.2.2. When post scores had been calculated,there would be post with scores equals to zero for each concept or topic. In that case, these postshave to be deleted, because their comparisons with other posts will be null. Also, time execution fornetwork configuration will be saved ignoring these posts before. Summarizing, tables CB post scoreand LDA post score will have post with at least one topic or concept score different to zero.

4.2.3 Network Configuration and key-member discovery

The graph has many variables which modify its configuration.

1. Time: One dimension not mentioned before it is time. Depending time period of that it iswanted to be analyzed; it could be possible to have monthly, annual or historic networks. Itis also possible to have graphs with other specific periods of time like half-year or whateveradministrator wants. But, it is not direct to build a network with specific time period, becausetopics and concepts scores would be different regarding the post that are considered, affectingmeasures like TF-IDF. That means that post scores tables have to have for each post, asmany scores as periods are wanted to analyze.

2. Graph Filtering: Including the traditional non-filtered graph, the other two configurationscorrespond to graphs filtered by topics or concepts, according with algorithm 1.

3. Interaction topology: According to the assumption of “to whom is replying?”, three possibleconfiguration appears. Creator reply (algorithm 2), Last reply (algorithm 3) and All previousreply (algorithm 4).

To configure the network and have the graph representation, all of this three variables hasto be decided. In this thesis, monthly and annual networks have been built, and to be compared,the threshold is set by 0, 8 for both filters, as explained in section 3.3.2. Then, for each interactionrepresentation, the result is a graph with the members who posts in a specific period of time and hasan interaction greater or equal to the filter threshold. Graphs are saved and table graphis created,which has the number of nodes, arcs and density for each graph. Also, tables orientation and filterare created to describe these items which define graph topologies.

Once the graphs are built, the next step is to find the key-members. There, HITS, in-degree,out-degree and centrality are applied for each temporal, filtered and orientation topology, obtaininga score for every node of the graph.

The last table is created here scores, which contains member ranking score for each topology.Due to the combination of variables involved in a specific member score, this table contains a hugeamount of data. For a single period of time, there are two ranking approaches (degree and HITS),and both of them have two measures (In-degree and Out-degree, Hub and Authority). Also, this

35

Page 49: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Chapter 4. APPLICATION

rankings are applied for the three network topologies, filtered by Concept based and LDA or justnot filtered. In other words, a member in a single period of time have 36 different ranking scores.

With this structure, information could be presented moving across different dimensions com-binations. When graph rank scores are arranged from high to low, the order will present the mostvaluated members in first place. Then, potential motivators and repliers key-members of a specifictime period and topology will be members with highest rank scores.

4.3 Results and Discussion

As described in Section 1.2.3, one of the expected results is the Key-members discovery database,in which administrators could surf to see the key-members of different times, topologies or rankalgorithm. Database tables are shown in Figure 4.2. Appendix A.2 present the ER database model.

Figure 4.2: Key-members discovery database

4.3.1 Topics obtained

The application of LDA over text content resulted in 50 topics with 100 words and their respectivelyprobabilities. Table 4.4 presents the first 20 words of Topics 6, 43 and 3 obtained by LDA.

36

Page 50: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Chapter 4. APPLICATION

Table 4.4: Topics obtainedCable connections Components buy/sell solicitudes Classic questions

Word Probability Word Probability Word Probabilitycable 0.114900 mand 0.071717 cos 0.042431jack 0.047225 vend 0.040419 electronica 0.038946pata 0.036512 precio 0.036566 hac 0.032695entr 0.032616 mail 0.034919 com 0.032640negar 0.023726 pag 0.031418 bas 0.031284conectado 0.018432 cos 0.021299 aprend 0.029846bateria 0.016160 pedir 0.019564 gust 0.021798sal 0.015985 envio 0.018711 conocer 0.020249conectar 0.014786 avis 0.016152 construir 0.016155conecta 0.013912 traer 0.015593 empezar 0.015353circuit 0.012564 interesa 0.015563 trabaj 0.015049lado 0.011315 necesit 0.014299 proyect 0.013749plug 0.011090 contact 0.014122 busc 0.011564enchuf 0.010890 msn 0.011563 hay 0.010070switch 0.009517 interesado 0.011034 harto 0.009656conexion 0.007919 cotiz 0.004239 informacion 0.009628extremo 0.007894 habl 0.004239 libro 0.009102rojo 0.007195 import 0.004209 estudio 0.008964out 0.006970 via 0.004121 diseo 0.008909cen 0.006945 respond 0.004062 mat 0.007581

Results of LDA were presented to administrators4, which provided meaning to many of thetopics obtained. Only 9 of the 50 topics were considered “unhelpful” to understand the community,and Topic 34 was described as “many words bad written”. Table 4.5 has 7 of the successful 40 topicsas a sample of the meaning found with LDA. The rest of the topics extracted with their respectivemeanings are presented in A.3. The words of each resulted topic were used for the construction ofthe graphs filtered by LDA.

Table 4.5: Topic meaningsTopic Id Meaning

03 Classic questions of new members06 Cable conections13 Electronic supply stores23 Images and videos of amplifiers construction improvements25 Places where to buy35 Community coexistence norms43 Components buy and sale solicitudes

4Work develop by Felipe Aguilera and Sebastian Rıos

37

Page 51: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Chapter 4. APPLICATION

4.3.2 Resulted Networks

Monthly and annually graph representations of the community were built, but only results for 2009are presented in this section.

The graphs will be compared according their interaction representation. Then, results werearranged by filter technique (non-filtered, Concept Based and LDA) and circular visualization ofthe graph was chosen, in order to evidence the changes in members’ interaction when the differentfilters are applied.

Figure 4.3 illustrate the resulted graphs with Creator reply network. The changes between aConcept base and LDA filtering are very significant. This phenomena is explained for the topologyitself. A member creates a thread because needs an answer for a specific problem. This problemis related with community concepts in a certain degree, or have a probability that topics appears.Therefore, only replies whose content is according to concepts or topics of original post will beconsidered.

Figure 4.4 shows the Last reply network. In this case, no significant change is observed.Threads contains so many post that is possible to see, in the same thread, discussions which are notaligned with original post. As this network measures interaction of consecutive posts, they couldbe strongly related, but does not mean that posts are aligned with original post. In conclusion,this configuration measures how similar are consecutive post, without considering their content isaccording original post.

Finally, figure 4.5 present the graphs oriented by All Previous replies. In this case, when amember replies in a certain thread, the assumption is that he or she is replying to each member whosepost is before in the thread. Total of arcs between nodes will be greater than other configurations,but filters conserve their quality, appreciating a considerable decrease in network density.

Figure 4.3: Creator reply Networks

Density is a ratio which compare the difference between the arcs that compose the graphand all potential arcs that could be build in them. Equation 4.1 represents the density for a graph

38

Page 52: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Chapter 4. APPLICATION

Figure 4.4: Last reply Networks

Figure 4.5: All previous reply Networks

G(N,E). For present analysis, formula 4.1 is adapted with the purpose that each graph could becompared with the original graph. In other words, if a graph G(N,E) is filtered, the density of thefiltered graph Gf (Nf , Ef ) is calculated with equation 4.2 and Table 4.6 shows the results for eachnetwork configuration of year 2009.

d =|E|

|N | (|N | − 1)(4.1)

Interaction representation affects the graph density. In fact, Figure 4.6 shows that exceptfor LDA filtered graphs, All previous reply is more dense than others configurations. This excep-tion could be explained because LDA filters with more topics, meaning that two non-consecutivepost would be less similar than consecutive ones, which is what Last reply orientation have in hisconfiguration.

df =|Ef |

|N | (|N | − 1)(4.2)

39

Page 53: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Chapter 4. APPLICATION

Table 4.6: 2009 DensitiesCounting Concept Based LDA

Creator 0.019 0.013 0.011Last Reply 0.025 0.025 0.030

All Previous 0.059 0.027 0.019

Figure 4.6: 2009 densities comparison

With visualization of Figures 4.3, 4.4, and 4.5 a preliminary idea of the graph density ispresented. The filter techniques effectively reduce the interaction between members, keeping onlymeaningful relations for the community. Figure 4.7 compare filtered graphs with traditional ap-proach for each reply topology, to evaluate how much reduced are graphs after applying each filteredalgorithm. Results are impressive, specially in the All previous approach, where density was dra-matically reduced. It is not relevant if Concept based or LDA presents the lower density of arepresentation, main result is that filtered graphs have a lower density than the original graph.

After this general visualization, it is important to evaluate if the same behaviour is presentedin each network configuration. Monthly graphs density through year 2009 were used to comparefilters behaviour. Figures 4.8, display the density evolution for Creator reply graphs. It appreciatesthat there are months of high activity which consider with summer vacations. The most interest-ing is that this behaviour is common for three configurations. The density reduction achieved isconsiderable, both filters presents a reduction from 2% to 0.05% or less. In this case, LDA filter isstronger than Concept Based.

Figure 4.9 illustrate density evolution for Last reply graphs. In this case is interesting tosee how both Concept based and LDA conserves the behaviour of the original graph, meaning thatduring the year, the directs replies and concepts or topics treated are very similar. In the caseof LDA, the filter is stronger than Concept based. Besides, both filters achieved their purpose ofreduce the interactions.

40

Page 54: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Chapter 4. APPLICATION

Figure 4.7: Density reduction

Finally, figure 4.10 presents evolution of the density through 2009 for All Previous replygraphs. There, the density reduction is complete notorious. Both filters eliminate most of thecommunity interaction, which combined with previous figures means that replies are more orientedto original post or previous reply more than all post of the thread.

Figure 4.8: 2009 Creator Oriented Density

It is important to highlight that not only a reduction of the density is obtained with theapplication of content filter algorithms. As described in Section 1.2.3, community behaviour is verysimilar among all filters, even peaks are reached at the same period, and in the case of Last replyorientation, the behaviour of the Concept Based network is the same as the original network butwith a lower density. Although, it is possible to conclude that at the same threshold θ for contentfiltering, Concept Based is better for Reply oriented networks, while LDA is recomendable to use in

41

Page 55: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Chapter 4. APPLICATION

Figure 4.9: 2009 Reply Oriented Density

Figure 4.10: 2009 All Previous Oriented Density

Creator oriented in terms of reduce density. For All Previous oriented network it is not clear whichfilter is better, because in magnitude both present good performance.

4.3.3 Key-members Discovered

Once the graphs are obtained, proposed algorithms and traditional SNA techniques are appliedto discover key-members. HITS [29], and degree [57] algorithms where chosen. The reason of usethis algorithms is that both present in their results the approaches what defines a key-member.While Authority and In-degree consider the replies received by a member, Hub and Out-degreeuses the replies generated by a member, meaning that key-members will be people who motivateparticipation, and whose replies are according to community purposes, respectively.

42

Page 56: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Chapter 4. APPLICATION

As a sample of the key-members discovered, tables 4.7 and 4.8 shows the Top 10 motiva-tors and repliers key-members for Creator Oriented graph, respectively. Members nicknames werereplaced by their users ID. Tables for Reply and All Previous orientation are showed in AppendixA.1.

Table 4.7: Creator Oriented Motivators Key-membersIn-Degree Authority

Counting Concept Based LDA Counting Concept Based LDAUser100314 User101696 User100314 User101651 User101696 User100314User101651 User677 User101452 User100314 User677 User101452User101696 User101651 User677 User101696 User101651 User101730User100039 User100439 User101730 User101452 User100439 User677User101452 User100314 User101112 User100039 User100314 User101378

User677 User101378 User101378 User101376 User101378 User101112User101376 User100010 User101327 User677 User100010 User101696

User36 User100074 User793 User36 User100074 User101651User100505 User100992 User101696 User100505 User100992 User101327User100074 User101376 User101651 User100074 User101376 User793

Table 4.8: Creator Oriented Repliers Key-membersOut-Degree Hub

Counting Concept Based LDA Counting Concept Based LDAUser101696 User100010 User101696 User101696 User101696 User101378User101697 User101696 User100439 User101696 User677 User100314User100010 User101697 User101697 User101651 User101651 User101730

User1 User1 User101651 User100439 User100439 User677User100439 User101651 User100314 User100314 User100314 User101696

User161 User100439 User36 User101378 User101378 User101452User101697 User161 User100010 User100992 User100010 User101651

User36 User100314 User32 User100074 User100074 User7User101376 User32 User1 User100010 User100992 User100010

User32 User677 User677 User1 User101376 User677

Now that key-members are identified for each network, an evaluation of detection precisionhas to be realized. To do this evaluation, administrators were asked about who are key-members bytheir perspective. They not only have a key-members list, also they classify their key-members fromA to C, according to their relevance for the community, being A the most important key-membersand C the less. Table 4.9 show the administrators’ key-members. It is important to emphasize thatadministrators consider that key-members of the same category has the same relevance betweenthem, so the order not represent an internal relevance.

Compare key-members automatically detected with administrators’ key-members has twoobjectives. First, to evaluate the precision of the algorithms from an administrator point of view,and second, understand which aspects are considered by administrators when classify users as key-

43

Page 57: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Chapter 4. APPLICATION

Table 4.9: Administrators’ Key-membersUser ClassificationUser1 A

User100885 AUser3 AUser85 AUser36 B

User100439 BUser100074 B

User677 BUser100505 B

User10145274 BUser101792 B

User161 BUser671 CUser759 CUser32 CUser7 C

User127 CUser733 CUser136 CUser28 CUser658 CUser22 C

members. The precision is measured with the classical Information Retrieval ratio. From the setof K key-members, let tp the number of administrators’ key-members that an algorithm detected.Then, precision p is calculated with equation 4.3.

p =tp

|K|(4.3)

As key-members are classified in three categories, four types of precision were calcu-lated.Table 4.10 presents the results of key-member global precision (from the key-members detected,how many really are), for each combination of filter and orientation. The results are promising,except for the combination LDA - Creator, precision of algorithms are over 50% in both motiva-tors and replier, which means that at least half of the key-members detected are also consideredkey-members by administrators. All Previous orientation present the best results, explained bymotivator key-members ask relevant questions for community, and the effect of a good questionis measured according how many people participate in discussions. From the point of view of thereplies, All Previous orientation presents also good results. As an expert, repliers key-membersgives high valued answers, so a reply to all previous replies will have a high value.

44

Page 58: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Chapter 4. APPLICATION

Table 4.10: Global Key-members precisionCreator Reply Previous

Counting In-Degree 54,5% 63,6% 68,2Counting Authority 54,5% 63,6% 68,2

CB In-Degree 59,1% 63,6% 68,2%CB Authority 59,1% 59,1% 68,2%

LDA In-Degree 40,9% 63,6% 63,6%LDA Authority 40,9% 63,6% 63,6%

Counting Out-Degree 72,7% 59,1% 68,2%Counting Hub 54,5% 63,6% 63,6%

CB Out-Degree 68,2% 63,6% 63,6%CB Hub 50,0% 59,1% 68,2%

LDA Out-Degree 59,1% 59,1% 63,6%LDA Hub 50,0% 63,6% 63,6%

The difference between Degree and HITS are not notorious in the case of motivators, whichhave the same precision. On the contrary, repliers present an uncertain result, not having the sameprecision for Out-Degree and Hub for the same filter-orientation combination. Despite this results,the variation is not significant and could be explained by a simple member exclusion in one of bothalgorithms.

The following precision measures are related with the type of key-member founded. Asadministrators have their own key-member classification, it was necessary to use it for detected,assuming that rank algorithms presents a similar behaviour in that sense. The results for precisionof Type A detection are presented in table 4.11. Due to the number of Type A key-members, avariation of one member will implied an increment or a decrease of 25% of the algorithm precision.Therefore, an increment in the precision is not than significant for the same reason. In general,the detection of repliers key-members have better results than motivators. This could be helpfulto understand the kind of key-member that type A is. If a member is classified as a “Type A” isbecause their contribution is too helpful for the community, and from administrator point of view,this contribution are good answers for community discussions. The reason that Creator Out-Degreecombination has the best precision is explained because count how many good answers are postedby a member.

Either way, algorithms are capable to find at least one of Type A members, which is a goodresult consider that the amount of them is low.

Type B key-members detection precision are showed in table 4.12. The precision is morehomogeneous than type A, probably because it is easier to find a key-member in a bigger set ofcandidates, being not necessary that the same member be found in every algorithm. The resultsare similar for Degree and HITS, and also presents high values for each combination. The actionof filters affects only in Creator orientation, meaning that member contribution are not usuallyaccording to the original post. In fact, the similitude in the precision could be interpreted as a both

45

Page 59: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Chapter 4. APPLICATION

motivator and replier key-member, almost like a broker which in occasions gives good answers andin others contributes asking to continue the discussion or bringing type A members which answerthe original post.

Table 4.11: Type A Key-members precisionCreator Reply Previous

Counting In-Degree 25,0% 50,0% 50,0%Counting Authority 25,0% 50,0% 50,0%

CB In-Degree 50,0% 25,0% 50,0%CB Authority 50,0% 25,0% 50,0%

LDA In-Degree 0,0% 50,0% 25,0%LDA Authority 0,0% 50,0% 25,0%

Counting Out-Degree 75,0% 25,0% 50,0%Counting Hub 25,0% 50,0% 50,0%

CB Out-Degree 75,0% 50,0% 50,0%CB Hub 50,0% 25,0% 50,0%

LDA Out-Degree 50,0% 25,0% 25,0%LDA Hub 25,0% 50,0% 25,0%

Table 4.12: Type B Key-members precisionCreator Reply Previous

Counting In-Degree 75,0% 75,0% 75,0%Counting Authority 75,0% 75,0% 75,0%

CB In-Degree 62,5% 75,0% 75,0%CB Authority 62,5% 75,0% 75,0%

LDA In-Degree 37,5% 75,0% 75,0%LDA Authority 37,5% 75,0% 75,0%

Counting Out-Degree 75,0% 75,0% 75,0%Counting Hub 75,0% 75,0% 75,0%

CB Out-Degree 75,0% 75,0% 75,0%CB Hub 50,0% 75,0% 75,0%

LDA Out-Degree 50,0% 75,0% 75,0%LDA Hub 50,0% 75,0% 75,0%

The results for Type C key-members does not present a easy-to-understand pattern. Table4.13 presents the results, which precision is lower than previously showed. A first approach isthat algorithms are not capable to find members of this type, but the results presented beforedemonstrate that they could, so the second approach is to give an interpretation of what type ofkey-member the administrator is defining. As the analysis has been made with an administratorpoint of view, an explanation could be that this type of key-members had not been active duringthe period studied, either they are historic key-members which participated in the community timeago and now they their participation is moderated, or just not participate so much in the period oftime analyzed. For that reason administrators recognize them as key-members, but algorithms not.

46

Page 60: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Chapter 4. APPLICATION

Besides, the behaviour through algorithms, in terms of precision range, is similar. ComparingDegree and HITS, it not appears a meaningful difference in their precisions. All Previous orientation,although, present the best results for this type of key-member, finding in general a 30% of them.Same as previous analysis, the reason is the major number of interaction which this configurationpresents, giving the oportunity of consider the same contribution multiple times.

Table 4.13: Type C Key-members precisionCreator Reply Previous

Counting In-Degree 10,0% 20,0% 30,0%Counting Authority 10,0% 20,0% 30,0%

CB In-Degree 20,0% 30,0% 30,0%CB Authority 20,0% 20,0% 30,0%

LDA In-Degree 20,0% 20,0% 30,0%LDA Authority 20,0% 20,0% 30,0%

Counting Out-Degree 30,0% 20,0% 30,0%Counting Hub 10,0% 20,0% 20,0%

CB Out-Degree 20,0% 20,0% 20,0%CB Hub 10,0% 20,0% 30,0%

LDA Out-Degree 30,0% 20,0% 30,0%LDA Hub 20,0% 20,0% 30,0%

Previous analysis consider only key-members that administrators could remember. Whenalgorithms where applied, results were presented to administrators and they were asked to recognizekey-members presented by algorithms, in order to discover whose where ignored in their previousreview. Improvements are represented not only in quantity, but also in algorithm precision. Table4.14 presents a comparison between the number of administrator key-members before and after thereview of algorithm results. It is notorious the increment of Global key-members, produced for theadministrators review: from the previous key-members, eight of Type B and C were re-classified asType A, nine Type C were re-classified as Type B. Also, twelve Type A, nine Type B and sevenType C members were discovered by algorithms and added to the set of administrator key-members.

Table 4.14: Administrator Key-members enhancementsAdmin Algorithm

Type A 4 26Type B 8 16Type C 10 7Global 22 49

Considering this new set of key-members, precision was calculated again. Table 4.15 showsthe improvements in the Global precision for each graph configuration. Content filters does not havea better precision than SNA approach, finding a similar proportion of key-members. Although, thatdoes not mean that algorithms find the same key-members. There are two factors that affect thisresult: first, administrators collect the new key-members by reviewing all algorithm results together,and second, precision was calculated considering how many of the 49 administrator key-members

47

Page 61: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Chapter 4. APPLICATION

are in the top-25 rank for each configuration. Both factors helps to understand the results presentedin Table 4.15 and realize that these numbers could be similar, but does not have to represent thesame.

Table 4.15: Global Key-members precision enhancementCreator Reply Previous

Counting In-Degree 84% 84% 84%Counting Authority 100% 80% 88%

CB In-Degree 100% 80% 88%CB Authority 84% 84% 84%

LDA In-Degree 48% 80% 72%LDA Authority 48% 80% 72%

Counting Out-Degree 84% 80% 76%Counting Hub 84% 80% 84%

CB Out-Degree 84% 84% 80%CB Hub 84% 84% 84%

LDA Out-Degree 72% 84% 84%LDA Hub 68% 80% 80%

Apparently, it is not possible to establish that Content filtering presents an improvementfinding key-members with this approach. At least, their performance is similar to traditional SNA.The value of the threshold θ has to be reviewed and the application of Content filtering in order tofind sub-networks that can not be founded by traditional SNA.

4.3.4 Key-members detection algorithm comparison

In section 4.3.3 appears that results obtained by Degree and HITS are similar in terms of precision.Then, it is natural to think that it does not matter which algorithm is chosen to detect key-members,because all give similar results. But even both algorithms work with nodes degree, the difference ishow they use it. While degree in general only count the number of arcs and where are pointing them,HITS use arcs for an iterative process, where the output is a normalized rank score of members.

In this section, admin key-members are not necessary, because algorithm performance ismeasured. Also, eventhought the order in which key-members appears is not relevant, this factor willbe considered in this case because both algorithms arrange them according the relevance calculatedby their degrees.

To compare both algorithms, for each reply topology and filter content process, In-degreeand Authority will be compared for motivators key-members and Out-degree and Hub for replierskey-members. To analyze them, Kendall Tau rank correlation coefficient is used. This statisticcalculates how similar are two rankings according how the elements were arranged in them. LetX and Y be two comparable rankings of n elements, (x1, y1), (x2, y2) . . . (xn, yn) as the pair of

48

Page 62: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Chapter 4. APPLICATION

elements ranking in the nth position of both ranks. Then, a pair of observations (xi, yi) and (xj , yj)are defined concordant if the ranks of both elements agree, i.e. xi > xj and yi > yj or xi < xjand yi < yj . In other case, the pair is defined as discordant. Kendall tau correlation coefficient isdefined by equation 4.4.

τ =# of concordant pairs−# of disconcordant pairs

12n(n− 1)

(4.4)

This coefficient is in range (−1, 1), being 1 too similar rankings and -1 when a ranking isthe inverse of the other. In addition of this coefficient, a significance test was made where nullhypothesis is rankings are independent. The expected result is that degree and HITS are similarfor each combination of topology, filter and kind of key-member, meaning that coefficient will benear to 1, and null hypothesis of ranking independence will be rejected.

Table 4.16: Kendall τ coefficient for rank algorithmMotivators Key-members Repliers Key-member

Counting Concept Based LDA Counting Concept Based LDACreator 0,6179 0,8884 0,8830 0,14450 0,6358 0,7558Reply 0,7374 0,9222 0,8347 0,4898 0,4159 0,4725

Previous 0,6036 0,8622 0,8258 0,4272 0,3125 0,5008

Table 4.16 shows the Kendall τ coefficient between Degree and HITS for both motivatorsand repliers key-members. The results are significant, the null hypothesis was rejected for everycombination, which means that both algorithms are dependent. This first conclusion is very logical,because both, degree and HITS, use the same network. The coefficients shows that both algo-rithms has a very similar members arrange in the case of motivators key-members. For replierskey-members, algorithm present the same behaviour in Creator orientation, but in Reply and AllPrevious the correlation it is not that clear.

Graphically, the result are more evident. As a sample, scatter plots of ranks by Degree andHITS were made. Figure 4.11 illustrate for each filter, the scatter plots of In-Degree and Authorityfor Creator oriented networks from a motivator key-member point of view. It is clear that thereis a correlation between both ranks, which confirms the results presented by Kendall τ . The moreelements being in the diagonal, the more similar are the rankings.

Figure 4.12 show the scatter plot for repliers key-members for Creator reply Network. In thiscase, the Counting graph does not present a specific order, not like the filters, which have a greatsimilarity. This figure helps to emphasize a main purpose of this work. Filtered networks ignoredinteractions which are not according community concepts of topics. In this case, by deleting theseinteractions, appears an order of relevance of members that was not clear in a traditional network.For the other cases, where the correlation is not high, only means that the algorithms are not sosimilar, but in any case it is possible to do a judgment about which one is better.

49

Page 63: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Chapter 4. APPLICATION

Figure 4.11: Motivators Creator reply rank Scatter plot

Figure 4.12: Repliers Creator oriented rank Scatter plot

4.3.5 Filter algorithm

Section 4.3.3 presents the results obtained comparing with administrators’ key-members. But inoccasions, key-members are unknown even to administrators. In this case, both traditional SNAand proposed SNA and Text Mining approach are helpful to discover key-members.

Compared with traditional approach, filters only consider valuable contributions, so if thereis loss of information, are from post that not contribute to the community, such as trolls, flood, spamor other malicious posts. Then, results of key-member have to be different, because the relationsconsidered is reduced. Although, is expected that key-members appears either the network befiltered or not, because participation is a factor in the algorithms. So, as mentioned in Section1.2.3, the other expected result is that the ranking of the key-members has to change in favor ofthe real key-members.

To evaluate changes provoked by network filtering, Kendall τ was used to have a correlationbetween filtered and traditional graphs. Table 4.17 have the correlations between filtered graphsand traditional approach (Counting graph), for motivators and repliers key-members. Correlationsare low, which means a rearrangement of the members in terms of the content generated by them.Besides, this correlation appoint that there is a relation between both graphs, what is true, because

50

Page 64: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Chapter 4. APPLICATION

filtered graph derive from Counting graph and also traditional SNA has been develop a preliminarysolution to key-members discovery problem through years.

Table 4.17: Kendall τ coefficient for filter algorithm compared with Counting graphCreator Reply Previous

In-Degree Authority In-Degree Authority In-Degree AuthorityConcept-Based 0,2755 0,2642 0,2915 0,3389 0,4052 0,4383

LDA 0,2291 0,1932 0,4373 0,4127 0,3312 0,3820Out-Degree Hub Out-Degree Hub Out-Degree Hub

Concept-Based 0,2832 0,2184 0,3012 0,3392 0,2842 0,4051LDA 0,2593 0,1726 0,3409 0,4278 0,2271 0,2961

Summarizing, there is an improvement using text content filtering in community relation-ships, because as well of discover the member who makes more questions are more replied, it capturesthe content that this members does in the community, presenting not only the key-members in termsof participation, but also in terms of contribution for the community.

51

Page 65: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Chapter 5

Conclusions and Future Work

VCoPs’ key-member discovery problem is commonly solved with SNA techniques. Present worksuggest that apply only this approach will result in a wrong judgment, because the content generatedin the community is not considered.

For that reason, the main objective of this thesis was to incorporate community content byWeb Text Mining techniques, such as Concept Based and LDA, in order to improve the discovery ofkey-members. As presented in section 4.3, the main objective was accomplish, following the specificobjectives stated in section 1.2.2. Each objective is fulfilled and their contribution to this thesis ispresented in the following list.

1. In Section 3.3, the elements which compose communitys’ network was determined. Also, threetopologies according how and to who a member replies in the community were designed.

2. Concept Based (section 3.2.2) and LDA (section 3.2.3) algorithms were implemented in orderto measure the content generated by members. With different approach, both algorithms givea score to each post of the community, related with the concepts that administrators definedor with the topics extracted from the community.

3. An algorithm which include community content was develop and explained in section 3.3.1.Three graph resulted, traditional graph and filtered graph made with Concept Based andLDA.

4. As presented in section 4, SNA was applied as was detailed in section 3.4. Degree andHITS were applied and result in two approaches of key-members: motivators and replierskey-members. Analysis was based and treated separated.

5. In section 4.3, results are shown and states the improvements generated by using the contentto discover key-members. In the following, improvements obtained are fully detailed.

52

Page 66: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Chapter 5. CONCLUSIONS AND FUTURE WORK

5.1 SNA-KDD methodology

In order to have an scheme which explain the novel approach proposed in this thesis, KDD wasadapted into the SNA-KDD methodology, which include a SNA step, used in this thesis for key-member discovery, as explained in section 3. This approach will be helpful as a basis for laterwork.

Also, network configuration was established. The three topologies based in the assumption ofhow a user replies in a thread are complementary, depending in what administrators needs. Creatorreply network is helpful to find key-members who motivates participation. Last reply network is toobtain the members whose replies are according to the post exactly before, measuring their capacityof direct interaction. Finally, All Previous reply network helps to find member whose replies areaccording to community purposes, because is consider as a global reply for the thread.

As the idea was to present a work structure for SNA which include community content, theidea for future work is to add other features that SNA can develop, such as sub-communities orbrokers finding, just as discussed in 2.1, among each other treated in [57].

5.2 Content Filtering

Concept Based and LDA were successfully implemented as contents filters1. In the case of LDA,the topics extracted were recognized by administrators as relevant topics for the community. Then,filters were applied to community, obtaining a reduced graph keeping only interaction which arerelated with the community purposes or topics extracted, as explained in section 3.3.1.

When filters where used, the graph contains only high valued interactions, and this results isa great improvement for SNA state of the art, because probes that is possible to obtain a graph rep-resentation which have not only the interaction, but also the quality of the community discussions,improving the later application of each SNA technique.

The interaction threshold used for this work was 0.8, meaning that community was extremelyfiltered. in the case of LDA filtration is more notorious than Concept Based, having he lowest densityin the majority of the cases. What threshold is better for each filter was not discussed in this work,because the idea was to evaluate the quality of the filter. In future work, the value of the thresholdhas to be discussed to obtain the better filtration of the community.

1LDA was implemented by Gaston L’Huillier

53

Page 67: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Chapter 5. CONCLUSIONS AND FUTURE WORK

5.3 Density reduction

With the implementation of content filters, network density was reduced. In section 4.3.2, annualnetworks were showed, demonstrated that novel approach filter the content, reducing the amountof interaction. Otherwise, Circular visualization illustrate how much the interaction are decreased,which means that other types of visualization could be used and helped to understand graphicallythe community behaviour. Figure 5.1 display the 2009 Creator orientation networks with Kamada-Kawai Visualization.

Figure 5.1: Kawada-Kawai Visualization for 2009 Creator oriented networks

Visualization would help administrators to reinforce results obtained by SNA, and also ver-ify their own judgment when classify users as key-members. In addition, this presentation couldbe clarifying as a preliminary analysis over the network, helping to discover sub-communities forexample, changing the color of nodes depending on which sub-community it belongs. Summariz-ing, visualization aims to bring a point of view of the network that without filter application isimpossible to view, because of the huge amount of interaction between members.

5.4 Key-members discovered

Traditional SNA find key-members, in terms of their participation in the community. Two defi-nitions of key-members were presented according of the relations which arcs represent. The firsttype of key-members was called motivator key-member, who correspond to members which postsare replied for others, motivating and generating participation. The second type, called replierskey-members are members who reply more than others.

Degree and HITS were used to find them, being In-Degree and Authority the ranks formotivators, and Out-Degree and Hub the ranks for repliers. The first conclusion is that bothalgorithms find key-members with the same precision, but this not mean that they are substitutingalgorithms, because key-members found by both of them are not the same.

54

Page 68: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Chapter 5. CONCLUSIONS AND FUTURE WORK

As exposed in section 4.3.5, no loss of information was found when filters were applied, onlya rearrangement of the ranking, meaning that content filtered network can be used to discover key-members. The precision decrease a little, but more than mean a decrease in the finding quality is anew approach to present key-members. The reason is that this kind of graphs consider participationwhich contributes to the community, so the key-members found correspond to members which postsare according to community purposes or treated topic.

The improvement of find this new kind of key-member helps administrators to improve theirpoint of view, differentiating two definitions of key-members, and also presenting to them memberswhich are not consider as key-members, but in fact they are, enhancing the list of communitys’key-members. With the filter graph, a motivator key-members correspond to members whose postsfulfill community concepts or topics and motivate participation of other members in terms of postcontent. On the other hand, repliers key-members are members whose replies are according to thepost which is replying and also the community desired content.

Results were showed to administrators, in order to have an evaluation of them. The resultwas an increment in administrator key-members, which implies in a better precision from eachalgorithm and graph configuration. It is possible to conclude that both SNA and Content Filteringapproach present similar results finding key-members, at least in terms of proportion. Therefore, itis important to find how Content Filtering would represent a real improvement to the Key-memberdiscovery problem.

5.5 Future Work

Present work is a start point for different work areas, because consider many different approaches.In this section, the guidelines to future work are presented, focused in content of the community,new representations of a network and SNA applications development.

5.5.1 Content contribution

At this moment, community content is used only as a interaction filter. If a reply do not coincidewith the post which is reply, the interaction is deleted. Even more, if a interaction is valid andanother valid interaction appears between the same members, it will not count, because only mattersif exist and interaction between two members.

The next step for this approach is consider the contribution of the contents generated by amember. Incorporating an interaction weight could improve even more the key-member detection,because not only considerate the participative contributing members, but also will consider thequality of all their contributions.

55

Page 69: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Chapter 5. CONCLUSIONS AND FUTURE WORK

To discuss in later research is how to incorporate the quality contribution, how measuremultiples contributions, and which rank algorithms to use. One possibility is to use a weighted-HITS approach, but the value and range of the arcs weight has to be discussed previously.

Also, the value of the threshold θ has to be evaluated. In this thesis, θ has the value of 0.8,representing a strong filtering. In a future work, different values of θ has to be applied to filtercontent and evaluate the quality of the obtained graphs.

5.5.2 Concepts approach

Concept Based is used to extract the value of post according previous established concepts, definedby administrators. In occasions, administrators could not include all the concepts that are treatedin the community, also the community is evolving through time, changing the concepts too andadministrator could not capture this changes.

LDA extracts topics of the community, this way it could be possible to obtain the actualtopics treated by members in the community. Then, the work of administrators will be to recognizewhat these topics means. In the future, how to use automatic topic extraction systems have to beevaluated, and also how to analyze the topic evolution through time in a VCoP.

Also, concepts approach have more applications than key-members discovery. If adminis-trators defines a concepts of bad behaviour, such as “trolling” or “spam”, could help to moderatethe community by catching the key-members of these concepts. Also could be helpful to study thehealth of the community in terms of how much junk post are in the community, and analyze theevolution through time of this concepts.

Purposes evolution through time in VCoP was worked in [52]. In this work, a revision of thecommunity was analyze, but not the evolution of members through time. If the same idea is used tostudy the behaviour of members through time, it could be possible to research issues like: evolutionof members from common member to key-member, evolution of member participation and memberschurn detection.

5.5.3 Thematic networks

In this thesis, all concepts or topics are compared to measure the global contribution of a post, but itis possible to isolate the concepts to have a network which include the high valued post of a specificconcept or topic, creating a new kind of network representation, defined as Thematic network.

With this network configuration, key-members founded are related with specific concepts ortopics. With this approach, administrators efforts to enhance the community will be more focused,because the answer of “who know what about something” will be answered.

56

Page 70: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Chapter 5. CONCLUSIONS AND FUTURE WORK

5.5.4 SNA and Topic extraction computational tool

Every algorithm used in this thesis was programmed only in order to have the needed data. Thereis no optimization of the code, or efficiency in the algorithms implemented, even graphic interfaceswhere omitted in benefit of batch process to have all network configurations needed. So a benchmarkof the software used is not possible. A database was modeled and used to have and process thedata, which could be useful for all later works, but it is necessary to have a computational toolwhich does all the previous data processing. Among the features that a computational tool willhave is:

• Multidimensional Database with the needed information as shown in section 4.3.

• A step by step framework which extract the topics, measures the concept and topic post scoresconfigures the networks filtered and non-filtered (with the option of vary the threshold), usethe SNA features and present benchmark between algorithms.

• End-user OLAP tool, which present the evolution of concept, topics and user behaviourthrough time into the community.

• Graphical visualization of the networks and the results obtained by SNA.

• A repository to store resulted networks.

• Report generator of the experiments realized.

This software will help to have the results easier, quickly and will establish a standard forthe VCoPs’ analysis which will benefit later research.

57

Page 71: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Chapter 5. CONCLUSIONS AND FUTURE WORK

Conferences and Workshops

Authors would like to thank the continuous support of Instituto Sistemas Complejos deIngenierıa (ICM: P-05-004- F, CONICYT: FBO16); Initiation into Research Founding (FONDE-CYT), project code 11090188, entitled “Semantic Web Mining Techniques to Study Enhancementsof Virtual Communities”; and the Web Intelligence Research Group (wi.dii.uchile.cl).

[1] H. I. Alvarez, S. Rıos, F. Aguilera, G. L’Huillier. Enhancing SNA with a Concept-based TextMining Approach to discover key members on a VCoP. In KES ’10: 14th International Conferenceon Knowledge-Based and Intelligent Information & Engineering Systems. Cardiff, Wales, England,2010.

[2] G. L’Huillier, H. I. Alvarez, S. Rıos, F. Aguilera. Topic-Based Social Network Analysis forVirtual Communities of Interests in the Dark Web. ISI-KDD ’10: ACM SIGKDD Workshop onIntelligence and Security Informatics 2010. Washington DC, USA, 2010.

58

Page 72: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

REFERENCES

[1] R. Alberich, J. Miro-Julia, and F. Rossello. Marvel universe looks almost like a real socialnetwork. http://arxiv.org/abs/cond-mat/0202174, February 2002.

[2] Xavier Amatriain, Neal Lathia, Josep M. Pujol, Haewoon Kwak, and Nuria Oliver. The wisdomof the few: a collaborative filtering approach based on expert opinions from the web. InProceedings of the 32nd international ACM SIGIR conference on Research and development ininformation retrieval, pages 532–539, Boston, MA, USA, 2009. ACM.

[3] Istvan Bıro, David Siklosi, Jacint Szabo, and Andras A Benczur. Linked latent dirichletallocation in web spam filtering. pages 37–40, 2009.

[4] D Blei, A Ng, and M Jordan. Latent dirichlet allocation. The Journal of Machine Learning ldots, Jan 2003.

[5] A. Bourhis, L. Dube, R. Jacob, et al. The success of virtual communities of practice: Theleadership factor. The Electronic Journal of Knowledge Management, 3(1):23–34, 2005.

[6] Christopher S. Campbell, Paul P. Maglio, Alex Cozzi, and Byron Dom. Expertise identificationusing email communications. In Proceedings of the 12th international conference on Informationand knowledge management, pages 528–531, New Orleans, LA, USA, 2003. ACM.

[7] Deepayan Chakrabarti and Christos Faloutsos. Graph mining: Laws, generators, and algo-rithms. ACM Comput. Surv., 38(1):2, 2006.

[8] Chih-Jou Chen and Shiu-Wan Hung. To give or to receive? factors influencing members’knowledge sharing and community promotion in professional virtual communities. Information& Management, 47(4):226–236, May 2010.

[9] Chao-Min Chiu, Meng-Hsiang Hsu, and Eric T.G. Wang. Understanding knowledge sharingin virtual communities: An integration of social capital and social cognitive theories. DecisionSupport Systems, 42(3):1872–1888, December 2006.

[10] Anthony Cocciolo, Hui Soo Chae, and Gary Natriello. Using social network analysis to highlightan emerging online community of practice. In Proceedings of the 8th iternational conference onComputer supported collaborative learning, pages 148–152, New Brunswick, New Jersey, USA,2007. International Society of the Learning Sciences.

59

Page 73: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

REFERENCES

[11] Pedro O.S. Vaz de Melo, Virgilio A.F. Almeida, and Antonio A.F. Loureiro. Can complexnetwork metrics predict the behavior of NBA teams? In Proceeding of the 14th ACM SIGKDDinternational conference on Knowledge discovery and data mining, pages 695–703, Las Vegas,Nevada, USA, 2008. ACM.

[12] Wouter de Nooy, Andrej Mrvar, and Vladimir Batagelj. Exploratory Social Network Analysiswith Pajek. Cambridge University Press, 2004.

[13] Kristine de Valck, Gerrit H. van Bruggen, and Berend Wierenga. Virtual communities: Amarketing perspective. Decision Support Systems, 47(3):185–203, June 2009.

[14] Peter Eades. A heuristic for graph drawing. Congressus Numerantium, 42:149–160, 1984.

[15] Kate Ehrlich, Ching-Yung Lin, and Vicky Griffiths-Fisher. Searching for experts in the en-terprise: combining text and social network analysis. In Proceedings of the 2007 internationalACM conference on Supporting group work, pages 117–126, Sanibel Island, Florida, USA, 2007.ACM.

[16] Yu-Hui Fang and Chao-Min Chiu. In justice we trust: Exploring knowledge-sharing continuanceintentions in virtual communities of practice. Computers in Human Behavior, 26(2):235–246,March 2010.

[17] Santo Fortunato. Community detection in graphs. Physics Reports, 486(3-5):75–174, February2010.

[18] Thomas M. J. Fruchterman and Edward M. Reingold. Graph drawing by force-directed place-ment. Software: Practice and Experience, 21(11):1129–1164, 1991.

[19] Yupeng Fu, Rongjing Xiang, Yiqun Liu, Min Zhang, and Shaoping Ma. Finding experts usingsocial network analysis. In Proceedings of the IEEE/WIC/ACM International Conference onWeb Intelligence, pages 77–80. IEEE Computer Society, 2007.

[20] Joaquın Gairın-Sallan, David Rodrıguez-Gomez, and Carme Armengol-Asparo. Who exactlyis the moderator? a consideration of online knowledge management network moderation ineducational organisations. Computers & Education, 55(1):304–312, August 2010.

[21] Andrew Gelman, John B. Carlin, Hal S. Stern, and Donald B. Rubin. Bayesian Data Analysis,Second Edition (Chapman & Hall/CRC Texts in Statistical Science). Chapman & Hall, 2edition, July 2003.

[22] Lise Getoor and Christopher P. Diehl. Link mining: a survey. SIGKDD Explor. Newsl.,7(2):3–12, 2005.

[23] Pablo M Gleiser. How to become a superhero. Journal of Statistical Mechanics: Theory andExperiment, 2007(09):P09020–P09020, 2007.

[24] T Griffiths. Finding scientific topics. Number 101, pages 5228–5235, 2004.

[25] G Heinrich. Parameter estimation for text analysis. Technical report, 2004.

60

Page 74: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

REFERENCES

[26] John Hagel III and Arthur G. Armstrong. Net gain: expanding markets through virtual com-munities. Harvard Business School Press, 1997.

[27] T. Kamada and S. Kawai. An algorithm for drawing general undirected graphs. Inf. Process.Lett., 31(1):7–15, April 1989.

[28] Won Kim, Ok-Ran Jeong, and Sang-Won Lee. On social web sites. Information Systems,35(2):215–236, April 2010.

[29] Jon M. Kleinberg. Authoritative sources in a hyperlinked environment. J. ACM, 46(5):604–632,1999.

[30] Gueorgi Kossinets. Effects of missing data in social networks. Social Networks, 28(3):247–268,July 2006.

[31] Ravi Kumar, Jasmine Novak, and Andrew Tomkins. Structure and evolution of online socialnetworks. In Proceedings of the 12th ACM SIGKDD international conference on Knowledgediscovery and data mining, pages 611–617, Philadelphia, PA, USA, 2006. ACM.

[32] Haewoon Kwak, Yoonchan Choi, Young-Ho Eom, Hawoong Jeong, and Sue Moon. Miningcommunities in networks: a solution for consistency and its evaluation. pages 301–314, 2009.

[33] Ohbyung Kwon and Yixing Wen. An empirical study of the factors affecting social networkservice use. Computers in Human Behavior, 26(2):254–263, March 2010.

[34] Ming-Ji James Lin, Shiu-Wan Hung, and Chih-Jou Chen. Fostering the determinants of knowl-edge sharing in professional virtual communities. Computers in Human Behavior, 25(4):929–939, July 2009.

[35] Xiaoyong Liu, W. Bruce Croft, and Matthew Koll. Finding experts in community-basedquestion-answering services. In Proceedings of the 14th ACM international conference on In-formation and knowledge management, pages 315–316, Bremen, Germany, 2005. ACM.

[36] Stanley Loh, Jose Palazzo M. de Oliveira, and Mauricio A. Gameiro. Knowledge discovery intexts for constructing decision support systems. Applied Intelligence, 18(3):357–366, May 2003.

[37] Andrew McCallum, Andres Corrada-Emmanuel, and Xuerui Wang. Topic and role discov-ery in social networks. In Proceedings of the 19th international joint conference on Artificialintelligence, pages 786–791, Edinburgh, Scotland, 2005. Morgan Kaufmann Publishers Inc.

[38] Andrew McCallum, Xuerui Wang, and Andres Corrada-Emmanuel. Topic and role discovery insocial networks with experiments on enron and academic email. J. Artif. Int. Res., 30(1):249–272, 2007.

[39] Alan Mislove, Massimiliano Marcon, Krishna P. Gummadi, Peter Druschel, and Bobby Bhat-tacharjee. Measurement and analysis of online social networks. In Proceedings of the 7th ACMSIGCOMM conference on Internet measurement, pages 29–42, San Diego, California, USA,2007. ACM.

61

Page 75: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

REFERENCES

[40] Katarzyna Musia l, Przemys law Kazienko, and Piotr Brodka. User position measures in socialnetworks. In Proceedings of the 3rd Workshop on Social Network Mining and Analysis, pages1–9, Paris, France, 2009. ACM.

[41] H Nakanishi, IB Turksen, and M Sugeno. A review and comparison of six reasoning methods.Fuzzy Sets and Systems, Jan 1993.

[42] Robert D Nolker and Lina Zhou. Social computing and weighting to identify member rolesin online communities. Web Intelligence, IEEE / WIC / ACM International Conference on,0:87–93, 2005.

[43] Nishith Pathak, Colin Delong, Arindam Banerjee, and Kendrick Erickson. Social topic modelsfor community extraction. Aug 2008.

[44] Adam Perer and Ben Shneiderman. Integrating statistics and visualization: case studies ofgaining clarity during exploratory data analysis. In Proceeding of the twenty-sixth annualSIGCHI conference on Human factors in computing systems, pages 265–274, Florence, Italy,2008. ACM.

[45] Ulrike Pfeil. Online support communities for older people: investigating network patterns andcharacteristics of social support. SIGACCESS Access. Comput., (89):35–41, 2007.

[46] Ulrike Pfeil and Panayiotis Zaphiris. Investigating social network patterns within an empathiconline community for older people. Computers in Human Behavior, 25(5):1139–1155, Septem-ber 2009.

[47] Ulrike Pfeil and Panayiotis Zaphiris. Investigating social network patterns within an empathiconline community for older people. Computers in Human Behavior, 25(5):1139–1155, Septem-ber 2009.

[48] X H Phang and CT Nguyen. Gibbslda++ (http://gibbslda.sourceforge.net/), 2008.

[49] Gilbert Probst and Stefano Borzillo. Why communities of practice succeed and why they fail.European Management Journal, 26(5):335–347, October 2008.

[50] Indra Rajasingh, Bharati Rajan, and Florence Isido D. Betweeness-Centrality of grid networks.In Computer Technology and Development, International Conference on, volume 1, pages 407–410, Los Alamitos, CA, USA, 2009. IEEE Computer Society.

[51] Sebastian Rıos. A Study on Web Mining Techniques for Off-Line Enhancements of Web Sites.PhD thesis, University of Tokyo, 2007.

[52] Sebastian Rıos, Felipe Aguilera, and Luis Guerrero. Virtual communities of practices purposeevolution analysis using a Concept-Based mining approach. In Knowledge-Based and IntelligentInformation and Engineering Systems, pages 480–489. 2009.

[53] G Salton, A Wong, and C S Yang. A vector space model for automatic indexing. Commun.ACM, Vol. 18(11):613–620, 1975.

62

Page 76: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

REFERENCES

[54] S.L. Toral, M.R. Martınez-Torres, and F. Barrero. Analysis of virtual communities supportingOSS projects using social network analysis. Information and Software Technology, 52(3):296–303, March 2010.

[55] J.D. Velasquez and V. Palade. Adaptive Web Sites: A Knowledge Extraction from Web DataApproach. IOS Press, 2008.

[56] Jyun-Cheng Wang, Chui-Chen Chiu, and Jr jing Tang. The correlation study of eWOM andproduct sales predictions through SNA perspectives: an exploratory investigation by taiwan’scellular phone market. In Proceedings of the 7th international conference on Electronic com-merce, pages 666–673, Xi’an, China, 2005. ACM.

[57] S Wasserman and K Faust. Social Network Analysis: Methods and Applications. 1994.

[58] Etienne Wenger, Richard Arnold McDermott, and William Snyder. Cultivating communitiesof practice. Harvard Business Press, 2002.

[59] Dongshan Xing and Mark Girolami. Employing latent dirichlet allocation for fraud detectionin telecommunications. Pattern Recognition Letters, Vol. 28(13):1727–1734, 2007.

[60] K. Yelupula and Srini Ramaswamy. Social network analysis for email classification. In Pro-ceedings of the 46th Annual Southeast Regional Conference on XX, pages 469–474, Auburn,Alabama, 2008. ACM.

[61] Jun Zhang, Mark S. Ackerman, and Lada Adamic. Expertise networks in online communities:structure and algorithms. In Proceedings of the 16th international conference on World WideWeb, pages 221–230, Banff, Alberta, Canada, 2007. ACM.

[62] Bin Zhu, Stephanie Watts, and Hsinchun Chen. Visualizing social network concepts. DecisionSupport Systems, 49(2):151–161, May 2010.

63

Page 77: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Appendix

64

Page 78: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Appendix A

SNA-KDD Results

The following results are showed as a complement of results showed and discussed in section 4.3.

A.1 Key-member obtained

Top 10 motivators and repliers are showed in the following tables. The nicknames were replaced bytheir community user ID.

Table A.1: Reply Oriented Motivators Key-membersIn-Degree Authority

Counting Concept Based LDA Counting Concept Based LDAUser101696 User101696 User101696 User101696 User101696 User101696

User1 User1 User1 User1 User1 User1User100010 User100010 User101697 User101697 User101651 User101697User101697 User101651 User100010 User100010 User100010 User100010

User36 User100439 User101651 User36 User100439 User101651User101651 User101697 User36 User101651 User101697 User36User100439 User36 User100439 User100439 User36 User100439User101376 User161 User677 User101376 User161 User677User100314 User101376 User100314 User100314 User101376 User100314User100074 User100314 User100074 User100074 User100314 User161

65

Page 79: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Appendix A.

Table A.2: Reply Oriented Repliers Key-membersOut-Degree Hub

Counting Concept Based LDA Counting Concept Based LDAUser101696 User101696 User101696 User101696 User101696 User101696User101697 User101697 User101697 User1 User1 User1

User1 User101651 User100439 User101697 User101651 User101697User100439 User100010 User100010 User100010 User100010 User100010User100010 User1 User1 User36 User100439 User101651User101651 User100439 User101651 User101651 User101697 User36

User161 User101376 User161 User100439 User36 User100439User101376 User32 User677 User101376 User101376 User677

User32 User100314 User100314 User100314 User161 User100314User100314 User36 User32 User100074 User100314 User161

Table A.3: All Previous Oriented Motivators Key-membersIn-Degree Authority

Counting Concept Based LDA Counting Concept Based LDAUser101696 User101696 User101696 User101696 User101696 User101696

User1 User100010 User36 User1 User100010 User36User100010 User1 User100314 User100010 User1 User101651

User36 User101651 User101651 User36 User101651 User100314User101697 User101697 User1 User101697 User101697 User1User101651 User101376 User161 User101651 User101376 User101697User100439 User161 User100010 User100439 User161 User161User101376 User677 User101697 User101376 User677 User100010User100314 User32 User100439 User100314 User32 User100439

User161 User100074 User101452 User161 User100074 User101452

Table A.4: All Previous Oriented Repliers Key-membersOut-Degree Hub

Counting Concept Based LDA Counting Concept Based LDAUser101696 User101696 User101696 User101696 User101696 User101696User101697 User101651 User101697 User1 User100010 User36User100010 User101697 User101651 User36 User1 User101651

User1 User100010 User677 User100010 User101651 User100314User36 User1 User36 User101697 User101697 User1

User101651 User101376 User1 User101651 User101376 User161User101376 User100439 User161 User100439 User161 User101697

User161 User161 User100314 User101376 User677 User100010User100439 User36 User101452 User100314 User32 User100439

User32 User101452 User100010 User161 User100074 User101452

66

Page 80: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Appendix A.

A.2 Key-member Database

Figure A.1 display the resulting database after the key-member discovery processs.

Figure A.1: Key-member Final Database.

67

Page 81: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Appendix A.

A.3 Topic extracted

Table A.5: Topic Extracted

Topic Id Meaning0 distorsion1 not helpful2 efects sheets impresion3 classical new users questions4 help asking5 weld techniques6 cable conections7 not helpful8 boxes conection and construction9 not helpful10 failures explanation11 congrats12 guitar construction woods13 electronic suplies stores14 references to english books and general texts15 modifies to JCM Marshall amplifier16 electronic components17 tube amplifier18 previous posts replies conversations19 not helpful20 handmade brands21 not helpful22 plexilandia website23 images and videos of construction advances24 modulation effects25 places where to buy26 sound differences factors27 not helpful28 failure detection process29 transistors adjustment30 band sound apreciation31 transformers for tube amplifiers32 tube amplifiers rectificators33 not helpful34 many words bad written35 community coexistence norms36 effects boxes and amplifiers plateholder37 effects schemes38 differents effects

68

Page 82: DETECCION DE MIEMBROS CLAVE EN UNA COMUNIDAD … · de Datos y las t ecnicas de An aisis de Redes Sociales para encontrar miembros clave. La idea principal es obtener una representaci

Appendix E.

39 effects interruptors40 not helpful41 plexi-meeting42 not helpful43 components buy and sale solicitudes44 guitars brand and models45 distorsion stages46 software and hardware for sound applications47 brands vs prices opinion and recommendation48 couple problems in distorsion effects49 acustic isolation

69