journal cross-citation analysis for validation and improvement of journal-based subject...

20

Click here to load reader

Upload: lin-zhang

Post on 14-Jul-2016

219 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Journal cross-citation analysis for validation and improvement of journal-based subject classification in bibliometric research

Journal cross-citation analysis for validationand improvement of journal-based subjectclassification in bibliometric research

Lin Zhang • Frizo Janssens • Liming Liang • Wolfgang Glanzel

Received: 11 May 2009 / Published online: 17 February 2010� Akademiai Kiado, Budapest, Hungary 2010

Abstract The objective of this study is to use a clustering algorithm based on journal

cross-citation to validate and to improve the journal-based subject classification schemes.

The cognitive structure based on the clustering is visualized by the journal cross-citation

network and three kinds of representative journals in each cluster among the communi-

cation network have been detected and analyzed. As an existing reference system the

15-field subject classification by Glanzel and Schubert (Scientometrics 56:55–73, 2003)

has been compared with the clustering structure.

Keywords Journal cross-citation � Cluster analysis � Mapping of science �Subject classification

Introduction

Among the scientific aggregations, journals play a vital role in the spread of information

within and between disciplines. As early as in the 1960s, Price (1965) suggested that

journals would be the appropriate units of analysis and that aggregated mutual citations

among journals might reveal the disciplinary and finer-grained delineations. There is an

L. Zhang � F. Janssens � W. Glanzel (&)Centre for R&D Monitoring, Department of MSI, K. U. Leuven, Leuven, Belgiume-mail: [email protected]

L. Zhang � L. LiangWISE Lab, Dalian University of Technology, Dalian, China

F. JanssensK. U. Leuven, ESAT-SCD, Leuven, Belgium

L. LiangInstitute for Science, Technology, and Society, Henan Normal University, Xinxiang, China

W. GlanzelHungarian Academy of Sciences, IRPS, Budapest, Hungary

123

Scientometrics (2010) 82:687–706DOI 10.1007/s11192-010-0180-1

Page 2: Journal cross-citation analysis for validation and improvement of journal-based subject classification in bibliometric research

inherent challenge for structuring these scientific aggregations, since it may well reflect the

mosaic of cognitive knowledge (Carpenter and Narin 1973). The problem of how to

delineate the journal sets consistently has been a major concern of the scientometric

researchers since its early days (Narin et al. 1972; Narin 1976; Leydesdorff 2002).

Along with the development of computerised scientometrics, mapping of science plays

an increasing role in information science. Recently, the progress in visualization tech-

niques has added the ability to visualize knowledge domains (e.g., Borner et al. 2003).

However, most of the published journal-based maps have typically been focused on small

or single disciplines (e.g., McCain 1991, 1998; Morris and McCain 1998; Ding et al. 2000;

Tsay et al. 2003). Garfield (1998) stated that the new techniques of visualization make it

possible to generate global science maps and to identify emerging research fronts. A few

more recent works have tried to map journals on a larger scale. Bassecoulard and Zitt

(1999) produced a hierarchical journal structure using 1993 JCR (Journal Citation Reports)

data from 32 disciplines. Leydesdorff (2004a, b) used the 2001 JCR data to map the

journals from the SCI and SSCI on basis of a Pearson correlation on citing counts as the

edge weights.

Subject classification schemes have a long tradition in library and information science

and management. Most of these classification systems are based on longstanding practice

and experience and the intellectual interpretation of the literature’s content. Although more

recent schemes also include computerised components, we will call them ‘‘intellectual’’

schemes as distinction them from those machine-based solutions that are to the largest

possible extent independent of the assignment by human judgement and expertise.

There are several existing ‘‘intellectual’’ classification schemes used in bibliometrics,

such as the 22 broad field classification scheme of the Essential Science Indicators data-

base, the 240? subject categories system from the Journal Citation Reports database.

Glanzel and Schubert (2003) and Boyack et al. (2005), respectively, proposed 12 subject

areas, though their results differ from each other. The question arises of in how far it is

feasible to validate and further adjust the existing subject classification schemes relying

computerised techniques.

The main objective of this research is 2-fold; first, we study the cognitive structure

based on journal cross-citation cluster analysis, then we compare cluster structure with

traditional ‘‘intellectual’’ subject-classification schemes.

Several methodological approaches are possible: Both co-citation and bibliographic

coupling have to cope with methodological problems. This has been reported, for instance,

by Hicks (1987) in the context of co-citation analysis and by Janssens et al. (2008) and

Jarneving (2005) with regard to bibliographic coupling. One solution is to combine these

techniques with other methods such as lexical-based approaches (Braam et al. 1991;

Janssens 2007; Janssens et al. 2008), or to make use of direct reference-citation links

among pre-defined units as, for instance, journal cross-citations. The cross-citation method

has certainly advantages, such as the possibility to analyze the direction of information

flows among the units under study (Zhang et al. 2009). Leydesdorff (2006); Leydesdorff

and Rafols (2008) have used journal cross-citation matrices to visualise the structure of

science and its dynamics. In contrast to the latter studies, our method is not based on the

JCR. We calculate citations on a paper-by-paper basis and then assign individual papers to

the journals in which they have been published. This offers four important opportunities.

(1) Selection of document types,

(2) Choice of pre-defined publication periods and citation windows,

688 L. Zhang et al.

123

Page 3: Journal cross-citation analysis for validation and improvement of journal-based subject classification in bibliometric research

(3) Use of keywords extracted from titles and abstracts to characterise the cognitive

composition of the clusters and

(4) Combination of the citation-based classification with other methods (e.g., with text

mining).

The 15-field subject classification scheme developed by Glanzel and Schubert (2003) is

used as ‘‘control structure’’ to compare the obtained results with an ‘‘intellectual’’ subject

classification.

Data

The data have been collected from the Web of Science (WoS) of Scientific (part of

Thomson Reuters) for the period 2002–2006. Only papers of the type articles, notes, lettersand reviews were taken into account. Citations have been summed up from the publication

year till 2006. The complete database has been indexed and all terms extracted from titles,

abstracts and keywords have been used for ‘‘labelling’’ the obtained clusters.

All journals have been checked for continuity. Journal that changed name, been merged

or split up, were identified and unified and journals which were not covered in the entire

period have been omitted. Furthermore, in order to build meaningful and statistically

reliable measures, only journals that had published at least 50 papers and the sum of their

references and citations was at least 30 during 2002–2006 were considered. The resulting

number of remained journals was 8,305. Most of the subsequent analyses were performed

in Java and MATLAB.

Methods

Our analysis is conducted in the following five steps.

(1) Clustering data into journal groupings and evaluating the obtained clusters.

(2) Labelling clusters using most relevant terms.

(3) Studying the cognitive structure based on cross-citation cluster analysis.

(4) Comparing the cluster structure with an existing ‘‘intellectual’’ classification scheme.

(5) Detecting ‘‘migration’’ of journals to improve the ‘‘intellectual’’ classification

scheme.

For the cluster analysis we use the agglomerative hierarchical clustering algorithm with

Ward’s method (Jain and Dubes 1988). This is a hard clustering algorithm, which means

that each individual journal is assigned to exactly one cluster.

Textual content was indexed with the Jakarta Lucene platform (Hatcher and Gospodnetic

2004) and encoded in the Vector Space Model using the TF-IDF weighting scheme (Baeza-

Yates and Ribeiro-Neto 1999). Stop words were neglected during indexing and the Porter

stemmer was applied to all remaining terms from titles, abstracts, and keyword fields. The

most relevant terms in each cluster were collected for labelling the clusters.

The comparison of cluster structure with the existing classification system is based on

the Multidimensional scaling and the Jaccard index.

Several measures were applied to detect the important or representative journals in each

cluster, namely, the journal strong links, journal entropies, and a modified version of

Google’s PageRank algorithm (Brin and Page 1998). The first two indicators were also

used to compare clusters with respect to their communication characteristics.

Journal cross-citation analysis 689

123

Page 4: Journal cross-citation analysis for validation and improvement of journal-based subject classification in bibliometric research

(1) Symmetrised link strength between journal i and j (SLij)

SLij ¼aij þ aji

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

TCi þ TRið Þ � TCj þ TRj

� �

q ð1Þ

where i and j denote journals, TCi the total number of citations of journal i, TRi the total

number of references of journal i and aij the number of citations of journal i receives from

journal j.This indicator measures the strength of cross-citation links between two journals in the

symmetric matrix. It can be considered as a measure of ‘‘centrality’’ of journals.

(2) Journal citation entropy according to symmetrised links (ELi)

ELi ¼ �X

n

j¼1

aij þ aji

TCi þ TRi� log2

aij þ aji

TCi þ TRi

� �

ð2Þ

This indicator measures in how far references/citations are spread among other journals

but unlike the previous indicator, it does not measure centrality. Therefore, indicators (1)

and (2) complement each other.

(3) PageRank algorithm (Brin and Page 1998)

PRi ¼ð1� aÞ

nþ a

X

j

PRj

aij

PiP

k

akj

Pk

ð3Þ

where PRi is the PageRank of journal i, a is a scalar between 0 and 1 (a = 0.9 in our

implementation), n is the number of journals, and Pi is the number of papers published by

journal i, both in the period under study. The PageRank of a journal can be understood here

as the probability that a random reader will be reading that journal, when he/she randomly

and with equal probability looks up cited references to other journals (different from the

current one), but once in a while randomly picks another journal from the library (cluster).

In general, the journals ranking highest represent their cluster in an adequate manner

(Janssens et al. 2009).

Results

Clustering data into journal groupings and making evaluations

of the cluster analysis

Our analysis is based of the symmetric journal cross-citation matrix. In terms of the

similarity measures between journals, we opted the ‘‘second order’’ journal cross-citation

similarities for clustering, taking into account that two journals might be highly cited by a

third one. By ‘‘second-order similarities’’ we mean that citation links between a journal and

all other journals were used as input for another step of pairwise similarity calculation. The

second-order similarities are found by calculating the cosine of the angle between pairs of

690 L. Zhang et al.

123

Page 5: Journal cross-citation analysis for validation and improvement of journal-based subject classification in bibliometric research

vectors containing all symmetric journal cross-citation values between the two respective

journals and all other journals (Janssens 2007). On the basis of these similarity measures

we applied the agglomerative hierarchical cluster algorithm on basis of Ward’s method

(e.g., Jain and Dubes 1988).

In a first step we determine the optimal number of clusters, which depended on different

factors (the similarity measure, data representation, validation method, etc.). This is done

by comparing the quality of different clustering solutions based on various numbers of

clusters where we mainly relied on mean Silhouette curves.

We have finally chosen 15 clusters which promises to provide stable results (Fig. 2).

This also allows to compare and validate the existing 15-field classification scheme

developed at SOOI in Leuven. Figure 1 shows the cluster dendrogram for cross-citation

hierarchical clustering which had been cut off at 15 clusters on the left-hand side allowing

a relatively well-separated clusters scheme.

Aware of the difficulty of determination of the optimal cut-off point on a dendrogram

(Jain and Dubes 1988), we complemented this method with the mean Silhouette curves.

Silhouette values range from -1 to ?1 and compares the similarity with documents in its

own cluster vs. documents in other clusters (Rousseeuw 1987). In particular, S (i) is defined

as follows:

S ið Þ ¼ minðBði;CjÞÞ �WðiÞmax½minðBði;CjÞÞ;WðiÞ�

ð4Þ

where W(i) is the average distance from document i to all other documents within its

cluster, and B(i, Cj) is the average distance from document i to all documents in another

cluster Cj. Since the calculation of Silhouette values are based on distances, different

distance measures might provide different results. Figure 2 compares the performance of

cross-citation clustering for various numbers of clusters based on the cosine similarity of

cross-citation. However, from this figure it is difficult to determine a clear local optimal

number of clusters. Considering one of our main objectives, the comparison of the results

with the SOOI classification, 15 proved an appropriate choice for the number of clusters.

Fig. 1 Cluster dendrogram for cross-citation hierarchical clustering of 8,305 journals (Cutting off at 15clusters on the left-hand side)

Journal cross-citation analysis 691

123

Page 6: Journal cross-citation analysis for validation and improvement of journal-based subject classification in bibliometric research

The quality of a specific clustering scheme could be further visualized in a Silhouette

plot, which express the contrast between intra- and inter-cluster similarities. Figure 3

presents the evaluation of the 15 cross-citation clustering, based on the cross-citation (left)

and citation-text linear combined (right) based Silhouette values. The more the Silhouette

profile of a cluster is to the right of the vertical line at the value 0, the more coherent the

cluster is. Accordingly, the silhouette plots substantiate that the result of this clustering

analysis are acceptable except for cluster #1. This cluster represents a somewhat hetero-

geneous, less consistent field.

Labelling clusters using most relevant terms

In order to describe the cognitive characteristics of individual clusters, we labelled each

cluster using the best TF-IDF terms on the basis of a textual approach. In particular, these

Fig. 2 Silhouette curves based on the cosine similarity of cross-citation for different numbers of clusters

Fig. 3 Evaluation of the cross-citation clustering solution with 15 clusters by Silhouette plot based oncitations (left) and citation-text combined distances (right)

692 L. Zhang et al.

123

Page 7: Journal cross-citation analysis for validation and improvement of journal-based subject classification in bibliometric research

Table 1 The 50 best TF-IDF terms describing the 15 cross-citation clusters

1 finit;nonlinear;firm;web;busi;fuzzi;graph;custom;logic;motion;machin;nois;ltd;internet;asymptot;s chedul;compani;price;robot;organiz;polit;stochast;fault;crack;polynomi;librari;traffic;elast;veloc;it er;neural;court;vehicl;job;circuit;concret;corpor;text;semant;queri;sensor;ey;execut;node;wireless; video;trade;student;team;compress;

2 rock;basin;sediment;miner;fault;isotop;ma;tecton;sea;metamorph;volcan;cretac;assemblag;mantl;f aci;ocean;magma;lake;sedimentari;southern;climat;fossil;melt;granit;magmat;miocen;northern;bel t;marin;crust;basalt;continent;stratigraph;or;pb;geolog;geochem;river;eastern;jurass;ic;quartz;trias sic;sr;deform;fauna;holocen;crustal;zircon;seismic;

3 receptor;dna;cancer;tumor;rat;mice;mutat;genom;transcript;il;immun;enzym;peptid;kinas;inhibitor;apoptosi;chromosom;therapi;antibodi;molecul;infect;mrna;rna;antigen;bone;muscl;mous;metabol;vitro;ca2;vivo;therapeut;assai;prolifer;liver;lipid;inflammatori;neuron;cytokin;mutant;transplant;phenotyp;ligand;brain;amino;serum;phosphoryl;lymphocyt;polymorph;syndrom;

4 Polym;ion;catalyst;thermal;solvent;bond;crystal;adsorpt;soil;hydrogen;film;aqueou;molecul;atom;nmr;polymer;ligand;methyl;poli;spectroscopi;electrod;cu;ltd;fuel;reactor;chromatographi;spectra;ms;veloc;toxic;coat;gel;column;solubl;copper;dry;blend;mol;salt;cation;pollut;chemistri;wast;enzym;copolym;surfact;emiss;cd;catalyt;bi;

5 polit;war;urban;gender;parti;reform;discours;democraci;democrat;economi;capit;contemporari;civ il;narr;essai;ethnic;liber;immigr;sociolog;british;german;rural;land;geographi;labor;china;union;la bour;coloni;ideolog;religi;elect;welfar;privat;moral;militari;foreign;sector;employ;race;trade;ninet eenth;african;actor;write;citizen;racial;america;crisi;africa;

6 infect;dog;viru;vaccin;hiv;cow;hors;milk;pcr;bacteria;pathogen;antibodi;parasit;bacteri;pig;dna;ca t;cattl;immun;coli;enzym;diet;breed;calv;antibiot;assai;genom;viral;antigen;serum;sheep;herd;far m;egg;sp;therapi;genotyp;bovin;microbi;dairi;malaria;antimicrobi;fed;veterinari;bird;chicken;pne umonia;vitro;tuberculosi;virul;

7 cancer;tumor;breast;surgeri;carcinoma;surgic;ct;lesion;therapi;resect;malign;tumour;recurr;chemotherapi;arteri;mri;mr;lung;postop;radiotherapi;node;aneurysm;metastas;histolog;biopsi;bone;prostat;liver;invas;neck;pet;lymph;underw;brain;median;metastat;colorect;patholog;pain;endoscop;tomographi;injuri;nerv;lymphoma;symptom;cervic;hospit;preoper;laparoscop;pancreat;

8 algebra;theorem;manifold;finit;infin;let;omega;invari;polynomi;singular;compact;inequ;lambda;asymptot;ellipt;conjectur;bar;proof;convex;hyperbol;lie;phi;nonlinear;banach;topolog;epsilon;metric;sigma;infinit;curvatur;cohomolog;symmetr;semigroup;holomorph;hilbert;bundl;subgroup;commut;integ;graph;abelian;automorph;riemannian;isomorph;norm;prime;dirichlet;geometr;eigenvalu;parabol;

9 soil;seed;crop;forest;leaf;cultivar;seedl;shoot;ha;wheat;fruit;rice;fertil;germin;flower;irrig;land;ve get;weed;season;dry;maiz;grain;agricultur;harvest;nitrogen;nutrient;farm;biomass;genotyp;transge n;pathogen;arabidopsi;grown;co2;inocul;pollen;pine;wood;potato;genom;drought;infect;fungi;dna;grass;barlei;cultiv;trait;farmer;

10 therapi;diabet;hospit;pain;nurs;cancer;physician;injuri;arteri;chronic;coronari;infect;renal;syndrom;symptom;men;mortal;hypertens;surgeri;serum;cardiac;muscl;pregnanc;cardiovascular;bone;infant;obes;insulin;ci;smoke;hiv;pulmonari;surgic;myocardi;questionnair;liver;elderli;rat;student;ventricular;fractur;metabol;intak;nutrit;hepat;transplant;lung;cholesterol;glucos;knee;

11 habitat;fish;forest;egg;genu;predat;sea;nest;season;larva;ecolog;reproduct;prei;bird;lake;sp;island;river;taxa;breed;veget;seed;mate;sediment;southern;diet;ecosystem;soil;marin;juvenil;biomass;phylogenet;forag;parasit;coastal;insect;larval;genera;ocean;winter;northern;fisheri;nutrient;summer;nov;eastern;landscap;coloni;atlant;flower;

12 film;alloi;laser;quantum;crystal;atom;ion;steel;thermal;beam;coat;glass;si;grain;spin;microstructur;silicon;nm;dope;scatter;powder;ceram;diffract;corros;fabric;dielectr;deform;lattic;excit;photon;e miss;microscopi;nonlinear;fiber;cu;fe;crack;voltag;neutron;anneal;sinter;ni;spectra;polym;sensor;s emiconductor;spectroscopi;hydrogen;weld;oscil;

Journal cross-citation analysis 693

123

Page 8: Journal cross-citation analysis for validation and improvement of journal-based subject classification in bibliometric research

terms were extracted from the title, abstract and keywords of the individual documents

published in the journals. The top 50 terms of each cluster are presented in Table 1.

Through these terms, we could roughly grasp the topics covered in each cluster.

We obtain the following structure: biosciences (cluster #3), neurosciences and

psychology (#13 and #14), two clinic medical clusters (#7 and #10), agriculture and

environment (#9), biology (#6 and #11), geosciences and space sciences (#2), chemistry

(#4), physics (#12), mathematics (#8), economics (#15) and one further social sciences

cluster (#5). The terms characterising cluster #1 confirm the observation in the Silhouette

plots (Fig. 3), as there are many terms from different fields without apparently consistent

characteristics. Therefore, cluster #1 is indeed a heterogeneous cluster of lesser quality.

Although the most relevant terms already provided a recognisable description for the

cognitive characteristics of each cluster, we noticed that there were considerable overlaps

between some pairs of clusters. We found several cluster pairs between which the share of

overlapping terms exceeds 20%: cluster #3 and #6, with terms related to Bioscience,

Biology and Biomedical Science; cluster #7 and #10, between which there are many

common terms representing Medical Science; cluster #5 and #15, which share many terms

focusing on Social Science.

Studying the cognitive structure based on cross-citation cluster analysis

Studying the hierarchical and network structure of cross-citation clusters

The visualisation of the network structure based on cross-citation links among journals is

presented in Fig. 4 (visualized by Pajek; Batagelj and Mrvar 2002). For measuring the

citation link strength between clusters, a normalised similarity is applied based on formula

(1), where now i and j denote cluster i and cluster j. Here, intra-cluster ‘‘self-citations’’ are

counted only once. All clusters are represented by the three most relevant TF-IDF terms.

The overall structure presented in the network is in line with the architecture in the cluster

dendrogram (Fig. 1), and also agrees with the analysis of the most characteristic terms.

Among the whole network, the strongest link is found between cluster #3 and #10

focussing on Biosciences and Clinic medical science, respectively. Some other obvious

links also protrude from the network, for instances, cluster #3 and #13, cluster #3 and #6,

cluster #4 and #12, cluster #7 and #10, cluster #13 and #14. A distinct group related to the

Table 1 continued

13 neuron;brain;rat;receptor;cognit;motor;cortex;stroke;cerebr;sleep;neural;cortic;epilepsi;seizur;nerv;mice;injuri;stimulu;deficit;muscl;stimuli;lesion;neurolog;pain;dementia;syndrom;parkinson;eeg;symptom;alzheim;synapt;spinal;axon;ms;sensori;nervou;dopamin;hippocamp;neuropsycholog;schizophrenia;frontal;hippocampu;sclerosi;therapi;chronic;auditori;pd;cue;nucleu;emot;

14 student;school;psycholog;cognit;teacher;adolesc;mental;emot;child;anxieti;symptom;gender;psychiatr;abus;attitud;interview;skill;mother;disabl;sexual;item;alcohol;cope;teach;belief;violenc;word;schizophrenia;suicid;instruct;youth;profession;questionnair;english;classroom;peer;academ;men;therapi;discours;development;offend;client;aggress;verbal;speech;satisfact;mood;linguist;colleg;

15 price;firm;trade;economi;wage;incom;tax;capit;invest;monetari;labor;welfar;financi;bank;sector;employ;household;inflat;privat;game;incent;reform;polit;unemploy;worker;insur;stock;foreign;poverti;asset;forecast;labour;fiscal;export;profit;inequ;school;union;agricultur;shock;monei;earn;macroeconom;volatil;busi;job;currenc;panel;credit;alloc;

694 L. Zhang et al.

123

Page 9: Journal cross-citation analysis for validation and improvement of journal-based subject classification in bibliometric research

life and medical sciences, consisting of cluster #3, #10, #13, #6 and #7, contrasts strongly

with the rest of the structure, which is closely conjoint. Another interesting phenomenon is

found concerning cluster #13 (neuro- and behavioural science) and #14 (psychology),

where cluster #14 dissociates from the life science clusters and is, in fact, closer to the

social science group. This effect is apparently caused by the strong social orientation of the

subject categories psychology and psychiatry.

Detecting and comparing ‘‘important journals’’ of each cluster amongthe cross-citation network

Within the entire cross-citation structure, some journals play a significant role in the

communication network. These journals could be identified based on different indicators

considering the centrality and significance within the cluster. We compared three kinds of

journal rankings according to the indicators introduced in the outset of the paper.

The first indicator (cf. Eq. 1) is based on the high frequency of strong links with other

journals. This measure (SL) can be considered an indicator of ‘‘centrality’’. Journals

strongly linked with numerous other journals can therefore be regarded as important nodes

in the cross-citation network. We ranked all journals in each cluster according to their

number of ‘‘strong links’’. The top five journals in each cluster are presented in Table 2.

Unlike the first approach, citation entropy describes in how far citations are spread

among other journals. This expresses rather diversity than ‘‘centrality’’. While the first

approach reflects a strong influence from/on other representatives of a social network, the

second one merely measures the ‘‘diffusion’’ of contacts regardless of their intensity. Thus,

entropy reflects the ‘‘width’’, SL the ‘‘depth’’ of social contacts.

Table 3 shows the top journals with high entropies in each cluster. These journals are

not necessarily ‘‘active’’ nodes in the cross-citation network, but spread information over

and/or collect information from a variety of other entities.

Clearly, Tables 2 and 3 provide different type of information about the role of journals

in the cross-citation network. The content of both tables considerably differs consequently.

Fig. 4 Network structure of cross-citation clusters represented by the three most relevant TF-IDF terms(thickness of lines is proportional to the strength of citation links, the size of the nodes is proportional to thenumber of journals)

Journal cross-citation analysis 695

123

Page 10: Journal cross-citation analysis for validation and improvement of journal-based subject classification in bibliometric research

Centrality according to the first indicator also includes a rather thematic focus, a reason

why we find many general and multidisciplinary journals in Table 3 (e.g., Science, Nature,

Ann NY Acad Sci) while Table 2 is rather dominated by more specialised periodicals.

American Economic Review is the only overlapping journal in the two groups. This phe-

nomenon is in line with the results of Zhang et al. (2009). Therefore, ‘‘central’’ actors

contribute for forming some coherent sub-clusters in the network, and act as ‘‘cores’’ in

these clusters. By contrast, the ‘‘high-entropy’’ journals are mainly active in extending their

communication network, and play as important nodes facilitating and spreading the

information flows among the whole cross-citation network.

Table 2 Top five journals of each cluster based on the number of strong links

Cluster 1 Cluster 2 Cluster 3

1. LECT NOTES COMPUT SC 1. EARTH PLANET SC LETT 1. J AM ACAD DERMATOL

2. OPHTHALMOLOGY 2. TECTONOPHYSICS 2. J IMMUNOL

3. AM J OPHTHALMOL 3. PALAEOGEOGR PALAEOCL 3. BLOOD

4. IEEE J SEL AREA COMM 4. LITHOS 4. J BIOL CHEM

5. COLUMBIA LAW REV 5. PRECAMBRIAN RES 5. ARTH RHEUM/AR C RES

Cluster 4 Cluster 5 Cluster 6

1. INT J HEAT MASS TRAN 1. ENVIRON PLANN A 1. APPL ENVIRON MICROB

2. INORG CHEM 2. PROG HUM GEOG 2. J AM VET MED ASSOC

3. J AM CHEM SOC 3. AM J POLIT SCI 3. J CLIN MICROBIOL

4. TETRAHEDRON LETT 4. J POLIT 4. BIOL REPROD

5. ANAL CHEM 5. URBAN STUD 5. AIDS

Cluster 7 Cluster 8 Cluster 9

1. LARYNGOSCOPE 1. J DIFFER EQUATIONS 1. PLANT PHYSIOL

2. J NEUROSURG 2. J MATH ANAL APPL 2. SOIL SCI SOC AM J

3. RADIOLOGY 3. INVENT MATH 3. FOREST ECOL MANAG

4. AM J ROENTGENOL 4. J ALGEBRA 4. PLANT J

5. J NUCL MED 5. DUKE MATH J 5. THEOR APPL GENET

Cluster 10 Cluster 11 Cluster 12

1. J BONE JOINT SURG AM 1. MAR ECOL-PROG SER 1. PHYS REV D

2. J UROLOGY 2. AQUACULTURE 2. PHYS LETT B

3. CIRCULATION 3. J ECON ENTOMOL 3. PHYS REV B

4. KIDNEY INT 4. J GEOPHYSICAL RES-OCEANS 4. PHYS REV LETT

5. AM J SPORT MED 5. J PHYS OCEANOGR 5. APPL PHYS LETT

Cluster 13 Cluster 14 Cluster 15

1. J NEUROSCI 1. J PERS SOC PSYCHOL 1. AM ECON REV

2. NEUROLOGY 2. CHILD DEV 2. AM J AGR ECON

3. NEUROIMAGE 3. PERS SOC PSYCHOL B 3. ECONOMETRICA

4. NEUROPSYCHOLOGIA 4. J EDUC PSYCHOL 4. J MONETARY ECON

5. VISION RES 5. EXCEPT CHILDREN 5. J INT ECON

696 L. Zhang et al.

123

Page 11: Journal cross-citation analysis for validation and improvement of journal-based subject classification in bibliometric research

Google’s PageRank algorithm (Brin and Page 1998) computes the status of a Web page

based on a combination of the number of hyperlinks that point to the page and the status of

the pages that the hyperlinks originate from. By taking into account both the popularity and

the prestige factor of status, Google avoided assigning high ranks to popular but otherwise

not-relevant Web pages (Bollen et al. 2006). In this study, we applied a modified version of

PageRank algorithm, in which the number of citations is taken into account, normalized by

the number of published papers (see Eq. 3). We re-ranked all the journals in each cluster

according to their modified PageRank values and present the most representative journals

in Table 4.

Table 3 Top five journals of each cluster based on citation entropy

Cluster 1 Cluster 2 Cluster 3

1. ZOOTAXA 1. PROG NAT SCI 1. SCIENCE

2. P IEEE 2. CHINESE SCI BULL 2. BRAZ J MED BIOL RES

3. J ENVIRON MANAGE 3. CURR SCI INDIA 3. NATURE

4. COMPUT METH PROG BIO 4. EARTH-SCI REV 4. MED HYPOTHESES

5. CRIT REV ORAL BIOL M 5. APPL GEOCHEM 5. MICROSC RES TECHNIQ

Cluster 4 Cluster 5 Cluster 6

1. TOXICOL LETT 1. SOC SCI MED 1. TRENDS BIOTECHNOL

2. TOXICOLOGY 2. TRANSPORT RES REC 2. APMIS

3. J PHARM PHARMACOL 3. AM BEHAV SCI 3. CURR OPIN BIOTECH

4. ANAL BIOCHEM 4. ANNU REV SOCIOL 4. PATHOL BIOL

5. J BIOCHEM BIOPH METH 5. ANNU REV ANTHROPOL 5. J BIOTECHNOL

Cluster 7 Cluster 8 Cluster 9

1. J CLIN PATHOL 1. LECT NOTES MATH 1. AFR J BIOTECHNOL

2. LANCET ONCOL 2. B AM MATH SOC 2. TRANSGENIC RES

3. ANTICANCER RES 3. CR MATH 3. CRIT REV PLANT SCI

4. CANCER 4. ACTA APPL MATH 4. ADV AGRON

5. EUR J CANCER 5. T AM MATH SOC 5. CAN J BOT

Cluster 10 Cluster 11 Cluster 12

1. LANCET 1. AM MUS NOVIT 1. J MICROSC-OXFORD

2. NEW ENGL J MED 2. NATURWISSENSCHAFTEN 2. J CENT SOUTH UNIV T

3. JAMA-J AM MED ASSOC 3. BIOL REV 3. MAT SCI ENG C-BIO S

4. J KOREAN MED SCI 4. COMP BIOCHEM PHYS B 4. CURR OPIN SOLID ST M

5. ARCH MED RES 5. COMP BIOCHEM PHYS A 5. J PHYS IV

Cluster 13 Cluster 14 Cluster 15

1. ANN NY ACAD SCI 1. J PSYCHOSOM RES 1. J ECON PERSPECT

2. LIFE SCI 2. ANNU REV PSYCHOL 2. J ECON LIT

3. CELL TISSUE RES 3. PSYCHOL BULL 3. WORLD DEV

4. NEUROSCI LETT 4. AM PSYCHOL 4. AM ECON REV

5. EUR J PHARMACOL 5. PSYCHOSOM MED 5. ECON J

Journal cross-citation analysis 697

123

Page 12: Journal cross-citation analysis for validation and improvement of journal-based subject classification in bibliometric research

The PageRank expresses a third aspect of social contacts. In particular, it adds the

dimension of ‘‘prominence’’ to the picture. Frequent contacts with prominent members of

the network increase the score. Therefore, we expect results different from the previous

ones. However, we found some coincidence between the top journals based on PageRank

with the other two groups of important journals as well. In principle, there is no large

overlap among the three groups, but coincidence could be revealed in several clusters.

‘‘Central’’ and ‘‘high-PageRank’’ journals overlap, for instances, in cluster #5, #6, #8, #9,

#10 and #15; for ‘‘high-entropy’’ and highly PageRanking journals, there are overlapping

nodes in cluster #2, #5, #10, #14 and #15.

Table 4 Top five journals of each cluster based on a modified PageRank algorithm

Cluster 1 Cluster 2 Cluster 3

1. ACM COMPUT SURV 1. REV MINERAL GEOCHEM 1. ANNU REV IMMUNOL

2. ADMIN SCI QUART 2. EARTH-SCI REV 2. NAT REV IMMUNOL

3. J FINANC 3. CONTRIB MINERAL PETR 3. NAT IMMUNOL

4. MIS QUART 4. J PETROL 4. CELL

5. ACAD MANAGE J 5. GEOLOGY 5. ANNU REV BIOCHEM

Cluster 4 Cluster 5 Cluster 6

1. CHEM REV 1. AM POLIT SCI REV 1. CLIN MICROBIOL REV

2. PROG POLYM SCI 2. AM SOCIOL REV 2. MICROBIOL MOL BIOL R

3.PROG ENERG COMBUST 3. WORLD POLIT 3. ANNU REV MICROBIOL

4. ANNU REV FLUID MECH 4. ANNU REV SOCIOL 4. AIDS

5. ACCOUNTS CHEM RES 5. AM J POLIT SCI 5. J ACQ IMMUN DEF SYND

Cluster 7 Cluster 8 Cluster 9

1. CA-CANCER J CLIN 1. J AM MATH SOC 1. ANNU REV PLANT BIOL

2. J NATL CANCER I 2. ACTA MATH-DJURSHOLM 2. PLANT CELL

3. J CLIN ONCOL 3. ANN MATH 3. CURR OPIN PLANT BIOL

4. SEMIN NUCL MED 4. INVENT MATH 4. PLANT J

5. ANN SURG ONCOL 5. DUKE MATH J 5. TRENDS PLANT SCI

Cluster 10 Cluster 11 Cluster 12

1. CIRCULATION 1. ANNU REV ECOL EVOL S 1. REV MOD PHYS

2. JAMA-J AM MED ASSOC 2. SYSTEMATIC BIOL 2. MAT SCI ENG R

3. MILBANK Q 3. ANNU REV ENTOMOL 3. ANNU REV NUCL PART S

4. J AM COLL CARDIOL 4. OCEANOGR MAR BIOL 4. PHYS REP

5. NEW ENGL J MED 5. TRENDS ECOL EVOL 5. NAT MATER

Cluster 13 Cluster 14 Cluster 15

1. ANNU REV NEUROSCI 1. ANNU REV PSYCHOL 1. Q J ECON

2. NAT REV NEUROSCI 2. PSYCHOL BULL 2. J ECON LIT

3. NEURON 3. REV EDUC RES 3. J POLIT ECON

4. NAT NEUROSCI 4. PSYCHOL REV 4. ECONOMETRICA

5. PROG NEUROBIOL 5. AM EDUC RES J 5. REV ECON STUD

698 L. Zhang et al.

123

Page 13: Journal cross-citation analysis for validation and improvement of journal-based subject classification in bibliometric research

Comparison of each cluster based on different communication characteristics

Besides identifying ‘‘central’’ journals, strong links also reflect the affinities between

journals in each cluster among the communication network. Table 5 compares the ratios of

strong links within each cluster (SL means strong links). The ratio of strong links in cluster

i is calculated as the number of strong links inside cluster i divided by the total number of

possible symmetric links within the same cluster. The highest ratio of strong links is found

in cluster #2 (Geosciences & Space sciences), and the lowest is in cluster #3 (Biosciences).

The three clusters #2, #15 and #8 appearing at the top in Table 5 are also among the ‘‘best’’

clusters according to their Silhouette values (cf. Fig. 3). The two social science clusters are

interesting cases, as the ratio of strong links is much higher in cluster #15 (economics) than

in #5 (social and political sciences).

As the clustering algorithm is based on cross-citation similarities, it is quite plausible

that most of the strong links can be found within the clusters. However, we also find more

than 9% of all strong links among different clusters, which could be called ‘‘foreign strong

links’’. These links indicate the information flow among individual journals that are

assigned to different categories. The analysis of ‘‘foreign strong links’’ could help us to

detect the journals which are important nodes in the communication network for

exchanging information among different clusters since they can be considered the inter-

faces between different subject categories. As extreme cases, American journal of physicalanthropology which is assigned to cluster #3 (Biosciences) has nine strong links in total,

but five of which are ‘‘foreign strong links’’, respectively with journals in cluster #5 (Social

sciences), cluster #10 (Clinical medicine) and # 11 (Biology). Journal of medicinalchemistry which belongs to cluster #4 (Chemistry) has seven strong links, but five of which

are with journals in cluster #3 (Bioscience). This example reveals aspects of interdisci-

plinarity of journals in these categories. We found 12 strong links between journals in

Table 5 Comparison of strong links within each cluster

Cluster # Numberof SL

Ratioof SL

Cluster # Numberof SL

Ratioof SL

Cluster # Numberof SL

Ratioof SL

2 207 0.0159 6 304 0.0048 4 834 0.0019

15 136 0.0091 11 381 0.0045 10 612 0.0015

8 135 0.0081 13 170 0.0035 5 248 0.0014

9 221 0.0073 12 423 0.0034 1 1807 0.0012

7 249 0.0069 14 522 0.0024 3 339 0.0010

Table 6 Average citation entropy of journals in each cluster

Cluster # Numberof journals

Averageentropy

Cluster#

Numberof journals

Averageentropy

Cluster#

Numberof journals

Averageentropy

3 830 7.50 4 925 6.22 9 246 5.88

13 308 7.23 14 586 6.19 12 498 5.84

10 891 7.08 11 412 6.14 2 162 5.65

7 269 7.02 8 183 6.10 1 1542 5.38

6 356 6.64 15 173 5.93 5 442 5.37

Journal cross-citation analysis 699

123

Page 14: Journal cross-citation analysis for validation and improvement of journal-based subject classification in bibliometric research

cluster #5 (Social sciences) and cluster #3 (Biosciences), which might indicate the

underlying trend of information flows between the corresponding fields of science and

social sciences.

In Table 6, we present the average entropy of journals in each cluster, which shows the

overall degree of information spreading in each cluster. Cluster #3 (Bioscience) has the

highest average entropy but one of the lowest SL ratio, which again illustrates that these

two measures can be considered two sides of the same coin since most of these rela-

tionships are rather loose. This is followed by the neuroscience and psychology cluster

(#13), two clinical medicine clusters (#10 and #7) and the biology cluster (#6). On the

other hand, Cluster #2 (Geosciences & Space sciences) can be regarded as opponent; links

are not far-ranging but of relatively strong affinity. The comparison between rankings in

Tables 5 and 6 thus indicates the diversity of communication characteristics of each

cluster.

Comparison of cluster structure and existing ‘‘intellectual’’ classification

The above-mentioned SOOI classification scheme comprises 15 major fields, including 12

science fields and three fields in the social sciences and humanities (see Table 7). This

scheme is updated annually so that the comparison between the cross-citation clustering

and SOOI fields is based on valid assignments and settings. The concordance between the

15 major fields and the cross-citation clusters hereby allows a direct comparison of the two

systems. And the results of the cluster analysis may help to improve and fine-tune the

classification scheme developed at SOOI.

We first visualize the cross-citation network among SOOI fields using Pajek (Fig. 5).

The abbreviations of SOOI fields in Fig. 5 are as follows. AGRI = Agriculture & Envi-

ronment; BIOL = Biology; BIOS = Biosciences; BIOM = Biomedical research; CLI1 =

Clinical and experimental medicine I; CLI2 = Clinical and experimental medicine II;

NEUR = Neuroscience & Behaviour; CHEM = Chemistry; PHYS = Physics; GEOS =

Table 7 SOOI 15-field subject classification scheme

Field#

SOOI field Field#

SOOI field

1 AGRICULTURE &ENVIRONMENT

9 PHYSICS

2 BIOLOGY (ORGANISMIC &SUPRAORGANISMIC LEVEL)

10 GEOSCIENCES & SPACESCIENCES

3 BIOSCIENCES (GENERAL,CELLULAR & SUBCELLULARBIOLOGY; GENETICS)

11 ENGINEERING

4 BIOMEDICAL RESEARCH 12 MATHEMATICS

5 CLINICAL AND EXPERIMENTALMEDICINE I (GENERAL &INTERNAL MEDICINE)

13 SOCIAL SCIENCES I (GENERAL,REGIONAL & COMMUNITYISSUES)

6 CLINICAL AND EXPERIMENTALMEDICINE II (NON-INTERNALMEDICINE SPECIALTIES)

14 SOCIAL SCIENCES II(ECONOMICAL &POLITICAL ISSUES)

7 NEUROSCIENCE & BEHAVIOR 15 ARTS & HUMANITIES

8 CHEMISTRY

700 L. Zhang et al.

123

Page 15: Journal cross-citation analysis for validation and improvement of journal-based subject classification in bibliometric research

Geosciences & Space sciences; ENGN = Engineering; MATH = Mathematics; SOC1 =

Social sciences I; SOC2 = Social Sciences II; HUMA = Arts & Humanities. Since the

SOOI scheme is not a partition, i.e., journals are partially assigned to more than one single

field, more cross-citations (thicker lines) can be observed among fields (see Fig. 5) than

among clusters (cf. Fig. 4). Three fields related to social sciences and humanities are

relatively separated from the rest of the fields. On the other hand, ‘‘Social science I’’ has

nevertheless more and closer links with some biosciences, neuroscience and the medical

science fields.

Fig. 5 Cross-citation networks among SOOI fields represented by the three most important TF-IDF terms

Fig. 6 Three-dimensional MDS map visualising distances between the centroids of the 15 clusters and 15SOOI fields

Journal cross-citation analysis 701

123

Page 16: Journal cross-citation analysis for validation and improvement of journal-based subject classification in bibliometric research

Another comparison between the cluster structure and SOOI classification scheme is

based on the centroids of the two systems. The centroid of a cluster or field is defined as the

linear combination of all documents in it and is thus a vector in the same vector space. For

each cluster and each field, the centroid is calculated and the MDS of pairwise distances

between all centroids is shown in Fig. 6. Through this map, we could basically gain the

knowledge of the similarities and differences of nodes among the two schemes.

Furthermore, we used the Jaccard index, i.e., the ratio of the cardinality of the inter-

section of two sets and the cardinality of their union, to compare each cluster with each

field. The best matching field for each cluster is presented in Table 8 and the best matching

cluster for each field can be found in Table 9. ‘‘Strong’’ matches between the fields and

clusters are, for instance, among cluster #4 and the SOOI field Chemistry; the same applies

to cluster #12 and Physics. However, we should also note that not all fields and clusters

could be matched uniquely, where the multiple assignments in the SOOI system might be

an important factor.

On basis of these matches, we detected that some SOOI fields tend to fall apart. For

instance, Biology (Organismic & supraorganismic level) splits up into the clusters #6 and

#11. Combined with the analysis in cross-citation network of the clustering, we can see that

although both cluster #6 and #11 have Biology as their best matching field, they are

relatively separated in the network. Cluster #6 has very strong links with cluster #3, which

is actually the biosciences cluster. Further combined with the analysis of the most related

terms, most of the top TF-IDF terms in cluster #6 are related to Microbiology, Molecular

Biology or Biosciences; while terms in cluster #11 represent Biology. The second field that

seems to fall apart by clustering is Clinical and experimental medicine II (Non-internal

medicine specialties), which actually splits into cluster #7 and cluster #10. In the cross-

citation network, these two clusters are strongly inter-linked. We also observed that the

strongest link in the network was established between cluster #10 (a medical cluster) and

cluster #3 (the bioscience cluster). This result indicates that Clinical and experimental

Table 8 Best matching SOOI field for each cluster (based on Jaccard index)

Cluster SOOI Field

9 AGRICULTURE & ENVIRONMENT

6 BIOLOGY (ORGANISMIC & SUPRAORGANISMIC LEVEL)

11 BIOLOGY (ORGANISMIC & SUPRAORGANISMIC LEVEL)

3 BIOSCIENCES (GENERAL, CELLULAR & SUBCELLULAR BIOLOGY; GENETICS)

4 CHEMISTRY

7 CLINICAL AND EXPERIMENTAL MEDICINE II (NON-INTERNAL MEDICINESPECIALTIES)

10 CLINICAL AND EXPERIMENTAL MEDICINE II (NON-INTERNAL MEDICINESPECIALTIES)

1 ENGINEERING

2 GEOSCIENCES & SPACE SCIENCES

8 MATHEMATICS

13 NEUROSCIENCE & BEHAVIOR

14 NEUROSCIENCE & BEHAVIOR

12 PHYSICS

5 SOCIAL SCIENCES II (ECONOMICAL & POLITICAL ISSUES)

15 SOCIAL SCIENCES II (ECONOMICAL & POLITICAL ISSUES)

702 L. Zhang et al.

123

Page 17: Journal cross-citation analysis for validation and improvement of journal-based subject classification in bibliometric research

medicine II (Non-internal medicine specialties) tends to split into two clusters, one of

which has strong affinity with Biosciences. Similarly, Neuroscience & Behavior tends to

split into cluster #14 and #13. The first one is more related with ‘‘social sciences’’ and the

second one is rather related with the ‘‘sciences’’.

On the other hand, based on the result of best matching in Table 9, we observed some

fields tending to merge. For instances, Biosciences (general, cellular and subcellular

biology; genetics) and Biomedical research are merging into one cluster (#3); two fields of

Clinical and experimental medicine are merging into cluster #10; and three fields in the

social sciences and humanities tend to merge into cluster #5. But these tendencies are not

clear in the cross-citation network of SOOI fields (Fig. 5), where only the two Clinical and

experimental medicine fields are strongly interlinked, while other two groups of merging

fields are rather loosely connected.

Detecting ‘‘migration’’ of journals and improving the ‘‘intellectual’’ classification

schemes

In the previous part, based on Jaccard index, we determined for each SOOI field the cluster

that best matches the field. More than 40% of the journals are not assigned to the cluster

which best matches their SOOI field. We call these journals migrated journals. We expect

that the analysis of journal ‘‘migration’’ helps improve our ‘‘intellectual’’ classification

scheme. The large share of ‘‘migrations’’ may be due to several reasons. First, there are

different bases for the classification. The SOOI scheme is based on the classification of ISI

subject categories. By contrast, our clustering results are generated from the chosen

algorithm and are based on cross-citation similarities only. Secondly, according to our

definition, the ‘‘migrations’’ are detected on basis of the ‘‘best matching’’ result. But

actually the ‘‘matching’’ between SOOI and the clustering is not always clear, especially

because the clustering is a partition while SOOI allows multiple assignments.

Table 9 Best matching cluster for each SOOI field (based on Jaccard index)

SOOI field Cluster#

AGRICULTURE & ENVIRONMENT 9

BIOLOGY (ORGANISMIC & SUPRAORGANISMIC LEVEL) 11

BIOSCIENCES (GENERAL, CELLULAR & SUBCELLULAR BIOLOGY; GENETICS) 3

BIOMEDICAL RESEARCH 3

CLINICAL AND EXPERIMENTAL MEDICINE I (GENERAL & INTERNAL MEDICINE) 10

CLINICAL AND EXPERIMENTAL MEDICINE II (NON-INTERNAL MEDICINESPECIALTIES)

10

NEUROSCIENCE & BEHAVIOR 14

CHEMISTRY 4

PHYSICS 12

GEOSCIENCES & SPACE SCIENCES 2

ENGINEERING 1

MATHEMATICS 8

SOCIAL SCIENCES I (GENERAL, REGIONAL & COMMUNITY ISSUES) 5

SOCIAL SCIENCES II (ECONOMICAL & POLITICAL ISSUES) 5

ARTS & HUMANITIES 5

Journal cross-citation analysis 703

123

Page 18: Journal cross-citation analysis for validation and improvement of journal-based subject classification in bibliometric research

In SOOI classification, more than 70% of journals have single assignment. For these

journals, migration is readily determined. For those with multiple assignments, migration is

determined if the journal has been assigned to a cluster which is not the best matching

clusters for any of its SOOI fields. Table 10 presents the top 10 strongest migration

patterns based on the journals which have single assignment in SOOI fields. It is not

surprising that there is a strong emigration from different SOOI fields to cluster #1, which

is the most heterogeneous cluster. Except for migrations to cluster #1, several other groups

of remarkable emigration have drawn our attention: e.g., from Chemistry to cluster #12

(Physics), from Clinical and experimental medicine I to cluster #3 (Biosciences) and Social

sciences I to cluster #14 (Neuroscience & Behaviour). These distinct migration patterns

indicate possible adjustment or improvement of journal assignments.

Nevertheless, not all ‘‘migrations’’ can be used for the improvement of the reference

classification system since we could distinguish between ‘‘good migration’’ and ‘‘bad

migration’’. ‘‘Good migration’’ is observed if the goodness of the unit’s classification

improves, and based on its title and scope, clearly it should indeed be assigned to the

cluster to which it has moved. Otherwise we call it ‘‘bad migration’’ (Janssens et al. 2009).

According to the scope of the journals, these migrations are not convincing. But in the

cross-citation clustering, they indeed show more affinity with a different field. On one

hand, this phenomenon might reflect some strong information flow between different

subject fields, and on the other hand, it may also indicate the possible trend of profile or

dynamics changes of these journals. However, ‘‘bad migrations’’ may also just be ‘‘poor

results’’ generated from the clustering algorithm. Therefore, we propose to always combine

quantitative approaches, e.g., the clustering analysis with the ‘‘intellectual’’ assessment to

adjust and improve the existing classification system.

The distinction between good and bad migrations is important for validating and

adjusting the ‘‘intellectual’’ subject classification. In the future, we would deepen the

‘‘migration’’ analysis through a hybrid clustering algorithm combined with both of cita-

tions and texts information. Based on the hybrid clustering system, the good and bad

migrations could be further validated and confirmed; therefore the efficiency of classifi-

cation could be even improved.

Table 10 Top 10 strongest migration patterns (for single assignment in SOOI fields)

From SOOI field Tocluster

Number ofmigrated journals

SOCIAL SCIENCES II (ECONOMICAL & POLITICAL ISSUES) 1 195

ARTS & HUMANITIES 1 194

BIOLOGY (ORGANISMIC & SUPRAORGANISMIC LEVEL) 6 190

MATHEMATICS 1 144

CHEMISTRY 12 138

GEOSCIENCES & SPACE SCIENCES 1 127

CLINICAL AND EXPERIMENTAL MEDICINE I (GENERAL &INTERNAL MEDICINE)

3 122

SOCIAL SCIENCES I (GENERAL, REGIONAL & COMMUNITYISSUES)

14 119

CLINICAL AND EXPERIMENTAL MEDICINE II (NON-INTERNALMEDICINE SPECIALTIES)

7 117

SOCIAL SCIENCES II (ECONOMICAL & POLITICAL ISSUES) 15 103

704 L. Zhang et al.

123

Page 19: Journal cross-citation analysis for validation and improvement of journal-based subject classification in bibliometric research

Conclusions and discussions

The hard-clustering algorithm for the journal cross-citation analysis provides important

information for studying the cognitive structure of the journal cross-citation network. The

text mining technique allows the ‘‘labelling’’ clusters making the cognitive characteristics

of the clusters visible. Through the cross-citation analysis, we have identified three groups

of representative journals of each cluster, namely, ‘‘central journals’’ with strongest links,

journals with highest entropy, and ‘‘top’’ journals according to the PageRank algorithm.

These journals play an important part as nodes in the communication network, however

with different functions and there is a clear divergence between the centrality and high-

entropy.

A generic classification scheme as well as a global map of science has been generated

based on the cross-citation clustering analysis. This could help better understand the

cognitive base of scientific knowledge, and adjust the existing delineation of science

subject fields. The comparison of the clustering result with the ‘‘intellectual’’ classification

may reflect the dynamic development of journals, and may also reveal the underlying

convergence and emergence of new fields. There are some obvious advantages of the

clustering approach, among others, the base of classification is quantitatively stable and

testable; the classification result is automatically generated from the clustering algorithm,

which objectively reflects the underlying affinities among journals. However, we should

also mention the weakness of the hard clustering, because subjects often overlap and there

are sometimes ‘‘poor clusters’’ as well. For the final classification, we should therefore

combine the results with the ‘‘intellectual’’ approach.

Acknowledgements The research was supported by Steunpunt O&O Indicatoren of the Flemish Gov-ernment, the National Natural Science Foundation of China (grant no. 70673019), and the China ScholarshipCouncil. We thank Bart Thijs for his assistance in collecting and processing data.

References

Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern information retrieval. Cambridge: Addison-Wesley.Bassecoulard, E., & Zitt, M. (1999). Indicators in a research institute: A multi-level classification of

journals. Scientometrics, 44(3), 323–345.Batagelj, V., & Mrvar, A. (2002). Pajek: Analysis and visualization of large networks. Graph Drawing,

2265, 477–478. (ISSN).Bollen, J., Rodriguez, M. A., & Van De Sompel, H. (2006). Journal status. Scientometrics, 69(3), 669–687.Borner, K., Chen, C., & Boyack, K. W. (2003). Visualizing knowledge domains. Annual Review of

Information Science and Technology, 37, 179–255.Boyack, K. W., Klavans, R., & Borner, K. (2005). Mapping the backbone of science. Scientometrics, 64(3),

351–374.Braam, R. R., Moed, H. F., & Van Raan, A. F. J. (1991). Mapping science by combined co-citation and word

analysis 1: Structural aspects. Journal of the American Society for Information Science, 42(4), 233–251.

Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual web search engine. ComputerNetworks and ISDN Systems, 30(1–7), 107–117.

Carpenter, M. P., & Narin, F. (1973). Clustering of scientific journals. Journal of the American Society forInformation Science, 24(6), 425–436.

Ding, Y., Chowdhury, G., & Foo, S. (2000). Journal as markers of intellectual space: Journal co-citationanalysis of information retrieval area, 1987–1997. Scientometrics, 47(1), 55–73.

Garfield, E. (1998). Mapping the world of science. The 150 Anniversary Meeting of the AAAS,Philadelphia, PA, http://www.garfield.library.upenn.edu/papers/mapsciworld.html.

Journal cross-citation analysis 705

123

Page 20: Journal cross-citation analysis for validation and improvement of journal-based subject classification in bibliometric research

Glanzel, W., & Schubert, A. (2003). A new classification scheme of science fields and subfields designed forscientometric evaluation purposes. Scientometrics, 56(3), 357–367.

Hatcher, E., & Gospodnetic, O. (2004). Lucene in action. New York: Manning Publications Co.Hicks, D. (1987). Limitations of co-citation analysis as a tool for science policy. Social Studies of Science,

17(2), 295–316.Jain, A., & Dubes, R. (1988). Algorithms for clustering data. Englewood Cliffs, NJ: Prentice Hall.Janssens, F. (2007). Clustering of scientific fields by integrating text mining and bibliometrics. Ph.D. Thesis,

Faculty of Engineering, Katholieke Universiteit Leuven, Belgium, http://hdl.handle.net/1979/847.Janssens, F., Glanzel, W., & De Moor, B. (2008). A hybrid mapping of information science. Scientometrics,

75(3), 607–631.Janssens, F., Lin, Z., & Glanzel, W. (2009). Hybrid clustering for validation and improvement of subject-

classification schemes. Information Processing & Management, 45(6), 683–702.Jarneving, B. (2005). The combined application of bibliographic coupling and the complete link cluster

method in bibliometric science mapping. Ph.D. Thesis, University College of Boras/GoteborgUniversity, Sweden.

Leydesdorff, L. (2002). Dynamic and evolutionary updates of classificatory schemes in scientific journalstructures. Journal of the American Society for Information Science and Technology, 53(12), 987–994.

Leydesdorff, L. (2004a). Clusters and maps of science journals based on bi-connected graphs in the JournalCitation Reports. Journal of Documentation, 60(4), 371–427.

Leydesdorff, L. (2004b). Top-down decomposition of the Journal Citation Report of the Social ScienceCitation Index: Graph- and factor-analytical approaches. Scientometrics, 60(2), 159–180.

Leydesdorff, L. (2006). Can scientific journals be classified in terms of aggregated journal-journal citationrelations using the Journal Citation Reports? Journal of the American Society for Information Science& Technology, 57(5), 601–613.

Leydesdorff, L., & Rafols, I. (2008). A global map of science based on the ISI subject categories. Journal ofthe American Society for Information Science and Technology, 60(2), 348–362.

McCain, K. W. (1991). Mapping economics through the journal literature: An experiment in journal co-citation analysis. Journal of the American Society for Information Science, 42(4), 290–296.

McCain, K. W. (1998). Neural networks research in context: A longitudinal journal co-citation analysis ofan emerging interdisciplinary field. Scientometrics, 41(3), 389–410.

Morris, T. A., & McCain, K. W. (1998). The structure of medical informatics journal literature. Journal ofthe American Medical Informatics Association, 5(5), 448–466.

Narin, F. (1976). Evaluative bibliometrics: The use of publication and citation analysis in the evaluation ofscientific activity. Washington, DC: National Science Foundation.

Narin, F., Carpenter, M., & Berlt, N. C. (1972). Interrelationships of scientific journals. Journal of theAmerican Society for Information Science, 23(5), 323–331.

Price, D. J. D. (1965). Networks of scientific papers. Science, 149(3683), 510–515.Rousseeuw, P. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis.

Journal of Computational and Applied Mathematics, 20(1), 53–65.Tsay, M. Y., Xu, H., & Wu, C. W. (2003). Journal co-citation analysis of semiconductor literature.

Scientometrics, 57(1), 7–25.Zhang, L., Glanzel, W., & Liang, L. M. (2009). Tracing the role of individual journals in a cross-citation

network based on different indicators. Scientometrics, 81(3), 821–838.

706 L. Zhang et al.

123