journal cross-citation analysis for validation and improvement of journal-based subject...
TRANSCRIPT
Journal cross-citation analysis for validationand improvement of journal-based subjectclassification in bibliometric research
Lin Zhang • Frizo Janssens • Liming Liang • Wolfgang Glanzel
Received: 11 May 2009 / Published online: 17 February 2010� Akademiai Kiado, Budapest, Hungary 2010
Abstract The objective of this study is to use a clustering algorithm based on journal
cross-citation to validate and to improve the journal-based subject classification schemes.
The cognitive structure based on the clustering is visualized by the journal cross-citation
network and three kinds of representative journals in each cluster among the communi-
cation network have been detected and analyzed. As an existing reference system the
15-field subject classification by Glanzel and Schubert (Scientometrics 56:55–73, 2003)
has been compared with the clustering structure.
Keywords Journal cross-citation � Cluster analysis � Mapping of science �Subject classification
Introduction
Among the scientific aggregations, journals play a vital role in the spread of information
within and between disciplines. As early as in the 1960s, Price (1965) suggested that
journals would be the appropriate units of analysis and that aggregated mutual citations
among journals might reveal the disciplinary and finer-grained delineations. There is an
L. Zhang � F. Janssens � W. Glanzel (&)Centre for R&D Monitoring, Department of MSI, K. U. Leuven, Leuven, Belgiume-mail: [email protected]
L. Zhang � L. LiangWISE Lab, Dalian University of Technology, Dalian, China
F. JanssensK. U. Leuven, ESAT-SCD, Leuven, Belgium
L. LiangInstitute for Science, Technology, and Society, Henan Normal University, Xinxiang, China
W. GlanzelHungarian Academy of Sciences, IRPS, Budapest, Hungary
123
Scientometrics (2010) 82:687–706DOI 10.1007/s11192-010-0180-1
inherent challenge for structuring these scientific aggregations, since it may well reflect the
mosaic of cognitive knowledge (Carpenter and Narin 1973). The problem of how to
delineate the journal sets consistently has been a major concern of the scientometric
researchers since its early days (Narin et al. 1972; Narin 1976; Leydesdorff 2002).
Along with the development of computerised scientometrics, mapping of science plays
an increasing role in information science. Recently, the progress in visualization tech-
niques has added the ability to visualize knowledge domains (e.g., Borner et al. 2003).
However, most of the published journal-based maps have typically been focused on small
or single disciplines (e.g., McCain 1991, 1998; Morris and McCain 1998; Ding et al. 2000;
Tsay et al. 2003). Garfield (1998) stated that the new techniques of visualization make it
possible to generate global science maps and to identify emerging research fronts. A few
more recent works have tried to map journals on a larger scale. Bassecoulard and Zitt
(1999) produced a hierarchical journal structure using 1993 JCR (Journal Citation Reports)
data from 32 disciplines. Leydesdorff (2004a, b) used the 2001 JCR data to map the
journals from the SCI and SSCI on basis of a Pearson correlation on citing counts as the
edge weights.
Subject classification schemes have a long tradition in library and information science
and management. Most of these classification systems are based on longstanding practice
and experience and the intellectual interpretation of the literature’s content. Although more
recent schemes also include computerised components, we will call them ‘‘intellectual’’
schemes as distinction them from those machine-based solutions that are to the largest
possible extent independent of the assignment by human judgement and expertise.
There are several existing ‘‘intellectual’’ classification schemes used in bibliometrics,
such as the 22 broad field classification scheme of the Essential Science Indicators data-
base, the 240? subject categories system from the Journal Citation Reports database.
Glanzel and Schubert (2003) and Boyack et al. (2005), respectively, proposed 12 subject
areas, though their results differ from each other. The question arises of in how far it is
feasible to validate and further adjust the existing subject classification schemes relying
computerised techniques.
The main objective of this research is 2-fold; first, we study the cognitive structure
based on journal cross-citation cluster analysis, then we compare cluster structure with
traditional ‘‘intellectual’’ subject-classification schemes.
Several methodological approaches are possible: Both co-citation and bibliographic
coupling have to cope with methodological problems. This has been reported, for instance,
by Hicks (1987) in the context of co-citation analysis and by Janssens et al. (2008) and
Jarneving (2005) with regard to bibliographic coupling. One solution is to combine these
techniques with other methods such as lexical-based approaches (Braam et al. 1991;
Janssens 2007; Janssens et al. 2008), or to make use of direct reference-citation links
among pre-defined units as, for instance, journal cross-citations. The cross-citation method
has certainly advantages, such as the possibility to analyze the direction of information
flows among the units under study (Zhang et al. 2009). Leydesdorff (2006); Leydesdorff
and Rafols (2008) have used journal cross-citation matrices to visualise the structure of
science and its dynamics. In contrast to the latter studies, our method is not based on the
JCR. We calculate citations on a paper-by-paper basis and then assign individual papers to
the journals in which they have been published. This offers four important opportunities.
(1) Selection of document types,
(2) Choice of pre-defined publication periods and citation windows,
688 L. Zhang et al.
123
(3) Use of keywords extracted from titles and abstracts to characterise the cognitive
composition of the clusters and
(4) Combination of the citation-based classification with other methods (e.g., with text
mining).
The 15-field subject classification scheme developed by Glanzel and Schubert (2003) is
used as ‘‘control structure’’ to compare the obtained results with an ‘‘intellectual’’ subject
classification.
Data
The data have been collected from the Web of Science (WoS) of Scientific (part of
Thomson Reuters) for the period 2002–2006. Only papers of the type articles, notes, lettersand reviews were taken into account. Citations have been summed up from the publication
year till 2006. The complete database has been indexed and all terms extracted from titles,
abstracts and keywords have been used for ‘‘labelling’’ the obtained clusters.
All journals have been checked for continuity. Journal that changed name, been merged
or split up, were identified and unified and journals which were not covered in the entire
period have been omitted. Furthermore, in order to build meaningful and statistically
reliable measures, only journals that had published at least 50 papers and the sum of their
references and citations was at least 30 during 2002–2006 were considered. The resulting
number of remained journals was 8,305. Most of the subsequent analyses were performed
in Java and MATLAB.
Methods
Our analysis is conducted in the following five steps.
(1) Clustering data into journal groupings and evaluating the obtained clusters.
(2) Labelling clusters using most relevant terms.
(3) Studying the cognitive structure based on cross-citation cluster analysis.
(4) Comparing the cluster structure with an existing ‘‘intellectual’’ classification scheme.
(5) Detecting ‘‘migration’’ of journals to improve the ‘‘intellectual’’ classification
scheme.
For the cluster analysis we use the agglomerative hierarchical clustering algorithm with
Ward’s method (Jain and Dubes 1988). This is a hard clustering algorithm, which means
that each individual journal is assigned to exactly one cluster.
Textual content was indexed with the Jakarta Lucene platform (Hatcher and Gospodnetic
2004) and encoded in the Vector Space Model using the TF-IDF weighting scheme (Baeza-
Yates and Ribeiro-Neto 1999). Stop words were neglected during indexing and the Porter
stemmer was applied to all remaining terms from titles, abstracts, and keyword fields. The
most relevant terms in each cluster were collected for labelling the clusters.
The comparison of cluster structure with the existing classification system is based on
the Multidimensional scaling and the Jaccard index.
Several measures were applied to detect the important or representative journals in each
cluster, namely, the journal strong links, journal entropies, and a modified version of
Google’s PageRank algorithm (Brin and Page 1998). The first two indicators were also
used to compare clusters with respect to their communication characteristics.
Journal cross-citation analysis 689
123
(1) Symmetrised link strength between journal i and j (SLij)
SLij ¼aij þ aji
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
TCi þ TRið Þ � TCj þ TRj
� �
q ð1Þ
where i and j denote journals, TCi the total number of citations of journal i, TRi the total
number of references of journal i and aij the number of citations of journal i receives from
journal j.This indicator measures the strength of cross-citation links between two journals in the
symmetric matrix. It can be considered as a measure of ‘‘centrality’’ of journals.
(2) Journal citation entropy according to symmetrised links (ELi)
ELi ¼ �X
n
j¼1
aij þ aji
TCi þ TRi� log2
aij þ aji
TCi þ TRi
� �
ð2Þ
This indicator measures in how far references/citations are spread among other journals
but unlike the previous indicator, it does not measure centrality. Therefore, indicators (1)
and (2) complement each other.
(3) PageRank algorithm (Brin and Page 1998)
PRi ¼ð1� aÞ
nþ a
X
j
PRj
aij
�
PiP
k
akj
Pk
ð3Þ
where PRi is the PageRank of journal i, a is a scalar between 0 and 1 (a = 0.9 in our
implementation), n is the number of journals, and Pi is the number of papers published by
journal i, both in the period under study. The PageRank of a journal can be understood here
as the probability that a random reader will be reading that journal, when he/she randomly
and with equal probability looks up cited references to other journals (different from the
current one), but once in a while randomly picks another journal from the library (cluster).
In general, the journals ranking highest represent their cluster in an adequate manner
(Janssens et al. 2009).
Results
Clustering data into journal groupings and making evaluations
of the cluster analysis
Our analysis is based of the symmetric journal cross-citation matrix. In terms of the
similarity measures between journals, we opted the ‘‘second order’’ journal cross-citation
similarities for clustering, taking into account that two journals might be highly cited by a
third one. By ‘‘second-order similarities’’ we mean that citation links between a journal and
all other journals were used as input for another step of pairwise similarity calculation. The
second-order similarities are found by calculating the cosine of the angle between pairs of
690 L. Zhang et al.
123
vectors containing all symmetric journal cross-citation values between the two respective
journals and all other journals (Janssens 2007). On the basis of these similarity measures
we applied the agglomerative hierarchical cluster algorithm on basis of Ward’s method
(e.g., Jain and Dubes 1988).
In a first step we determine the optimal number of clusters, which depended on different
factors (the similarity measure, data representation, validation method, etc.). This is done
by comparing the quality of different clustering solutions based on various numbers of
clusters where we mainly relied on mean Silhouette curves.
We have finally chosen 15 clusters which promises to provide stable results (Fig. 2).
This also allows to compare and validate the existing 15-field classification scheme
developed at SOOI in Leuven. Figure 1 shows the cluster dendrogram for cross-citation
hierarchical clustering which had been cut off at 15 clusters on the left-hand side allowing
a relatively well-separated clusters scheme.
Aware of the difficulty of determination of the optimal cut-off point on a dendrogram
(Jain and Dubes 1988), we complemented this method with the mean Silhouette curves.
Silhouette values range from -1 to ?1 and compares the similarity with documents in its
own cluster vs. documents in other clusters (Rousseeuw 1987). In particular, S (i) is defined
as follows:
S ið Þ ¼ minðBði;CjÞÞ �WðiÞmax½minðBði;CjÞÞ;WðiÞ�
ð4Þ
where W(i) is the average distance from document i to all other documents within its
cluster, and B(i, Cj) is the average distance from document i to all documents in another
cluster Cj. Since the calculation of Silhouette values are based on distances, different
distance measures might provide different results. Figure 2 compares the performance of
cross-citation clustering for various numbers of clusters based on the cosine similarity of
cross-citation. However, from this figure it is difficult to determine a clear local optimal
number of clusters. Considering one of our main objectives, the comparison of the results
with the SOOI classification, 15 proved an appropriate choice for the number of clusters.
Fig. 1 Cluster dendrogram for cross-citation hierarchical clustering of 8,305 journals (Cutting off at 15clusters on the left-hand side)
Journal cross-citation analysis 691
123
The quality of a specific clustering scheme could be further visualized in a Silhouette
plot, which express the contrast between intra- and inter-cluster similarities. Figure 3
presents the evaluation of the 15 cross-citation clustering, based on the cross-citation (left)
and citation-text linear combined (right) based Silhouette values. The more the Silhouette
profile of a cluster is to the right of the vertical line at the value 0, the more coherent the
cluster is. Accordingly, the silhouette plots substantiate that the result of this clustering
analysis are acceptable except for cluster #1. This cluster represents a somewhat hetero-
geneous, less consistent field.
Labelling clusters using most relevant terms
In order to describe the cognitive characteristics of individual clusters, we labelled each
cluster using the best TF-IDF terms on the basis of a textual approach. In particular, these
Fig. 2 Silhouette curves based on the cosine similarity of cross-citation for different numbers of clusters
Fig. 3 Evaluation of the cross-citation clustering solution with 15 clusters by Silhouette plot based oncitations (left) and citation-text combined distances (right)
692 L. Zhang et al.
123
Table 1 The 50 best TF-IDF terms describing the 15 cross-citation clusters
1 finit;nonlinear;firm;web;busi;fuzzi;graph;custom;logic;motion;machin;nois;ltd;internet;asymptot;s chedul;compani;price;robot;organiz;polit;stochast;fault;crack;polynomi;librari;traffic;elast;veloc;it er;neural;court;vehicl;job;circuit;concret;corpor;text;semant;queri;sensor;ey;execut;node;wireless; video;trade;student;team;compress;
2 rock;basin;sediment;miner;fault;isotop;ma;tecton;sea;metamorph;volcan;cretac;assemblag;mantl;f aci;ocean;magma;lake;sedimentari;southern;climat;fossil;melt;granit;magmat;miocen;northern;bel t;marin;crust;basalt;continent;stratigraph;or;pb;geolog;geochem;river;eastern;jurass;ic;quartz;trias sic;sr;deform;fauna;holocen;crustal;zircon;seismic;
3 receptor;dna;cancer;tumor;rat;mice;mutat;genom;transcript;il;immun;enzym;peptid;kinas;inhibitor;apoptosi;chromosom;therapi;antibodi;molecul;infect;mrna;rna;antigen;bone;muscl;mous;metabol;vitro;ca2;vivo;therapeut;assai;prolifer;liver;lipid;inflammatori;neuron;cytokin;mutant;transplant;phenotyp;ligand;brain;amino;serum;phosphoryl;lymphocyt;polymorph;syndrom;
4 Polym;ion;catalyst;thermal;solvent;bond;crystal;adsorpt;soil;hydrogen;film;aqueou;molecul;atom;nmr;polymer;ligand;methyl;poli;spectroscopi;electrod;cu;ltd;fuel;reactor;chromatographi;spectra;ms;veloc;toxic;coat;gel;column;solubl;copper;dry;blend;mol;salt;cation;pollut;chemistri;wast;enzym;copolym;surfact;emiss;cd;catalyt;bi;
5 polit;war;urban;gender;parti;reform;discours;democraci;democrat;economi;capit;contemporari;civ il;narr;essai;ethnic;liber;immigr;sociolog;british;german;rural;land;geographi;labor;china;union;la bour;coloni;ideolog;religi;elect;welfar;privat;moral;militari;foreign;sector;employ;race;trade;ninet eenth;african;actor;write;citizen;racial;america;crisi;africa;
6 infect;dog;viru;vaccin;hiv;cow;hors;milk;pcr;bacteria;pathogen;antibodi;parasit;bacteri;pig;dna;ca t;cattl;immun;coli;enzym;diet;breed;calv;antibiot;assai;genom;viral;antigen;serum;sheep;herd;far m;egg;sp;therapi;genotyp;bovin;microbi;dairi;malaria;antimicrobi;fed;veterinari;bird;chicken;pne umonia;vitro;tuberculosi;virul;
7 cancer;tumor;breast;surgeri;carcinoma;surgic;ct;lesion;therapi;resect;malign;tumour;recurr;chemotherapi;arteri;mri;mr;lung;postop;radiotherapi;node;aneurysm;metastas;histolog;biopsi;bone;prostat;liver;invas;neck;pet;lymph;underw;brain;median;metastat;colorect;patholog;pain;endoscop;tomographi;injuri;nerv;lymphoma;symptom;cervic;hospit;preoper;laparoscop;pancreat;
8 algebra;theorem;manifold;finit;infin;let;omega;invari;polynomi;singular;compact;inequ;lambda;asymptot;ellipt;conjectur;bar;proof;convex;hyperbol;lie;phi;nonlinear;banach;topolog;epsilon;metric;sigma;infinit;curvatur;cohomolog;symmetr;semigroup;holomorph;hilbert;bundl;subgroup;commut;integ;graph;abelian;automorph;riemannian;isomorph;norm;prime;dirichlet;geometr;eigenvalu;parabol;
9 soil;seed;crop;forest;leaf;cultivar;seedl;shoot;ha;wheat;fruit;rice;fertil;germin;flower;irrig;land;ve get;weed;season;dry;maiz;grain;agricultur;harvest;nitrogen;nutrient;farm;biomass;genotyp;transge n;pathogen;arabidopsi;grown;co2;inocul;pollen;pine;wood;potato;genom;drought;infect;fungi;dna;grass;barlei;cultiv;trait;farmer;
10 therapi;diabet;hospit;pain;nurs;cancer;physician;injuri;arteri;chronic;coronari;infect;renal;syndrom;symptom;men;mortal;hypertens;surgeri;serum;cardiac;muscl;pregnanc;cardiovascular;bone;infant;obes;insulin;ci;smoke;hiv;pulmonari;surgic;myocardi;questionnair;liver;elderli;rat;student;ventricular;fractur;metabol;intak;nutrit;hepat;transplant;lung;cholesterol;glucos;knee;
11 habitat;fish;forest;egg;genu;predat;sea;nest;season;larva;ecolog;reproduct;prei;bird;lake;sp;island;river;taxa;breed;veget;seed;mate;sediment;southern;diet;ecosystem;soil;marin;juvenil;biomass;phylogenet;forag;parasit;coastal;insect;larval;genera;ocean;winter;northern;fisheri;nutrient;summer;nov;eastern;landscap;coloni;atlant;flower;
12 film;alloi;laser;quantum;crystal;atom;ion;steel;thermal;beam;coat;glass;si;grain;spin;microstructur;silicon;nm;dope;scatter;powder;ceram;diffract;corros;fabric;dielectr;deform;lattic;excit;photon;e miss;microscopi;nonlinear;fiber;cu;fe;crack;voltag;neutron;anneal;sinter;ni;spectra;polym;sensor;s emiconductor;spectroscopi;hydrogen;weld;oscil;
Journal cross-citation analysis 693
123
terms were extracted from the title, abstract and keywords of the individual documents
published in the journals. The top 50 terms of each cluster are presented in Table 1.
Through these terms, we could roughly grasp the topics covered in each cluster.
We obtain the following structure: biosciences (cluster #3), neurosciences and
psychology (#13 and #14), two clinic medical clusters (#7 and #10), agriculture and
environment (#9), biology (#6 and #11), geosciences and space sciences (#2), chemistry
(#4), physics (#12), mathematics (#8), economics (#15) and one further social sciences
cluster (#5). The terms characterising cluster #1 confirm the observation in the Silhouette
plots (Fig. 3), as there are many terms from different fields without apparently consistent
characteristics. Therefore, cluster #1 is indeed a heterogeneous cluster of lesser quality.
Although the most relevant terms already provided a recognisable description for the
cognitive characteristics of each cluster, we noticed that there were considerable overlaps
between some pairs of clusters. We found several cluster pairs between which the share of
overlapping terms exceeds 20%: cluster #3 and #6, with terms related to Bioscience,
Biology and Biomedical Science; cluster #7 and #10, between which there are many
common terms representing Medical Science; cluster #5 and #15, which share many terms
focusing on Social Science.
Studying the cognitive structure based on cross-citation cluster analysis
Studying the hierarchical and network structure of cross-citation clusters
The visualisation of the network structure based on cross-citation links among journals is
presented in Fig. 4 (visualized by Pajek; Batagelj and Mrvar 2002). For measuring the
citation link strength between clusters, a normalised similarity is applied based on formula
(1), where now i and j denote cluster i and cluster j. Here, intra-cluster ‘‘self-citations’’ are
counted only once. All clusters are represented by the three most relevant TF-IDF terms.
The overall structure presented in the network is in line with the architecture in the cluster
dendrogram (Fig. 1), and also agrees with the analysis of the most characteristic terms.
Among the whole network, the strongest link is found between cluster #3 and #10
focussing on Biosciences and Clinic medical science, respectively. Some other obvious
links also protrude from the network, for instances, cluster #3 and #13, cluster #3 and #6,
cluster #4 and #12, cluster #7 and #10, cluster #13 and #14. A distinct group related to the
Table 1 continued
13 neuron;brain;rat;receptor;cognit;motor;cortex;stroke;cerebr;sleep;neural;cortic;epilepsi;seizur;nerv;mice;injuri;stimulu;deficit;muscl;stimuli;lesion;neurolog;pain;dementia;syndrom;parkinson;eeg;symptom;alzheim;synapt;spinal;axon;ms;sensori;nervou;dopamin;hippocamp;neuropsycholog;schizophrenia;frontal;hippocampu;sclerosi;therapi;chronic;auditori;pd;cue;nucleu;emot;
14 student;school;psycholog;cognit;teacher;adolesc;mental;emot;child;anxieti;symptom;gender;psychiatr;abus;attitud;interview;skill;mother;disabl;sexual;item;alcohol;cope;teach;belief;violenc;word;schizophrenia;suicid;instruct;youth;profession;questionnair;english;classroom;peer;academ;men;therapi;discours;development;offend;client;aggress;verbal;speech;satisfact;mood;linguist;colleg;
15 price;firm;trade;economi;wage;incom;tax;capit;invest;monetari;labor;welfar;financi;bank;sector;employ;household;inflat;privat;game;incent;reform;polit;unemploy;worker;insur;stock;foreign;poverti;asset;forecast;labour;fiscal;export;profit;inequ;school;union;agricultur;shock;monei;earn;macroeconom;volatil;busi;job;currenc;panel;credit;alloc;
694 L. Zhang et al.
123
life and medical sciences, consisting of cluster #3, #10, #13, #6 and #7, contrasts strongly
with the rest of the structure, which is closely conjoint. Another interesting phenomenon is
found concerning cluster #13 (neuro- and behavioural science) and #14 (psychology),
where cluster #14 dissociates from the life science clusters and is, in fact, closer to the
social science group. This effect is apparently caused by the strong social orientation of the
subject categories psychology and psychiatry.
Detecting and comparing ‘‘important journals’’ of each cluster amongthe cross-citation network
Within the entire cross-citation structure, some journals play a significant role in the
communication network. These journals could be identified based on different indicators
considering the centrality and significance within the cluster. We compared three kinds of
journal rankings according to the indicators introduced in the outset of the paper.
The first indicator (cf. Eq. 1) is based on the high frequency of strong links with other
journals. This measure (SL) can be considered an indicator of ‘‘centrality’’. Journals
strongly linked with numerous other journals can therefore be regarded as important nodes
in the cross-citation network. We ranked all journals in each cluster according to their
number of ‘‘strong links’’. The top five journals in each cluster are presented in Table 2.
Unlike the first approach, citation entropy describes in how far citations are spread
among other journals. This expresses rather diversity than ‘‘centrality’’. While the first
approach reflects a strong influence from/on other representatives of a social network, the
second one merely measures the ‘‘diffusion’’ of contacts regardless of their intensity. Thus,
entropy reflects the ‘‘width’’, SL the ‘‘depth’’ of social contacts.
Table 3 shows the top journals with high entropies in each cluster. These journals are
not necessarily ‘‘active’’ nodes in the cross-citation network, but spread information over
and/or collect information from a variety of other entities.
Clearly, Tables 2 and 3 provide different type of information about the role of journals
in the cross-citation network. The content of both tables considerably differs consequently.
Fig. 4 Network structure of cross-citation clusters represented by the three most relevant TF-IDF terms(thickness of lines is proportional to the strength of citation links, the size of the nodes is proportional to thenumber of journals)
Journal cross-citation analysis 695
123
Centrality according to the first indicator also includes a rather thematic focus, a reason
why we find many general and multidisciplinary journals in Table 3 (e.g., Science, Nature,
Ann NY Acad Sci) while Table 2 is rather dominated by more specialised periodicals.
American Economic Review is the only overlapping journal in the two groups. This phe-
nomenon is in line with the results of Zhang et al. (2009). Therefore, ‘‘central’’ actors
contribute for forming some coherent sub-clusters in the network, and act as ‘‘cores’’ in
these clusters. By contrast, the ‘‘high-entropy’’ journals are mainly active in extending their
communication network, and play as important nodes facilitating and spreading the
information flows among the whole cross-citation network.
Table 2 Top five journals of each cluster based on the number of strong links
Cluster 1 Cluster 2 Cluster 3
1. LECT NOTES COMPUT SC 1. EARTH PLANET SC LETT 1. J AM ACAD DERMATOL
2. OPHTHALMOLOGY 2. TECTONOPHYSICS 2. J IMMUNOL
3. AM J OPHTHALMOL 3. PALAEOGEOGR PALAEOCL 3. BLOOD
4. IEEE J SEL AREA COMM 4. LITHOS 4. J BIOL CHEM
5. COLUMBIA LAW REV 5. PRECAMBRIAN RES 5. ARTH RHEUM/AR C RES
Cluster 4 Cluster 5 Cluster 6
1. INT J HEAT MASS TRAN 1. ENVIRON PLANN A 1. APPL ENVIRON MICROB
2. INORG CHEM 2. PROG HUM GEOG 2. J AM VET MED ASSOC
3. J AM CHEM SOC 3. AM J POLIT SCI 3. J CLIN MICROBIOL
4. TETRAHEDRON LETT 4. J POLIT 4. BIOL REPROD
5. ANAL CHEM 5. URBAN STUD 5. AIDS
Cluster 7 Cluster 8 Cluster 9
1. LARYNGOSCOPE 1. J DIFFER EQUATIONS 1. PLANT PHYSIOL
2. J NEUROSURG 2. J MATH ANAL APPL 2. SOIL SCI SOC AM J
3. RADIOLOGY 3. INVENT MATH 3. FOREST ECOL MANAG
4. AM J ROENTGENOL 4. J ALGEBRA 4. PLANT J
5. J NUCL MED 5. DUKE MATH J 5. THEOR APPL GENET
Cluster 10 Cluster 11 Cluster 12
1. J BONE JOINT SURG AM 1. MAR ECOL-PROG SER 1. PHYS REV D
2. J UROLOGY 2. AQUACULTURE 2. PHYS LETT B
3. CIRCULATION 3. J ECON ENTOMOL 3. PHYS REV B
4. KIDNEY INT 4. J GEOPHYSICAL RES-OCEANS 4. PHYS REV LETT
5. AM J SPORT MED 5. J PHYS OCEANOGR 5. APPL PHYS LETT
Cluster 13 Cluster 14 Cluster 15
1. J NEUROSCI 1. J PERS SOC PSYCHOL 1. AM ECON REV
2. NEUROLOGY 2. CHILD DEV 2. AM J AGR ECON
3. NEUROIMAGE 3. PERS SOC PSYCHOL B 3. ECONOMETRICA
4. NEUROPSYCHOLOGIA 4. J EDUC PSYCHOL 4. J MONETARY ECON
5. VISION RES 5. EXCEPT CHILDREN 5. J INT ECON
696 L. Zhang et al.
123
Google’s PageRank algorithm (Brin and Page 1998) computes the status of a Web page
based on a combination of the number of hyperlinks that point to the page and the status of
the pages that the hyperlinks originate from. By taking into account both the popularity and
the prestige factor of status, Google avoided assigning high ranks to popular but otherwise
not-relevant Web pages (Bollen et al. 2006). In this study, we applied a modified version of
PageRank algorithm, in which the number of citations is taken into account, normalized by
the number of published papers (see Eq. 3). We re-ranked all the journals in each cluster
according to their modified PageRank values and present the most representative journals
in Table 4.
Table 3 Top five journals of each cluster based on citation entropy
Cluster 1 Cluster 2 Cluster 3
1. ZOOTAXA 1. PROG NAT SCI 1. SCIENCE
2. P IEEE 2. CHINESE SCI BULL 2. BRAZ J MED BIOL RES
3. J ENVIRON MANAGE 3. CURR SCI INDIA 3. NATURE
4. COMPUT METH PROG BIO 4. EARTH-SCI REV 4. MED HYPOTHESES
5. CRIT REV ORAL BIOL M 5. APPL GEOCHEM 5. MICROSC RES TECHNIQ
Cluster 4 Cluster 5 Cluster 6
1. TOXICOL LETT 1. SOC SCI MED 1. TRENDS BIOTECHNOL
2. TOXICOLOGY 2. TRANSPORT RES REC 2. APMIS
3. J PHARM PHARMACOL 3. AM BEHAV SCI 3. CURR OPIN BIOTECH
4. ANAL BIOCHEM 4. ANNU REV SOCIOL 4. PATHOL BIOL
5. J BIOCHEM BIOPH METH 5. ANNU REV ANTHROPOL 5. J BIOTECHNOL
Cluster 7 Cluster 8 Cluster 9
1. J CLIN PATHOL 1. LECT NOTES MATH 1. AFR J BIOTECHNOL
2. LANCET ONCOL 2. B AM MATH SOC 2. TRANSGENIC RES
3. ANTICANCER RES 3. CR MATH 3. CRIT REV PLANT SCI
4. CANCER 4. ACTA APPL MATH 4. ADV AGRON
5. EUR J CANCER 5. T AM MATH SOC 5. CAN J BOT
Cluster 10 Cluster 11 Cluster 12
1. LANCET 1. AM MUS NOVIT 1. J MICROSC-OXFORD
2. NEW ENGL J MED 2. NATURWISSENSCHAFTEN 2. J CENT SOUTH UNIV T
3. JAMA-J AM MED ASSOC 3. BIOL REV 3. MAT SCI ENG C-BIO S
4. J KOREAN MED SCI 4. COMP BIOCHEM PHYS B 4. CURR OPIN SOLID ST M
5. ARCH MED RES 5. COMP BIOCHEM PHYS A 5. J PHYS IV
Cluster 13 Cluster 14 Cluster 15
1. ANN NY ACAD SCI 1. J PSYCHOSOM RES 1. J ECON PERSPECT
2. LIFE SCI 2. ANNU REV PSYCHOL 2. J ECON LIT
3. CELL TISSUE RES 3. PSYCHOL BULL 3. WORLD DEV
4. NEUROSCI LETT 4. AM PSYCHOL 4. AM ECON REV
5. EUR J PHARMACOL 5. PSYCHOSOM MED 5. ECON J
Journal cross-citation analysis 697
123
The PageRank expresses a third aspect of social contacts. In particular, it adds the
dimension of ‘‘prominence’’ to the picture. Frequent contacts with prominent members of
the network increase the score. Therefore, we expect results different from the previous
ones. However, we found some coincidence between the top journals based on PageRank
with the other two groups of important journals as well. In principle, there is no large
overlap among the three groups, but coincidence could be revealed in several clusters.
‘‘Central’’ and ‘‘high-PageRank’’ journals overlap, for instances, in cluster #5, #6, #8, #9,
#10 and #15; for ‘‘high-entropy’’ and highly PageRanking journals, there are overlapping
nodes in cluster #2, #5, #10, #14 and #15.
Table 4 Top five journals of each cluster based on a modified PageRank algorithm
Cluster 1 Cluster 2 Cluster 3
1. ACM COMPUT SURV 1. REV MINERAL GEOCHEM 1. ANNU REV IMMUNOL
2. ADMIN SCI QUART 2. EARTH-SCI REV 2. NAT REV IMMUNOL
3. J FINANC 3. CONTRIB MINERAL PETR 3. NAT IMMUNOL
4. MIS QUART 4. J PETROL 4. CELL
5. ACAD MANAGE J 5. GEOLOGY 5. ANNU REV BIOCHEM
Cluster 4 Cluster 5 Cluster 6
1. CHEM REV 1. AM POLIT SCI REV 1. CLIN MICROBIOL REV
2. PROG POLYM SCI 2. AM SOCIOL REV 2. MICROBIOL MOL BIOL R
3.PROG ENERG COMBUST 3. WORLD POLIT 3. ANNU REV MICROBIOL
4. ANNU REV FLUID MECH 4. ANNU REV SOCIOL 4. AIDS
5. ACCOUNTS CHEM RES 5. AM J POLIT SCI 5. J ACQ IMMUN DEF SYND
Cluster 7 Cluster 8 Cluster 9
1. CA-CANCER J CLIN 1. J AM MATH SOC 1. ANNU REV PLANT BIOL
2. J NATL CANCER I 2. ACTA MATH-DJURSHOLM 2. PLANT CELL
3. J CLIN ONCOL 3. ANN MATH 3. CURR OPIN PLANT BIOL
4. SEMIN NUCL MED 4. INVENT MATH 4. PLANT J
5. ANN SURG ONCOL 5. DUKE MATH J 5. TRENDS PLANT SCI
Cluster 10 Cluster 11 Cluster 12
1. CIRCULATION 1. ANNU REV ECOL EVOL S 1. REV MOD PHYS
2. JAMA-J AM MED ASSOC 2. SYSTEMATIC BIOL 2. MAT SCI ENG R
3. MILBANK Q 3. ANNU REV ENTOMOL 3. ANNU REV NUCL PART S
4. J AM COLL CARDIOL 4. OCEANOGR MAR BIOL 4. PHYS REP
5. NEW ENGL J MED 5. TRENDS ECOL EVOL 5. NAT MATER
Cluster 13 Cluster 14 Cluster 15
1. ANNU REV NEUROSCI 1. ANNU REV PSYCHOL 1. Q J ECON
2. NAT REV NEUROSCI 2. PSYCHOL BULL 2. J ECON LIT
3. NEURON 3. REV EDUC RES 3. J POLIT ECON
4. NAT NEUROSCI 4. PSYCHOL REV 4. ECONOMETRICA
5. PROG NEUROBIOL 5. AM EDUC RES J 5. REV ECON STUD
698 L. Zhang et al.
123
Comparison of each cluster based on different communication characteristics
Besides identifying ‘‘central’’ journals, strong links also reflect the affinities between
journals in each cluster among the communication network. Table 5 compares the ratios of
strong links within each cluster (SL means strong links). The ratio of strong links in cluster
i is calculated as the number of strong links inside cluster i divided by the total number of
possible symmetric links within the same cluster. The highest ratio of strong links is found
in cluster #2 (Geosciences & Space sciences), and the lowest is in cluster #3 (Biosciences).
The three clusters #2, #15 and #8 appearing at the top in Table 5 are also among the ‘‘best’’
clusters according to their Silhouette values (cf. Fig. 3). The two social science clusters are
interesting cases, as the ratio of strong links is much higher in cluster #15 (economics) than
in #5 (social and political sciences).
As the clustering algorithm is based on cross-citation similarities, it is quite plausible
that most of the strong links can be found within the clusters. However, we also find more
than 9% of all strong links among different clusters, which could be called ‘‘foreign strong
links’’. These links indicate the information flow among individual journals that are
assigned to different categories. The analysis of ‘‘foreign strong links’’ could help us to
detect the journals which are important nodes in the communication network for
exchanging information among different clusters since they can be considered the inter-
faces between different subject categories. As extreme cases, American journal of physicalanthropology which is assigned to cluster #3 (Biosciences) has nine strong links in total,
but five of which are ‘‘foreign strong links’’, respectively with journals in cluster #5 (Social
sciences), cluster #10 (Clinical medicine) and # 11 (Biology). Journal of medicinalchemistry which belongs to cluster #4 (Chemistry) has seven strong links, but five of which
are with journals in cluster #3 (Bioscience). This example reveals aspects of interdisci-
plinarity of journals in these categories. We found 12 strong links between journals in
Table 5 Comparison of strong links within each cluster
Cluster # Numberof SL
Ratioof SL
Cluster # Numberof SL
Ratioof SL
Cluster # Numberof SL
Ratioof SL
2 207 0.0159 6 304 0.0048 4 834 0.0019
15 136 0.0091 11 381 0.0045 10 612 0.0015
8 135 0.0081 13 170 0.0035 5 248 0.0014
9 221 0.0073 12 423 0.0034 1 1807 0.0012
7 249 0.0069 14 522 0.0024 3 339 0.0010
Table 6 Average citation entropy of journals in each cluster
Cluster # Numberof journals
Averageentropy
Cluster#
Numberof journals
Averageentropy
Cluster#
Numberof journals
Averageentropy
3 830 7.50 4 925 6.22 9 246 5.88
13 308 7.23 14 586 6.19 12 498 5.84
10 891 7.08 11 412 6.14 2 162 5.65
7 269 7.02 8 183 6.10 1 1542 5.38
6 356 6.64 15 173 5.93 5 442 5.37
Journal cross-citation analysis 699
123
cluster #5 (Social sciences) and cluster #3 (Biosciences), which might indicate the
underlying trend of information flows between the corresponding fields of science and
social sciences.
In Table 6, we present the average entropy of journals in each cluster, which shows the
overall degree of information spreading in each cluster. Cluster #3 (Bioscience) has the
highest average entropy but one of the lowest SL ratio, which again illustrates that these
two measures can be considered two sides of the same coin since most of these rela-
tionships are rather loose. This is followed by the neuroscience and psychology cluster
(#13), two clinical medicine clusters (#10 and #7) and the biology cluster (#6). On the
other hand, Cluster #2 (Geosciences & Space sciences) can be regarded as opponent; links
are not far-ranging but of relatively strong affinity. The comparison between rankings in
Tables 5 and 6 thus indicates the diversity of communication characteristics of each
cluster.
Comparison of cluster structure and existing ‘‘intellectual’’ classification
The above-mentioned SOOI classification scheme comprises 15 major fields, including 12
science fields and three fields in the social sciences and humanities (see Table 7). This
scheme is updated annually so that the comparison between the cross-citation clustering
and SOOI fields is based on valid assignments and settings. The concordance between the
15 major fields and the cross-citation clusters hereby allows a direct comparison of the two
systems. And the results of the cluster analysis may help to improve and fine-tune the
classification scheme developed at SOOI.
We first visualize the cross-citation network among SOOI fields using Pajek (Fig. 5).
The abbreviations of SOOI fields in Fig. 5 are as follows. AGRI = Agriculture & Envi-
ronment; BIOL = Biology; BIOS = Biosciences; BIOM = Biomedical research; CLI1 =
Clinical and experimental medicine I; CLI2 = Clinical and experimental medicine II;
NEUR = Neuroscience & Behaviour; CHEM = Chemistry; PHYS = Physics; GEOS =
Table 7 SOOI 15-field subject classification scheme
Field#
SOOI field Field#
SOOI field
1 AGRICULTURE &ENVIRONMENT
9 PHYSICS
2 BIOLOGY (ORGANISMIC &SUPRAORGANISMIC LEVEL)
10 GEOSCIENCES & SPACESCIENCES
3 BIOSCIENCES (GENERAL,CELLULAR & SUBCELLULARBIOLOGY; GENETICS)
11 ENGINEERING
4 BIOMEDICAL RESEARCH 12 MATHEMATICS
5 CLINICAL AND EXPERIMENTALMEDICINE I (GENERAL &INTERNAL MEDICINE)
13 SOCIAL SCIENCES I (GENERAL,REGIONAL & COMMUNITYISSUES)
6 CLINICAL AND EXPERIMENTALMEDICINE II (NON-INTERNALMEDICINE SPECIALTIES)
14 SOCIAL SCIENCES II(ECONOMICAL &POLITICAL ISSUES)
7 NEUROSCIENCE & BEHAVIOR 15 ARTS & HUMANITIES
8 CHEMISTRY
700 L. Zhang et al.
123
Geosciences & Space sciences; ENGN = Engineering; MATH = Mathematics; SOC1 =
Social sciences I; SOC2 = Social Sciences II; HUMA = Arts & Humanities. Since the
SOOI scheme is not a partition, i.e., journals are partially assigned to more than one single
field, more cross-citations (thicker lines) can be observed among fields (see Fig. 5) than
among clusters (cf. Fig. 4). Three fields related to social sciences and humanities are
relatively separated from the rest of the fields. On the other hand, ‘‘Social science I’’ has
nevertheless more and closer links with some biosciences, neuroscience and the medical
science fields.
Fig. 5 Cross-citation networks among SOOI fields represented by the three most important TF-IDF terms
Fig. 6 Three-dimensional MDS map visualising distances between the centroids of the 15 clusters and 15SOOI fields
Journal cross-citation analysis 701
123
Another comparison between the cluster structure and SOOI classification scheme is
based on the centroids of the two systems. The centroid of a cluster or field is defined as the
linear combination of all documents in it and is thus a vector in the same vector space. For
each cluster and each field, the centroid is calculated and the MDS of pairwise distances
between all centroids is shown in Fig. 6. Through this map, we could basically gain the
knowledge of the similarities and differences of nodes among the two schemes.
Furthermore, we used the Jaccard index, i.e., the ratio of the cardinality of the inter-
section of two sets and the cardinality of their union, to compare each cluster with each
field. The best matching field for each cluster is presented in Table 8 and the best matching
cluster for each field can be found in Table 9. ‘‘Strong’’ matches between the fields and
clusters are, for instance, among cluster #4 and the SOOI field Chemistry; the same applies
to cluster #12 and Physics. However, we should also note that not all fields and clusters
could be matched uniquely, where the multiple assignments in the SOOI system might be
an important factor.
On basis of these matches, we detected that some SOOI fields tend to fall apart. For
instance, Biology (Organismic & supraorganismic level) splits up into the clusters #6 and
#11. Combined with the analysis in cross-citation network of the clustering, we can see that
although both cluster #6 and #11 have Biology as their best matching field, they are
relatively separated in the network. Cluster #6 has very strong links with cluster #3, which
is actually the biosciences cluster. Further combined with the analysis of the most related
terms, most of the top TF-IDF terms in cluster #6 are related to Microbiology, Molecular
Biology or Biosciences; while terms in cluster #11 represent Biology. The second field that
seems to fall apart by clustering is Clinical and experimental medicine II (Non-internal
medicine specialties), which actually splits into cluster #7 and cluster #10. In the cross-
citation network, these two clusters are strongly inter-linked. We also observed that the
strongest link in the network was established between cluster #10 (a medical cluster) and
cluster #3 (the bioscience cluster). This result indicates that Clinical and experimental
Table 8 Best matching SOOI field for each cluster (based on Jaccard index)
Cluster SOOI Field
9 AGRICULTURE & ENVIRONMENT
6 BIOLOGY (ORGANISMIC & SUPRAORGANISMIC LEVEL)
11 BIOLOGY (ORGANISMIC & SUPRAORGANISMIC LEVEL)
3 BIOSCIENCES (GENERAL, CELLULAR & SUBCELLULAR BIOLOGY; GENETICS)
4 CHEMISTRY
7 CLINICAL AND EXPERIMENTAL MEDICINE II (NON-INTERNAL MEDICINESPECIALTIES)
10 CLINICAL AND EXPERIMENTAL MEDICINE II (NON-INTERNAL MEDICINESPECIALTIES)
1 ENGINEERING
2 GEOSCIENCES & SPACE SCIENCES
8 MATHEMATICS
13 NEUROSCIENCE & BEHAVIOR
14 NEUROSCIENCE & BEHAVIOR
12 PHYSICS
5 SOCIAL SCIENCES II (ECONOMICAL & POLITICAL ISSUES)
15 SOCIAL SCIENCES II (ECONOMICAL & POLITICAL ISSUES)
702 L. Zhang et al.
123
medicine II (Non-internal medicine specialties) tends to split into two clusters, one of
which has strong affinity with Biosciences. Similarly, Neuroscience & Behavior tends to
split into cluster #14 and #13. The first one is more related with ‘‘social sciences’’ and the
second one is rather related with the ‘‘sciences’’.
On the other hand, based on the result of best matching in Table 9, we observed some
fields tending to merge. For instances, Biosciences (general, cellular and subcellular
biology; genetics) and Biomedical research are merging into one cluster (#3); two fields of
Clinical and experimental medicine are merging into cluster #10; and three fields in the
social sciences and humanities tend to merge into cluster #5. But these tendencies are not
clear in the cross-citation network of SOOI fields (Fig. 5), where only the two Clinical and
experimental medicine fields are strongly interlinked, while other two groups of merging
fields are rather loosely connected.
Detecting ‘‘migration’’ of journals and improving the ‘‘intellectual’’ classification
schemes
In the previous part, based on Jaccard index, we determined for each SOOI field the cluster
that best matches the field. More than 40% of the journals are not assigned to the cluster
which best matches their SOOI field. We call these journals migrated journals. We expect
that the analysis of journal ‘‘migration’’ helps improve our ‘‘intellectual’’ classification
scheme. The large share of ‘‘migrations’’ may be due to several reasons. First, there are
different bases for the classification. The SOOI scheme is based on the classification of ISI
subject categories. By contrast, our clustering results are generated from the chosen
algorithm and are based on cross-citation similarities only. Secondly, according to our
definition, the ‘‘migrations’’ are detected on basis of the ‘‘best matching’’ result. But
actually the ‘‘matching’’ between SOOI and the clustering is not always clear, especially
because the clustering is a partition while SOOI allows multiple assignments.
Table 9 Best matching cluster for each SOOI field (based on Jaccard index)
SOOI field Cluster#
AGRICULTURE & ENVIRONMENT 9
BIOLOGY (ORGANISMIC & SUPRAORGANISMIC LEVEL) 11
BIOSCIENCES (GENERAL, CELLULAR & SUBCELLULAR BIOLOGY; GENETICS) 3
BIOMEDICAL RESEARCH 3
CLINICAL AND EXPERIMENTAL MEDICINE I (GENERAL & INTERNAL MEDICINE) 10
CLINICAL AND EXPERIMENTAL MEDICINE II (NON-INTERNAL MEDICINESPECIALTIES)
10
NEUROSCIENCE & BEHAVIOR 14
CHEMISTRY 4
PHYSICS 12
GEOSCIENCES & SPACE SCIENCES 2
ENGINEERING 1
MATHEMATICS 8
SOCIAL SCIENCES I (GENERAL, REGIONAL & COMMUNITY ISSUES) 5
SOCIAL SCIENCES II (ECONOMICAL & POLITICAL ISSUES) 5
ARTS & HUMANITIES 5
Journal cross-citation analysis 703
123
In SOOI classification, more than 70% of journals have single assignment. For these
journals, migration is readily determined. For those with multiple assignments, migration is
determined if the journal has been assigned to a cluster which is not the best matching
clusters for any of its SOOI fields. Table 10 presents the top 10 strongest migration
patterns based on the journals which have single assignment in SOOI fields. It is not
surprising that there is a strong emigration from different SOOI fields to cluster #1, which
is the most heterogeneous cluster. Except for migrations to cluster #1, several other groups
of remarkable emigration have drawn our attention: e.g., from Chemistry to cluster #12
(Physics), from Clinical and experimental medicine I to cluster #3 (Biosciences) and Social
sciences I to cluster #14 (Neuroscience & Behaviour). These distinct migration patterns
indicate possible adjustment or improvement of journal assignments.
Nevertheless, not all ‘‘migrations’’ can be used for the improvement of the reference
classification system since we could distinguish between ‘‘good migration’’ and ‘‘bad
migration’’. ‘‘Good migration’’ is observed if the goodness of the unit’s classification
improves, and based on its title and scope, clearly it should indeed be assigned to the
cluster to which it has moved. Otherwise we call it ‘‘bad migration’’ (Janssens et al. 2009).
According to the scope of the journals, these migrations are not convincing. But in the
cross-citation clustering, they indeed show more affinity with a different field. On one
hand, this phenomenon might reflect some strong information flow between different
subject fields, and on the other hand, it may also indicate the possible trend of profile or
dynamics changes of these journals. However, ‘‘bad migrations’’ may also just be ‘‘poor
results’’ generated from the clustering algorithm. Therefore, we propose to always combine
quantitative approaches, e.g., the clustering analysis with the ‘‘intellectual’’ assessment to
adjust and improve the existing classification system.
The distinction between good and bad migrations is important for validating and
adjusting the ‘‘intellectual’’ subject classification. In the future, we would deepen the
‘‘migration’’ analysis through a hybrid clustering algorithm combined with both of cita-
tions and texts information. Based on the hybrid clustering system, the good and bad
migrations could be further validated and confirmed; therefore the efficiency of classifi-
cation could be even improved.
Table 10 Top 10 strongest migration patterns (for single assignment in SOOI fields)
From SOOI field Tocluster
Number ofmigrated journals
SOCIAL SCIENCES II (ECONOMICAL & POLITICAL ISSUES) 1 195
ARTS & HUMANITIES 1 194
BIOLOGY (ORGANISMIC & SUPRAORGANISMIC LEVEL) 6 190
MATHEMATICS 1 144
CHEMISTRY 12 138
GEOSCIENCES & SPACE SCIENCES 1 127
CLINICAL AND EXPERIMENTAL MEDICINE I (GENERAL &INTERNAL MEDICINE)
3 122
SOCIAL SCIENCES I (GENERAL, REGIONAL & COMMUNITYISSUES)
14 119
CLINICAL AND EXPERIMENTAL MEDICINE II (NON-INTERNALMEDICINE SPECIALTIES)
7 117
SOCIAL SCIENCES II (ECONOMICAL & POLITICAL ISSUES) 15 103
704 L. Zhang et al.
123
Conclusions and discussions
The hard-clustering algorithm for the journal cross-citation analysis provides important
information for studying the cognitive structure of the journal cross-citation network. The
text mining technique allows the ‘‘labelling’’ clusters making the cognitive characteristics
of the clusters visible. Through the cross-citation analysis, we have identified three groups
of representative journals of each cluster, namely, ‘‘central journals’’ with strongest links,
journals with highest entropy, and ‘‘top’’ journals according to the PageRank algorithm.
These journals play an important part as nodes in the communication network, however
with different functions and there is a clear divergence between the centrality and high-
entropy.
A generic classification scheme as well as a global map of science has been generated
based on the cross-citation clustering analysis. This could help better understand the
cognitive base of scientific knowledge, and adjust the existing delineation of science
subject fields. The comparison of the clustering result with the ‘‘intellectual’’ classification
may reflect the dynamic development of journals, and may also reveal the underlying
convergence and emergence of new fields. There are some obvious advantages of the
clustering approach, among others, the base of classification is quantitatively stable and
testable; the classification result is automatically generated from the clustering algorithm,
which objectively reflects the underlying affinities among journals. However, we should
also mention the weakness of the hard clustering, because subjects often overlap and there
are sometimes ‘‘poor clusters’’ as well. For the final classification, we should therefore
combine the results with the ‘‘intellectual’’ approach.
Acknowledgements The research was supported by Steunpunt O&O Indicatoren of the Flemish Gov-ernment, the National Natural Science Foundation of China (grant no. 70673019), and the China ScholarshipCouncil. We thank Bart Thijs for his assistance in collecting and processing data.
References
Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern information retrieval. Cambridge: Addison-Wesley.Bassecoulard, E., & Zitt, M. (1999). Indicators in a research institute: A multi-level classification of
journals. Scientometrics, 44(3), 323–345.Batagelj, V., & Mrvar, A. (2002). Pajek: Analysis and visualization of large networks. Graph Drawing,
2265, 477–478. (ISSN).Bollen, J., Rodriguez, M. A., & Van De Sompel, H. (2006). Journal status. Scientometrics, 69(3), 669–687.Borner, K., Chen, C., & Boyack, K. W. (2003). Visualizing knowledge domains. Annual Review of
Information Science and Technology, 37, 179–255.Boyack, K. W., Klavans, R., & Borner, K. (2005). Mapping the backbone of science. Scientometrics, 64(3),
351–374.Braam, R. R., Moed, H. F., & Van Raan, A. F. J. (1991). Mapping science by combined co-citation and word
analysis 1: Structural aspects. Journal of the American Society for Information Science, 42(4), 233–251.
Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual web search engine. ComputerNetworks and ISDN Systems, 30(1–7), 107–117.
Carpenter, M. P., & Narin, F. (1973). Clustering of scientific journals. Journal of the American Society forInformation Science, 24(6), 425–436.
Ding, Y., Chowdhury, G., & Foo, S. (2000). Journal as markers of intellectual space: Journal co-citationanalysis of information retrieval area, 1987–1997. Scientometrics, 47(1), 55–73.
Garfield, E. (1998). Mapping the world of science. The 150 Anniversary Meeting of the AAAS,Philadelphia, PA, http://www.garfield.library.upenn.edu/papers/mapsciworld.html.
Journal cross-citation analysis 705
123
Glanzel, W., & Schubert, A. (2003). A new classification scheme of science fields and subfields designed forscientometric evaluation purposes. Scientometrics, 56(3), 357–367.
Hatcher, E., & Gospodnetic, O. (2004). Lucene in action. New York: Manning Publications Co.Hicks, D. (1987). Limitations of co-citation analysis as a tool for science policy. Social Studies of Science,
17(2), 295–316.Jain, A., & Dubes, R. (1988). Algorithms for clustering data. Englewood Cliffs, NJ: Prentice Hall.Janssens, F. (2007). Clustering of scientific fields by integrating text mining and bibliometrics. Ph.D. Thesis,
Faculty of Engineering, Katholieke Universiteit Leuven, Belgium, http://hdl.handle.net/1979/847.Janssens, F., Glanzel, W., & De Moor, B. (2008). A hybrid mapping of information science. Scientometrics,
75(3), 607–631.Janssens, F., Lin, Z., & Glanzel, W. (2009). Hybrid clustering for validation and improvement of subject-
classification schemes. Information Processing & Management, 45(6), 683–702.Jarneving, B. (2005). The combined application of bibliographic coupling and the complete link cluster
method in bibliometric science mapping. Ph.D. Thesis, University College of Boras/GoteborgUniversity, Sweden.
Leydesdorff, L. (2002). Dynamic and evolutionary updates of classificatory schemes in scientific journalstructures. Journal of the American Society for Information Science and Technology, 53(12), 987–994.
Leydesdorff, L. (2004a). Clusters and maps of science journals based on bi-connected graphs in the JournalCitation Reports. Journal of Documentation, 60(4), 371–427.
Leydesdorff, L. (2004b). Top-down decomposition of the Journal Citation Report of the Social ScienceCitation Index: Graph- and factor-analytical approaches. Scientometrics, 60(2), 159–180.
Leydesdorff, L. (2006). Can scientific journals be classified in terms of aggregated journal-journal citationrelations using the Journal Citation Reports? Journal of the American Society for Information Science& Technology, 57(5), 601–613.
Leydesdorff, L., & Rafols, I. (2008). A global map of science based on the ISI subject categories. Journal ofthe American Society for Information Science and Technology, 60(2), 348–362.
McCain, K. W. (1991). Mapping economics through the journal literature: An experiment in journal co-citation analysis. Journal of the American Society for Information Science, 42(4), 290–296.
McCain, K. W. (1998). Neural networks research in context: A longitudinal journal co-citation analysis ofan emerging interdisciplinary field. Scientometrics, 41(3), 389–410.
Morris, T. A., & McCain, K. W. (1998). The structure of medical informatics journal literature. Journal ofthe American Medical Informatics Association, 5(5), 448–466.
Narin, F. (1976). Evaluative bibliometrics: The use of publication and citation analysis in the evaluation ofscientific activity. Washington, DC: National Science Foundation.
Narin, F., Carpenter, M., & Berlt, N. C. (1972). Interrelationships of scientific journals. Journal of theAmerican Society for Information Science, 23(5), 323–331.
Price, D. J. D. (1965). Networks of scientific papers. Science, 149(3683), 510–515.Rousseeuw, P. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis.
Journal of Computational and Applied Mathematics, 20(1), 53–65.Tsay, M. Y., Xu, H., & Wu, C. W. (2003). Journal co-citation analysis of semiconductor literature.
Scientometrics, 57(1), 7–25.Zhang, L., Glanzel, W., & Liang, L. M. (2009). Tracing the role of individual journals in a cross-citation
network based on different indicators. Scientometrics, 81(3), 821–838.
706 L. Zhang et al.
123