putting oac-triclustering on mapreduce
TRANSCRIPT
Putting OAC-triclustering on MapReduce
Sergey Zudin, Dmitry V. Gnatyshak, and Dmitry I. Ignatov
National Research University Higher School of Economics, Russian FederationFaculty of Computer Science
CLA 2015, Clermont-Ferrand, FranceOctober 13-16
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 1 / 39
Outline
1 Motivation and previous work
2 Prime OAC-triclusteringTriadic Formal concept analysisBasic algorithmOnline version of the algorithm
3 OAC-triclustering on MapReduceMapReduce technologyMapReduce implementation
4 ExperimentsDescription of the experimentsDatasetsResults
5 Conclusion
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 2 / 39
Outline
1 Motivation and previous work
2 Prime OAC-triclusteringTriadic Formal concept analysisBasic algorithmOnline version of the algorithm
3 OAC-triclustering on MapReduceMapReduce technologyMapReduce implementation
4 ExperimentsDescription of the experimentsDatasetsResults
5 Conclusion
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 3 / 39
Motivation
Big amount of multimodal data:
Gene expression dataFolksonomiesRecommender SystemsCommunities in multi-mode (social) networksPattern mining in relational databases. . .
Non-binary data can be scaled (possibly increasing the dimensionality)
Increasing amount of big data: fast and/or distributed algorithms arerequired (linear or sublinear, one-pass)
Existing methods: finding all n-sets (mulitimodal clusters) satisfying someconditions (often the exponential number of patterns)
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 4 / 39
MotivationIMDB example, [Mirkin et al., 2011]
Clump Movie-Keyword-Genre
Bicluster{12 Angry Men (1957), To Kill a Mockingbird (1962), Wit-ness for the Prosecution (1957)}, {Murder, Trial}, {n/a }
Tricluster
{12 Angry Men (1957), Double Indemnity (1944), China-town (1974), The Big Sleep (1946), Witness for the Pros-ecution (1957), Dial M for Murder (1954), Shadow of aDoubt (1943) }, {Murder, Trial, Widow, Marriage, Privatedetective, Blackmail, Letter}, {Crime, Drama, Thriller,Mystery, Film-Noir }
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 5 / 39
Previous and related workA short (not full) list
Triadic FCA [Wille, 1995; Lehman and Wille,1995] and Polyadic FCA[Voutsadakis, 2002]
TRIAS [Jaeschke et al., 2006] for mining (frequent) triconcepts
DataPeeler for closed n-sets [Cerf et al., 2009], MultiDupeHack [Cerf et al,2013]
TriBox [Mirkin et al., 2011] for mining dense triboxes with LS criterion
Box OAC-triclustering and Spectral Triclustering [Ignatov et al., 2011,2013]
Multi-way set enumeration in weight tensors [Scholkopf et al, 2011]
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 6 / 39
Previous and related workA short (not full) list
Quadri-concepts for personalised folksnomies [Jelassi et al., 2012, 2013]
Prime OAC-triclustering [Gnatyshak et al., 2012–2014]
Triadic Boolean tensor factorisation [Miettinen et al., 2011; Belohlavek et al.,2013] and Boolean tensor clustering [Miettinen et al., 2015]
Closed and connected patterns in multi-relational data. [Spyropoulu et al.,2011–14]
Triadic FCA and triclustering: Searching for optimal patterns. MachineLearning journal [Ignatov et al., 2015] and CLA 2013
. . .
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 7 / 39
Outline
1 Motivation and previous work
2 Prime OAC-triclusteringTriadic Formal concept analysisBasic algorithmOnline version of the algorithm
3 OAC-triclustering on MapReduceMapReduce technologyMapReduce implementation
4 ExperimentsDescription of the experimentsDatasetsResults
5 Conclusion
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 8 / 39
Prime OAC-triclusteringFormal concept analysis: triadic case
DefinitionLet G , M, B be sets and the ternary relation I be a subset of their Cartesianproduct: I ⊆ G ×M × B. Then the tuple K = (G ,M,B, I ) is called a triadicformal context.G is a set of objects, M is a set of attributes, B is a set of conditions.
G\M m1 m2 m3 m1 m2 m3 m1 m2 m3
g1 x x x x x x x x
g2 x x x x x
g3 x x x x
g4 x x x x x x
B b1 b2 b3
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 9 / 39
Prime OAC-triclusteringFormal concept analysis: triadic case
Definition
Galois operators (prime operators) are defined in similar way to the dyadic case:
2G → 2M × 2B 2G × 2M → 2B
2M → 2G × 2B 2G × 2B → 2M
2B → 2G × 2M 2M × 2B → 2G
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 9 / 39
Prime OAC-triclusteringFormal concept analysis: triadic case
G\M m1 m2 m3 m1 m2 m3 m1 m2 m3
g1 x x x x x x x x
g2 x x x x x
g3 x x x x
g4 x x x x x x
B b1 b2 b3
({g1, g2}, {m1,m2})′ = {b1, b3}
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 9 / 39
Prime OAC-triclusteringFormal concept analysis: triadic case
G\M m1 m2 m3 m1 m2 m3 m1 m2 m3
g1 x x x x x x x x
g2 x x x x x
g3 x x x x
g4 x x x x x x
B b1 b2 b3
m′2 = {(g1, b1), (g2, b1), (g3, b1), (g1, b2), (g1, b3), (g2, b3), (g4, b3)}
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 9 / 39
Prime OAC-triclusteringFormal concept analysis: triadic case
Definition
The triple (X ,Y ,Z ) is called triadic formal concept of the contextK = (G ,M,B, I ), if X ⊆ G ,Y ⊆ M, Z ⊆ B, (X ,Y )′ = Z , (X ,Z )′ = Y ,(Y ,Z )′ = X .X is called (formal) extent, Y — (formal) intent, Z — (formal) modus.
G\M m1 m2 m3 m1 m2 m3 m1 m2 m3
g1 x x x x x x x x
g2 x x x x x
g3 x x x x
g4 x x x x x x
B b1 b2 b3
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 9 / 39
Prime OAC-triclusteringBasic algorithm [Gnatyshak et al., 2013]
This method uses the following types of prime operators (for the contextK = (G ,M,B, I )):
(g ,m)′ = {b ∈ B | (g ,m, b) ∈ I},(g , b)′ = {m ∈ M | (g ,m, b) ∈ I},(m, b)′ = {g ∈ G | (g ,m, b) ∈ I}
Definition
Then the triple T = ((m, b)′, (g , b)′, (g ,m)′) is called the prime-basedOAC-tricluster for a triple (g ,m, b) ∈ I . The sets of tricluster are called,respectively, tricluster extent, intent, and modus. Triple (g ,m, b) is called agenerating triple of the tricluster T .
Definition
Density of a tricluster: ρ(X ,Y ,Z ) = |I∩(X×Y×Z)||X ||Y ||Z |
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 10 / 39
Prime OAC-triclusteringBasic algorithm
An example of a tricluster based on triple (g , m, b):
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 11 / 39
Prime OAC-triclusteringBasic algorithm
Input: K = (G ,M,B, I ) — triadic context;ρmin — density threshold
Output: T = {T = (X ,Y ,Z)}1: T := ∅2: for all (g ,m) : g ∈ G ,m ∈ M do3: PrimesObjAttr [g ,m] = (g ,m)′
4: end for5: for all (g , b) : g ∈ G ,b ∈ B do6: PrimesObjCond [g , b] = (g , b)′
7: end for8: for all (m, b) : m ∈ M,b ∈ B do9: PrimesAttrCond [m, b] = (m, b)′
10: end for11: for all (g ,m, b) ∈ I do12: T = (PrimesAttrCond [m, b],PrimesObjCond [g , b],PrimesObjAttr [g ,m])13: Tkey = hash(T )14: if Tkey ∈ T .keys ∧ ρ(T ) ≥ ρmin then15: T [Tkey ] := T16: end if17: end for
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 12 / 39
Prime OAC-triclusteringOnline version of the algorithm [Gnatyshak et al., 2014]
Let K = (G ,M,B, I ) be a triadic context. We do not know G , M, B, I , or theircardinalities in advance.
Input on each iteration: {(g ,m, b)} = J ⊆ I .Goal: maintain an updated version of the results and efficiently update them whennew triples are received.
We need to keep in memory the results of prime operators’ application (primesets):
PrimesObjAttr — dictionary with elements of type ((g ,m), {b ∈ B}), g ∈ G ,m ∈ M;
PrimesObjCond — dictionary with elements of type ((g , b), {m ∈ M}),g ∈ G , b ∈ B;
PrimesAttrCond — dictionary with elements of type ((m, b), {g ∈ G}),m ∈ M, b ∈ B.
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 13 / 39
Prime OAC-triclusteringOnline version of the algorithm
RemarkIn this case we need to consider triclusters based on different triples different, evenif their extents, intents, and modi are equal.
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 14 / 39
Prime OAC-triclusteringOnline version of the algorithm
Algorithm of triples addition:
Input: J is a set of triples to add;T = {T = (∗X , ∗Y , ∗Z)} is a current tricluster set;PrimesObjAttr , PrimesObjCond , PrimesAttrCond ;
Output: T = {T = (∗X , ∗Y , ∗Z)};PrimesObjAttr , PrimesObjCond , PrimesAttrCond ;
1: for all (g ,m, b) ∈ J do2: PrimesObjAttr [g ,m] := PrimesObjAttr [g ,m] ∪ b3: PrimesObjCond [g , b] := PrimesObjCond [g , b] ∪m4: PrimesAttrCond [m, b] := PrimesAttrCond [m, b] ∪ g5: T :=
T ∪ (&PrimesAttrCond [m, b],&PrimesObjCond [g , b],&PrimesObjAttr [g ,m])6: end for
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 15 / 39
Prime OAC-triclusteringOnline version of the algorithm
A user may require to remove the triclusters with the same extent, intent andmodus at the post-processing stage. At this stage we can also check variousconditions (for instance, minimal density condition).
Input: T = {T = (∗X , ∗Y , ∗Z)} is a current tricluster set;Output: T = {T = (∗X , ∗Y , ∗Z)} — processed tricluster hash-set;1: for all T ∈ T do2: Compute hash(T )3: if hash(T ) ∈ T .keys() then4: T := T ∪ T5: end if6: end for
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 16 / 39
Prime OAC-triclusteringOnline version of the algorithm
Complexity summary:
Time complexity: O(|I |) (as there is a constant number of operations oneach step);
More precisely: 8|I | operations in total;1 Modification of 3 prime sets (3);2 Creation of a new tricluster (1);3 Addition of pointers to its extent, intent, and modus (3);4 Addition of the tricluster to the set of all triclusters (1).
Memory complexity: O(|I |) (as we need to keep in memory only prime sets,|I | elements in each dictionary + keys).
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 17 / 39
Prime OAC-triclusteringOnline version of the algorithm
Example:
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 18 / 39
Prime OAC-triclusteringOnline version of the algorithm
→ (g1,m1, b1)
1 PrimesObjAttr = {((g1,m1), {b1})}2 PrimesObjCond = {((g1, b1), {m1})}3 PrimesAttrCond = {((m1, b1), {g1})}4 T := T ∪ {PrimesAttrCond [m1, b1],PrimesObjCond [g1, b1],PrimesObjAttr [g1,m1]}
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 18 / 39
Prime OAC-triclusteringOnline version of the algorithm
→ (g1,m2, b1)
1 PrimesObjAttr = {((g1,m1), {b1}), ((g1,m2), {b1})}2 PrimesObjCond = {((g1, b1), {m1,m2})}3 PrimesAttrCond = {((m1, b1), {g1}), ((m2, b1), {g1})}4 T := T ∪ {PrimesAttrCond [m2, b1],PrimesObjCond [g1, b1],PrimesObjAttr [g1,m2]}
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 18 / 39
Prime OAC-triclusteringOnline version of the algorithm
→ (g2,m1, b1)
1 PrimesObjAttr = {((g1,m1), {b1}), ((g1,m2), {b1}), ((g2,m1), {b1})}2 PrimesObjCond = {((g1, b1), {m1,m2}), ((g2, b1), {m1})}3 PrimesAttrCond = {((m1, b1), {g1, g2}), ((m2, b1), {g1})}4 T := T ∪ {PrimesAttrCond [m1, b1],PrimesObjCond [g2, b1],PrimesObjAttr [g2,m1]}
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 18 / 39
Prime OAC-triclusteringOnline version of the algorithm
→ (g2,m2, b1)
1 PrimesObjAttr = {((g1,m1), {b1}), ((g1,m2), {b1}), ((g2,m1), {b1}), ((g2,m2), {b1})}2 PrimesObjCond = {((g1, b1), {m1,m2}), ((g2, b1), {m1,m2})}3 PrimesAttrCond = {((m1, b1), {g1, g2}), ((m2, b1), {g1, g2})}4 T := T ∪ {PrimesAttrCond [m2, b1],PrimesObjCond [g2, b1],PrimesObjAttr [g2,m2]}
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 18 / 39
Prime OAC-triclusteringOnline version of the algorithm
→ (g3,m3, b1)
1 PrimesObjAttr ={((g1,m1), {b1}), ((g1,m2), {b1}), ((g2,m1), {b1}), ((g2,m2), {b1}), ((g3,m3), {b1})}
2 PrimesObjCond = {((g1, b1), {m1,m2}), ((g2, b1), {m1,m2}), ((g3, b1), {m3})}3 PrimesAttrCond = {((m1, b1), {g1, g2}), ((m2, b1), {g1, g2}), ((m3, b1), {g3})}4 T := T ∪ {PrimesAttrCond [m3, b1],PrimesObjCond [g3, b1],PrimesObjAttr [g3,m3]}
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 18 / 39
Prime OAC-triclusteringOnline version of the algorithm
→ (g1,m2, b2)
1 PrimesObjAttr = {((g1,m1), {b1}), ((g1,m2), {b1, b2}), ((g2,m1),{b1}), ((g2,m2), {b1}), ((g3,m3), {b1})}
2 PrimesObjCond = {((g1, b1), {m1,m2}), ((g2, b1), {m1,m2}), ((g3, b1),{m3}), ((g1, b2), {m2})}
3 PrimesAttrCond = {((m1, b1), {g1, g2}), ((m2, b1), {g1, g2}), ((m3, b1),{g3}), ((m2, b2), {g1})}
4 T := T ∪ {PrimesAttrCond [m2, b2],PrimesObjCond [g1, b2],PrimesObjAttr [g1,m2]}
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 18 / 39
Prime OAC-triclusteringOnline version of the algorithm
→ (g2,m1, b2)
1 PrimesObjAttr = {((g1,m1), {b1}), ((g1,m2), {b1, b2}), ((g2,m1), {b1, b2}),((g2,m2), {b1}), ((g3,m3), {b1})}
2 PrimesObjCond = {((g1, b1), {m1,m2}), ((g2, b1), {m1,m2}), ((g3, b1), {m3}),((g1, b2), {m2}), ((g2, b2), {m1})}
3 PrimesAttrCond = {((m1, b1), {g1, g2}), ((m2, b1), {g1, g2}), ((m3, b1), {g3}),((m2, b2), {g1}), ((m1, b2), {g2})}
4 T := T ∪ {PrimesAttrCond [m1, b2],PrimesObjCond [g2, b2],PrimesObjAttr [g2,m1]}
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 18 / 39
Prime OAC-triclusteringOnline version of the algorithm
→ (g2,m2, b2)
1 PrimesObjAttr = {((g1,m1), {b1}), ((g1,m2), {b1, b2}), ((g2,m1), {b1, b2}),((g2,m2), {b1, b2}), ((g3,m3), {b1})}
2 PrimesObjCond = {((g1, b1), {m1,m2}), ((g2, b1), {m1,m2}), ((g3, b1), {m3}),((g1, b2), {m2}), ((g2, b2), {m1,m2})}
3 PrimesAttrCond = {((m1, b1), {g1, g2}), ((m2, b1), {g1, g2}), ((m3, b1), {g3}),((m2, b2), {g1, g2}), ((m1, b2), {g2})}
4 T := T ∪ {PrimesAttrCond [m2, b2],PrimesObjCond [g2, b2],PrimesObjAttr [g2,m2]}
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 18 / 39
Prime OAC-triclusteringOnline version of the algorithm
→ (g3,m3, b2)
1 PrimesObjAttr = {((g1,m1), {b1}), ((g1,m2), {b1, b2}), ((g2,m1), {b1, b2}), ((g2,m2),{b1, b2}), ((g3,m3), {b1, b2})}
2 PrimesObjCond = {((g1, b1), {m1,m2}), ((g2, b1), {m1,m2}), ((g3, b1), {m3}), ((g1, b2),{m2}), ((g2, b2), {m1,m2}), ((g3, b2), {m3})}
3 PrimesAttrCond = {((m1, b1), {g1, g2}), ((m2, b1), {g1, g2}), ((m3, b1), {g3}),((m2, b2), {g1, g2}), ((m1, b2), {g2}), ((m3, b2), {g3})}
4 T := T ∪ {PrimesAttrCond [m3, b2],PrimesObjCond [g3, b2],PrimesObjAttr [g3,m3]}
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 18 / 39
Prime OAC-triclusteringOnline version of the algorithm
Postprocessing:
1 T(g1,m1,b1) = (g1, g2,m1,m2, b1)← add
2 T(g1,m2,b1) = (g1, g2,m1,m2, b1, b2)← add
3 T(g2,m1,b1) = (g1, g2,m1,m2, b1, b2)← the same as T(g1,m2,b1), skip
4 T(g2,m2,b1) = (g1, g2,m1,m2, b1, b2)← the same as T(g1,m2,b1), skip
5 T(g3,m3,b1) = (g3,m3, b1, b2)← add
6 T(g1,m2,b2) = (g1, g2,m2, b1, b2)← add
7 T(g2,m1,b2) = (g2,m1,m2, b1, b2)← add
8 T(g2,m2,b2) = (g1, g2,m1,m2, b1, b2)← the same as T(g1,m2,b1), skip
9 T(g3,m3,b2) = (g3,m3, b1, b2)← the same as T(g3,m3,b1), skip
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 18 / 39
Prime OAC-triclusteringOnline version of the algorithm
The final output set of triclusters:
1 T1 = ({g1, g2}, {m1,m2}, {b1})2 T2 = ({g1, g2}, {m1,m2}, {b1, b2})3 T3 = ({g3}, {m3}, {b1, b2})4 T4 = ({g1, g2}, {m2}, {b1, b2})5 T5 = ({g2}, {m1,m2}, {b1, b2})
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 18 / 39
Outline
1 Motivation and previous work
2 Prime OAC-triclusteringTriadic Formal concept analysisBasic algorithmOnline version of the algorithm
3 OAC-triclustering on MapReduceMapReduce technologyMapReduce implementation
4 ExperimentsDescription of the experimentsDatasetsResults
5 Conclusion
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 19 / 39
MapReduce TechnologyMapReduce scheme [Dean and Ghemawat, 2004]
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 20 / 39
MapReduce TechnologyMapReduce example
Figure: Word counting. Source:http://blog.trifork.com/2009/08/04/introduction-to-hadoop/
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 21 / 39
MapReduce TechnologyCommunication costs: Mining of Massive Datasets [Leskovec et al., 2013]
Chapter 2: MapReduce and the New Software Stack
“Replication Rate and Reducer Size: It is often convenient to measurecommunication by the replication rate, which is the communication per input.Also, the reducer size is the maximum number of inputs associated with anyreducer. For many problems, it is possible to derive a lower bound on replicationrate as a function of the reducer size.”
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 22 / 39
MapReduce ImplementationThe previous lattice-oriented M/R implementations
A version of Close-by-One algorithm was ported to M/R framework [Krajca& Vychodil, 2009]
A M/R algorithm for computation of closed cube lattices was proposed[Kudryavcev & Kuznecov, 2009]
[Xu et al., 2012] demonstrated that iterative algorithms like Ganter’sNextClosure can benefit from the usage of iterative M/R schemes
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 23 / 39
MapReduce ImplementationTechnologies and code repositories
Technologies used
Apache Hadoop 1
Apache Maven (framework for automatic project assembling)
Apache Commons (for work with extended Java collections)
Google Guava (utilities and data structures)
Jackson JSON (open-source library for transformation of object-orientedrepresentation of an object like tricluster to string)
TypeTools (for real-time type resolution of inbound and outbound key-valuepairs)
. . .
Implementations
Source 1: “Chaining-job” module2
Source 2: M/R-based OAC Triclustering3
1http://hadoop.apache.org/2https://github.com/zydins/chaining-job3https://github.com/zydins/DistributedTriclustering
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 24 / 39
Two-stage MapReduce ImplementationDistributed OAC-triclustering: First Map
Input: S is a set of input triples as strings;r is a number of reducers;i is a grouping index (objects, attributes or conditions).
Output: J is a list of ⟨key , triple⟩ pairs.1: for all s ∈ S do2: t := transform(s)3: key := hash(t[i ]) mod r4: J := J ∪ {⟨key , t⟩}5: end for
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 25 / 39
Two-stage MapReduce ImplementationDistributed OAC-triclustering: First Reduce
Input: J is a list of triples (for a certain key);T = {T = (X ,Y ,Z )} is a current set of triclusters;PrimesOA, PrimesOC , PrimesAC .
Output: file of strings – encoded ⟨triple, tricluster⟩ pairs.1: Primes ← initialise a new multimap2: for all (g ,m, b) ∈ J do3: Primes[g ,m] := Primes[g ,m] ∪ {b}4: Primes[g , b] := Primes[g , b] ∪ {m}5: Primes[m, b] := Primes[m, b] ∪ {g}6: end for7: for all (g ,m, b) ∈ J do8: T := (set(Primes[m, b]), set(Primes[g , b]), set(Primes[g ,m]))9: s := encode(⟨(g ,m, b),T ⟩)
10: store s11: end for
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 26 / 39
Two-stage MapReduce ImplementationDistributed OAC-triclustering: Second Map
Input: S is a list of strings.Output: T is an list of ⟨tricluster , tricluster⟩ pairs.1: Primes ← initialise a new multimap2: for all s ∈ S do3: ⟨(g ,m, b),T ⟩ := decode(s)4: update Primes multimap appropriately5: I := I ∪ {(g ,m, b)}6: end for7: for all (g ,m, b) ∈ I do8: T := (set(Primes[m, b]), set(Primes[g , b]), set(Primes[g ,m]))9: T := T ∪ {⟨T ,T ⟩}
10: end for
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 27 / 39
Two-stage MapReduce ImplementationDistributed OAC-triclustering: Second Reduce
Input: T is a list of ⟨tricluster , list of triclusters⟩ pairs.Output: File with a final set of triclusters {T = (X ,Y ,Z )}.1: for all ⟨T , [T , . . . ,T ]⟩ ∈ T do2: store T3: end for
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 28 / 39
Two-stage MapReduce ImplementationCommunication costs
The time complexity of the M/R solution is composed from two terms foreach stage: O(|I |/r) (or O(|I |)) and O(|I |).
The replication rate for the first M/R stage r1 = 1 (each triple is passed asone key-value pair), the reducer size q1 = |I |/r
The replication rate for the second M/R stage is r2 = 1 (it assigns onekey-value pair for each tricluster), but the reducer size varies from qmin
2 = 1(no duplicate triclusters) and qmax
2 = |I | (one final tricluster when all theinitial triples belong to one absolutely dense cuboid).
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 29 / 39
Outline
1 Motivation and previous work
2 Prime OAC-triclusteringTriadic Formal concept analysisBasic algorithmOnline version of the algorithm
3 OAC-triclustering on MapReduceMapReduce technologyMapReduce implementation
4 ExperimentsDescription of the experimentsDatasetsResults
5 Conclusion
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 30 / 39
ExperimentsDescription of the experiments
OS X 10, 1.8 GHz Intel Core i5, 4 Gb 1600 MHz DDR3 and 8 Gb free spaceon the hard drive (a typical commodity hardware).
Two M/R modes have been tested: sequential mode of tasks completion andemulation of distributed one with 16 first reducers and 32 threads for thesecond stage.
To evaluate the runtime more carefully, for each context the average result of5 runs of the algorithms has been recorded.
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 31 / 39
ExperimentsDatasets
Synthetic datasets. 1) 20,000 triples (25 unique entities of each type); 2) 100,000 triples (50unique entities of each type); 3) 1,000,000 triples (all possible combinations of 100 uniqueentities of each type).The 1st dataset contains duplicates since 25× 25× 25 gives only 15,625 unique triples. The 2ndone contains less triples than 503 = 125, 000, the number of all possible combinations. The 3rdone is an absolutely dense cuboid 100× 100× 100.The 3rd dataset does not result in 3min(|G |,|M|,|B|) formal triconcepts, this is an example of theworst case scenario for the second reducer (qmax
2 = |I |).IMDB. Top-250 list of the best movies from Internet Movie Database
Bibsonomy. The data of bibsonomy.org from ECML PKDD discovery challenge 2008.
Context |G | |M| |B| # triples Density20k 25 25 25 20,000 1100k 50 50 50 100,000 0.81m 100 100 100 1,000,000 1
IMDB 250 795 22 3,818 0.00087BibSonomy 2,337 67,464 28,920 816,197 1.8 · 10−7
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 32 / 39
ExperimentsResults
Algorithm/Context IMDB 20k 100k 1m Bibsonomy(≈3k triples) triples triples triples (≈800k triples)
Tribox 324 800 1,265 >3,000 >3,000TRIAS 189 362 862 >3,000 >3,000OAC Box 374 756 1,265 >3,000 >3,000OAC Prime 7 8 734 >3,000 >3,000Online OAC prime 3 3 3 5 >3,000M/R OAC prime seq. 12 30 81 166 1,534M/R OAC prime distr. 1 15 20 25 520
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 33 / 39
Alternative MapReduce decompositionVariant I: First stage
First Map: Finding primes. During this phase every input triple (g ,m, b) isencoded by three key-value pairs ⟨(g ,m), b⟩, ⟨(g , b),m⟩, and ⟨(m, b), g⟩. Thesepairs are passed to the first reducer.
The replication rate is r1 = 3.
First Reduce: Finding primes. This reducer fills three corresponding dictionariesfor primes of keys. So, for example, the first dictionary, PrimeOA containskey-value pairs ⟨(g ,m), {b1, b2, . . . , bn}⟩.
The reducer size is q1 = max(|G |, |M|, |B|)
The process can be stopped after the first reduce phase and all the triclustersfound as (Prime[g ,m],Prime[g , b],Prime[m, b]) each by enumeration of(g ,m, b) ∈ I . However, to do it faster and keep the result for furthercomputation, it is possible to use M/R as well.
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 34 / 39
Alternative MapReduce decompositionVariant I: Second stage
Second Map: Tricluster generation. The second map does tricluster combiningjob, i.e. for each triple (g ,m, b) it composes the new key-value pair, ⟨(g ,m, b), ∅⟩.And for each pair of either type, ⟨(g ,m),Prime[g ,m]⟩, ⟨(g , b),Prime[g , b]⟩, and⟨(m, b),Prime[m, b]⟩ it generates key-values pairs ⟨(g ,m, b),Prime[g ,m]⟩,⟨(g , m, b),PrimeOC [g , b]⟩, and ⟨(g ,m, b),Prime[m, b]⟩, where g ∈ G , m ∈ M,and b ∈ B.
r2 = (|I |+ 3|G ||M||B|)/(|I |+ |G ||M|+ |G ||B|+ |M||B|) ≤(ρ+ 3)/(ρ+ 3/max(|G |, |M|, |B|)), where ρ is the input tricontext density.
Second Reduce: Tricluster generation. The second reducer just assembles onlyone value for each key (g ,m, b), the generating triple, its tricluster, (Prime[g ,m],Prime[g , b],Prime[m, b]). If there is no key-value pair ⟨(g ,m, b), ∅⟩ for aparticular triple (g ,m, b), it does not output any key-value pair for the key.
The reducer size q2 is either 3 (no output) or 4 (tricluster assembled).
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 35 / 39
Alternative MapReduce decompositionVariant II: Second stage
Second Map: Tricluster generation with duplicate generating triples.Second map does tricluster combining job, i.e. for each triple (g ,m, b) itcomposes a new key-value pair:⟨(Prime[g ,m],Prime[g , b],Prime[m, b]), (g ,m, b)⟩.
Second Map: Tricluster generation with duplicate generating triples.The second reducer just groups values for each key: ⟨(X ,Y ,Z ), {(g1,m1, b1), . . . ,(gn,mn, bn)}⟩.
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 36 / 39
Outline
1 Motivation and previous work
2 Prime OAC-triclusteringTriadic Formal concept analysisBasic algorithmOnline version of the algorithm
3 OAC-triclustering on MapReduceMapReduce technologyMapReduce implementation
4 ExperimentsDescription of the experimentsDatasetsResults
5 Conclusion
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 37 / 39
Conclusion and further work
MapReduce Prime OAC-triclustering implementation has been proposed.
Communication costs have been analysed.
Comparison of the online version and M/R one has been performed.
Further experiments are needed with other M/R variants and othertriclustering algorithms.
A proper comparison of the proposed OAC triclustering and noise tolerantpatterns in n-ary relations, e.g., by DataPeeler descendants [Cerf et al., 2013]is not yet conducted.
S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 38 / 39