a clustering-based approach to detect probable outcomes of lawsuits
TRANSCRIPT
A clustering-based approach to detect probable outcomes of lawsuits
Undergraduate thesis/final projectEscola de Informática Aplicada - UNIRIO
Author: Daniel Lemes Gribel <[email protected]>
Comission:Leonardo G. Azevedo 1,2 (supervisor)
Maíra A. C. Gatti 2 (supervisor)Adriana C. de F. Alvim 1
Sean W. M. Siqueira 1
1 UNIRIO, 2 IBM Research
December 19, 2014 1
The project idea
IBM Research, 2013: inspired from a Social Media Simulator (SMSim project) developed to predict Twitter users behavior.
First idea: to model judges behavior and then predict lawsuits outcomes through multi-agent simulation, as SMSim.
New proposal: develop an approach to suggest possible outcomes for a given lawsuit based on modelling, similarity detection and clustering.
2
Project contributions
Results shown that, by analysing past data, was possible to verify the most likely outcome and to detect its uncertainty degree.
3
Problem statement
Large amount of unstructured data coming from the numerous lawsuits ⇒ Large number of hidden or unknown information
★ How do we know which similar lawsuits can be a reference to a new lawsuit?
★ How do we estimate the time for taking the decisions?
★ How do we estimate a likelihood for the possible emergent results?
4
The STF and its responsibilities
The Brazilian Supreme Court (STF) is an organism part of the Brazilian Judiciary System, responsible for the safeguarding and interpreting of the Constitution. STF decides matters related to the Constitution or when there is doubt or controversy regarding legal actions ².
² STF. Institucional. 2011. Available from internet: http://www.stf.jus.br/portal/cms/verTexto.asp?servico=sobreStfConhecaStfInstitucional
5
STF judgement configuration
Nowadays, STF is constituted by 11 judges, who act in its Panels as well as in its Plenary.
1. Monocratic: decision taken by a single judge.
2. Collegial: there is a rapporteur (one of them), and each judge votes individually, prevailing the majority decision.
a. First Panel (Primeira Turma): 5 judges.
b. Second Panel (Segunda Turma): 5 judges.
c. Plenary: 11 judges – currently, there is an open position.
6
Law classes
There are several lawsuit classes in the Brazilian judicial system: Habeas Corpus, Interlocutory Appeal, Extraordinary Appeal, etc.
In this work, only lawsuits belonging to the Appeal class are considered *.
* The choice of Appeal class was supported by some conversation with a professor and a student of Law School in Fundação Getúlio Vargas (FGV).
7
Law classesAppeal: “the instrument to cause a review of a decision by the same judicial authority, or other hierarchically higher, in order to obtain their reform or modification” ³
● +50% of ~1.5M lawsuits judged by STF - which is important in terms of the heterogeneity of the data.
● Have similar dynamics in their life cycles - which is important in terms of pattern detection.
³ Moacyr Amaral Santos, professor, lawyer and minister of the Supreme Court.8
Mental modelling
1. Look for an appeal lawsuit page in the STF website and identify its meta-data: lawsuit id, period (start and end date), state of origin, rapporteur, author, defendant, type (area of Law) and subjects associated to the lawsuit.
2. Identify the summary and the claim of the lawsuit, found in a document called “Acórdão”.
3. Extract decisions and votes from “Acórdão”.
9
Mental modelling
10
Classification and clusteringClustering goals 4:
1. Development of a typology or classification.
2. Investigation of conceptual schemes for grouping entities.
3. Hypothesis generation through data exploration.
4. Hypothesis testing, or the attempt to determine if types defined through other procedures are in fact present in a dataset.
4 ALDENDERFER, M. S.; BLASHFIELD, R. K. Cluster Analysis. Beverly Hills: Sage, 1984.
11
Classification and clustering
12
Adapted from WOOYOUNG, K. Parallel Clustering Algorithms: Survey. Available from internet: http://www.solver.com/hierarchical-clustering-intro
Hierarchical clustering
13
A B C D E
A,B D,E
C,D,E
A,B,C,D,E
Agglomerative Divisive
tree cut
tree cut
Adapted from Frontline Solvers. Cluster Analysis. Available from internet: http://www.solver.com/hierarchical-clustering-intro
Hierarchical clustering+ Advantages:
● Does not require pre-defined number of clusters.
● Accepts any valid measure of distance.
● Less influenced by cluster shapes and less sensitive to handle clusters with different densities.
14
- Disadvantages:
● Complexity, which in general is ≥ O(n²), which makes them too slow for large datasets.
Ward’s algorithm
Ward’s minimum variance criterion, a particularization of the Ward general method, the objective function is to minimize the total within-cluster variance.
As a general result, Ward’s minimum variance method leads to compact and spherical clusters.
15
Single-linkage algorithmIn Single-linkage clustering, the objective function is defined by those two elements (one in each cluster) that are closest to each other.
16
The shortest of these links causes the fusion of the two clusters whose elements are involved.
Complete-linkage algorithmIn Complete-linkage clustering, the objective function is defined by those two elements (one in each cluster) that are farthest away from each other.
17
The shortest of these links causes the fusion of the two clusters whose elements are involved.
Proposed solution
18
Similarity calculation
From the modelled dataset, calculate the similarities between lawsuits:
1. Each pair of lawsuit receives a similarity coefficient regarding to a property.
2. Then, a mean (resultant) matrix is obtained from each property matrix.
Output: Similarity matrix
19
Similarity calculationSimilarity metric - Jaccard index:
20
Mean similarity:
Lawsuits clustering
From the similarities observed, run the hierarchical clustering algorithm.
Output: lawsuits classified into clusters.
21
Lawsuit instance assigning
From the detected clusters, calculate the similarities between the new lawsuit instance and the other lawsuits already classified.
Output: new instance assigned to the most similar cluster.
22
Decisions compilationConsidering a list of judges that will decide the lawsuit:
1. Collect their past votes observed in the cluster.2. Compute the degree of agreement between them.
For each judge jx, compare his/her decisions with each decision taken by another judge composing input, lawsuit by lawsuit.
Ratio no of commum votes/no of commum decisions determines the degree of agreement for each judge.
Output: the likely outcome – a number between 0 and 1, indicating the probable decision.
23
Datasets
lawsuit_16.csv: 16 lawsuits
decision_16.csv: 24 decisions
Lawsuits: lawsuit id, start/end date of lawsuit, state of origin, rapporteur, defendant, author, type, subjects, summary and claim.
Decisions: associated lawsuit id, decision id, type of decision, date, votes tuple <judge name, vote> and resultant decision.
24
Similarity analysis
25Rapporteur Summary
completely similar
completely different
Similarity analysis
26Mean similarity Mean similarity (Pearson correlation)
completely similar
completely different
Clustering analysis
27
completely similar
completely different
Agglomerative algorithms performances
28
Prediction results
29
Prediction results
30
reveals an…Optimization
problem!
● The correct choice of the number k of clusters is not trivial, depending on the distribution of points in a dataset and on the desired clustering resolution.
● Possible approach: define a search space, overvalue a k, and then develop optimization heuristics to determine a new stopping point (k2) when the algorithm finds a good solution.
● A stopping point, in this case, could be when the algorithm finds a cluster that is similar enough to the instance been tested and has difficulties to improve this best rate found.
Main contributions
● By analysing past data, it is possible that other similar cases were already judged.
● Results shown that was possible to verify the most likely outcome and to detect the degree of uncertainty of the outcome.
● Prediction results were satisfied: lawsuit instances were correctly assigned to clusters and similarity comparison revealed a good coefficient between lawsuits.
31
Future work● Use more sophisticated machine learning techniques.● Investigate a more efficient clustering method than the
hierarchical clustering - consider optimization issues.● Discriminate decisions by type.● Develop a better mechanism to find lawsuits properties
weights.● Have a training and a testing dataset. Then, use evaluation
metrics to check if predictions match real outcomes.● Investigate stochastic simulation approaches.
32
Code and datasets at bitbucket.org Git repository.Contact [email protected] to have access!
Thank you! Questions?
33