a clustering-based approach to detect probable outcomes of lawsuits

A clustering-based approach to detect probable outcomes of lawsuits

Undergraduate thesis/final projectEscola de Informática Aplicada - UNIRIO

Author: Daniel Lemes Gribel <[email protected]>

Comission:Leonardo G. Azevedo 1,2 (supervisor)

Maíra A. C. Gatti 2 (supervisor)Adriana C. de F. Alvim 1

Sean W. M. Siqueira 1

1 UNIRIO, 2 IBM Research

December 19, 2014 1

mailto:[email protected]

The project idea

IBM Research, 2013: inspired from a Social Media Simulator (SMSim project) developed to predict Twitter users behavior.

First idea: to model judges behavior and then predict lawsuits outcomes through multi-agent simulation, as SMSim.

New proposal: develop an approach to suggest possible outcomes for a given lawsuit based on modelling, similarity detection and clustering.

2

Project contributions

Results shown that, by analysing past data, was possible to verify the most likely outcome and to detect its uncertainty degree.

3

Problem statement

Large amount of unstructured data coming from the numerous lawsuits ⇒ Large number of hidden or unknown information

★ How do we know which similar lawsuits can be a reference to a new lawsuit?

★ How do we estimate the time for taking the decisions?

★ How do we estimate a likelihood for the possible emergent results?

4

The STF and its responsibilities

The Brazilian Supreme Court (STF) is an organism part of the Brazilian Judiciary System, responsible for the safeguarding and interpreting of the Constitution. STF decides matters related to the Constitution or when there is doubt or controversy regarding legal actions ².

² STF. Institucional. 2011. Available from internet: http://www.stf.jus.br/portal/cms/verTexto.asp?servico=sobreStfConhecaStfInstitucional

5

http://www.stf.jus.br/portal/cms/verTexto.asp?servico=sobreStfConhecaStfInstitucional



STF judgement configuration

Nowadays, STF is constituted by 11 judges, who act in its Panels as well as in its Plenary.

1. Monocratic: decision taken by a single judge.

2. Collegial: there is a rapporteur (one of them), and each judge votes individually, prevailing the majority decision.

a. First Panel (Primeira Turma): 5 judges.

b. Second Panel (Segunda Turma): 5 judges.

c. Plenary: 11 judges – currently, there is an open position.

6

Law classes

There are several lawsuit classes in the Brazilian judicial system: Habeas Corpus, Interlocutory Appeal, Extraordinary Appeal, etc.

In this work, only lawsuits belonging to the Appeal class are considered *.

* The choice of Appeal class was supported by some conversation with a professor and a student of Law School in Fundação Getúlio Vargas (FGV).

7

Law classesAppeal: “the instrument to cause a review of a decision by the same judicial authority, or other hierarchically higher, in order to obtain their reform or modification” ³

● +50% of ~1.5M lawsuits judged by STF - which is important in terms of the heterogeneity of the data.

● Have similar dynamics in their life cycles - which is important in terms of pattern detection.

³ Moacyr Amaral Santos, professor, lawyer and minister of the Supreme Court.8

Mental modelling

1. Look for an appeal lawsuit page in the STF website and identify its meta-data: lawsuit id, period (start and end date), state of origin, rapporteur, author, defendant, type (area of Law) and subjects associated to the lawsuit.

2. Identify the summary and the claim of the lawsuit, found in a document called “Acórdão”.

3. Extract decisions and votes from “Acórdão”.

9

Mental modelling

10

Classification and clusteringClustering goals 4:

1. Development of a typology or classification.

2. Investigation of conceptual schemes for grouping entities.

3. Hypothesis generation through data exploration.

4. Hypothesis testing, or the attempt to determine if types defined through other procedures are in fact present in a dataset.

4 ALDENDERFER, M. S.; BLASHFIELD, R. K. Cluster Analysis. Beverly Hills: Sage, 1984.

11

Classification and clustering

12

Adapted from WOOYOUNG, K. Parallel Clustering Algorithms: Survey. Available from internet: http://www.solver.com/hierarchical-clustering-intro

http://www.solver.com/hierarchical-clustering-intro


Hierarchical clustering

13

A B C D E

A,B D,E

C,D,E

A,B,C,D,E

Agglomerative Divisive

tree cut

tree cut

Adapted from Frontline Solvers. Cluster Analysis. Available from internet: http://www.solver.com/hierarchical-clustering-intro




Hierarchical clustering+ Advantages:

● Does not require pre-defined number of clusters.

● Accepts any valid measure of distance.

● Less influenced by cluster shapes and less sensitive to handle clusters with different densities.

14

- Disadvantages:

● Complexity, which in general is ≥ O(n²), which makes them too slow for large datasets.

Ward’s algorithm

Ward’s minimum variance criterion, a particularization of the Ward general method, the objective function is to minimize the total within-cluster variance.

As a general result, Ward’s minimum variance method leads to compact and spherical clusters.

15

Single-linkage algorithmIn Single-linkage clustering, the objective function is defined by those two elements (one in each cluster) that are closest to each other.

16

The shortest of these links causes the fusion of the two clusters whose elements are involved.

Complete-linkage algorithmIn Complete-linkage clustering, the objective function is defined by those two elements (one in each cluster) that are farthest away from each other.

17

The shortest of these links causes the fusion of the two clusters whose elements are involved.

Proposed solution

18

Similarity calculation

From the modelled dataset, calculate the similarities between lawsuits:

1. Each pair of lawsuit receives a similarity coefficient regarding to a property.

2. Then, a mean (resultant) matrix is obtained from each property matrix.

Output: Similarity matrix

19

Similarity calculationSimilarity metric - Jaccard index:

20

Mean similarity:

Lawsuits clustering

From the similarities observed, run the hierarchical clustering algorithm.

Output: lawsuits classified into clusters.

21

Lawsuit instance assigning

From the detected clusters, calculate the similarities between the new lawsuit instance and the other lawsuits already classified.

Output: new instance assigned to the most similar cluster.

22

Decisions compilationConsidering a list of judges that will decide the lawsuit:

1. Collect their past votes observed in the cluster.2. Compute the degree of agreement between them.

For each judge jx, compare his/her decisions with each decision taken by another judge composing input, lawsuit by lawsuit.

Ratio no of commum votes/no of commum decisions determines the degree of agreement for each judge.

Output: the likely outcome – a number between 0 and 1, indicating the probable decision.

23

Datasets

lawsuit_16.csv: 16 lawsuits

decision_16.csv: 24 decisions

Lawsuits: lawsuit id, start/end date of lawsuit, state of origin, rapporteur, defendant, author, type, subjects, summary and claim.

Decisions: associated lawsuit id, decision id, type of decision, date, votes tuple <judge name, vote> and resultant decision.

24

Similarity analysis

25Rapporteur Summary

completely similar

completely different

Similarity analysis

26Mean similarity Mean similarity (Pearson correlation)

completely similar


Clustering analysis

27

completely similar


Agglomerative algorithms performances

28

Prediction results

29

Prediction results

30

reveals an…Optimization

problem!

● The correct choice of the number k of clusters is not trivial, depending on the distribution of points in a dataset and on the desired clustering resolution.

● Possible approach: define a search space, overvalue a k, and then develop optimization heuristics to determine a new stopping point (k2) when the algorithm finds a good solution.

● A stopping point, in this case, could be when the algorithm finds a cluster that is similar enough to the instance been tested and has difficulties to improve this best rate found.

Main contributions

● By analysing past data, it is possible that other similar cases were already judged.

● Results shown that was possible to verify the most likely outcome and to detect the degree of uncertainty of the outcome.

● Prediction results were satisfied: lawsuit instances were correctly assigned to clusters and similarity comparison revealed a good coefficient between lawsuits.

31

Future work● Use more sophisticated machine learning techniques.● Investigate a more efficient clustering method than the

hierarchical clustering - consider optimization issues.● Discriminate decisions by type.● Develop a better mechanism to find lawsuits properties

weights.● Have a training and a testing dataset. Then, use evaluation

metrics to check if predictions match real outcomes.● Investigate stochastic simulation approaches.

32

Code and datasets at bitbucket.org Git repository.Contact [email protected] to have access!

Thank you! Questions?

33

http://bitbucket.org

mailto:[email protected]

a clustering-based approach to detect probable outcomes of lawsuits

Technology

new lawsuit

appeal lawsuit page

lawsuits outcomes

lawsuit id

lawsuit classes

given lawsuit

stf website

similar lawsuits