Jusletter IT

Toward AI-enhanced Computer-supported Peer Review in Legal Education

  • Authors: Kevin Ashley / Ilya Goldin
  • Category: Scientific Articles
  • Region: USA
  • Field of law: AI & Law
  • Citation: Kevin Ashley / Ilya Goldin, Toward AI-enhanced Computer-supported Peer Review in Legal Education, in: Jusletter IT 12 September 2012
Applying Bayesian data analysis to model a computer-supported peerreview process in a legal class writing exercise yielded pedagogically useful information about student-understanding of problem-specific legal concepts and of more general domain-related legal writing criteria and about the criteria’s effectiveness. The approach suggests how AI and Law can impact legal education.

Inhaltsverzeichnis

  • 1. Introduction
  • 2. Peer review as social network discussion
  • 3. Problem-specific legal-concept review criteria
  • 4. Application of hierarchical Bayesian modelling
  • 5. Discussion
  • 6. Conclusions
  • 7. References

1.

Introduction ^

[1]
While a few law students may encounter AI and Law research papers in specialized seminars, AI and Law systems have not had much direct impact on the way law students are educated. Most systems fielded or tested in legal classrooms have used an AI model of legal reasoning to engage students in legal problem-solving [13] or in making or diagramming legal arguments (see e.g., [1; 3; 5; 6]).
[2]
This paper examines how AI can help legal education via computer-supported peer review (CSPR) among students. Students’ conceptual knowledge is a classroom resource that can benefit other students. Students may know a good deal, or may acquire relevant knowledge while working on an assignment. CSPR can harness students’ developing knowledge and apply it for the benefit of the class [14].
[3]
CSPR integrates students’ knowledge in solving legal problems with the computer’s ability to manage the extensive data generated in a peer review exercise. When using a CSPR system such as SWoRD [7] or Comrade [9; 10], (1) students write their compositions as per the instructor’s assignment and submit them to the system. (2) The system distributes the compositions to a group of N student peers for review. (3) Using a set of instructor-specified review criteria and forms, the reviewers assess the authors’ works as per the review criteria and submit their feedback via the system. The reviewers assign numerical ratings for each criterion and provide written justifications. The authors receive the anonymous reviews and (4) rate their helpfulness in back-reviews submitted to the system and, possibly (5) revise their draft.
[4]
Instructors may direct peer review by devising the rubric of review criteria [2]. Criteria may be oriented to legal concepts, i.e., on how the concepts apply in specific contexts, and on identifying arguments for, and counterarguments against, the concepts’ application given specific facts. Peer assessment according to such a problem-specific (PS), legal concept-oriented rubric may show instructors how well students understand legal concepts. Instructors may also use a rubric of domain-relevant (DR) legal writing criteria that address more general skills of legal writing.
[5]
In a recent study with law students in an Intellectual Property (IP) course engaged in a CSPR exercise, we randomly assigned students to two conditions, one using only PS criteria and the other only DR criteria. The PS criteria specifically addressed five legal claims raised by the problem scenario. The DR addressed four general legal writing skills: issue identification, argument development, justifying an overall conclusion, and writing quality. Each student gave feedback to and received feedback from four others in the same condition. For each criterion, reviewers rated peer works on a 7-point scale (grounded at 1,3,5,7) and wrote comments, yielding two datasets: PS and DR.
[6]
We inquired how well hierarchical Bayesian modeling, an AI / statistical method, could mine these data sets to answer instructor questions regarding the review criteria and legal concepts: Q1. Are the rating criteria noisy, appropriately grounded, mutually independent? Q2. What is the relative importance of the criteria? How well do peer ratings on the various criteria predict the instructor’s scores? Q3. Which legal concepts were challenging? and Q4. Which students are struggling with what legal concepts? We chose hierarchical Bayesian modeling as well suited to the task of analyzing the hierarchically structured information in a network of authors and peer reviewers.

2.

Peer review as social network discussion ^

[7]
A peer-review exercise is a discussion within a social network of students focused on the topic of the review criteria. Students evidence their understanding of the criteria both as authors and as reviewers. Peer review may be seen as a directed graph (Figure 1).
[8]
Each node is a student, and an arc connects a reviewer (arc tail) to an author (arc head). Feedback given by a student reviewer is said to be the student’s outbound feedback, and feedback received by a student author is said to be the student’s inbound feedback.
[9]
The graph representation of peer review can support inference about a student’s understanding of conceptual issues addressed by the rubric. First, each student provides evidence of his understanding in multiple roles: as an author, as a reviewer, and as a back-reviewer critiquing reviews. Second, the evidence is made more reliable by having multiple data points (multiple inbound or outbound arcs) for each role. We applied statistical techniques that, in principle, can take into account progressively more of the information associated with these multiple participants in their multiple roles in the peer-review process, although, as reported below, our models have not yet fully accounted for student roles of reviewer and back-reviewer.

3.

Problem-specific legal-concept review criteria ^

[10]
The open-ended problem scenario in Figure 2 was assigned as an essay-type, midterm examination in the IP course. The instructor (Ashley) designed the single question’s factual scenario to raise a number of legal claims and issues addressed in the first third of the course. Students submitted their answers to Comrade, a CSPR system [10], for review by classmates. Meanwhile, the instructor graded the exams independently.
[11]
In providing advice on particular parties’ rights and liabilities, students were expected to analyze the facts, identify the claims and issues raised, make arguments pro and con resolution of the issues in terms of the concepts, rules, and cases discussed in class, and make recommendations accordingly. Students should identify the different kinds of ideas and information that could be protected under the relevant IP laws: an Iphone-based instrument-controller interface, the idea of the Guitar-Pyro game, the name «Guitar-Pyro», the Guitar-Pyro images, and the synchronization method and code. Students should consider not only the IP claims of the primary intellectual property claimant, Prof. Smith, against others but also the IP claims of others against him:
  1. Smith v. Barry for breach of the nondisclosure/noncompetition agreement (nda),
  2. Smith v. Barry and VG for trade-secret misappropriation (tsm),
  3. Jack v. Smith for misappropriating the I-phone-based instrument-controller interface idea (idea1),
  4. Barry v. Smith for misappropriating Barry’s idea for the design of a Jimi-Hydrox-related look with flames for winning (idea2),
  5. Estate of Jimi Hydrox v. Smith for violating right-of-publicity (rop),

as well as unfair competition, and passing off (federal Lanham Act s. 1125(a)).

Students needed to address some general issues across IP claims (e.g., the extent and nature of the alleged infringer’s use of the ideas or information required for misappropriation or infringement, the degree of similarity required between the claimant’s and alleged infringer’s ideas and information.) Since the instructor included factual weaknesses as well as strengths for each claim, the problem is ill-defined, that is, there is no one right answer but competing reasonable arguments can be made. In the instructor’s view, given the course materials and problem, students should make arguments for and against each of the five claims citing cases as shown in Table 1.

[12]
The PS rubric focused on the five IP claims in Table 1. For each claim, students rated the author’s answer using the rating scale in Figure 3.By contrast, the DR rubric did not mention specific claims but prompted reviewers to rate (on a 7-point Likert scale) how well the author does at issue identification (identifies and clearly explains all relevant legal issues; does not raise irrelevant issues), argument development (for all legal issues, applies principles, doctrines, and precedents; considers counterarguments), justifying overall conclusion (assesses strengths and weaknesses of parties’ legal positions in detail; recommends and justifies an overall conclusion), and writing quality (makes insightful, clear arguments in a well-organized manner).
[14]
Both kinds of rubrics are pedagogically valuable but they also represent tradeoffs. A problem-specific rubric has to be tailored for each new problem, but it reflects the substantive course content the instructor regarded as important. A DR rubric is more general and could apply to many legal writing assignments, yet it does not specifically capture how well students have mastered the course’s substantive content.
[15]
We have analyzed the peer ratings elicited by the two rubrics. To evaluate the validity of peer-review with the two rubrics, we correlated the peer ratings of students’ essays to the instructor’s independently-assigned scores. Each type of rubric is valid as determined by similarity to instructor scores [9]; for each condition, the Pearson correlations of the mean inbound peer ratings with the instructor’s scores were significant. Neither rubric consistently produced reliable peer scoring, but this may be unsurprising, and even desirable, with open-ended legal problems.
[16]

We also examined the peer comments elicited by the two rubrics for any differences. In a preliminary categorization of a sample of PS comments, we focused on the level of detail in the discussion of DR aspects and, separately, PS aspects. (Table 2) We observed that problem-specific rubric criteria can elicit comments that are high or low in either PS or DR detail. This justifies to some extent our motivation for exploring the different rubrics. Although the comments are always made (and delivered) in the context of some rubric criterion, even our preliminary investigation shows that a problem-specific prompt is no guarantee that a comment will mention problem-specific aspects. We will analyze comments more systematically, perhaps in terms of the depth and breadth of a comment’s discussion of PS / DR aspects. Given that a trade-secret misappropriation claim rests on multiple factors, identifying a factor, tying it to facts and making an argument could be construed as an indicator of depth, whereas mentioning multiple relevant factors could indicate breadth.

4.

Application of hierarchical Bayesian modelling ^

[17]
An instructor who looks for insight into a peer review exercise by examining individual ratings or comments will be hindered by a lack of context and overwhelmed by the large quantities of data. Although it would be desirable to mine useful information from the raw peer review data, fitting models to the data is challenging due to repeated measures (each paper receives multiple reviews), sparseness (any student reviews only a few papers), and nesting (papers are assessed on multiple criteria).
[18]
To address this, we applied hierarchical Bayesian modeling to peer review [8]. The models estimated parameters such as the quality of student works, as assessed by peer reviewers, and the variability across reviewers. These and other parameters (Table 2) can only be estimated, not observed directly. Each model maps between a response variable (here, the instructor’s independently-assigned exam scores), and explanatory variables (the peer ratings within the social network). MCMC sampling computes all the quantities of interest at once using all the available data; different parameters help estimate each other according to specified expressions of likelihood. The analysis yields «a complete posterior distribution over the conjoint parameter space, [indicating] the relative credibility of every possible combination of parameter values.» [12]
[19]
The models were validated separately on the PS and DR datasets. For each dataset, the best-fitting model was selected with the Deviance Information Criterion (DIC).
[20]
DIC quantifies the tradeoff between how well the model fits the data, defined as deviance, and its complexity, the effective number of parameters in the model. This is computed at «run time» by how information is shared across groups in a multilevel model, rather than at «compile time» from the mathematical model’s structure. Each model took into account progressively more information from the social network both in terms of different combinations of the parameters in Table 3 and of hierarchical structural aspects of the peer-review process represented. The simplest model (5.1a) is a regression of the midterm scores as a function of the pupils’ inbound peer ratings. It averaged all the ratings received by a peer author, treating all of a pupil’s inbound peer ratings as interchangeable, ignoring the distinct rating dimensions, and treating all authors as independent. In a more complex model (5.2a) student authors are not considered independent. Instead, the model shares (i.e., pools) key parameters so that what is known about authors as a group informs information about individual authors, and vice versa. In addition, it evaluates the utility of the ratings criteria by representing them separately rather than together, as in the first model.
[21]

A comparison of more formal descriptions of the models, Table 4, illustrates these differences. Both models treat the response variable (the instructor-assigned midterm exam score) Yp as normally distributed, with mean µp, the per-pupil knowledge estimate, and overall variance estimate σ2 (Table 4, row 2). In the terminology of hierarchical modeling [8], model 5.1a is a «no pooling» regression because it does not pool (i.e., share) information across pupils. Each pupil is described via an individual intercept αp, an overall (between pupils) variance σ2, and an individual mean peer rating X1p with weight β1 (Table 4, row 2). A pupil’s inbound peer ratings are treated as normally distributed according to the pupil’s individual mean X1p and individual ratings variance σ2p[IPR]. Lacking strong prior beliefs, the individual pupil means X1p, are treated as normally distributed with «uninformative» priors (i.e., to avoid biasing the model): a mean of 0 and a variance of 1000.

[22]

In Model 5.2a, by contrast, information is shared (partially pooled) across pupils; what the model learns about one student informs the estimates of other students’ parameters. All individual pupil intercepts αp are deemed not independent, but drawn from a common distribution that is estimated from and constrains the estimates of the individual students’ intercepts.

[23]

Model 5.2a also differs by incorporating information on the distinct criteria of the inbound peer ratings (IPRs). Each observed IPR is modeled as normally distributed with mean Xnp that corresponds to the average of the ratings received by author p for rating criterion n and a variance for this criterion σ2n[IPR] that is shared across all pupils. (Table 4, row 3). The explanatory variable matrix X is altered to include one column per rating criterion, and the lone regression coefficient β1 is replaced by regression coefficients βn for each rating criterion n (i.e., 4 domain-relevant criteria and 5 problem-specific criteria.) Within each criterion, individual pupils’ mean inbound peer ratings Xnp are pooled by stipulating a shared «uninformative» prior distribution across students. Peer ratings from each criterion are centered about that criterion’s respective mean. Due to fewer observed ratings per pupil per criterion, per-dimension variances σ2n[IPR] were fitted.

[24]

A third model, 5.1b, adds partial pooling for intercepts αp and for the pupils individual means of inbound ratings X1p, but does not distinguish rating criteria.

[25]
Each model was run separately for the two conditions, one on the PS and the other on the DR datasets. Each model was fit three times, with a different randomly determined starting value for each MCMC simulation. The simulations were examined to ensure that they converged in their estimates of parameter values. Each fit was allowed 100,000 iterations, with 20,000 initial iterations discarded to avoid bias.
[26]
Taking Model 5.1a as a baseline, we found that (1) adding partial pooling (in model 5.1b) improved fit significantly over baseline for both PS and DR datasets and (2) distinguishing the rating criteria in 5.2a improved the fit over model 5.1b for the PS dataset, but hurt the fit for the DR dataset.

5.

Discussion ^

[27]

We found that the hierarchical statistical analysis generated pedagogical information useful for instructors. The βn regression coefficients represent the importance of each criterion for estimating the instructor’s score of the average inbound peer rating (Q2). Two of the five problem-specific criteria, idea2 and rop, contributed to estimating the instructor’s score for at least 95% of the students, and a third, tsm, contributed for at least 80% of the students. (Table 5) By distinguishing among the rating criteria, model 5.2a improved the fit for the data (i.e., reduced DIC) for the PS criteria.

[28]

Consistency in signs of the βn coefficients provides a built-in sanity check for the model. A negative βn (as for the conclusion and writing DR criteria, Table 5) implies implausibly that high performance on, say, justifying a conclusion corresponds to a reduction in the instructor’s score. Such sign inconsistency may be due to collinearity (redundancy) of rating dimensions. The fact that the βn for problem-specific criteria (Table 5 bottom) were all positive is a good sign that the criteria are mutually independent and linearly additive (Q1).

[29]

Narrow credible intervals on the βn showed that the model is confident in estimating coefficients for four of the five PS criteria (i.e., all but idea1) (Q1). The wide credible interval for idea1 may indicate a lot of noise in how reviewers applied that criterion. Among the four domain-relevant criteria, estimates are confident for argument and issue but not for conclusion or writing. The combination of low confidence and negative signs for the conclusion and writing dimensions likely reflects high pairwise correlation between the mean inbound peer ratings for the DR criteria (Q1). Such correlation may cause instability and interactions among βn coefficients (even though it may not affect overall model fit). We found pairwise correlations between mean inbound peer ratings among all pairs of the DR rating dimensions, indicating that the dimensions may provide redundant information to authors; such correlations can also be computed over the Bayesian estimates of Xnp (Q1).

[30]
Among the problem-specific criteria, we found pairwise correlations between mean inbound peer ratings for tsm vs. idea1 and tsm vs. nda. Coefficients that were estimated confidently yet were found to be not significant contributors to estimating the instructor’s score (e.g., nda, tsm) may also indicate redundancy (Q1). As an empirical matter, these two problem-specific criteria correlated significantly. They are also conceptually linked; noncompetition agreements are not enforceable unless trade secrets are at stake.
[31]

Knowing that some rubric criteria are problematic may suggest to the instructor that they need to be omitted or redefined. In addition, a criterion may not be adequately anchored; e.g., low σ2n[IPR] may indicate a criterion with too coarse a scale.

[32]

Model 5.2a can also show which concepts challenged which students by examining Xnp, the assessment of the work of author p on criterion n aggregated across peer reviewers. Positive skew in the distribution of Xnp is a sign that concept challenges many students. Students whose Xnp estimates are lower likely find the concept more challenging than students with higher Xnp estimates (Q4). As a validity measure of the aggregated peer ratings, non-Bayesian (i.e., classical) means of the ratings on the n=28 papers in the problem-specific condition correlated significantly with the ratings by a trained rater for each of the problem-specific criteria except idea2 [9]; this correlation could also be computed with respect to Xnp (Q3). Intuitively, idea2 may have challenged students because it was a claim against the main IP claimant, it is weak, and it is the second instance of a claim of misappropriation of ideas; test-taker «common sense» may have mislead students who may not have expected to encounter two instances of a claim in the same test question.

6.

Conclusions ^

[33]

This work points to new applications for AI in a legal educational context. The problem in Figure 2 challenges any current AI and Law program. Although Table 1 mentions trade secret factors and ownership of employee-generated information [1; 4], it is unlikely that any current program could drive intelligent tutoring for such problems.

[34]

AI can still play a role, however. Bayesian models of computer-supported peerreview yield pedagogically useful information about student learning and about grading schema. This suggests that there may be opportunities for further research in legal education to clarify links among content, instruction and assessment without the burden of having to develop strong domain models. Additionally, Bayesian models could be combined with other AI techniques to assist instructors. Since the models estimate lots of parameters, an intelligent user interface could manage these outputs. In related work, we are developing a «teacher-side dashboard» to provide teachers a comprehensible global overview of aggregated peer review information while allowing them to drill down to efficiently retrieve detailed assignment information for any particular student, who, according to the statistical analysis, is struggling with particular concepts. We are also developing Machine Learning (ML) tools to automatically process free-text feedback in peer reviews, as well as first and revised drafts of author papers, to detect pedagogically useful information for teachers such as (1) the types of changes peers comment on and authors implement as a result of feedback or (2) the presence of recurring comments across reviewers.

[35]

Additionally, authors could use argument diagramming [3; 11] to prepare or summarize arguments applying problem-specific criteria to legal problems, and students could review each other’s diagrams. Argument diagramming help systems [3], equipped with argument schema and critical questions [11], would help students to construct and review their diagrams; model-generated argument diagrams could serve as examples. AI techniques such as ML and Natural Language Processing could help reviewers improve their feedback so that it targets specific problems and suggests solutions [15]. The Bayesian models would then analyze student data from both the argument diagramming and subsequent argument writing exercises.

[36]

Intriguingly, CSPR in legal education may even help develop larger-scale AI and Law models. An instructor who creates an exam problem defines an answer key that relates broad legal concepts (e.g., right of publicity) to component concepts (e.g., is the right descendible) and ultimately to generic facts. These relations amount to a rubric (even more problem-specific than the PS rubric) on which to base a review interface. Across multiple peer review exercises, this relational information can link diverse legal concepts and generic facts in an ontology. Since the labor costs of creating ontologies are high, crowdsourcing by students and instructors could enable new applications in legal education and AI and Law.

7.

References ^

[1] Aleven, V. (2003) Using background knowledge in case-based legal reasoning: a computational model and an intelligent learning environment. Artif Intell 150:183–238.

 

[2] Andrade, H. (2000). Using rubrics to promote thinking and learning. Educational Leadership, 57(5),13.

 

[3] Ashley, K. (2009) Teaching a process model of legal argument with hypotheticals. Artificial Intelligence and Law.17:4:321–370.

 

[4] Ashley, K. and S. Brüninghaus (2006) «Computer Models for Legal Prediction.» Jurimetrics 46, 309–352.

 

[5] Carr, C. (2003) Using computer supported argument visualization to teach legal argumentation. In: Visualizing argumentation, 75–96. Springer, London.

 

[6] Centina, F., Routen, T., Hartmann, A. and Hegarty, C. (1995) Statutor: Too Intelligent by Half? Legal Knowledge Based Systems, Jurix ’95: Telecommunication and AI & Law:121–131.

 

[7] Cho, K., Schunn, C.D. (2007) Scaffolded writing and rewriting in the discipline: A web-based reciprocal peer review system. Computers and Education. 48.

 

[8] Gelman, A., Hill, J.: Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press (2006).

 

[9] Goldin, I. M. (2011, April 29). A Focus on Content: The Use of Rubrics in Peer Review to Guide Students and Instructors (PhD Dissertation). University of Pittsburgh, Pittsburgh, PA. http://etd.library.pitt.edu/ETD/available/etd-07142011-004329/

 

[10] Goldin, I.M., Ashley, K.D. (2010): Eliciting informative feedback in peer review: importance of problem-specific scaffolding. 10th International Conference on Intelligent Tutoring Systems. Pittsburgh.

 

[11] Gordon, T. F., Prakken, H., & Walton, D. (2007). The Carneades model of argument and burden of proof. Artificial Intelligence, 171(10–15), 875–896.

 

[12] Kruschke, J. K. (2010). What to believe: Bayesian methods for data analysis. Trends in Cognitive Sciences, 14(7), 293–300.

 

[13] Muntjewerff, A.J. (2009). ICT in Legal Education. In: German Law Journal (GLJ). Special Issue, Vol. 10, No. 06, pp. 359–406.

 

[14] Topping, K. J. (1998). Peer assessment between students in colleges and universities. Review of Educational Research, 68 (3), 249_76.

 

[15] Xiong, W., Litman, D., & Schunn, C. D. (2010). Assessing Reviewers’ Performance Based on Mining Problem Localization in Peer-Review Data. In 3rd Intl Conf. on Educational Data Mining. Pittsburgh.


Kevin Ashley, University of Pittsburgh School of Law, Learning Research and Development Center, University of Pittsburgh Intelligent Systems Program and corresponding Author.

 

Ilya Goldin, University of Pittsburgh Intelligent Systems Program.

 

This article is republished with permission of IOS Press, the authors, and JURIX, Legal Knowledge and Information Systems from: Kathie M. Atkinson (ed.), Legal Knowledge Systems and Information Systems, JURIX 2011: The Twenty-Fourth Annual Conference, IOS Press, Amsterdam et al.