Jusletter IT

Toward Extracting Information from Public Health Statutes using Text Classification and Machine Learning

  • Authors: Matthias Grabmair / Kevin D. Ashley / Rebecca Hwa / Patricia M. Sweeney
  • Category: Scientific Articles
  • Region: USA
  • Field of law: AI & Law
  • Citation: Matthias Grabmair / Kevin D. Ashley / Rebecca Hwa / Patricia M. Sweeney, Toward Extracting Information from Public Health Statutes using Text Classification and Machine Learning, in: Jusletter IT 12 September 2012
This paper presents preliminary results in extracting semantic information from US state public health legislative provisions using natural language processing techniques and machine learning classifiers. Challenges in the density and distribution of the data as well as the structure of the prediction task are described. Decision tree models trained on a unigram representation with TFIDF measures in most cases outperform the baselines by varying margins, leaving room for further improvement.

Inhaltsverzeichnis

  • 1. Introduction
  • 2. Task Description
  • 3. Our Framework
  • 3.1. Preprocessing
  • 3.2. Chunk Dataset
  • 3.3. Machine Learning Environment
  • 3.4. Bag-of-Words and TFIDF
  • 3.5. Code Ranking
  • 4. Experiments
  • 4.1. Experiment Setup
  • 4.2. Evaluation Metrics
  • 4.3. Results
  • 4.4. Discussion
  • 5. Relationship to Prior Work
  • 6. Conclusions
  • 7. Acknowledgements
  • 8. References

1.

Introduction ^

[1]
The public health system is a network of state and federal, public and private sector agencies, institutions, and individuals, which endeavors daily to foster societal conditions improving people’s health. Emergency preparedness and response are fundamental functions of the system. Each interdependent agent, however, is directed by and accountable to a discrete and independent set of federal, state or local laws and regulations. Policy analysts at the the University of Pittsburgh’s Graduate School of Public Health are comparing how states’ laws governing agents in the public health system either facilitate or frustrate each system’s ability to plan for and respond to public health emergencies. Statutes direct certain actors to coordinate actions to achieve specified goals and purposes in a particular timeframe under certain conditions. The states’ readiness for public health emergencies can be assessed and compared in terms of these statutory frameworks. Encoding the relevant statutes according to the legislative directives enables a comprehensive, objective analysis of these frameworks, within and across states. See, e.g. Fig. 1. However, encoding even one state’s public health system emergency preparedness and response statutes is a time and labor intensive task; to encode that of fifty states would entail exorbitant cost. The goal of this work is to explore the possibility of utilizing the already encoded statutes for one state as means to automate (or semi automate) the encoding of statutes for new states.
[2]
Different states have different statutes, so training classifiers using encodings of one state for predicting another may be very challenging. This problem is difficult because a legal statute is a complex document, and the coders need to extract different types of information. We propose a framework that uses Natural Language Processing (NLP) and Machine Learning (ML) to automatically encode statutes using classifiers that have been trained on data from the same state. We show how to decompose the task of statute encoding into several classification problems that can be tackled automatically by NLP/ML methods. We devised a method for identifying a feature representation of the provisions and performed experiments to evaluate each classifier. Results show most classifiers achieve better performance than baselines. This suggests automatic semantic information extract is a promising direction for expediting statutes encoding and analysis.

2.

Task Description ^

[3]
The analysis of a given state’s emergency related legislation is a two step process: First, all relevant statutory provisions need to be retrieved. Here, we understand provision to be a statutory element identified by a root-level §-citation, e.g. 4 Pa. Code §3.24. It is deemed relevant if it contains information that should be part of the public health analysis. Second, semantic information needs to be extracted from the provisions. Traditionally, this is performed manually. At the school of public health, approximately 1700 Pennsylvania (PA) statutory passages had been retrieved from the LexisNexis legal database using expert hand-crafted queries. Then, each provision contained therein must be manually coded according to a standardized codebook. The information that must be identified includes: the public health agents that are the objects of the provision, the action the provision directs and whether it is permitted or obligatory, the goal or product of the action, the purpose of the statute, the type of emergency in which the direction occurs and under what timeframe and conditions, the provision’s citation, etc.
[4]

Relevant provisions are encoded into a scheme consisting of a citation and nine attributes. It has been developed and used before the application of NLP/ML methods was considered. This is similar to provision «arguments» as used by [1], but different in that we do not distinguish different provision types and are only concerned with provisions regulating actions among agents of interest. However, the prescription attribute (see below) allows distinctions between «may», «must», etc. commandments, thereby representing more than one regulatory type of provision. Also, our work is distinct from [1] in that our semantic extraction is a classification task as opposed to an XML text markup.

[5]
To automate the analysis using ML methods, the process decomposes into several classification tasks. Determining whether a provisions is relevant is a binary classification decision (single-label prediction). Identifying the primary acting agent is a 31-way classification decision (multiple-label prediction); determining the goal(s) of an action is a harder decision of picking up to five out of 143 choices. Below, we provide more details about the coding scheme to which the human coders adhere. Most attributes can have more than one entry up to a maximum number of assigned columns. Every attribute needs to be assigned at least one code. The scheme can be specified as follows:
  1. Acting agent – Who is acting? Represented as ten available slots, each consisting of primary (31 codes), secondary (ten codes) and «footnote» code (20 codes).
  2. Prescription – How is the action prescribed? (Must (not), can, etc.) Represented as one slot with five codes.
  3. Action – Which action is being taken? Represented as three slots (82 codes).
  4. Goal – What goal is the action supposed to achieve? Represented as five slots (143 codes)
  5. Purpose – For what purpose is the action being taken? Represented as four slots (four codes).
  6. Emergency type – Which types of emergencies are covered by the provision? Represented as seven slots (19 codes).
  7. Partner agent – Towards whom is the action taken? Represented as 15 available slots, each consisting of primary, secondary and «footnote» code.
  8. Timeframe – In what timeframe can/must the action be taken? Represented as two slots (150 codes, to be reduced to six in future work).
  9. Condition – Under what condition does the action need to be taken, e.g. during a declared emergency? Represented as three slots (172 codes).
[6]
A set of histograms about the distributions of the codes in each attribute can be found on the authors’ website at http://lrdc.pitt.edu/Ashley. The reference data stems from manual coding by, typically, one human coder. There has been some investigation about inter-coder agreement on a sampled set of provisions (annotated by two coders) from different states, both in terms of determining whether a provision is relevant and in terms of its substantive coding. The percentage of disagreement over substantive coding appears to shrink over the course of the project. For example, PA as the first coded state yielded 27.2% disagreement whereas Florida only had 19.1% coder disagreement. We take this as an indication of increasing coherence in human coding, which potentially leads to better datasets and hence better classifiers.

3.

Our Framework ^

[7]
In order to automate the process of taking provisions in full text as input and predicting coding data described in section 2, our system performs several steps.

3.1.

Preprocessing ^

[8]
The source data consisted of a set of MS Word documents, each containing one provision in full text, i.e. a series of sentences divided into paragraphs, subparagraphs, etc. Each one (and the provision therein) contained at least one relevant/codable part. The documents were converted into plain text and segmented into a tree structure as well as stored in a database. Each provision is represented as a tree of clauses (and, recursively) subclauses with the leaf nodes as sentences. Each clause has its own specific non-root sub-citation (e.g. 4 Pa. Code §3.24 (1)(a)). From the database, each provision can be reconstructed into its original form. While the structural parser was developed to automate the data entry process, some hand-annotation of the text files had to be done in order to provide it with necessary clues. The correctness of the data after entry has not been validated through sampling, but no errors have become apparent so far.

3.2.

Chunk Dataset ^

[9]
Since the target is to identify and analyse relevant parts of a provision, the treerepresentation of the provisions was used to produce chunks. A chunk is a subpart of a provision consisting of all the clauses from tree root to a specific clause node and is represented by a text consisting of all the leaf node sentences belonging to the clauses along the path. Each chunk has its own citation represented by a chain of citation subelements. For example, the citation 4 Pa. Code §3.25(a)(2)(iv) identifies the chunk of 4 Pa. Code §3.25 which incorporates all sentences under subsection (a), not taking into account any of its subsubsections with the exception of (2) and, recursively, its child clause (iv).
[10]
The target data contains a citation for every coded piece of a provision. It has been assumed that this chunking method is adequate to associate a piece of a provision with its target coding entry, i.e. that all information necessary to code the provision piece can be extracted from the chunk’s text. A number of challenges are known. For example, provisions may contain references to other provisions and incorporate information from there. Also, certain subsections may depend on information present in their siblings as opposed to parent clauses. We intend to address these challenges in future work if feasible.
[11]
A noticeable restriction imposed by this chunking method is that it does not produce chunks for parts of provisions that do not contain any text, but contain subsections with text. While one can think of ways to overcome this limitation (e.g. add a chunk for nontext clauses containing the sentences of all its subclauses), they may greatly inflate the number of chunks, creating ambiguity as to which chunk a unique citation refers to, or enlarging the chunks to a degree suboptimal for ML. The working corpus for this paper contained 16 such target coding citations which could not be linked to a chunk.

3.3.

Machine Learning Environment ^

[12]

To train the classifiers to predict the desired labels, supervised learning methods are applied. In a nutshell, ML classifiers provide methods to produce a mathematical function mapping a vector of input features onto an output label. The task is to represent each part of a provision as a finite set of features and a target label (i.e. the code assigned to an attribute). The majority of these feature-target pairs become so-called «training examples» which are used to compute an optimum mapping function, which can then be evaluated on the remaining testing examples.

[13]
Chunking and feature vector translation were done using Python scripts and the NLTK toolkit [5]. All ML was done using the R statistical software [13]. Decision tree models were trained for all tasks using R’s rpart-package [12], which employs the CART algorithm. SVM models (using R’s e1071 package [3]) were also trained for the relevance prediction task using a radial kernel and a self-optimizing grid search (using a fifth of the training data for tuning) over possible parameter configurations. Tried ranges were γ = 0.01, 0.1, 0.5 and c = 0.1, 1, 10 In most cases, these converged to γ = 0.01 and cost c = 10.We plan to further expand these parameter search ranges to improve results.

3.4.

Bag-of-Words and TFIDF ^

[14]

During testing and training, each chunk is translated into a feature vector by means of a bag-of-words representation, i.e. the occurrence of each word becomes a feature, where the word is isolated from its context and lemmatized using the WordNet lemmatizer [4]. Stopwords («to», «for», etc.) and punctuation are removed before the translation.

[15]

We used a TFIDF representation (Term Frequency / Inverse Document Frequency), where a word’s feature value becomes a numerical measure of the relative importance of word i in chunk j. An advantage of using TFIDF values in the feature vector is that one can reduce the number of features by removing all terms whose maximum TFIDF value does not exceed some static threshold, i.e. the term is simply not informative enough. Finally, chunk size has been added as an additional feature to the vector.

3.5.

Code Ranking ^

[16]

The first attempted prediction target is that of relevance, i.e. a binary decision about whether a chunk is relevant or not, where «relevant» means that it is worth coding, i.e. adding information pertinent for the public health analysis purposes. The training provisions are chunked and translated into a feature vector. If the chunk’s citation is found in the reference coding sheet (i.e. it had been coded by the human before), then the target attribute is 1 and 0 otherwise. A trained model is evaluated by having it predict the target attribute of a given feature vector from the validation data.

[17]
Except for prescription and relevance, however, the target attributes can be annotated with a set of codes. The prediction task is that of assigning an unspecified number of labels (from one to the maximum number of slots) to a chunk’s feature vector. As one can see from the enumeration in section 2, the number of possible target codes varies greatly in size. Training binary classifiers for each code would require larger amounts of data. Also, some of these codes appear very infrequently, leading to experiment runs where only a significantly smaller portion of target codes actually appears in training data. The classifier hence cannot predict them in the validation data. Until a better solution is apparent, we train classifiers using duplicate feature vectors. For example, if a chunk has been annotated with three actions, then the one chunk is turned into three training instances. Its feature vector is used three times and added to the training data with a different target action code each time. Empty slots are not counted for this purpose. We are aware that this method, among other shortcomings, does not overcome the sparsity problem.
[18]
As the system is expected to predict a set of codes, we proceed as follows. Assume we predict an attribute that has n slots available for prediction in the reference coding sheet. Given a validation set feature vector, the classifier assigns a probability for each possible label. Instead of picking the label with the highest probability, the predictor skims off the n codes with the highest probability and then, additionally, disregards any code whose probability does not pass a static threshold (in the given experiments: 0.1). If this static threshold would cause all codes to be discarded, it is repeatedly divided by two until it leads to a nonzero number of best codes n. We plan to develop more sophisticated shortening methods in future work.

4.

Experiments ^

4.1.

Experiment Setup ^

[19]
Data Split Only PA data has been used in the experiments. After examining the 1700 retrieved statutory passages, 1515 passages were eventually coded, distributed over a total of 599 full-text provisions. Of these, 120 have been randomly sampled and set aside as final evaluation data. They are not subject of the work in this paper. The remaining 479 are considered working data. As described above, they have been structurally parsed and chunked. Experiments were run using 4-fold cross validation. The 479 provisions of the working dataset were randomly split into four parts. During four runs, a classifier was trained on three parts as training data and validated on the fourth, choosing a different validation set each time. Performance statistics were averaged over all four runs. This resulted in the subparts being of different sizes, as some provisions are larger than others and may contain multiple relevant chunks, a realistic scenario. Also, not spreading a provision’s chunks across partial datasets seems to be the only way of preventing chunks of one provision from being used in both training and testing.
[20]

Baselines The first baseline we use is that of the most frequent code (MFC). It is a simple predictor which determines the single most frequent code for a given attribute from the training data and predicts it for all the validation data. For relevance prediction, this is equivalent to predicting all chunks as irrelevant (Irr.), as only a small portion of the overall set of chunks in the working dataset is relevant (1342 / 6010). The second, more sophisticated baseline is a keyword enhanced version of the first (MFC+). As it would have been too big an endeavour to manually gather keywords for all labels of all attributes, we have used the terms of the small dictionary used in the manual coding process. It lists action types, purposes, timeframes, etc. The terms are lemmatized and used as signal terms (ST) for their respective codes with the most frequent code being the fallback mechanism in case none of the terms fire. Similarly, for relevance prediction, we have manually gathered a number of signal terms (e.g. «emergency», «flood», etc.) which flag a relevant chunk. Characteristics and peculiarities of these baselines will be addressed in light of the results further below in the discussion section.

4.2.

Evaluation Metrics ^

[21]
Experiments were conducted for relevance and all attributes except the actor columns, where only the primary code was attempted. To reduce features, static TFIDF thresholds have been added. That is, a term is removed from the feature set whenever its maximum TFIDF value in the dataset does not surpass the threshold. For the relevance prediction task, precision and recall were calculated in the traditional way. The prescription attribute is a non-binary single-label prediction and hence only measured in terms of classification error. The other attributes use a custom definition for precision and recall to take account of the ranking task, empty coding slots in the reference data and the prediction of a set of codes by the classifier. Assume a set of predicted codes P = {p1, ..., pn} that is compared to a set of target predictions T = {t1, ..., tn}. In validation, each chunk’s precision and recall are averaged over the full validation set for a given run. Finally, precision and recall are averaged over all four runs and the F measure is calculated from these averages.

4.3.

Results ^

[22]
The most interesting results are shown in Table 1. The first two columns show the performance of the baselines; the next four show the TFIDF methods with different thresholds. Aside from the second row, which lists the average number of features used in the model, the other rows present the performance results for the classification tasks. Within each classification task, we distinguish the system that achieves the highest average precision (P), the highest average recall (R), the highest harmonic mean of the two (F) and/or the lowest average classification error (CE) – highlighted in bold font in the table. Note that CE is only provided for single label classification tasks. Some experiments had been run multiple times, exhibiting some fluctuation in results (typically between 0.06 to +0.05; the emergency type going as far as 0.08 to +0.07), but not to the point where the implications of this discussion would have changed.
[23]
For most classification tasks, ML methods outperform the baselines. The proposed learning method shows a reduction of classification errors over baselines for single-label prediction tasks (relevance and prescription). Both relevance and the prescription attribute show a significant reduction in classification error, which is promising. Further efforts should examine the wrongly predicted instances to see whether a more sophisticated feature set (such as information extracted based on syntactic patterns) can improve this even further. Interestingly, the relevance prediction task baseline produces high recall and comparatively low precision, where the decision tree classifier only achieves low recall but higher precision. Intuitively, one would think that the small number of relevant chunks skews the model learning towards a very discriminative approach. One fruitful continuation could be, as a first step, to retrieve chunks using the high recall keyword predictor to increase the density of positive instances and then train an ML classifier on this dataset. The SVM model, however, scores very high precision but at the cost of very low recall. As stated above, we plan to tune SVMs to a greater extent before applying them to the ranking task of the other attributes.

4.4.

Discussion ^

[24]
The MFC baseline yields considerable precision and recall in the condition and timeframe attributes. This reveals that these attributes are dominated by one code (see also the histograms on our website referenced above). More diverse attributes (e.g. goal) show much lower performance of the baseline. Here, the most frequent action only accounts for 19% of all cases. In the multiple-labels prediction tasks, with few exceptions, ML methods outperform the baselines on at least one TFIDF threshold setting. However, a pattern emerges: except for relevance and action, the ML classifier always produces higher recall than the baselines. Since the baseline always only predicts the single most frequent code and not a set of codes, validation chunks that have been annotated with more than one code will account for low recall. The ML classifier, by contrast, will frequently predict a set of codes, leading to much higher recall. The most striking example of this might be the purpose attribute, where there are four slots for four possible codes, resulting in above 0.9 recall. However, the ease with which the ML predictor can produce such sets prevents it from achieving good precision. With the exception of the action, purpose and receiving agent attribute, though, ML methods perform comparably well (e.g. emergency type or acting agent) if not slightly better than the baselines (e.g. timeframe), causing the impression that if one discovers a better way of shortening the list of retrieved codes in the ranking process, precision may be improved. Finally, we observe that increasing TFIDF filtering regularly impacts performance negatively for the decision tree models, but seems to enhance the SVM model to a limited extent.

5.

Relationship to Prior Work ^

[25]

In the last decade, a growing body of AI&Law research has focused on the automatic classification and analysis of statutory texts using ML techniques. Some of these studies focused on relatively coarse classifications. For example, [9] employed SVMs to categorize legal documents as belonging to one of ten major types («Administrative Law», «Private Law», «Computer Science Law», etc.) achieving 85% accuracy. [11] achieved 63% accuracy with SVMs classifying statutory texts in terms of six abstract categories. [10] achieved 50% precision in a massively multiple label classification problem assigning directory codes and subject matters (e.g., intellectual property law, internal market, industrial and commercial property) using a perceptron-based approach.

[26]
ML techniques have also been applied to classify cases in terms of regulatory functions. [2] compared a manually crafted knowledge engineering classifier to an SVMbased classifier. They categorized sentences (as opposed to full provisions or chunks thereof) into a finite set of categories concerning the type of norm (i.e., definition, permission, obligation, etc.) They achieved accuracies around 90% and above, albeit lower when the classifier is tested on bills none of whose sentences have been used in training.
[27]

Francesconi, et al., applied SVMs to categorize multi-sentence statutory texts in terms of regulatory functions (e.g., as a definition, prohibition, duty, etc.) with accuracies as high as 93% [1,8,6]. They used an NLP approach to extract typical features associated with each function achieving 83% precision and 74% recall. For example, the provision, «A controller intending to process personal data falling within the scope of application of this Act shall have to notify the Guarantor thereof,» is classified as «duty» with the features: Bearer (of the duty) = «controller», Action = «notification», Counterpart = «Guarantor», and Object = «process personal data». Once mapped into conceptual indices (i.e., legal thesauri, dictionaries, or ontologies), the extracted information could enable the retrieval of statutory texts responsive to conceptual queries, e.g. «Return all provisions that impose upon controllers duties regarding privacy protection of personal data.» [7]

[28]
Although the information we attempt to learn (i.e., the provision’s purpose, objects, directed action, etc.) is related to the features learned in the above work, it is more domain specific as is consistent with our ultimate goal of supporting inferences across state statutes dealing with a particular subject matter. Our primary goal for the automated analysis of legal statutory texts is to support inferences about similarities and differences in the ways that states’ regulatory systems address emergency preparedness and response. We necessarily must employ a set of classifiers to extract semantic information about a given provision using multiple labels ranked by their probability.

6.

Conclusions ^

[29]
We have presented preliminary results obtained in an experiment applying NLP and ML techniques to data from human coding of public health statutory law. Our goal is to assess whether already encoded statutes for one state can be used to automate (or semi automate) the encoding of statutes for new states. As a first step, we divided one state’s (Pennsylvania) data and evaluated how well a program could learn from the training set to classify the test set. In the binary classification of determining whether a provision chunk is relevant or not, ML techniques improve precision over the baseline, but at the cost of lower recall. With respect to the single-label classification task, prescription, the ML approach significantly reduced classification error. For most multilabel classifiers (acting agent, goal, emergency type, condition and time frame), the ML prediction outperformed the baselines’ but not for receiving agent and purpose. Predicting an unspecified number of codes for attributes of interest and discriminating among ranked codes to achieve good precision remain as major challenges. Peculiar data patterns, specifically sparsity and dominating codes, also cause problems we have not yet solved. We yet have to adapt and explore SVMs for code ranking and search for additional features. Since the manual coding process has improved in both quantity and quality, we plan to run a similar experiment on more recently coded state data (e.g. Wisconsin).

7.

Acknowledgements ^

[30]
We are grateful to the University of Pittsburgh’s University Research Council Multidisciplinary Small Grant Program for funding this project. This publication was also supported in part by the Cooperative Agreement 5P01TP000304 from the Centers for Disease Control and Prevention. Its contents are solely the responsibility of the authors and do not necessarily represent the official views of the CDC.

8.

References ^

[1] Biagioli, C., Francesconi, E., Passerini, A., Montemagni, S. and Soria, C., Automatic semantics extraction in law documents, ICAIL 2005 Proceedings, 133–140, ACM Press (2005).

 

[2] de Maat, E., Krabben, K. andWinkels, R., Machine Learning versus Knowledge Based Classification of Legal Texts, Jurix 2010 Proceedings, pp. 87–96, R.G.F. Windkels (Ed.), IOS Press (2010).

 

[3] Dimitriadou, E., Hornik, K., Leisch, F., Meyer, D. andWeingessel, A. (2011). e1071: Misc Functions of the Dept. of Statistics, TU Wien. R package ver. 1.5-26. http://CRAN.R-project.org/package=e1071.

 

[4] Fellbaum, C. (1998, ed.), WordNet: An Electronic Lexical Database. Cambridge, MA: MIT Press.

 

[5] Bird, S., Loper, E., and Klein, E., (2009). Natural Language Processing with Python. O’Reilly Media.

 

[6] Francesconi, E., An Approach to Legal Rules Modelling and Automatic Learning. JURIX 2009 Proceedings (G. Governatori, Ed.), 59–68, IOS Press (2009).

 

[7] Francesconi, E., Montemagni, S., Peters,W., and Tiscornia, D., Integrating a Bottom-Up and Top-Down Methodology for Building Semantic Resources for the Multilingual Legal Domain. In Semantic Processing of Legal Texts. LNAI 6036, pp. 95–121. Springer: Berlin (2010).

 

[8] Francesconi, E. and Passerini, A., Automatic Classification of Provisions in Legislative Texts, Artificial Intelligence and Law 15:1–17 (2007).

 

[9] Francesconi, E. and Peruginelli, G., Integrated access to legal literature through automated semantic classification. Artificial Intelligence and Law 17:31–49 (2008).

 

[10] Mencía, E. and Fürnkranz, J., Efficient Multilabel Classification Algorithms for Large-Scale Problems in the Legal Domain. In Semantic Processing of Legal Texts. LNAI 6036, pp. 192–215. Springer (2010).

 

[11] Opsomer, R., De Meyer, G., Cornelis, C., Van Eetvelde, G., Exploiting Properties of Legislative Texts to Improve Classification Accuracy, Proc. Jurix 2009 (G. Governatori, ed.), 136–145, IOS Press (2009).

 

[12] Therneau, T. M. and Atkinson, B., R port by Ripley, B.,(2011). rpart: Recursive Partitioning. R package version 3.1–49. http://CRAN.R-project.org/package=rpart.

 

[13] R Development Core Team (2011). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org/.


Matthias Grabmair, Intelligent Systems Program, University of Pittsburgh, USA.

 

Kevin D. Ashley, Intelligent Systems Program, School of Law, University of Pittsburgh, USA.

 

Rebecca Hwa, Intelligent Systems Program, Department of Computer Science, University of Pittsburgh, USA.

 

Patricia M. Sweeney, Graduate School of Public Health, University of Pittsburgh, USA.

 

This article is republished with permission of IOS Press, the authors, and JURIX, Legal Knowledge and Information Systems from: Kathie M. Atkinson (ed.), Legal Knowledge Systems and Information Systems, JURIX 2011: The Twenty-Fourth Annual Conference, IOS Press, Amsterdam et al.