Jusletter IT

Hypertext – the deep structure

  • Author: Jon Bing
  • Category: Articles
  • Region: Norwegen
  • Collection: Tagungsband IRIS 2014
  • Citation: Jon Bing, Hypertext – the deep structure, in: Jusletter IT 20 February 2014
Hypertext may loosely be described as pointers from one location in a document to another. The concept has been widely popularised by links between web-pages, where typically an element is one page serves as a button which executes a link for another page (rather than to an element in that other page). The links create a maze of networks between documents, and make it possible to unravel a thread of related documents by following the links from document to document. Being popularised by the technology associated with World-Wide Web and HTML-coding of documents, some have made the claim that this is a novel way of navigate within or between documents. In this brief paper, some relations to the basic theory of text retrieval and older efforts is sketched, hopefully providing a perspective that also for appreciating hyperstructures.

Inhaltsverzeichnis

  • 1. The functions of an information system
  • 2. Hyperlink and indexes
  • 3. Types of indexes
  • 3.1. Introduction
  • 3.2. Alphabetical indexes
  • 3.3. Systematic indexes
  • 3.4. Citation indexes
  • 4. Memex

1.

The functions of an information system ^

[1]

Any information system has to support three functions:

  • A retrieval function
  • A relevance function
  • A source function
[2]
As stated, any information system has to offer all these three functions, otherwise it will not satisfy user needs.
[3]
Taking the list above from the bottom up, one will appreciate that the source function is vital. It is of little comfort to the professional user to learn that a relevant book has been published, if the system does not provide any means of acquiring this book.1 And the material must be available in a form which satisfies the needs of the user. Often, there will be norms for the domain which specify in which form the document has to be in order for the user to base arguments on the document. Within the legal domain,2 for instance, a statute has to be available in authentic form, that is the form it was given by the legislator.3 It would not be permitted to base the interpretation of a statute on, for instance, a notice from the government in the press informing the general public that a new statute has been passed. Such secondary sources will, however, be permitted, but the user will then base the argument on legal literature, a different source with a lower priority in the hierarchy of legal sources, and with a different role in the legal decision process. One may already at this early stage note that legal material is characterised by its composite nature – legal literature will, for instance, base its discussion on statutes and cases, the representation of the content of these documents being embedded in the text of the legal literature. This indicates that hyperlinks are basic to the structure of legal material, as will be further explored in this paper.
[4]

The relevance function is an auxiliary function, it is features of the document or the information system which make it easier – more efficient – for the user to determine the relevance4 of a document. In the document, relevance functions are typically enabled by the inclusion of an abstract, and user research has demonstrated that this is rather efficient, especially for excluding non-relevant documents.5 Occasionally, elements of the authentic form have been designed to assist relevance assessment, typically headings. A system function assisting relevance assessment is what often is known as a focus-function,6 the system let the user read a document by homing in on the specified search terms identified in the document, as this makes it easier for the user to determine whether the term occurs in a context relevant for his or her problem. The relevance of a document can finally only be determined by referring to the authentic form, and it should be emphasised that the relevance assessment is not objective, but will be relative both to user and context.7

[5]
The function to be discussed further in this paper, is the retrieval function. The user always in principle has the possibility to read through the available source material from beginning to the end, for instance taking a set of case reporters, typically starting some time in the latter part of last century, and reading each and every report up to the current issue. It is obvious that this is a option only rarely open to a user – the available resources for legal research will not permit the associated expenditure of time or money. Therefore, one has developed retrieval systems, allowing the user to formulate a search request in the formalisms defined by the retrieval system, and having as a result, suggestions to what documents may be relevant to the problem. Any information system imposes limitations to the formulation of the search request – a simple, but important example is that a back-in-the-book subject index only will permit the user to phrase search request using the terms predefined by the index. The efficiency of a retrieval function is dependent upon how well it allows the user to formulate search requests as a hypothesis of relevance, either by characterising the problem to be addressed (problem oriented search) or the legal issues the user perceives that the problem embraces (systematic search). In both cases, the efficiency depends on the knowledge of the user of the possibilities of the retrieval function, and the probable relation between specified requests and the documents he or she wants to retrieve. Therefore, one can even at this very introductory stage of the discussion note that the more experienced and expert the user, the better a retrieval function will perform: An efficient retrieval system does not replace, but enhances experience.
[6]
It has already been implied by this introduction of the retrieval function, that this is based on indexes. There are really no alternative to this strategy, though there are very many types of indexes, and they may be established by different means. A brief discussion of different type of indexes will be included below, but first the relation between hyperlinks and indexes with be indicated.

2.

Hyperlink and indexes ^

[7]
An index in our context may be characterised as a sorted list of terms. The sorting criterion may be alphabetic or according to a pre-defined system (for instance a hierarchical topic scheme). The terms may be extracted from the document, or assigned to the document, they may be freely chosen, or chosen from a predefined list of indexing terms by an indexer (intellectual indexing), or generated by a computer (automatic indexing).
[8]
A typical index will be an alphabetical back-in-the-book subject index. It has basically a very simple structure:

[indexing term1] * [page in the book1]

[indexing term2] * [page in the book2]

[indexing term3] * [page in the book3]

[9]
The user will decide which indexing term best characterises the problem he or on which she is working, or the document (page) he or she thinks may be relevant. Looking up the term, the user is referred to a page, and has manually to leaf through the book in order to find the page. Usually, the reference is to the whole page,8 and therefore the user typically has to look sequentially9 through the page in order to identify the part of the text is relevant.
[10]
It is rather obvious that this is some sort of simple hyper-structure – a link is established between an element of the index and the element of the text. In a computerised text, it would be trivial to implement this as a jump directly from the index to the (part of the) page indicated, and indeed most help functions for current application programs will give ample examples of exactly this solution.
[11]
Commonly, also slightly more complex hyper structures are established, as indicated by this example:
>/tr>

[indexing term1] * [page in the book1], [page in the book 2]

[12]
As in the previous example, this implies a hyperlink between the indexing term and two pages in the book. But more interestingly, it also implies a hyperlink between the two pages, they are both characterised by the same indexing term, and the figure may be redrawn to another, just as simple form illustrating this:
[13]
In the conventional system, the relation is implied and not visible, the user reading the first or the second page will not be aware of their relationship unless consulting the index. This also holds for a computerised system based on the same structure, but for someone working consciously with hyperlinks, the fact that the same indexing term is used to characterise both pages may be sufficient for him or her to establish a direct link between the pages, typically using a word, phrase or other elements from the two pages as a button for executing the link.
[14]
In this simple way, relations between indexes and hyperlinks may be established, mapping the index onto a hyperlink structure. Before moving on to discuss some hyperstructures, a few comments on different forms of indexes – especially for the legal domain – may be offered.

3.

Types of indexes ^

3.1.

Introduction ^

[15]
There are several ways of classifying indexes, and in this paper no effort will be made to be exhaustive. It will suffice for our purposes to distinguish between three types of indexes:
[16]
  • Alphabetical indexes
  • Systematic indexes
  • Citation indexes
[17]
There are also different ways for establishing such indexes, typically manual ways of assigning indexing terms to the item being indexed (intellectual indexing), or using some fully automated way of establishing the index (automatic indexing). Obviously, there will be intermediary solutions, where a computer assisted system is used to support the indexer, proposing indexing terms, controlling that only valid terms are selected (where the system defines valid terms), etc.
[18]

One should also make the distinction between indexes using a pre-defined set of indexing terms, generally known as a thesaurus.10

[19]
In this section, there will be brief discussion of the three types of indexes, relating them to hyperstructures and the legal domain.

3.2.

Alphabetical indexes ^

[20]
An alphabetical index is basically very simple. Often a back-in-the book index is based on a manual reading of the book, the indexer noting for each page the terms thought appropriate to characterise that page, and finally sorting the references (term-page number couplets) into a consolidated, alphabetical index using – obviously – the terms as sorting criterion. In its simplest form, there are no formal controls for the indexing, the terms are chosen freely as deemed appropriate by the indexer. Even in such a situation, there would be, however, a tendency for the indexer to choose terms which appear in the text, especially in those part of the text which are emphasised as descriptive (and which also plays a role for the relevance function), typically headings.
[21]
In order to improve consistency in indexing – both inter- and intra-indexer consistency – one may refer the indexer to a thesaurus. This thesaurus would include an pre-defined set of indexing terms, often setting out relations between the terms (broader-narrower, related etc)11 and including recommendations on which term to select in different contexts, preferred terms where there are synonyms, etc. The terms may be expressed in an artificial «language», for instance a decimal system like Dewy Decimal Classification – within the legal domain, the West Key system may be the best known example. In this case, one may have the definitions of the classification numbers in different languages, making it possible to index documents with the same numbers working with indexers or documents of different languages. Computer-assisted systems have been devised for some of these indexing languages, securing that valid terms are specified, and helping the indexer to comply to the recommendations for using the indexing terms.
[22]

Even using a controlled indexing vocabulary, consistency remains a problem.12 Also, the thesaurus need to be maintained in order to reflect new knowledge, new trends and re-evaluation of relations between terms. The resources necessary for maintaining the thesaurus should not be underestimated. Also, the process of intellectual indexing itself requires high competence by the indexers, and is time-consuming, even when computer aided.

[23]
It is therefore understandable that one quite early started to look for methods of automating this process, the solution emerging in the early 1960s was text retrieval.13 This establishes an alphabetical index based on the text (authentic text, abstracts or whatever form of documents are to be indexed). This index was then inverted, looking up a page, the indexing terms in that page would be given – actually this «addresses» of the terms would be very detailed, typically given document number, paragraph number, sentence number and word within sentences. The basic relations can be indicated by the small figure below:
[24]
It is obvious that searching for the word in the search file, gives you one or more addresses, which then can be looked up in the text file. The link can be perceived as a hyperlink between the two files, and will also provide «hyperlinks» between documents in the sense that two documents indexed by the same term, will be available, though only through the list resulting from the search, not from the documents themselves (cf the small figure illustrating this in sect 2 above). This also illustrated the basis for different operators defining the relation between search terms in the request, typically Boolean operators (AND, OR).
[25]
Different authors will obviously make subjective choices for what terms to use in discussing the same issues. There were therefor in the early years some quite heated discussions on the relative performance of systems based on text retrieval and intellectual indexing.

    This can in the legal domain be exemplified by the competition between Mead Data Central and West Publishing Company. Mead’s system, Lexis, did include little or none intellectual indexing, while West relied heavily on its Key Number System to the extent that when Westlaw first was introduced in 1975, it was based on headnotes and the numbers, leaving the user to look up the authentic text in one of the case reporters. This actually both reduced retrieval performance, and impaired the source function compared to the competitor, Lexis, and lead to West changing its policy in 1978, also including the authentic text of the case.

[26]
There is today no reason to elaborate on the different positions taken at that early stage. For both practical reasons, and due to system performance, automatic indexing has prevailed – though one should note that this indexing often will include editorial elements of the document, which has been added to the authentic form, both abstracts (headnotes) and indexing terms, empirical research clearly demonstrating that though retrieval based on the indexing of authentic text alone performed better than indexing only based on abstracts, retrieval based indexing of both elements, outperformed both taken alone.14
[27]
Referring to the figure in sect 2, one may claim that if two words occurred in different documents, then there was an implied hyperlink between two or more documents that could be made explicit, for instance by making the words into buttons, when activated, the link would be executed and the user taken to the other documents. This hypothesis will, perhaps, hold true for intellectual indexing, where there is an decision made on whether the document may be appropriately characterised by the indexing term chosen. But for automatic indexing of authentic text, this is as a general proposition not feasible. What has been called the extensitivity of authentic texts – especially for documents like cases – will make the relation between one single term and the issue discussed in the document, too tenuous, the term may occur in an aside remark, in a slightly different context or a wholly different meaning.15
[28]

One way of utilising the vocabulary to create a type of hyper-structure, is systems based on connectionism or neural networks.16 In this approach, a document is represented as a weighted graph. The identification of the document is one node, the different words of the documents in the database is a dimension of other nodes, and other dimensions may be the authors, statutory citations, etc.

[29]
In this graph, the upper line (dimension) represents the words of the authentic texts of the documents, while the lower line (dimension) represents the authors of the documents in the database. In the example, all the four words occur in the document «Min84», which has two joint authors. The nodes are linked, and the links are symmetrical, one going from the document to the words occurring in that document, the others going from a word to the documents in which the word occurs. In the same way the nodes representing authors are linked to documents. These links are assigned weights, initially for instance based on inverse frequency, the higher the frequency of a word or an author in the database, the lower the weight.
[30]
The system is used by specifying search terms. These are matched with the words and authors, and the matching nodes are loaded with a starting weight. This is then distributed throughout the network, following the links to the documents containing the word, the initial loads is split into equal parts and multiplied with the weight of that link, resulting in a certain load being collected by different document nodes in a first cycle. In the second cycle, the same distribution is carried out, this time following the links from the documents to the words and authors occurring in that document, and so on. The result will be that the document nodes will collect different loads, after a specified number of cycles, all document above a certain threshold load will be considered «retrieved», and ranked according from highest to lowest load. One will appreciate that among the retrieved documents may be documents with no words matching that of the initial request, the load being gathered though the cycling from document nodes.
[31]
One may see the result of such an exercise as some sort of topological map of the document domain, and this might be used to hyperlink the documents, even using the different loads to assign links of strengths, using a passage in documents representing the search request as buttons.

    The method is related with other strategies for enhancing performance (especially recall) is that of relevance feedback, perhaps especially the «local metrical feedback» suggested by Attar and Fraenkel.17 Using an initial traditional search request, a part of the text of the retrieved documents (this part being words within a certain distance form the search terms, explaining the «metrical» element in the method) was examined by statistical means, and terms occurring with a certain significance in the identified parts would be included in a new search request, retrieving additional documents. This could also go on for several cycles, and might be represented in a hyper-structure similar to the one suggested above.

[32]
Though both connectionism and relevance feedback in principle may be used to create hyper-structures, this has not been explored in practice. And it would perhaps seem uncertain that the links provided in such a way actually would be appropriate for indicating semantic relationships between documents, or parts of documents. Generally this uncertainty will prevail using terms of the natural language in authentic text documents for creating hyperlinks, but one may find micro-vocabularies within a the authentic text with a much more stable semantic interpretation. An example may be geographical names, which often are unique (at least if one take into account that they start with a capital letter), and which may be sorted into a hierarchical structure describing a territory.

    Norway is basically divided into territories (mainland, continental shelf, Spitzbergen, and a few other odd islands in the Arctic and Antarctic regions). The mainland is divided into counties, the counties into municipalities.18 Towns, railway stations, lakes etc may be related to municipalities. Regrettably, the administrative division of the country is not this simple, for different purposes, more than 20 ways of dividing the country prevail – harbour districts, parishes, court jurisdictions, midwife districts, environmental zones along watercourses, natural parks, etc. The mapping of different features onto this rather complex structure is not trivial, and would certainly be even more complex for countries with a more sophisticated administrative structure, for instance a federal structure.

[33]
Another rather obvious micro-vocabularies would relate to companies (legal persons) or authorities (the government will typically have a rather well-defined hierarchical structure, where each agency slots into the structure in a unambiguous way). One might find it worthwhile to exploit such implied hyperstructures, for instance in case law material – the name of a company may be defined as a button linking that document to any other cases in which the company is mentioned, or – more restricted – the company is mentioned as party to the decision.19
[34]

An early example of trying to define such a micro vocabulary, were the attempts to find an algorithm which would qualify statutory definitions. If such a definition was identified, it would then be possible to set up – at least within the statute (or regulation) – intra-documentary links to those provisions using the defined term.20 A major example is the early work of Niblett and Nunn-Price on the British STATUS21 project, where a DEFINE function was included of the following structure:

[defined term] AND (mean OR means OR meaning OR include OR includes OR defined OR definitions OR deemed OR construed)(+12, –12).

[35]
This would find a provision defining the term if the defined termed co-occurred with one of the disjunctive terms in the parenthesis in a «distance» of 12 words before or after the defined term. It was found that this worked for the British atomic energy legislation at that time,22 but attempting a similar exercise on a Norwegian corpus failed,23 perhaps because the drafting of Norwegian statutes is less stringent than English. It is suggested, however, that finding an algorithm for identifying statutory definitions is generally not feasible, but related work on citations has been more successful.
[36]
The possibilities of micro-vocabularies for automatic establishing hyperlinks should not be discarded. By identifying micro-vocabularies in which the terms are distinct from that of the general natural language used in legal instruments, and which have a somewhat stable and specific semantic, one would be able to establish inter-documentary links, and perhaps enhance the hyperstructures of legal material.
[37]

A special use of this possibility is demonstrated by one of the (rather few) systems for conceptual text retrieval, RUBRIC.24 In this system, a semantic representation of the domain25 is constructed in the tradition of conventional knowledge based systems. A link is determined between the knowledge based structure and the authentic text of the document by using «evidence rules». For instance, one of the concepts in the semantic structure is the concept of an offer being «friendly». For this there is constructed a evidence rule determining whether a certain document should be considered to fulfil this requirement:

(Evidence friendly

    ((SENTENCE «BOARD» «OFFER» «RECOMMEND») 0.9))
[38]
This is to be read as a document being qualified satisfying the condition «friendly» with a probability of 0.9 per cent if a sentence of that document contains the words «board», «offer» AND «recommend». In this way, it would be possible to establish links between a conceptual description of a domain (which in most ways would correspond to a structured and detailed systematic map of the domain, cf below) with links to the documents. The user may access the documents through the knowledge representation, and retrieve those documents identically classified by the evidence rules. This is really a two-layer hyper-structure, with the knowledge representation as a map to the documents through which documents can be accessed and grouped. It obviously would take little effort in addition to introduce buttons for the concepts defined in the knowledge based structure in the documents themselves, linking them with other documents which the evidence rules have classified as containing the identical concept (perhaps with a probability threshold).
[39]
Though reports on the performance of RUBRIC are impressive, it may still be questioned if the evidence rules perform sufficiently well outside a very narrow domain to be used for establishing any kind of inter-documentary hyper-structure.

3.3.

Systematic indexes ^

[40]
Systematic indexing is really just a special case of intellectual indexing based on a thesaurus, where the thesaurus has the ambition of reflecting the systematic structure of the total legal domain, or within the sub-domain documented. Actually, the West Key Numbers is such a rather detailed systematic scheme, and it may also be taken as an example of the interaction between a user community and the way a domain is perceived as structured: By using the system, the users are trained in understanding the structure of the domain according to the premises of the system – it becomes a «natural» way of perceiving this structure. In fact, such structures will always be made on choices, as any other thesaurus, which may be more or less appropriate, and where what is «appropriate» easily will be a reference to what the users will accept. If a certain structure has been internalised by the members of the user community, it will be very difficult to have the community adopting another structure.
[41]
The systematic index will typically have a hierarchical structure. In the rather recently devised systematic structure of the Norwegian compilation of statutes in force and related publications,26 a structure with 26 top nodes has been introduced. These are here offered by the way of illustration:27
    1. Civil procedure law
    2. Claims, obligations
    3. Commerce, competition
    4. Companies
    5. Computers and law
    6. Constitutional law
    7. Criminal law
    8. Criminal procedure law
    9. European Economic Area, European Union, EFTA
    10. Expropriation
    11. Family, inheritance, probate
    12. General compilations of statutes
    13. Intellectual properties, copyright
    14. Interlegal law
    15. Labour
    16. Legal profession
    17. Liability, insurance
    18. Liquidation, negotiation of debts
    19. National insurance, social benefits and health
    20. Persons
    21. Petroleum
    22. Property
    23. Public administrative law
    24. Statistics
    25. Taxation and tariffs
    26. Transport
[42]
A reader will easily find elements in such a structure to which he or she disagrees.28 The structure is further refined into two levels depth, and becomes rather detailed. In our context, however, the point is that this really is a general map of the material, which imposes a hyper-structure. This can be – and is often – exploited by information systems, making it possible to make a search request based on the indexing, and in this way constructing a hit-list of all the material within the same category. The indexing term indicating the location within the structure is usually manually assigned to the document,29 and may easily be identified and used as a hyperlink button in the document itself – generally implemented as a two-step process: Pressing the button will give the user a list of documents classified with the same indexing term, and pressing any of the document titles, will take the user to that document.
[43]
One will appreciate the similarity between the hyperstructures created based on the intellectual indexing using a systematic approach as indicated above, and strategies like the one suggested by the RUBRIC system briefly mentioned above – in both cases, a systematic or semantic «map» of the domain is represented and linked to the documents, this providing a hyper-structure of the documents and the inter-documentary relations.

3.4.

Citation indexes ^

[44]
Citations may be looked upon as a special kind of micro-vocabulary within the legal domain. They are intriguing for several reasons, one of them being that there within a jurisdictions, often will be meta-norms governing the use of citations, typically a requirement that an argument based on a statute or a regulation should identify this instrument through a citation, or that claiming a precedence applies, the case citation should be included. They are also generally of a certain form which make it feasible to identify the occurrence of a citation in the authentic text. It may be argued that the use of citations is by far the most obvious example of hyperlinks in legal material, and even more exciting because they have a certain semantic interpretation unlike the interpretation of a word or phrase occurring in the text. It is a rather well-developed micro-vocabulary which, perhaps, has not been explored in the extent one should expect.
[45]

Citation indexes are also well known outside the legal domain. Using a standard work within the domain addressing a certain issue, the index will list subsequent works citing this work, and in this way provide a perspective on the domain.30 A well-published system is the Science Citation Index (SCI).31 Within law, Shephard’s is perhaps the best known citation index, having given rise to the verb shepardizing (which also is a trade mark) for use of the index.

    Frank Shepard32 was a young salseperson working for the legal publisher EB Meyer & Co in Chicago. He was not a lawyer himself, but in his work, be became familiar with the problems lawyers encountered in legal research. He noted the importance of precedents within US law, and observed that many lawyers would jot citations to subsequent decisions in the margin of their case reporters. He started by examining decisions from Illinois, and noted the precedents cited, which he printed on labels which could be pasted into the margins of the case reporters, in this way replacing the hand-written notes. Shephard’s Adhesive Annotations were first published in 1873, and was at once successful. Shephard soon worked full time on his project, expanding the system to other states. He lived to see his system becoming indispensable for lawyers, his files have bountiful evidence of how useful lawyers found his system, Oliver Wendell Holmes, jr wrote for instance to him and stated that he found the system the greatest simplification of lawyers work ever. He died at his desk 28 September 1900, and was followed by his brother in law, Reid A Kathan, who replaced the labels by a book publication – a red leather bound book with one single word on the cover: Shepard. Kathan was followed by William Gutherie Packard («Pack») in 1929, and though there now were competitors in the market, Shepard remained leader.

    When a factory for bomb sights was extended during the Second World War, the company had to move. During two years, the operation was moved to Colorado Springs, it is maintained that this not only was due to the central location within the country, but because Packard wanted to locate the company as far from the coast as possible to reduce the peril of enemy air attacks, and because he was a dedicated outdoor person. The removal load was 23 railway carriages, and the property of 40 families filled another 20. The weight of the lead type was 250 tons, and the type was moved without a single line being misplaced – or a single edition delayed.

    In 1966, Shephard became a wholly owned subsidiary of McGraw Hill, and in 1990 the company moved to new facilities specially designed for an information technology infrastructure with some 100 kilometres of fibre optic cable and a database of 64 billion bytes, holding the citations of 5 million cases. Today, Shepard is integrated with legal on-line services – citations are updated with a maximum delay of 48 hours counting from the time the decision is handed down.

[46]
In the US, using the hyper-structure provided by citation indexes such as Shepard’s was common before the advent of computerised systems. But in many other jurisdictions – like the Norwegian – such tools have not traditionally been available, probably due to a lack of marked, which in turn both rely on a legal system less dependent on precedents than the American, and on a smaller lawyer population. But text retrieval make it easy to find subsequent cases citing a case of interest: Using the citation of the case of interest as a search request, one will retrieve all other documents containing that citation.
[47]
This has been used to devise a strategy for hyper-linking which is different from what most commonly used, when an explicit link between two documents, using their system addresses, is established. The strategy presumes that citations can be identified and interpreted by the system. As stated (and mentioned below) the micro-vocabulary of citations is sufficient characteristic to do this with some certainty, but a legal information service provider would nevertheless opt for running a check to ensure that the citations have a standard form on updating a database. A program is then run, which for each document (1) secures the identification of the document in the form of a citation, and (2) identifying all citations within the document, organising them as a meta-document of the form indicated in sect 2 – looking up the citing document, all the cited documents are listed. This meta-document is then processed by the text retrieval system, establishing a «text file» (containing the citations of the citing documents), and a «search files», containing the citations of the cited documents with the citing document as address. The result is a meta-database, often termed a «shadow database».
[48]
When the user has retrieved a document, a special command may be executed. The system constructs a search request consisting of the identification of the current document in the form of a citation, and will retrieve from the shadow database (1) the documents cited by the current document, and (2) all documents citing the current document. The user will be able to hyperjump to any of these documents, between them, and back to the current document. Or the user may choose to make a document accessed in this way the starting point for a new search of the shadow database.
[49]
An attractive feature of this strategy, is that is also maintain the hyper-structure with little effort. When updating the database, also the shadow database is updated, and no extra resources are necessary (apart from checking that the citations have the appropriate format in the new documents).33
[50]

The more extensive study of case citations as a hyperstructure is made by Tapper.34 The objective of the study was by taking a current case, to find other «relevant» cases. The method proposed was an version of a vector system. In the vector systems generally applied to legal research, each document is represented in n-dimesjonal space, «n» being the number of different words in the database, and each element of the vector (representing a word) is assigned a value, either «0» for a word not occurring in the document, or a value, typically the frequency of the word in that document. «Similarity» between documents is then measured by the angle between the vectors, typically using a cosine function.

[51]

In Tapper’s study, the vectors was made up of the case citations occurring in the cases. Also, different methods for comparing similarity were explored. In this, and pertinent to our discussion, is the semantic interpretation made of citations. In assigning values to the vector elements, consideration was made of several aspects of the case:35

[52]
  • Age: The older the case, the higher the value (the year of the cited cases was subtracted from the year of the experimental data – 1974 – and rounded down to the nearest whole number).
  • Jurisdiction: The more remote the jurisdiction, the higher the value. For the American material,36 the jurisdictional value was nil if the cited case was decided by the Supreme Court or the same Circuit Court, ten if it was decided by a court of a different circuit, and twenty if it was decided by a state court. In the English material,37 the value was nil for any English or Welsh cases, ten for any other jurisdiction.
  • Hierarchical value: In the American material, this was ten for the Supreme Court, twenty for the Court of Appeals, and thirty for Federal District courts with broadly similar rules for the state court. In the English material, decisions by the House of Lords was assigned the value ten, the Court of Appeal the value twenty and the High Court the value thirty.
  • Frequency: The number of occurrences of the citation in the current case.
[53]
One will from this scheme see that a high value will be ascribed to a cited case if the case is old, is from a remote jurisdiction, is low in the hierarchical court system, and is cited frequently. To some extent this is the opposite of the weight ascribed to arguments derived from the cited cases (with the exception of frequency), the weight will be high in this respect if the precedent is recent, is from the same jurisdiction, and is from the Supreme Court or other courts close to the top of the hierarchy. This is exactly why Tapper decided to invert the weight scheme for measuring proximity between the current and another case: The older, the more remote, and the more minor the court, the fact that the same decision is cited in two cases is a stronger indication of similarity.
[54]
This remains one of the more important efforts of interpreting the semantics of case citations, and the clustering based on such vectors, provided some sort of topography of the «citation space» which also could be exploited in establishing hyperlinks between the documents. Not only would one therefore be able to use the citations as hyperlink buttons taking the user to the cited case, but one might also provide general buttons giving the user a ranked table of those cases most similar measured by the citations.
[55]

The experiment has demonstrated that this approach is promising. As Tapper concludes:38

    «It is possible to summarise the position by asserting that the research described in this paper has elaborated and vindicated the technique of using citation vectors. It still remains to be perfected.»

[56]
Regrettably, we are still waiting.
[57]
Of course, there are citation structures also in different legal material, typically based on regulatory citations (used as a common term for citations to statutes, regulations and similar legal sources based on statutory authority – the exact hierarchy of such instruments vary between jurisdictions). In such structures, one may make some distinctions for the analysis:
[58]
  • Authentic citations are those which are embedded in the text of the legal source itself. They have the same legal authority as the text, a statutory citation in the text of one statutory provisions incorporate the cited provision in the citing provision, and may be an alternative to statutory definitions. Editorial citations are those added by the editor of a compilation of statues in force or other publication, they are a special example of legal literature, representing the opinion of the editor that there is a certain relation between the two provisions – often the editorial citation is just a mirror of the authentic citation in another provision: Section A includes an authentic citation to Section B, but Section B does not cite Section A – this is then made explicit by the editor including an editorial citation from Section B to Section A. Taken as a whole, one will find that statutory citation structures tend to be symmetric.
  • Intra- and inter-regulatory citations. There will be citations between provisions within the regulation, while others will be between different regulations. The intra-regulatory citations are only one of several important structural elements in a regulation, another important structure is the sequence of provisions, as there often are implied links between adjoining provisions.39
[59]
Obviously, there will be regulatory or case citations also in other legal sources, like case law and legal literature. The citations riddle the legal material with a rich hyperstructure which can be exploited, and where the semantics are rather different than for those pointers set on the basis of the ad hoc opinion of the person designing a web page.

4.

Memex ^

[60]
The 1948 paper by Vannevar Bush40 entitled «As we may think»,41 in which he addresses the problem of managing large files of documents or records. To solve this problem, he suggest a device he dubs «memex»:

    «Consider a future device for individual use, which is a sort of mechanized private file and library. It needs a name, and to coin one at random, «memex’» will do. A memex is a device in which an individual stores all his books, records, and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility. It is an enlarged intimate supplement to his memory.

    It consists of a desk, and while it can presumably be operated from a distance, it is primarily the piece of furniture at which he works. On the top are slanting translucent screens, on which material can be projected for convenient reading. There is a keyboard, and sets of buttons and levers. Otherwise it looks like an ordinary desk.

    In one end is the stored material. The matter of bulk is well taken care of by improved microfilm. Only a small part of the interior of the memex is devoted to storage, the rest to mechanism. Yet if the user inserted 5000 pages of material a day it would take him hundreds of years to fill the repository, so he can be profligate and enter material freely.

    Most of the memex contents are purchased on microfilm ready for insertion. Books of all sorts, pictures, current periodicals, newspapers, are thus obtained and dropped into place. Business correspondence takes the same path. And there is provision for direct entry. On the top of the memex is a transparent platen. On this are placed longhand notes, photographs, memoranda, all sort of things. When one is in place, the depression of a lever causes it to be photographed onto the next blank space in a section of the memex film, dry photography being employed.

    There is, of course, provision for consultation of the record by the usual scheme of indexing. If the user wishes to consult a certain book, he taps its code on the keyboard, and the title page of the book promptly appears before him, projected onto one of his viewing positions. Frequently-used codes are mnemonic, so that he seldom consults his code book; but when he does, a single tap of a key projects it for his use. Moreover, he has supplemental levers. On deflecting one of these levers to the right he runs through the book before him, each page in turn being projected at a speed which just allows a recognizing glance at each. If he deflects it further to the right, he steps through the book 10 pages at a time; still further at 100 pages at a time. Deflection to the left gives him the same control backwards.

    A special button transfers him immediately to the first page of the index. Any given book of his library can thus be called up and consulted with far greater facility than if it were taken from a shelf. As he has several projection positions, he can leave one item in position while he calls up another. He can add marginal notes and comments, taking advantage of one possible type of dry photography, and it could even be arranged so that he can do this by a stylus scheme, such as is now employed in the telautograph seen in railroad waiting rooms, just as though he had the physical page before him.»

[61]

When text retrieval systems came along, many hailed these as the realisation of the memex devices envisioned by Bush.42 The same happened when hyperlinks became popular within the context of World-Wide Web.43 In this paper, an attempt has been made to indicate that both approaches basically rely on the same considerations – either relying on intellectual indexing or using syntactic features of the document to establish a link which can be given a semantic interpretation. Hyperlinks assigned on the basis of an intellectual effort on behalf of the person designing a web-page has the same advantages and disadvantages as traditional intellectual indexing: They may be based on an expert opinion, and therefor of great value to the user, but the reason for the link to be established may not be self-evident, and based on a gut feeling rather than any principles which has been, or easily may be made explicit – the relation will often be described by the two documents being «important» or «interesting» or even «relevant» (whatever that may be taken to mean) with respect to each other. The hyper-links also are messy to maintain.

[62]
Therefore one should consider methods which are based on more objective criteria, and where the semantic of the hyperlink may be appreciated in advance by the user, basing this on micro-vocabularies, domain maps, citations etc as suggested above. This will not replace intellectual indexing – the advent of automatic indexing by text retrieval systems did not replace the intellectual indexing, not even for computerised systems – but will certainly richly supplement such indexing. And it may lead to a deeper appreciation or understanding of the structures in and between documents implied by the syntactical elements in the documents of a certain domain, the legal domain being unusual rich in such structures.

 

Jon Bing

Norwegian Research Center for Computers and Law

Faculty of Law, University of Oslo

PO Box 6702 St Olavs plass, N-0130 OSLO, Norway

 


  1. 1 But one should bear in mind that the functions may be implemented by separate systems interacting to create one complete information system. Typically, the library has a card catalogue with brief entries for each document in its files, computerised or manual. This will enable the user to retrieve documents, and make a preliminary relevance assessment based on the title and, perhaps, a brief abstract of the document. But in order to satisfy the source function, the user has to leave the catalogue and walk along the shelves in order to find the physical publication – or, the user may have to make an inter-library loan in order to satisfy the source function. In such a composite information system, the retrieval and relevance functions may be rather efficient, while the source function is somewhat more cumbersome and costly.
  2. 2 This paper will restrict itself to examples from the legal domain.
  3. 3 With some qualifications, most jurisdictions will allow the user to base arguments on consolidated versions of legislation, where amending statutes or regulations have been edited into the amended statute or regulation.
  4. 4 «Relevance» is a concept used in many different meanings in the literature of information systems, often in a rather informal way. In this paper, an attempt is made to restrict the use of «relevance» to the situation where an argument derived from a document contributes towards the legal decision. It is a binary concept, a document either is or is not relevant in this terminology. Grading is reserved from the derived argument, which will have a relative weight with respect to other arguments available in the decision process. The weight is determined by several factors, two major factors will be the rank of the type of legal sources (typically statutes ranked above regulations), and the proximity of the facts of the argument to the facts of the issue which is to be decided. How to consolidate a decisions with conflicting arguments will not be discussed in this paper.
  5. 5 Cf Jon Bing Handbook of Legal Information Retrieval, North-Holland, Amsterdam 1984:93–96 with references.
  6. 6 It really is similar to the even more traditional KWIC-format (key words in context).
  7. 7 Not only the context of the problem, but also the context of the other documents appreciated by the user.
  8. 8 But it may refer to an element of the page, typically a footnote or a figure.
  9. 9 In principle the user reads sequentially, but typically the experienced user has acquired reading skills which allow him or her just to «glance» at the page to determine the relevant passage.
  10. 10 «Thesaurus» is a term used with slightly different meanings. Often it refers to a dictionary that when the user looks up a certain term, it will specify synonyms or related terms in the natural language – as the thesaurus tool for English in a word processing program like MS Word. In information retrieval, it is also often used for a list that takes words of the search request as input, and expands these words by what is defined as synonyms or near synonyms – in this way, thesaurus is used in important examples of legal information retrieval systems, like the former and quite famous function of the ITAGIURE system. Such «search theauri» provides a post co-ordination of the search request, and is in principle rather different from the thesauri discussed here, which provided ante co-ordination of indexing the documents themselves. Obviously, such thesauri also may be useful to consult when specifying a search request, but that is nearly just a side-effect of their use.
  11. 11 This would allow the user to move up or along such links for retrieval, actually providing intra-hyperlinks within the index, cf below.
  12. 12 A detailed example is the MEDLARS evaluation (1966–1967), cf Jon Bing Handbook of Legal Information Retrieval, North-Holland, Amsterdam 1984:214–216.
  13. 13 Often called «full text retrieval». The «full text» usually indicated the form of the document to be indexed, indicated that this is the authentic text (as formulated by the original author, court or regulator) in contrast to an abbreviated form of the text (like an abstract). In this paper, the term «text retrieval» is used for the method, while the type of document indexed will be specified (or implied).
  14. 14 Jon Bing Handbook of Legal Information Retrieval, North-Holland, Amsterdam 1984:246.
  15. 15 Jon Bing and Trygve Harvold Rettskildebruk og informasjonssystemer, Papers on Computers and Law 2a, Oslo 1973:192–195.
  16. 16 The example is based on Adaptive Information Retrieval, a system described by Richard K Belew «A Connectionist Approach to Conceptual Information Retrieval», Proceedings of the first iternational conference on artificial intelligence in law, ACM Press, New York 1987:116–126. The most comprehensive discussion may be found in Daniel E Rose A Symbolic and Connectionist Approach to Legal Information Retrieval, Lawrence Erlbaum Associates, Publishers. Hillsdale NJ (1994). The simple example in the text should be sufficient to illustrate the possibilities.
  17. 17 Jon Bing Handbook of Legal Information Retrieval, North-Holland, Amsterdam 1984:232 with references.
  18. 18 «Fylker» and «kommuner».
  19. 19 This would not be difficult, as the structure of a decision generally will have a paragraph containing the names of the parties which can be identified, even if this paragraph is not explicitly tagged.
  20. 20 Cf for instance GBF Niblett and NH Price The STATUS Project: Searching Atomic Energy Law by Computer, Culham Laboratories, 1969 and «Mechanized searching of Acts of Parliament», Information Storage and Retrieval 1970:269.
  21. 21 STATUS was probably the first portable text retrieval system, written in FORTRAN.
  22. 22 The total length of this legislation was 138,661 words.
  23. 23 Jon Bing and Trygve Harvold Rettskildebruk og informasjonssystemer, Papers on Computers and Law 2a, Oslo 1973:178.
  24. 24 Cf Jon Bing Conceptual Text Retrieval, CompLex 9/88, Tano, Oslo 1988:20–24 with references.
  25. 25 The domain in question was that of corporate mergers and acquisitions.
  26. 26 The structure is imbedded in the bibliographical program BibJure developed by Pål A Bertnes and marketed by DIAGNOSTICA. In the original, the slots 1-3 and 6 are open (certainly for good reasons, but not obvious to the author), therefore the last category in the original is numbered 30.
  27. 27 In the original, the titles are organised alphabetically, by translation into English, this is obviously corrupted, it has been re-organised for the benefit of the example, but the numbers do for this reason not correspond to those of the original.
  28. 28 And the author should hasten to add that the cause may be inadequate translations of the brief captions in the structure.
  29. 29 Often used to produce manual or back-in-the-book indexes for the material in addition to the use in the computerised system.
  30. 30 Cf for instance Th P Loosjes On Documentation of Scientific Literature, Butterworths, London 1967.
  31. 31 SCI is founded by Eugene Garfield, who discusses his approach for instance in The Foundation to Access to Knowledge, Sycaruse University 1968:169–196.
  32. 32 This anecdotal material is mainly based on material made available by Shepard at http://www.lawtown.com/ccentral/ccentral.html.
  33. 33 The author acknowledges that this strategy has been brought to his attention by the Norwegian national legal information service provider, Lovdata, which have implemented this in their system based on the text retrieval program SIFT and the user interface SIR. It is not only implemented for case citations, as the discussion in the text may imply, but also for citations of and between statutes and regulations.
  34. 34 Cf Colin Tapper An Experiment in the Use of Citation Vectors in the Area of legal Data, CompLex 9/82, Scandinavian University Press, Oslo 1982. The work reported in this publication follows work completed at Stanford University in 1976.
  35. 35 Cf Colin Tapper An Experiment in the Use of Citation Vectors in the Area of legal Data, CompLex 9/82, Scandinavian University Press, Oslo 1982:3.
  36. 36 Volume 500 of the Federal Reporter (Second series), covering part of 1974, totalling 248 cases on 1.397 pages.
  37. 37 Queen Bench Reports for 1974, 67 cases on 837 pages.
  38. 38 Cf Colin Tapper An Experiment in the Use of Citation Vectors in the Area of legal Data, CompLex 9/82, Scandinavian University Press, Oslo 1982:101.
  39. 39 In text retrieval systems, a provision is generally qualified as a document. If a provision is retrieved, it is necessary for the system to support some method for being able to jump to the preceding or following provisions, which typically will not be part of the answer set. Surprisingly many text retrieval systems do not offer this control structure, which in principle is a hyper-structure in the regulatory material.
  40. 40 Vannevar Bush was president of the Carneigie Institution of Washington from 1939–1955.
  41. 41 Atlantic Monthly July 1945, cf also http://www.isg.sfu.ca/~duchier/misc/vbush.
  42. 42 Lewis O Kelso may be the first to suggest creating an automatic retrieval system to assist legal research in «Does the Law need a technical Revolution?», Rocky Mountain Law Review, cf Reed C Lawlor «Information technology and law», Advances in Computers 1962:310. Kelso was inspired by Bush’s suggestion.
  43. 43 Cf Tor Nørretranders Stedet som ikke er, Aschehoug, Copenhagen 1997:75.