As more and more textual data is digitally available, the need for performant and maintainable algorithms and technologies for text mining and natural language processing grows. Most approaches on information extraction (IE) can be grouped in either category of rule-based, i.e. knowledge-based, or machine learning (ML) based technologies. A large part of information extraction systems in research are nowadays based on statistical methods using models generated by ML algorithms, leaving rule-based methodologies out of the focus of modern research on IE. However, systems based on ML are still underrepresented (Chiticariu et al. 2013). Although this is not fully in-line with current trends in academic research there is strong evidence that rule-based information extraction offers huge research potential in both application and development of tools. In addition, there are many research questions, which are highly interesting for the information extraction practice, but are not addressed by academic nor industrial research.
- Expressiveness of rule and domain specific languages
- Induction and creation of rules
- Testing and evaluation of rules (including performance)
- Maintenance of rules in an industrial and productive environment
- Frameworks and tools for the management of rules and their application on large document corpora
The main focus of this short paper is to discuss the importance of domain specific languages for rule-based pattern annotation on textual data. Thereby, the focus is set on current shortcomings of pattern languages and rule-based IE in general and proposes possible research questions that address these deficiencies.
Advantages of Rule-based IE ^
However, rule-based and ML approaches must not be used exclusively but can complement each other very well. As supervised learning always requires (large) amount of training data, one of the first steps is to sample and create training data sets. In domains where, pre-labelled samples are scarce or non-existent, this step is a mostly time-consuming task. To extract a great number of high quality training samples, rule languages can be used. Based on their declarative nature, the knowledge engineer is free to control the resulting precision and recall by either specifying broader rules that capture more samples but also some false examples or by specifying narrower rules that result in a smaller training data set, of high quality in terms of precision. This methodology is a form of bootstrapping modern machine learning technologies and the most recent project focus on exactly this challenge. For example, the project «Snorkel» run by Stanford University3 is a framework that allows «creating, modelling, and managing training data, currently focused on accelerating the development of structured or «dark» data extraction applications for domains in which large labelled training sets are not available or easy to obtain.» The main idea thereby is called «data programming», whereas functions, i.e. rules, are used to create data sets, which are subsequently used to train powerful and flexible machine learning approaches.
Limitations and Shortcomings ^
Although there are many advantages of rule-based IE, current state-of-the-art rule languages such as UIMA Ruta and JAPE also have several drawbacks. The declarative nature offers huge potential for directly applying domain knowledge to information extraction tasks. The tooling and syntax of these languages still prohibits the widespread use of these domain specific languages outside of the NLP community. Domain experts, i.e. end-users that have little technical background, find it hard to accommodate themselves with the syntax and quirks of the current mainstream rule languages. Consequently, the need for a joint development with domain experts, software engineers and so-called legal data scientists emerges.
Addressing the Shortcomings of Rule-based IE: A Brief Research Perspective ^
A first step to a revitalized research interest into rule-based information extraction would be the definition of a modern standardized rule language. This can be either accomplished by a formally well-defined pattern language or specification of a virtual machine fostering the embedding of different IE frameworks such as Apache UIMA and GATE. Such an open and common platform could spark and bundle new research interest in this platform. Possible research can address issues such as the representation and efficient indexing of annotations, automatic optimization of rules and platform independent tooling. Apart from the technical concepts and implementations, a centralized community surrounding an open standard increases the amount of documentation, tools, language patterns and training material, in contrast to the current rule languages where documentation is scarce.
Chiticariu, L./Li, Y./Reiss, F.R. (2013), October. Rule-based information extraction is dead! long live rule-based information extraction systems!. In EMNLP, No. October, pp. 827–832.
Kluegl, P./Atzmueller, M./Hermann, T./Puppe, F. (2009), A Framework for Semi-Automatic Development of Rule-based Information Extraction Applications. In LWA, pp. KDML–56.
Li, Y./Kim, E./Touchette, M.A./Venkatachalam, R./Wang, H. (2015), Vinery: A visual ide for information extraction. Proceedings of the VLDB Endowment, 8(12), pp. 1948–1951.
Wilcock G. (2017), The Evolution of Text Annotation Frameworks. In: Ide N./Pustejovsky J. (eds.), Handbook of Linguistic Annotation. Springer, Dordrecht.