Jusletter IT

Machine translation, language analysis, and mobile applications in the terminology domain

  • Authors: Bartholomäus Wloka / Gerhard Budin / Werner Winiwarter
  • Category: Articles
  • Region: Austria
  • Field of law: Law and Language
  • Collection: Tagungsband-IRIS-2013
  • Citation: Bartholomäus Wloka / Gerhard Budin / Werner Winiwarter, Machine translation, language analysis, and mobile applications in the terminology domain, in: Jusletter IT 20 February 2013
In this paper we present the results of our contribution to the TES4IP project, funded by the Austrian Research Promotion Agency (FFG), 10th call. Our focus was the integration of machine translation into the broad spectrum of the project to offer a translation service, lay the basis for further possibilities of language analysis on a structural level as well as studying and implementing the capabilities of modern hand held devices, i.e. Smartphones to further enhance the results and potentials of the tasks in TES4IP.

Inhaltsverzeichnis

  • 1. Introduction
  • 1.1. Translation Service
  • 2. Language Analysis
  • 3. Applications for Mobile Devices
  • 4. Conclusion
  • 5. Literature

1.

Introduction ^

[1]
The goal of the TES4IP (Terminological Services for the Intellectual Property Domain) project was to establish a framework consisting of workflows to enhance the queries of various documents, including legal data bases. Legal documents in particular contain a significant amount of reoccurring terms, phrases, and linguistic constructs, which was exploited in the query workflows to enable a fast and goal oriented query processing. The conceptional modeling of e-government workflows in conjunction with the linguistic analysis of the textual data, resulted in the implementation of a real world application framework. An important aspect within this framework was multilingual applications. Due to globalization and an increasing involvement in international legal problems, the need for queries in multiple languages has become imperative. In order to allow for quick analysis and evaluation, the translation of a document has to be done quickly and accurately. Quick translation of large text resources by human experts is not a feasible method in this case, therefore we have integrated an automated translation service.
[2]
Automated translation services, i.e. Machine Translation (MT) services, became ubiquitous in the last few years (Boitet, 2009). The applications range from translations of Web pages of any content for casual users to highly specific translation tasks in professional environments. The goal of MT for the general public is to offer a wide coverage of translation, whereas the goal of MT for professional, domain-specific applications is to increase the accuracy, our focus being the latter. We utilize the approach of Statistical Machine Translation (SMT) (Brown, 1990), which we have combined with methods from Example-based Machine Translation and algorithms from Bioinformatics to create hybrid MT, based on our previous work (Wloka, 2010).
[3]
While processing training data for the MT system, a wide range of Natural Language Processing (NLP) procedures are applied to make the raw textual data more meaningful and hence improve the accuracy of the translation result. In addition, this NLP information can be utilized in a variety of scenarios of analyzing text corpus data, adding an opportunity of a language analysis component into the translation pipeline.
[4]
While MT is already highly distributed on the World Wide Web, with e.g. Google Translate, the application potential of MT on mobile devices is barely tapped. We aim to survey these potential applications and find efficient and practical ways of implementation.

1.1.

Translation Service ^

[5]
We have chosen SMT as the basis of our translation service due to its advantages over other methods such as rule-based or example-based translation. The SMT process can be automated to a large extent and does not require expert evaluation for training. Further, SMT completely eliminates the problem of word ambiguity, since it solely relies on mathematical probability of the result, rather than lexical entries. However, this requires a large amount of input data, i.e. a bilingual corpus, in order to train the system. In most cases it is helpful for the corpus to be domain-specific.
[6]

We have chosen Moses (http://www.statmt.org/moses/manual/manual.pdf) as the baseline tool for our work. It is a state-of-the-art SMT tool, which has gained tremendous momentum over the last few years within the MT community and is used in small and large scale projects all over the world.

[7]
The translation service, including a Web interface, was implemented on a server infrastructure located at the Zentraler Informatikdienst of the University of Vienna (Illustration 1).

Illustration 1: Web Interface of the translation service

[8]
The coverage of an MT system is generally conflicting with its accuracy. The wider the domain of the training set, i.e. the bilingual corpus, the higher the coverage of the production system. As mentioned in the introduction, the system can be trained with multiple domain resources in one session or divided into separate tasks. To guarantee the best possible accuracy, the quality of the bilingual corpus is crucial, since the translation solely depends on the training from this resource. The best results are achieved if the domain of the corpus is kept in line with the translations which are demanded in the production environment. This, naturally conflicts with coverage, as mentioned previously. In order to resolve this conflict the system can be trained by multiple domain resources, though the size of the resources has to be significantly larger to maintain high accuracy, or the process can be split in domain-specific tasks, which can then be chosen in the production environment.
[9]

In order to provide the best possible functionality we have focused on the following technological and functional requirements:

  • Technological: For seamless integration with other modules and services of the TES4IP platform the MT task has to offer UTF-8 character support for the various language specific special characters and high error recovery, avoiding crashes, due to noisy input data (wrong characters, misplaced carriage returns, etc.).
  • Functional: The functional integration demands the swift delivery of translations of queries and sub-queries. Ideally the system should offer a number of translation candidates from which the user can choose the best fit. A combination with Computer Assisted Translation (CAT) services is possible by displaying PoS tags, word fragmentation (Japanese and Chinese) and displaying relevant sentences from the training corpus, which is not a functionality of the MT system as such, but can be integrated with low overhead.
  • The MT service should be available to the user at all times within the process of linguistic analysis. Hence speed and ease of use are the key. A mark/copy/paste functionality within the working environment is vital, to allow for convenient usage. The graphical display of the results, i.e. translation candidates, as mentioned in point 1 and 2 has to be integrated in such a way, that they won’t clutter the workspace but offer the most crucial results.
[10]
In order to fulfill those criteria, the translation service is kept as simple as possible. It receives xmlrpc requests, i.e. the sentences which are to be translated, and sends the results back, i.e. the translation candidates. This architecture allows for a seamless integration with arbitrary systems, given the adherence to the interface requirements.

2.

Language Analysis ^

[11]
NLP analysis of textual resources and the structural annotation can be utilized for a wide variety of semantic analysis scenarios. The annotation includes lemmatization, chunking, dependency trees, named entity recognition, part-of-speech tagging, morphological information and word sense disambiguation. This linguistic information enables a deep semantic understanding of the textual data and therefore a significant query refinement and semantic search. Consequently, the domain of the search is not only defined by pure textual information, such as individual keywords and conjunction of words or phrases, but also by its underlying context and meaning. Further, the semantic classification of the textual resources, even before the deep query process, eliminates the unnecessary search through large amounts of textual data, which increases the performance of the query.

3.

Applications for Mobile Devices ^

[12]
The integration of the service for mobile devices requires careful planning in architecture and design. The usability depends on many factors, such as limited screen space on smartphones and varying connection speeds in Wi-Fi and/or 3G networks. The implementation of the user interface can be done via a website, a native application on the device or a hybrid solution. The approach of hybrid solutions has recently gained more and more popularity, due to its advantages to purely web-based and native solutions. Illustration 2 gives an overview of the advantages in the categories, which are crucial for a mobile application.

Illustration 2: Advantages of hybrid development on mobile devices (http://www.scribd.com/doc/50805466/Native-Web-or-Hybrid-Mobile-App-Development)

[13]
In order for the mobile user interface to be user friendly and accessible, it has to be kept as simple and ergonomic as possible. The architecture of the communication takes into account the network and the computational limitations of mobile devices. The mobile device is merely sending the request to the server, where all computations are done and receives the result to be displayed to the user.

4.

Conclusion ^

[14]
In this report we have summarized our contribution to the FFG 10th call project TES4IP. It involves the implementation of a translation service for both desktop and mobile applications as well as a basis for semantic inference potential, based on language structure analysis. The implementation of the system is generic, hence data independent, therefore the domain of the application can be changed, depending on the particular needs, e.g. the translation, language analysis and text mining in legal informatics.

5.

Literature ^

Christian Boitet, Herve Blanchon, Mark Seligman, Valerie Bellynck: Evolution of MT with the Web: In: Proceedings of the Conference «Machine Translation 25 Years On»: Cranfield, England (2009).

Peter Brown et al.: A statistical approach to machine translation, Comput. Linguist. MIT Press 16, 2, 79–85 (1990).

Philipp Koehn et al: Moses: Open source toolkit for statistical machine translation, pp. 177–180. Association for Computational Linguistics (2007).

Bartholomäus Wloka, Werner Winiwarter: Enhancing Language Learning and Translation with Ubiquitous Applications, pp. 203-210. International Conference on Advances in Mobile Computing and Multimedia, ACM (2010).

Bartholomäus Wloka, Werner Winiwarter: TREF – TRanslation Enhancement Framework for Japanese-English, pp. 541-546, International Multiconference on Computer Science and Information Technology (2010).

 


 

Bartholomäus Wloka, Dissertant, Österreichische Akademie der Wissenschaften, Institut für Corpuslinguistik und Texttechnologie.

 

Gerhard Budin, Professor, Universität Wien, Zentrum für Translationswissenschaften.

 

Werner Winiwarter, Professor, Universität Wien, Forschungsgruppe Data Analytics and Computing.