The Center for Language and Speech Technologies and Applications (TALP) of the Universitat Politècnica de Catalunya · BarcelonaTech (UPC), a member of CIT UPC, has developed the prototype of an automatic translation system for patents in the biomedical area. The system can be used to create multilingual documents with the same structure as the original patents, including images, formulae and other kinds of annotations. In addition, the system works in real time and can be incorporated in web applications.
The TALP UPC researchers’ work, which has been carried out over three years, is part of a collaborative project called MOLTO in the European Union’s Seventh Framework Programme. MOLTO has involved the collaboration of research groups in Göteborg (Sweden), Helsinki (Finland), Utrecht (Holland), Sofía (Bulgaria) and Zurich (Switzerland).
With the general objective of obtaining an automatic translation system in several languages that can produce high quality translations, MOLTO researchers have worked on three cases: the formulations of mathematical exercises, a description of objects in a museum and a model for translating patents, which is the case that TALP members have worked on directly.
As a general technique in the MOLTO project, the researchers used syntactic-semantic grammars created on the basis of specific domain ontologies (conceptual schemes that facilitate information exchange between systems). In turn, these components have been integrated into what is known as a Grammatical Framework (GF): the IT tool that makes automatic translations into different languages possible through a common abstract representation. To facilitate its use online, an Application Programming Interface (API) has been designed so that the tool can be included in any Web application.
For the patent translation, hybridization techniques were used that combine the Grammatical Framework and statistical methods. GF produces grammatically correct translations, whilst the inclusion of statistical techniques (similar to those used by machine translators such as Google Translate) can cover extensive domains such as biomedicine.
In addition, the patents are part of a document recovery system that initially could only search for documents in English. Therefore, special care has been taken to create a method that maintains the complex arrangement of tags and semantic annotations that are present in documents. Among other factors, this means that the structure of chemical compounds described in biotechnology registers can be maintained, and documents can be searched for in the translation language.
The result is the automatic translation of patents into English, French and German (the three official languages of the European Patent Office), with the added advantage that the translations can be carried out in real time. This is of great use in the task of multilingual database searches.