NDLI: Studying machine translation technologies for large-data CLIR tasks: a patent prior-art search case study

Please wait, while we are loading the content...

Studying machine translation technologies for large-data CLIR tasks: a patent prior-art search case study

Content Provider	Springer Nature Link
Author	Magdy, Walid Jones, Gareth J. F.
Copyright Year	2013
Abstract	Prior-art search in patent retrieval is concerned with finding all existing patents relevant to a patent application. Since patents often appear in different languages, cross-language information retrieval (CLIR) is an essential component of effective patent search. In recent years machine translation (MT) has become the dominant approach to translation in CLIR. Standard MT systems focus on generating proper translations that are morphologically and syntactically correct. Development of effective MT systems of this type requires large training resources and high computational power for training and translation. This is an important issue for patent CLIR where queries are typically very long sometimes taking the form of a full patent application, meaning that query translation using MT systems can be very slow. However, in contrast to MT, the focus for information retrieval (IR) is on the conceptual meaning of the search words regardless of their surface form, or the linguistic structure of the output. Thus much of the complexity of MT is not required for effective CLIR. We present an adapted MT technique specifically designed for CLIR. In this method IR text pre-processing in the form of stop word removal and stemming are applied to the MT training corpus prior to the training phase. Applying this step leads to a significant decrease in the MT computational and training resources requirements. Experimental application of the new approach to the cross language patent retrieval task from CLEF-IP 2010 shows that the new technique to be up to 23 times faster than standard MT for query translations, while maintaining IR effectiveness statistically indistinguishable from standard MT when large training resources are used. Furthermore the new method is significantly better than standard MT when only limited translation training resources are available, which can be a significant issue for translation in specialized domains. The new MT technique also enables patent document translation in a practical amount of time with a resulting significant improvement in the retrieval effectiveness.
Starting Page	492
Ending Page	519
Page Count	28
File Format	PDF
ISSN	13864564
Journal	Information Retrieval
Volume Number	17
Issue Number	5-6
e-ISSN	15737659
Language	English
Publisher	Springer Netherlands
Publisher Date	2013-11-21
Publisher Place	Dordrecht
Access Restriction	Subscribed
Subject Keyword	Cross-language patent retrieval Prior-art Patent search Cross-language information retrieval Large-data CLIR Machine translation Information Storage and Retrieval Document Preparation and Text Processing Data Mining and Knowledge Discovery Data Structures, Cryptology and Information Theory Pattern Recognition
Content Type	Text
Resource Type	Article
Subject	Library and Information Sciences Information Systems

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in

Using multiple query representations in patent prior-art search

Multilayer source selection as a tool for supporting patent search and classification

Searching strategies for the Bulgarian language

Leveraging comparable corpora for cross-lingual information retrieval in resource-lean language pairs

Wikipedia-based query phrase expansion in patent class search

Mining subtopics from different aspects for diversifying search results

Cross-language information retrieval models based on latent topic models trained with document-aligned comparable corpora

An efficient method for using machine translation technologies in cross-language patent search

Special issue of The Journal of Information Retrieval on web mining for search

Studying machine translation technologies for large-data CLIR tasks: a patent prior-art search case study

Similar Documents

Using multiple query representations in patent prior-art search

Multilayer source selection as a tool for supporting patent search and classification

Searching strategies for the Bulgarian language

Leveraging comparable corpora for cross-lingual information retrieval in resource-lean language pairs

Wikipedia-based query phrase expansion in patent class search

Mining subtopics from different aspects for diversifying search results

Cross-language information retrieval models based on latent topic models trained with document-aligned comparable corpora

An efficient method for using machine translation technologies in cross-language patent search

Special issue of The Journal of Information Retrieval on web mining for search

Studying machine translation technologies for large-data CLIR tasks: a patent prior-art search case study