NDLI: Automatic detection of spelling variation in historical corpus An application to build a Brazilian Portuguese spelling variants dictionary

Please wait, while we are loading the content...

Automatic detection of spelling variation in historical corpus An application to build a Brazilian Portuguese spelling variants dictionary

Content Provider	Semantic Scholar
Author	Giusti, Rafael Candido, Arnaldo Muniz, Marcelo Cucatto, Livia Aluísio, Sandra M.
Copyright Year	2007
Abstract	The Historical Dictionary of Brazilian Portuguese (HDBP), the first of its kind, is based on a corpus of Brazilian Portuguese (BP) texts from the sixteenth through the eighteenth centuries (and some texts from the beginning of the nineteenth century), being developed under the sponsorship of the Brazilian funding agency CNPq (Conselho Nacional de Desenvolvimento Cientifico e Tecnologico). It is a three-year project that started in 2006 to fill a gap in Brazilian culture with a dictionary describing the vocabulary of Brazilian Portuguese from the beginning of the country’s history. The corpus totals more than 3,000 texts with approximately 7.5 million words. Our working corpus, i.e. the corpus already processed by the corpus processing system UNITEX (http://www-igm.univ-mlv.fr/~unitex/), is coded in Unicode (UTF-16) and totals 1,733 texts, 57.1 MB, and 4.9 million words. A difficulty in dealing with historical corpora to carry out lexicographic tasks is the identification of all spelling variants of a specific word, since spelling variation distorts frequency counts, a usual criterion to select dictionary entries. In our project, another challenge is to select all variants of a dictionary entry that are in the corpus to illustrate the absence of an orthographical system in the aforementioned centuries and to provide example sentences for them. This paper introduces both an approach based on transformation rules to cluster distinct spelling variations around a common form, which is not always the orthographic (or modern) form, and the choices made to build a dictionary of spelling variants of BP based on these clusters. Currently, we have forty-three rules manually developed, which generated 12,189 clusters of spelling variants, totalling 27,199 variants from our working corpus. After a careful analysis of these clusters, we adopted the DELA format to build our dictionary. The BP dictionary of spelling variants enables sophisticated searches in the historical corpus using UNITEX, giving support to build the main dictionary of the HDBP project. Moreover, the variants of a given word can be searched using an application named Dicionario we have developed to display dictionaries in DELA format. As we also use Philologic (http://philologic.uchicago.edu/index.php) to support the building of the HDPB, we carried out a comparative evaluation between our approach to cluster distinct spelling variants and AGREP (http://www.tgries.de/agrep/), which is used in Philologic to check for similar or alternative spellings. 1 University of Sao Paulo, NILC, CP 668,13560-970, Sao Carlos/SP, Brazil e-mail: rg@grad.icmc.usp.br, arnaldoc@icmc.usp.br, marcelo.muniz@gmail.com, liviacucatto@yahoo.com.br, sandra@icmc.usp.br
File Format	PDF HTM / HTML
Alternate Webpage(s)	http://ucrel.lancs.ac.uk/publications/cl2007/paper/238_Paper.pdf
Alternate Webpage(s)	http://www.researchgate.net/profile/Sandra_Aluisio/publication/228527552_Automatic_detection_of_spelling_variation_in_historical_corpus_An_application_to_build_a_Brazilian_Portuguese_spelling_variants_dictionary/links/02bfe510834f764be1000000.pdf
Alternate Webpage(s)	http://www.nilc.icmc.usp.br/nilc/download/corpus_linguistics_2007.pdf
Alternate Webpage(s)	http://www.nilc.icmc.usp.br/nilc/projects/hpc/cl_2007/cl_2007_presentation.pdf
Language	English
Access Restriction	Open
Content Type	Text
Resource Type	Article

Central Library (ISO-9001:2015 Certified)
Indian Institute of Technology Kharagpur
Kharagpur, West Bengal, India | PIN - 721302

See location in the Map
03222 282435
Mail: support@ndl.gov.in

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in