NDLI: A machine-learning approach to discovering company home pages

Content Provider	IEEE Xplore Digital Library
Author	Gryc, W. Melville, P. Lawrence, R.D.
Copyright Year	2010
Description	Author affiliation: Oxford Internet Institute, University of Oxford, UK OX1 3JS (Gryc, W.) \|\| IBM T.J. Watson Research Center, P.O. Box 218, Yorktown Heights, NY 10598, USA (Melville, P.; Lawrence, R.D.)
Abstract	For many marketing and business applications, it is necessary to know the home page of a company specified only by its company name. If we require the home page for a small number of big companies, this task is readily accomplished via use of Internet search engines or access to domain registration lists. However, if the entities of interest are small companies, these approaches can lead to mismatches, particularly if a specified company lacks a home page. We address this problem using a supervised machine-learning approach in which we train a binary classification model. We classify potential website matches for each company name based on a set of explanatory features extracted from the content on each candidate website. Our approach is related to web-based business intelligence in two ways: (1) we build the training set for our learning algorithms through crowdsourcing tools and illustrate their potential for business research, and (2) the success of our model allows one to easily use corporate home pages as data inputs into other research projects. Through the successful use of crowdsourcing, our approach is able to identify a correct home page or recognize that a valid home page does not exist with an accuracy that is 57% better than simply taking the highest ranked search engine result as the correct match.
Starting Page	361
Ending Page	366
File Size	372496
Page Count	6
File Format	PDF
ISBN	9781424455515
ISSN	21504938
e-ISBN	9781424455539
DOI	10.1109/DEST.2010.5610621
Language	English
Publisher	Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Publisher Date	2010-04-13
Publisher Place	United Arab Emirates
Access Restriction	Subscribed
Rights Holder	Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subject Keyword	Training Biological system modeling Web pages Companies Search engines Feature extraction Logistics
Content Type	Text
Resource Type	Article

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in