NDLI logo
  • Content
  • Similar Resources
  • Metadata
  • Cite This
  • Log-in
  • Fullscreen
Log-in
Do not have an account? Register Now
Forgot your password? Account recovery
  1. Transactions on the Web (TWEB)
  2. ACM Transactions on the Web (TWEB) : Volume 9
  3. Issue 4, October 2015
  4. Improving Researcher Homepage Classification with Unlabeled Data
Loading...

Please wait, while we are loading the content...

ACM Transactions on the Web (TWEB) : Volume 10
ACM Transactions on the Web (TWEB) : Volume 9
Issue 4, October 2015
Improving Researcher Homepage Classification with Unlabeled Data
Diversionary Comments under Blog Posts
Estimating Clustering Coefficients and Size of Social Networks via Random Walk
Fona: Quantitative Metric to Measure Focus Navigation on Rich Internet Applications
Issue 3, June 2015
Issue 2, May 2015
Issue 1, January 2015
ACM Transactions on the Web (TWEB) : Volume 8
ACM Transactions on the Web (TWEB) : Volume 7
ACM Transactions on the Web (TWEB) : Volume 6
ACM Transactions on the Web (TWEB) : Volume 5
ACM Transactions on the Web (TWEB) : Volume 4
ACM Transactions on the Web (TWEB) : Volume 3
ACM Transactions on the Web (TWEB) : Volume 2
ACM Transactions on the Web (TWEB) : Volume 1

Similar Documents

...
Researcher homepage classification using unlabeled data

Article

...
Researcher Homepage Classification using Unlabeled Data

...
Exploiting unlabeled data to improve peer-to-peer traffic classification using incremental tri-training method

Article

...
Tri-training: exploiting unlabeled data using three classifiers

Article

...
Comparing traffic classifiers

Article

...
Tri-Training: Exploiting Unlabeled Data Using Three Classifiers

...
1tri-training: exploiting unlabeled data using three classifiers.

...
Unlabeled Data Can Degrade Classification Performance of Generative Classifiers (2002)

Conference Proceedings

...
Unlabeled data can degrade classification performance of generative classifiers (2002)

Article

Improving Researcher Homepage Classification with Unlabeled Data

Content Provider ACM Digital Library
Author Gollapalli, Sujatha Das Caragea, Cornelia Mitra, Prasenjit Giles, C. Lee
Copyright Year 2015
Abstract A classifier that determines if a webpage is relevant to a specified set of topics comprises a key component for focused crawling. Can a classifier that is tuned to perform well on training datasets continue to filter out irrelevant pages in the face of changing content on the Web? We investigate this question in the context of identifying researcher homepages. We show experimentally that classifiers trained on existing datasets of academic homepages underperform on “non-homepages” present on current-day academic websites. As an alternative to obtaining labeled datasets to retrain classifiers for the new content, in this article we ask the following question: “How can we effectively use the unlabeled data readily available from academic websites to improve researcher homepage classification?” We design novel URL-based features and use them in conjunction with content-based features for representing homepages. Within the co-training framework, these sets of features can be treated as complementary views enabling us to effectively use unlabeled data and obtain remarkable improvements in homepage identification on the current-day academic websites. We also propose a novel technique for “learning a conforming pair of classifiers” that $\textit{mimics}$ co-training. Our algorithm seeks to minimize a loss (objective) function quantifying the difference in predictions from the two views afforded by co-training. We argue that this loss formulation provides insights for understanding co-training and can be used even in the absence of a validation dataset. Our next set of findings pertains to the evaluation of other state-of-the-art techniques for classifying homepages. First, we apply feature selection (FS) and feature hashing (FH) techniques independently and in conjunction with co-training to academic homepages. FS is a well-known technique for removing redundant and unnecessary features from the data representation, whereas FH is a technique that uses hash functions for efficient encoding of features. We show that FS can be effectively combined with co-training to obtain further improvements in identifying homepages. However, using hashed feature representations, a performance degradation is observed possibly due to feature collisions. Finally, we evaluate other semisupervised algorithms for homepage classification. We show that although several algorithms are effective in using information from the unlabeled instances, co-training that explicitly harnesses the feature split in the underlying instances outperforms approaches that combine content and URL features into a single view.
Starting Page 1
Ending Page 32
Page Count 32
File Format PDF
ISSN 15591131
e-ISSN 1559114X
DOI 10.1145/2767135
Volume Number 9
Issue Number 4
Journal ACM Transactions on the Web (TWEB)
Language English
Publisher Association for Computing Machinery (ACM)
Publisher Date 2015-10-19
Publisher Place New York
Access Restriction One Nation One Subscription (ONOS)
Subject Keyword Researcher homepage classification Co-training Conforming classifiers Unlabeled data
Content Type Text
Resource Type Article
Subject Computer Networks and Communications
  • About
  • Disclaimer
  • Feedback
  • Sponsor
  • Contact
  • Chat with Us
About National Digital Library of India (NDLI)
NDLI logo

National Digital Library of India (NDLI) is a virtual repository of learning resources which is not just a repository with search/browse facilities but provides a host of services for the learner community. It is sponsored and mentored by Ministry of Education, Government of India, through its National Mission on Education through Information and Communication Technology (NMEICT). Filtered and federated searching is employed to facilitate focused searching so that learners can find the right resource with least effort and in minimum time. NDLI provides user group-specific services such as Examination Preparatory for School and College students and job aspirants. Services for Researchers and general learners are also provided. NDLI is designed to hold content of any language and provides interface support for 10 most widely used Indian languages. It is built to provide support for all academic levels including researchers and life-long learners, all disciplines, all popular forms of access devices and differently-abled learners. It is designed to enable people to learn and prepare from best practices from all over the world and to facilitate researchers to perform inter-linked exploration from multiple sources. It is developed, operated and maintained from Indian Institute of Technology Kharagpur.

Learn more about this project from here.

Disclaimer

NDLI is a conglomeration of freely available or institutionally contributed or donated or publisher managed contents. Almost all these contents are hosted and accessed from respective sources. The responsibility for authenticity, relevance, completeness, accuracy, reliability and suitability of these contents rests with the respective organization and NDLI has no responsibility or liability for these. Every effort is made to keep the NDLI portal up and running smoothly unless there are some unavoidable technical issues.

Feedback

Sponsor

Ministry of Education, through its National Mission on Education through Information and Communication Technology (NMEICT), has sponsored and funded the National Digital Library of India (NDLI) project.

Contact National Digital Library of India
Central Library (ISO-9001:2015 Certified)
Indian Institute of Technology Kharagpur
Kharagpur, West Bengal, India | PIN - 721302
See location in the Map
03222 282435
Mail: support@ndl.gov.in
Sl. Authority Responsibilities Communication Details
1 Ministry of Education (GoI),
Department of Higher Education
Sanctioning Authority https://www.education.gov.in/ict-initiatives
2 Indian Institute of Technology Kharagpur Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project https://www.iitkgp.ac.in
3 National Digital Library of India Office, Indian Institute of Technology Kharagpur The administrative and infrastructural headquarters of the project Dr. B. Sutradhar  bsutra@ndl.gov.in
4 Project PI / Joint PI Principal Investigator and Joint Principal Investigators of the project Dr. B. Sutradhar  bsutra@ndl.gov.in
Prof. Saswat Chakrabarti  will be added soon
5 Website/Portal (Helpdesk) Queries regarding NDLI and its services support@ndl.gov.in
6 Contents and Copyright Issues Queries related to content curation and copyright issues content@ndl.gov.in
7 National Digital Library of India Club (NDLI Club) Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach clubsupport@ndl.gov.in
8 Digital Preservation Centre (DPC) Assistance with digitizing and archiving copyright-free printed books dpc@ndl.gov.in
9 IDR Setup or Support Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops idr@ndl.gov.in
I will try my best to help you...
Cite this Content
Loading...