NDLI: Estimating the Expected Effectiveness of Text Classification Solutions under Subclass Distribution Shifts

Content Provider	IEEE Xplore Digital Library
Author	Lipka, N. Stein, B. Shanahan, J.G.
Copyright Year	2012
Abstract	Automated text classification is one of the most important learning technologies to fight information overload. However, the information society is not only confronted with an information flood but also with an increase in "information volatility", by which we understand the fact that kind and distribution of a data source's emissions can significantly vary. In this paper we show how to estimate the expected effectiveness of a classification solution when the underlying data source undergoes a shift in the distribution of its subclasses (modes). Subclass distribution shifts are observed among others in online media such as tweets, blogs, or news articles, where document emissions follow topic popularity. To estimate the expected effectiveness of a classification solution we partition a test sample by means of clustering. Then, using repetitive resampling with different margin distributions over the clustering, the effectiveness characteristics is studied. We show that the effectiveness is normally distributed and introduce a probabilistic lower bound that is used for model selection. We analyze the relation between our notion of expected effectiveness and the mean effectiveness over the clustering both theoretically and on standard text corpora. An important result is a heuristic for expected effectiveness estimation that is solely based on the initial test sample and that can be computed without resampling.
Starting Page	972
Ending Page	977
File Size	278034
Page Count	6
File Format	PDF
ISBN	9781467346498
ISSN	15504786
DOI	10.1109/ICDM.2012.89
Language	English
Publisher	Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Publisher Date	2012-12-10
Publisher Place	Belgium
Access Restriction	Subscribed
Rights Holder	Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subject Keyword	Vectors Estimation Standards Clustering algorithms Mathematical model Media Machine learning clustering Classification Concept Drift unknown distributions Model Selection
Content Type	Text
Resource Type	Article

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in

Estimating the expected effectiveness of text classification solutions under subclass distribution shifts.

Adaptive Supervised Learning Model for Training Set Selection under Concept Drift Data Streams

Diagnosing a priori unknown faultsby modified supervised-unsupervised learning algorithm

Distribution mixtures, a reduced-bias estimation algorithm

Analysis of classification learning based on estimation of distribution algorithms

Meta-learning, model selection, and example selection in machine learning domains with concept drift (2005).

Classification with a reject option under Concept Drift: The Droplets algorithm

Hierarchical Stability-Based Model Selection for Clustering Algorithms

A new clustering technique for the identification of PWARX hybrid models

Estimating the Expected Effectiveness of Text Classification Solutions under Subclass Distribution Shifts

Similar Documents

Estimating the expected effectiveness of text classification solutions under subclass distribution shifts.

Adaptive Supervised Learning Model for Training Set Selection under Concept Drift Data Streams

Diagnosing a priori unknown faultsby modified supervised-unsupervised learning algorithm

Distribution mixtures, a reduced-bias estimation algorithm

Analysis of classification learning based on estimation of distribution algorithms

Meta-learning, model selection, and example selection in machine learning domains with concept drift (2005).

Classification with a reject option under Concept Drift: The Droplets algorithm

Hierarchical Stability-Based Model Selection for Clustering Algorithms

A new clustering technique for the identification of PWARX hybrid models

Estimating the Expected Effectiveness of Text Classification Solutions under Subclass Distribution Shifts