NDLI: Implemetation of Centroid Decomposition Algorithm on Big Data Platforms — Apache Spark vs . Apache Flink

Please wait, while we are loading the content...

Implemetation of Centroid Decomposition Algorithm on Big Data Platforms — Apache Spark vs . Apache Flink

Content Provider	Semantic Scholar
Author	Liu, Qian
Copyright Year	2016
Abstract	The Centroid Decomposition (CD) algorithm is the approximation of the Singular Value Decomposition (SVD) algorithm, which is one of the most used matrix decomposition techniques to deal with real world data analysis tasks. CD algorithm is based on a greedy algorithm, termed the Scalable Sign Vector (SSV), that efficiently determines vectors that are consisted of 1s and -1s as elements, called sign vectors. CD algorithm is generally applied for data analysis tasks that involve long time series, i.e. where the number of rows (observations) is much larger than the number of columns (time series). The goal of this thesis is to implement the CD algorithm on two Big Data platforms, i.e., Apache Spark and Apache Flink. The proposed implementation compares two different data structures for both platforms. The first data structure is the per-element data structure, which distributively transforms the matrix based on every single element. The second data structure, the per-vector data structure, executes every transformation on the basis of each row or column vector. We empirically evaluate the efficiency of the non-streamed Spark and Flink CD implementations respectively. To simulate the streams of time series, we use Apache Kafka to periodically produce new matrix data to a broker and Spark Streaming and Flink Data Streaming to regularly fetch the data and run the CD algorithm.
File Format	PDF HTM / HTML
Alternate Webpage(s)	https://exascale.info/assets/pdf/students/2016-Qian_CD_Flink-vs-Spark.pdf
Language	English
Access Restriction	Open
Content Type	Text
Resource Type	Article

Central Library (ISO-9001:2015 Certified)
Indian Institute of Technology Kharagpur
Kharagpur, West Bengal, India | PIN - 721302

See location in the Map
03222 282435
Mail: support@ndl.gov.in

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in