Toronto Networking Seminar

Organized by Department of Computer Science and Department of Electrical and Computer Engineering, University of Toronto


 

Real Time Memory Efficient Data

Redundancy Removal

 

 

Vikas Kumar Garg

IBM Research, India

 

 

Date: Monday, 25-OCT-10, 4:10pm

Room: BA 3116


Abstract:

Data intensive computing has become a central theme in research community and industry. There is an ever growing need to process and analyze massive amounts of data from diverse sources such as telecom call data records, telescope imagery, online transaction records, web pages, stock markets, medical records (monitoring critical health conditions of patients), climate warning systems, etc. Removing redundancy in the data is an important problem as it helps in resource and compute efficiency for downstream processing of the massive (1 billion to 10 billion records) datasets. In application domains such as IR, stock markets, telecom and others, there is a strong need for real-time data redundancy removal (referred to as DRR) of enormous amounts of data flowing at the rate of 1GB/s or more. Real-time scalable data redundancy removal on massive datasets is a challenging problem. We present the design of a novel parallel data redundancy removal algorithm for both in-memory and disk-based execution. We also develop queueing theoretic analysis to optimize the throughput of our parallel algorithm on multi-core architectures. For 500 million records, our parallel algorithm can perform complete de-duplication in 255 seconds on 16 core Intel Xeon 5570 architecture, with in-memory execution. This gives a throughput of 2M records/s. For 6 billion records, our parallel algorithm can perform complete de-duplication in less than 4.5 hours, using 6 cores of Intel Xeon 5570, with disk-based execution. This gives a throughput of around 370K records/s. To the best of our knowledge, this is the highest real-time throughput for data redundancy removal on such massive datasets. We also demonstrate the scalability of our algorithm with increasing number of cores and data.


 

Bio:

Vikas Kumar Garg received his B.E. in Information Technology from the Netaji Subhas Institute of Technology (NSIT)/Delhi College of  Engineering (DCE), University of Delhi, Delhi in 2006; and his M.E. in Computer Science and Engineering from the Indian Institute of Science (IISc), Bangalore in 2009. He is currently working for the Next Generation Systems and Smarter Planet Solutions group at IBM Research - India. Prior to joining IBM Research, he was a Researcher with the GLAMS center of excellence at the Indian School of Business (ISB), Hyderabad. His research interests include statistical machine learning and applications in bioinformatics, networks, information retrieval and  stream databases. He is also interested in knowledge discovery, data mining and high performance computing.


 

Host of Talk:

Jorg Liebeherr (jorg@comm.utoronto.ca)