Date of Award
College of Information Technology and Engineering
Type of Degree
Jamil M. Chaudri
YouTube is currently the most popular and successful video sharing website. As YouTube has broad and profound social impact, YouTube analytics has become a hot research area. The videos on YouTube have become a treasure of data. However, getting access to the immense and massive YouTube data is a challenge. Previous research, studies, and analysis so far, are only conducted on very small volumes of YouTube video data. To date, there exists no mechanism to systematically and continuously collect, process and store the rich set of YouTube data. This thesis presents a methodology to systematically and continuously mine and store the YouTube data. The methodology has two modules: a video discovery and a video metadata collection. YouTube provides an API to conduct search requests analogous to the search performed by a user on the YouTube website. However, the YouTube API’s ‘search’ operation was not designed to return large volumes of data and only provides limited search results (metadata) that can easily be handled by a human. The proposed discovery process makes the search process scalable, robust and swift by (a) initially taking a few carefully selected video IDs (seeds) as input from each of the video categories in YouTube (so to get a wider coverage), and (b) later, using each of them to find related videos over multiple generations. The thesis employed in-memory data management for the discovery process to suppress redundancy explosion and rapidly find new videos. Further, the batch-caching mechanism is introduced to ensure that the high velocity data generated by the discovery process do not result in memory explosion; thereby increasing the reliability of the methodology. The performance of the proposed methodology is gauged over the period of two months. Within two months, 16,000,000 videos were discovered and complete metadata of more than 42,000 videos was mined. The thesis also explores serveral new possible dimensions that can be possible extensions to the proposed framework. The two most promiment dimensions are (a) channel discovery: Every YouTube user that has ever made a comment contributes to a channel. A channel can hold hundreds of YouTube videos and related metadata. Discovering channels can speed up the video discovery up to 100-fold; and (b) channel metadata collection: Since the volume of videos in a channel is massive, therefore, a mechanism needs to be developed to use multiple machines running software agents that can collaborate and communicate with each other to collect metadata of billions of videos in a distributed fashion.
YouTube (Electronic resource)
Data mining -- Software.
Tian, Zifeng, "A Robust Framework for Mining YouTube Data" (2017). Theses, Dissertations and Capstones. 1129.