EnTube: A dataset for Youtube video engagement analytics

YouTube, one of the largest online video-sharing platforms today, has provided a place for content creators to share information and earn extra income. Anticipating whether a video will be engaged by viewers or not is an essential factor in helping video creators improve video content and quality before publishing. To facilitate this task, we build an annotated dataset of 23,738 videos collected from 72 YouTube channels in Vietnam that were in four categories (i.e., comedy, travel-and-events, education, science-and-technology) and published over 12 years. We evaluate a number of metrics for measuring video engagement to propose a novel measure which determines the engagement of a video via its: Q score. Using our proposed measure, we annotate videos with three levels of engagement including: Engage, Neutral, and not Engage. From the supervised dataset, we constructed a multimodal to infer the degree of engagement based on the content of a YouTube video such as title, audio, thumbnail, video, and tags. We believe our dataset and metric to be useful for engagement analytic as well as studies on social media content.

Column	Description
channel_id	Id of channel
channel_name	Name of channel
channel_category	Category of channel
channel_started	Started year of channel
channel_rank	Rank of channel in most-subscribed Vietnamese channels
channel_subscribers	Number of subscribers of channel
id	Id of video
title	Title of video
length_title	Length title of video (tokens)
categories	Categories of video
description	Description of video
tags	Tags of video
num_tags	Number of tags of video
upload_date	Uploaded date of video
delta_upload_date	Distance between collected date and uploaded date (days)
duration	Duration of video (minutes)
view_count	Number of views of video
like_count	Number of likes of video
comment_count	Number of comments of video
dislike_count	Number of dislikes of video
like_per_view	Number of likes per views of video
comment_per_view	Number of comments per views of video
dislike_per_view	Number of dislikes per views of video
engagement_rate_1	Total comments and likes per views of video
engagement_rate_2	Total comments, likes, and dislikes per views of video
q_score	Q score of video
label_1	Engagement level based on engagement_rate_1 of video
label_2	Engagement level based on q_score of video

MissingRate

Column	Description
channel_id	0.000000
channel_name	0.000000
channel_category	0.000000
channel_started	0.000000
channel_rank	0.000000
channel_subscribers	0.000000
id	0.000000
title	0.000000
length_title	0.000000
categories	0.000000
description	0.017609
tags	0.000000
num_tags	0.000000
upload_date	0.000000
delta_upload_date	0.000000
duration	0.000000
view_count	0.000000
like_count	0.000000
comment_count	0.000000
dislike_count	0.000000
like_per_view	0.000000
comment_per_view	0.000000
dislike_per_view	0.000000
engagement_rate_1	0.000000
engagement_rate_2	0.000000
q_score	0.000000
label_1	0.000000
label_2	0.000000

Sample

sample/audio_by_year: folder contains audio by year
sample/thumbnails_by_year: folder contains thumbnails by year
sample/video_by_year: folder contains video by year
sample/entube_final.parquet: files contains metadata

Embedding

You can get data which is feature extraction at here.

Data input includes 3 files: entube_embedding_train.pt, entube_embedding_val.pt, entube_embedding_test.pt
Data in each file is a list with each item is a dictionary including keys:

'id': id of video on Youtube
'embedding_title':tensor which is feature extraction of title, has shape: (768,)
'embedding_tag':tensor which is feature extraction of tag, has shape: (768,)
'embedding_thumbnail':tensor which is feature extraction of thumbnail, has shape: (2560,)
'embedding_video':tensor which is feature extraction of the video, has shape: (2304,1,2,2)
'embedding_audio':tensor which is feature extraction of audio, has shape: (62, 128)
'engagement_rate_label':tensor of label 1 which not use q-score
'q_score_label':tensor of label 2 which use q-score

Zip

You can get data zip in here at here.

Data input includes 3 folders: audio_short_zip, video_short_zip, thumbnails_zip

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
notebook		notebook
sample_data		sample_data
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EnTube: A dataset for Youtube video engagement analytics

Table of Contents

Metadata

MissingRate

Sample

Embedding

Zip

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

truongcntn2017/EnTube

Folders and files

Latest commit

History

Repository files navigation

EnTube: A dataset for Youtube video engagement analytics

Table of Contents

Metadata

MissingRate

Sample

Embedding

Zip

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages