Skip to content

truongcntn2017/EnTube

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 

Repository files navigation

EnTube: A dataset for Youtube video engagement analytics

YouTube, one of the largest online video-sharing platforms today, has provided a place for content creators to share information and earn extra income. Anticipating whether a video will be engaged by viewers or not is an essential factor in helping video creators improve video content and quality before publishing. To facilitate this task, we build an annotated dataset of 23,738 videos collected from 72 YouTube channels in Vietnam that were in four categories (i.e., comedy, travel-and-events, education, science-and-technology) and published over 12 years. We evaluate a number of metrics for measuring video engagement to propose a novel measure which determines the engagement of a video via its: Q score. Using our proposed measure, we annotate videos with three levels of engagement including: Engage, Neutral, and not Engage. From the supervised dataset, we constructed a multimodal to infer the degree of engagement based on the content of a YouTube video such as title, audio, thumbnail, video, and tags. We believe our dataset and metric to be useful for engagement analytic as well as studies on social media content.

Table of Contents

  1. Meta data
  2. Missing rate
  3. Sample
  4. Embedding
  5. Zip

Metadata

Column Description
channel_id Id of channel
channel_name Name of channel
channel_category Category of channel
channel_started Started year of channel
channel_rank Rank of channel in most-subscribed Vietnamese channels
channel_subscribers Number of subscribers of channel
id Id of video
title Title of video
length_title Length title of video (tokens)
categories Categories of video
description Description of video
tags Tags of video
num_tags Number of tags of video
upload_date Uploaded date of video
delta_upload_date Distance between collected date and uploaded date (days)
duration Duration of video (minutes)
view_count Number of views of video
like_count Number of likes of video
comment_count Number of comments of video
dislike_count Number of dislikes of video
like_per_view Number of likes per views of video
comment_per_view Number of comments per views of video
dislike_per_view Number of dislikes per views of video
engagement_rate_1 Total comments and likes per views of video
engagement_rate_2 Total comments, likes, and dislikes per views of video
q_score Q score of video
label_1 Engagement level based on engagement_rate_1 of video
label_2 Engagement level based on q_score of video

MissingRate

Column Description
channel_id 0.000000
channel_name 0.000000
channel_category 0.000000
channel_started 0.000000
channel_rank 0.000000
channel_subscribers 0.000000
id 0.000000
title 0.000000
length_title 0.000000
categories 0.000000
description 0.017609
tags 0.000000
num_tags 0.000000
upload_date 0.000000
delta_upload_date 0.000000
duration 0.000000
view_count 0.000000
like_count 0.000000
comment_count 0.000000
dislike_count 0.000000
like_per_view 0.000000
comment_per_view 0.000000
dislike_per_view 0.000000
engagement_rate_1 0.000000
engagement_rate_2 0.000000
q_score 0.000000
label_1 0.000000
label_2 0.000000

Sample

  • sample/audio_by_year: folder contains audio by year
  • sample/thumbnails_by_year: folder contains thumbnails by year
  • sample/video_by_year: folder contains video by year
  • sample/entube_final.parquet: files contains metadata

Embedding

You can get data which is feature extraction at here.

  • Data input includes 3 files: entube_embedding_train.pt, entube_embedding_val.pt, entube_embedding_test.pt
  • Data in each file is a list with each item is a dictionary including keys:
'id': id of video on Youtube
'embedding_title':tensor which is feature extraction of title, has shape: (768,)
'embedding_tag':tensor which is feature extraction of tag, has shape: (768,)
'embedding_thumbnail':tensor which is feature extraction of thumbnail, has shape: (2560,)
'embedding_video':tensor which is feature extraction of the video, has shape: (2304,1,2,2)
'embedding_audio':tensor which is feature extraction of audio, has shape: (62, 128)
'engagement_rate_label':tensor of label 1 which not use q-score
'q_score_label':tensor of label 2 which use q-score

Zip

You can get data zip in here at here.

  • Data input includes 3 folders: audio_short_zip, video_short_zip, thumbnails_zip

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •