Description

Want to know how quirky a particular movie is? Or how to find the most visually appealing movies of all time? Or how to find a movie that is similar to another movie you’ve seen but less big budget and more cerebral?

The tag genome is a data structure that enables you to answer queries such as these. As described in this article, the tag genome encodes how strongly movies exhibit particular properties represented by tags (atmospheric, thought-provoking, realistic, etc.). The tag genome was computed using a machine learning algorithm on user-contributed content including tags, ratings, and textual reviews

Dataset

This data set contains the tag relevance values that make up the tag genome, described here. Tag relevance represents the relevance of a tag to a movie on a continuous scale from 0 to 1. Tag relevance values are provided for 9,734 movies and 1,128 tags.

The data are contained in three files, tag_relevance.dat, movies.dat and tags.dat. More details about the contents and use of all these files follows.

This and other GroupLens data sets are publicly available for download at GroupLens Data Sets.

Citation

Please include the following citation if referencing this data set:

Vig, Jesse; Sen, Shilad; Riedl, John (2012): The Tag Genome: Encoding Community Knowledge to Support Novel Interaction. In: ACM Transactions on Interactive Intelligent Systems (TiiS), 2, 2012, ISSN: 2160-6455. (Type: Article | Links | BibTeX)

Usage License

Neither the University of Minnesota nor any of the researchers involved can guarantee the correctness of the data, its suitability for any particular purpose, or the validity of results based on the use of the data set. The data set may be used for any research purposes under the following conditions:

The user may not state or imply any endorsement from the University of Minnesota or the GroupLens Research Group.
The user must acknowledge the use of the data set in publications resulting from the use of the data set, and must send us an electronic or paper copy of those publications.
The user may not redistribute the data without separate permission.
The user may not use this information for any commercial or revenue-bearing purposes without first obtaining permission from a faculty member of the GroupLens Research Project at the University of Minnesota.

Executable software scripts are provided "as is" without warranty of any kind, either expressed or implied, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose. The entire risk as to the quality and performance of them is with you. Should the program prove defective, you assume the cost of all necessary servicing, repair or correction.

In no event shall the University of Minnesota, its affiliates or employees be liable to you for any damages arising out of the use or inability to use these programs (including but not limited to loss of data or data being rendered inaccurate).

If you have any further questions or comments, please email grouplens-info

Further Information About GroupLens

GroupLens is a research group in the Department of Computer Science and Engineering at the University of Minnesota. Since its inception in 1992, GroupLens' research projects have explored a variety of fields including:

Information Filtering
Recommender Systems
Online Communities
Mobile and Ubiquitious Technologies
Digital Libraries
Local Geographic Information Systems.

GroupLens Research operates a movie recommender based on collaborative filtering, MovieLens.

Content and Use of Files

Tag Relevance Data File Structure

Tag relevance values are contained in the file tag_relevance.dat in a tab-delimited format. Each line of this file represents the relevance of a tag to a movie, and has the following format:

<MovieID><TagID><Relevance>

Relevance values are on a continuous 0-1 scale. A value of 1 indicates that a tag is strongly relevant to a movie and a value of 0 indicates that a tag has no relevance to a movie.

Many of the movie-tag pairs contain low relevance values that are close to zero but not exactly zero. Very low values can be rounded to zero for any application of the tag genome that displays the relevance values for the end user. However, rounding can also create discontinuities that might effect, for example, applications that compute differences in tag relevance values.

Movie Data File Structure

All movies in the tag genome are contained in the file movies.dat in a tab-delimited format. Each line of this file represents a single movie, and has the following format:

<MovieID><Title><MoviePopularity>

MovieID is the movie id from MovieLens. Title is the movie title from MovieLens. MoviePopularity is the number of ratings for this movie on MovieLens. This can optionally be used to filter the movies. Tag relevance values may be slightly less accurate for very obscure movies as they have less data associated with them.

If accented characters (e.g. Misérables, Les (1995)) or other special characters in movie titles display incorrectly, make sure that any program reading the data, such as a text editor, terminal, or script, is configured to handle unicode characters.

Movie titles, by policy, should be entered identically to those found in IMDB, including year of release. However, they are entered manually, so errors and inconsistencies may exist.

Tag Data File Structure

Tag information is contained in the file tags.dat in a tab-delimited format. Each line of this file represents one tag from the tag genome, and has the following format:

<TagID><Tag><TagPopularity>

TagID is a unique ID for each tag in the tag genome, and is specific to this data set.

TagPopularity equals the number of distinct users on MovieLens who have applied the tag, as discussed here. Tag popularity can be used as a way to filter the tags selected from the tag genome. For example, the Movie Tuner application only displays tags with a popularity score greater than 50.