Want to know how quirky a particular movie is? Or how to find the most visually appealing movies of all time? Or how to find a movie that is similar to another movie you’ve seen but less big budget and more cerebral?
The tag genome is a data structure that enables you to answer queries such as these. As described in this article, the tag genome encodes how strongly movies exhibit particular properties represented by tags (atmospheric, thought-provoking, realistic, etc.). The tag genome was computed using a machine learning algorithm on user-contributed content including tags, ratings, and textual reviews
This data set contains the tag relevance values that make up the tag genome, described here. Tag relevance represents the relevance of a tag to a movie on a continuous scale from 0 to 1. Tag relevance values are provided for 9,734 movies and 1,128 tags.
The data are contained in three files, tag_relevance.dat
,
movies.dat
and tags.dat
. More details about the contents and use
of all these files follows.
This and other GroupLens data sets are publicly available for download at GroupLens Data Sets.
Please include the following citation if referencing this data set:
(2012): The Tag Genome: Encoding Community Knowledge to Support Novel Interaction. In: ACM Transactions on Interactive Intelligent Systems (TiiS), 2, 2012, ISSN: 2160-6455. |
Neither the University of Minnesota nor any of the researchers involved can guarantee the correctness of the data, its suitability for any particular purpose, or the validity of results based on the use of the data set. The data set may be used for any research purposes under the following conditions:
Executable software scripts are provided "as is" without warranty of any kind, either expressed or implied, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose. The entire risk as to the quality and performance of them is with you. Should the program prove defective, you assume the cost of all necessary servicing, repair or correction.
In no event shall the University of Minnesota, its affiliates or employees be liable to you for any damages arising out of the use or inability to use these programs (including but not limited to loss of data or data being rendered inaccurate).
If you have any further questions or comments, please email grouplens-info
GroupLens is a research group in the Department of Computer Science and Engineering at the University of Minnesota. Since its inception in 1992, GroupLens' research projects have explored a variety of fields including:
GroupLens Research operates a movie recommender based on collaborative filtering, MovieLens.
Tag relevance values are contained in the file tag_relevance.dat
in a tab-delimited format. Each line of this
file represents the relevance of a tag to a movie, and has the following format:
<MovieID><TagID><Relevance>
Relevance values are on a continuous 0-1 scale. A value of 1 indicates that a tag is strongly relevant to a movie and a value of 0 indicates that a tag has no relevance to a movie.
Many of the movie-tag pairs contain low relevance values that are close to zero but not exactly zero. Very low values can be rounded to zero for any application of the tag genome that displays the relevance values for the end user. However, rounding can also create discontinuities that might effect, for example, applications that compute differences in tag relevance values.
All movies in the tag genome are contained in the file movies.dat
in a tab-delimited format. Each line of this
file represents a single movie, and has
the following format:
<MovieID><Title><MoviePopularity>
MovieID
is the movie id from MovieLens. Title
is the movie title from MovieLens. MoviePopularity
is the number of ratings for this movie on MovieLens. This can optionally be used to filter the movies. Tag relevance values may be slightly less accurate for very obscure movies as they have less data associated with them.
If accented characters (e.g. Misérables, Les (1995)) or other special characters in movie titles display incorrectly, make sure that any program reading the data, such as a text editor, terminal, or script, is configured to handle unicode characters.
Movie titles, by policy, should be entered identically to those found in IMDB, including year of release. However, they are entered manually, so errors and inconsistencies may exist.
Tag information is contained in the file tags.dat
in a tab-delimited format.
Each line of this file represents one tag from the tag genome, and has the following format:
<TagID><Tag><TagPopularity>
TagID
is a unique ID for each tag in the tag genome, and is specific to this data set.
TagPopularity
equals the number of distinct users on MovieLens who have applied the tag, as discussed here. Tag popularity can be used as a way to filter the tags selected from the tag genome. For example, the Movie Tuner application only displays tags with a popularity score greater than 50.