Building a Book Recommendation System with Cosine Similarity and Tf-idf Vectorization Techniques

5 min readFeb 7, 2023

Book recommendation systems are a useful tool for finding new books to read based on user preferences. One approach to building these systems is using cosine similarity and Tf-idf vectorization. Cosine similarity measures the similarity between two vectors, while Tf-idf vectorization converts text data into numerical vectors representing the importance of each word. Together, these techniques can be used to compare the similarity between books based on their genre and summary information, and provide recommendations accordingly. This article will explore the implementation of this technique in a book recommendation system. The code is linked at the bottom of the article.

Step 1: Importing libraries and loading the Data

The dataset was from Kaggle consisting of plot summaries for 16,559 books extracted from Wikipedia, along with aligned metadata from Freebase, including book author, title, and genre.

import re
import csv
from tqdm import tqdm
import pandas as pd
import json


data = []

with open("booksummaries.txt", 'r') as f:
    reader = csv.reader(f, dialect='excel-tab')
    for row in tqdm(reader):
        data.append(row)

We then store the data which is a list in a dataframe.

#storing the data in a dataframe 
# Initialize empty lists to store the data
book_id = []
book_name = []
summary = []
genre = []
# Iterate over the rows in the data

for i in tqdm(data):
  # Extract the information for each column and store it in the corresponding list
    book_id.append(i[0])
    book_name.append(i[2])
    genre.append(i[5])
    summary.append(i[6])

# Create a Pandas DataFrame from the lists
books = pd.DataFrame({'book_id': book_id, 'book_name': book_name,
                       'genre': genre, 'summary': summary})
books.head(2)

Output:

Step 2.Preprocessing the Data

A function to convert the data to a string, convert to lowercase, remove special characters and digits, tokenize, remove stop-words and lemmatize the tokens.


# preprocessing
import re
import nltk
nltk.download('omw-1.4')
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download('stopwords')
nltk.download('wordnet')
# Initialize the lemmatizer
lemmatizer = WordNetLemmatizer()

# Initialize the stop words
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    # Convert the input text to string
    text = str(text)
    
    # Convert to lowercase
    text = text.lower()
    
    # Remove special characters and digits
    text = re.sub(r'[^a-zA-Z]', ' ', text)
    
    # Tokenize the text
    tokens = nltk.word_tokenize(text)
    
    # Remove stop words
    tokens = [word for word in tokens if word not in stop_words]
    
    # Lemmatize the tokens
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    
    # Return the processed text as a string
    return " ".join(tokens)

def preprocess_dataframe(df, column_name):
    df[column_name] = df[column_name].apply(preprocess_text)
    return df

# Apply the preprocess_dataframe function to the books DataFrame

books_df = preprocess_dataframe(books, 'book_name')
books_df = preprocess_dataframe(books, 'genre')
books_df = preprocess_dataframe(books, 'summary')
# Display the first five rows of the processed dataframe
books_df.head()

Output:

I then placed the data in the genre and summary column in a new column named book_info and dropped all the columns except book_name and book_info.


#placing the entries in the summary and genre columns in a column called book info
books_df["book_info"] = books_df["summary"] + " " + books_df["genre"] 
#deleting the summary and genre columns
books_df.drop(['summary','genre'],inplace=True, axis=1)
#dropping the book id column
books_df.drop(['book_id'],inplace=True, axis=1)
books_df.sample(3)

Output:

Step 3: Data Vectorization and Cosine Similarity

I used Tf-idf Vectorizer from scikit-learn library to vectorize the data, which is to convert text data to numerical representation. I then computed the cosine similarity using cosine_similarity from scikit-learn library.


# vectorizing the book info column using TFidf Vectorizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity


tf = TfidfVectorizer(analyzer = "word", ngram_range=(1,2), min_df=0, stop_words='english')

tfidf_matrix = tf.fit_transform(books_df['book_info'])

cosine_sim =  cosine_similarity(tfidf_matrix, tfidf_matrix)
print(cosine_sim)

Output:

I then converted the book_name column into a series ‘indices’.


indices = pd.Series(books_df['book_name'])
indices[:5]

Output:

Step 4: Recommendation Function

The functions takes a book title as an input and outputs 10 similar books as a list.


def recommend(title, cosine_sim = cosine_sim):
    if title not in indices.values:
        return "Title not found in the database."
    recommended_books = []
    idx = indices[indices == title].index[0]   # to get the index of book name matching the input book_name
    score_series = pd.Series(cosine_sim[idx]).sort_values(ascending = False)   # similarity scores in descending order
    top_10_indices = list(score_series.iloc[1:11].index)   # to get the indices of top 10 most similar books
    # [1:11] to exclude 0 (index 0 is the input book itself)
    
    for i in top_10_indices:   # to append the titles of top 10 similar booksto the recommended_books list
        recommended_books.append(list(books_df['book_name'])[i])
        
    return recommended_books

A demonstration:

#to output the recommendations.
recommend('book_name')

The system outputs ‘Title not found in the database’ when the book is not in the database. I’ll soon update a part 2 deploying it as a webapp using streamlit.

Follow me for more content on data science, machine learning and AI content.

The link to the repository.

Building a Book Recommendation System with Cosine Similarity and Tf-idf Vectorization Techniques

Step 1: Importing libraries and loading the Data

Step 2.Preprocessing the Data

Step 3: Data Vectorization and Cosine Similarity

Step 4: Recommendation Function

Written by Kevin Kibe