Building a Book Recommendation System with Cosine Similarity and Tf-idf Vectorization Techniques

Kevin Kibe
5 min readFeb 7, 2023

--

Book recommendation systems are a useful tool for finding new books to read based on user preferences. One approach to building these systems is using cosine similarity and Tf-idf vectorization. Cosine similarity measures the similarity between two vectors, while Tf-idf vectorization converts text data into numerical vectors representing the importance of each word. Together, these techniques can be used to compare the similarity between books based on their genre and summary information, and provide recommendations accordingly. This article will explore the implementation of this technique in a book recommendation system. The code is linked at the bottom of the article.

Step 1: Importing libraries and loading the Data

The dataset was from Kaggle consisting of plot summaries for 16,559 books extracted from Wikipedia, along with aligned metadata from Freebase, including book author, title, and genre.

import re
import csv
from tqdm import tqdm
import pandas as pd
import json


data = []

with open("booksummaries.txt", 'r') as f:
reader = csv.reader(f, dialect='excel-tab')
for row in tqdm(reader):
data.append(row)

We then store the data which is a list in a dataframe.

#storing the data in a dataframe 
# Initialize empty lists to store the data
book_id = []
book_name = []
summary = []
genre = []
# Iterate over the rows in the data

for i in tqdm(data):
# Extract the information for each column and store it in the corresponding list
book_id.append(i[0])
book_name.append(i[2])
genre.append(i[5])
summary.append(i[6])

# Create a Pandas DataFrame from the lists
books = pd.DataFrame({'book_id': book_id, 'book_name': book_name,
'genre': genre, 'summary': summary})
books.head(2)

Output:

Step 2.Preprocessing the Data

A function to convert the data to a string, convert to lowercase, remove special characters and digits, tokenize, remove stop-words and lemmatize the tokens.


# preprocessing
import re
import nltk
nltk.download('omw-1.4')
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download('stopwords')
nltk.download('wordnet')
# Initialize the lemmatizer
lemmatizer = WordNetLemmatizer()

# Initialize the stop words
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
# Convert the input text to string
text = str(text)

# Convert to lowercase
text = text.lower()

# Remove special characters and digits
text = re.sub(r'[^a-zA-Z]', ' ', text)

# Tokenize the text
tokens = nltk.word_tokenize(text)

# Remove stop words
tokens = [word for word in tokens if word not in stop_words]

# Lemmatize the tokens
tokens = [lemmatizer.lemmatize(word) for word in tokens]

# Return the processed text as a string
return " ".join(tokens)

def preprocess_dataframe(df, column_name):
df[column_name] = df[column_name].apply(preprocess_text)
return df

# Apply the preprocess_dataframe function to the books DataFrame

books_df = preprocess_dataframe(books, 'book_name')
books_df = preprocess_dataframe(books, 'genre')
books_df = preprocess_dataframe(books, 'summary')
# Display the first five rows of the processed dataframe
books_df.head()

Output:

I then placed the data in the genre and summary column in a new column named book_info and dropped all the columns except book_name and book_info.


#placing the entries in the summary and genre columns in a column called book info
books_df["book_info"] = books_df["summary"] + " " + books_df["genre"]
#deleting the summary and genre columns
books_df.drop(['summary','genre'],inplace=True, axis=1)
#dropping the book id column
books_df.drop(['book_id'],inplace=True, axis=1)
books_df.sample(3)

Output:

Step 3: Data Vectorization and Cosine Similarity

I used Tf-idf Vectorizer from scikit-learn library to vectorize the data, which is to convert text data to numerical representation. I then computed the cosine similarity using cosine_similarity from scikit-learn library.


# vectorizing the book info column using TFidf Vectorizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity


tf = TfidfVectorizer(analyzer = "word", ngram_range=(1,2), min_df=0, stop_words='english')

tfidf_matrix = tf.fit_transform(books_df['book_info'])

cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
print(cosine_sim)

Output:

I then converted the book_name column into a series ‘indices’.


indices = pd.Series(books_df['book_name'])
indices[:5]

Output:

Step 4: Recommendation Function

The functions takes a book title as an input and outputs 10 similar books as a list.


def recommend(title, cosine_sim = cosine_sim):
if title not in indices.values:
return "Title not found in the database."
recommended_books = []
idx = indices[indices == title].index[0] # to get the index of book name matching the input book_name
score_series = pd.Series(cosine_sim[idx]).sort_values(ascending = False) # similarity scores in descending order
top_10_indices = list(score_series.iloc[1:11].index) # to get the indices of top 10 most similar books
# [1:11] to exclude 0 (index 0 is the input book itself)

for i in top_10_indices: # to append the titles of top 10 similar booksto the recommended_books list
recommended_books.append(list(books_df['book_name'])[i])

return recommended_books

A demonstration:

#to output the recommendations.
recommend('book_name')

The system outputs ‘Title not found in the database’ when the book is not in the database. I’ll soon update a part 2 deploying it as a webapp using streamlit.

Follow me for more content on data science, machine learning and AI content.

The link to the repository.

--

--