Building a Multi RAG Streamlit Application to Interact with PDFs

Talking to big PDF’s is cool. You can chat with your notes, books and documents etc. This blog post will help you build a Multi RAG Streamlit based web application to read, process, and interact with PDFs data through a conversational AI chatbot. Here’s a step-by-step breakdown of how this application works, using simple language for easy understanding.

Setting the Stage with Necessary Tools

The application begins by importing various powerful libraries:
– Streamlit: Used to create the web interface.
– PyPDF2: A tool for reading PDF files.
– Langchain: A suite of tools for natural language processing and creating conversational AI.
– FAISS: A library for efficient similarity search of vectors, which is useful for finding information quickly in large datasets.

import streamlit as st
from PyPDF2 import PdfReader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.prompts import ChatPromptTemplate
from langchain_community.embeddings.spacy_embeddings import SpacyEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.tools.retriever import create_retriever_tool
from dotenv import load_dotenv
from langchain_anthropic import ChatAnthropic
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain.agents import AgentExecutor, create_tool_calling_agent
import os
os.environ["KMP_DUPLICATE_LIB_OK"]="TRUE"

Reading and Processing PDF Files

The first major function within our application is designed to read PDF files:
 PDF Reader: When a user uploads one or more PDF files, the application reads each page of these documents and extracts the text, merging it into a single continuous string.

Once the text is extracted, it is split into manageable chunks:

  • Text Splitter: Using the Langchain library, the text is divided into chunks of 1000 characters each. This segmentation helps in processing and analyzing the text more efficiently.
def pdf_read(pdf_doc):
text = ""
for pdf in pdf_doc:
pdf_reader = PdfReader(pdf)
for page in pdf_reader.pages:
text += page.extract_text()
return text
def get_chunks(text):
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = text_splitter.split_text(text)
return chunks

Creating a Searchable Text Database and Making Embeedings

To make the text searchable, the application converts the text chunks into vector representations:
– Vector Store: The application uses the FAISS library to turn text chunks into vectors and saves these vectors locally. This transformation is crucial as it allows the system to perform fast and efficient searches within the text.

embeddings = SpacyEmbeddings(model_name="en_core_web_sm")
def vector_store(text_chunks):
vector_store = FAISS.from_texts(text_chunks, embedding=embeddings)
vector_store.save_local("faiss_db")

Setting Up the Conversational AI

The core of this application is the conversational AI, which uses OpenAI’s powerful models:
– AI Configuration: The app sets up a conversational AI using OpenAI’s GPT model. This AI is designed to answer questions based on the PDF content it has processed.
– Conversation Chain: The AI uses a set of prompts to understand the context and provide accurate responses to user queries. If the answer to a question isn’t available in the text, the AI is programmed to respond with “answer is not available in the context,” ensuring that users do not receive incorrect information.

def get_conversational_chain(tools, ques):
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0, api_key="")
prompt = ChatPromptTemplate.from_messages([...])
tool=[tools]
agent = create_tool_calling_agent(llm, tool, prompt)
agent_executor = AgentExecutor(agent=agent, tools=tool, verbose=True)
response=agent_executor.invoke({"input": ques})
print(response)
st.write("Reply: ", response['output'])
def user_input(user_question):
new_db = FAISS.load_local("faiss_db", embeddings,allow_dangerous_deserialization=True)
retriever=new_db.as_retriever()
retrieval_chain= create_retriever_tool(retriever,"pdf_extractor","This tool is to give answer to queries from the pdf")
get_conversational_chain(retrieval_chain,user_question)

User Interaction

With the backend ready, the application uses Streamlit to create a user-friendly interface:
– User Interface: Users are presented with a simple text input where they can type their questions related to the PDF content. The application then displays the AI’s responses directly on the web page.
– File Uploader and Processing: Users can upload new PDF files anytime. The application processes these files on the fly, updating the database with new text for the AI to search.

def main():
st.set_page_config("Chat PDF")
st.header("RAG based Chat with PDF")
user_question = st.text_input("Ask a Question from the PDF Files")
if user_question:
user_input(user_question)
with st.sidebar:
pdf_doc = st.file_uploader("Upload your PDF Files and Click on the Submit & Process Button", accept_multiple_files=True)
if st.button("Submit & Process"):
with st.spinner("Processing..."):
raw_text = pdf_read(pdf_doc)
text_chunks = get_chunks(raw_text)
vector_store(text_chunks)
st.success("Done")

Conclusion

Flow Chart of how answers are streamed

Entire Code

import streamlit as st
from PyPDF2 import PdfReader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.prompts import ChatPromptTemplate
from langchain_community.embeddings.spacy_embeddings import SpacyEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.tools.retriever import create_retriever_tool
from dotenv import load_dotenv
from langchain_anthropic import ChatAnthropic
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain.agents import AgentExecutor, create_tool_calling_agent
import os
os.environ[
"KMP_DUPLICATE_LIB_OK"]="TRUE"
embeddings = SpacyEmbeddings(model_name=
"en_core_web_sm")
def pdf_read(pdf_doc):
text =
""
for pdf in pdf_doc:
pdf_reader = PdfReader(pdf)
for page in pdf_reader.pages:
text += page.extract_text()
return text

def get_chunks(text):
text_splitter = RecursiveCharacterTextSplitter(chunk_size=
1000, chunk_overlap=200)
chunks = text_splitter.split_text(text)
return chunks

def vector_store(text_chunks):
vector_store = FAISS.from_texts(text_chunks, embedding=embeddings)
vector_store.save_local(
"faiss_db")

def get_conversational_chain(tools,ques):
#os.environ["ANTHROPIC_API_KEY"]=os.getenv["ANTHROPIC_API_KEY"]
#llm = ChatAnthropic(model="claude-3-sonnet-20240229", temperature=0, api_key=os.getenv("ANTHROPIC_API_KEY"),verbose=True)
llm = ChatOpenAI(model_name=
"gpt-3.5-turbo", temperature=0, api_key="")
prompt = ChatPromptTemplate.from_messages(
[
(
"system",
"""You are a helpful assistant. Answer the question as detailed as possible from the provided context, make sure to provide all the details, if the answer is not in
provided context just say, "answer is not available in the context", don't provide the wrong answer"""
,
),
(
"placeholder", "{chat_history}"),
(
"human", "{input}"),
(
"placeholder", "{agent_scratchpad}"),
]
)
tool=[tools]
agent = create_tool_calling_agent(llm, tool, prompt)
agent_executor = AgentExecutor(agent=agent, tools=tool, verbose=
True)
response=agent_executor.invoke({
"input": ques})
print(response)
st.write(
"Reply: ", response['output'])

def user_input(user_question):

new_db = FAISS.load_local(
"faiss_db", embeddings,allow_dangerous_deserialization=True)
retriever=new_db.as_retriever()
retrieval_chain= create_retriever_tool(retriever,
"pdf_extractor","This tool is to give answer to queries from the pdf")
get_conversational_chain(retrieval_chain,user_question)


def main():
st.set_page_config(
"Chat PDF")
st.header(
"RAG based Chat with PDF")
user_question = st.text_input(
"Ask a Question from the PDF Files")
if user_question:
user_input(user_question)
with st.sidebar:
st.title(
"Menu:")
pdf_doc = st.file_uploader(
"Upload your PDF Files and Click on the Submit & Process Button", accept_multiple_files=True)
if st.button("Submit & Process"):
with st.spinner("Processing..."):
raw_text = pdf_read(pdf_doc)
text_chunks = get_chunks(raw_text)
vector_store(text_chunks)
st.success(
"Done")
if __name__ == "__main__":
main()

Run the application by saving it as app.py and then using

streamlit run app.py

Output:

What it looks like !

Leave a Reply

Your email address will not be published. Required fields are marked *