# Task
Build a chatbot for the Indian Penal Code using the document at "https://www.indiacode.nic.in/repealedfileopen?rfilename=A1860-45.pdf".

## Download the document

### Subtask:
Download the PDF document from the provided link.


In [1]:
import requests

# The URL of the document
url = "https://www.iitk.ac.in/wc/data/IPC_186045.pdf"

# Local filename to save as
filename = "Indian_Penal_Code_1860.pdf"

try:
    # Send GET request
    response = requests.get(url, stream=True)
    response.raise_for_status()  # Raise error for bad status codes

    # Write the file in binary mode
    with open(filename, "wb") as f:
        for chunk in response.iter_content(chunk_size=8192):
            if chunk:
                f.write(chunk)

    print(f"✅ File successfully downloaded and saved as: {filename}")

except requests.exceptions.RequestException as e:
    print(f"❌ Error downloading file: {e}")


✅ File successfully downloaded and saved as: Indian_Penal_Code_1860.pdf


## Extract text from the document

### Subtask:
Extract the text content from the downloaded PDF document.

In [2]:
!pip install pypdf -q

In [3]:
from pypdf import PdfReader

# Local filename of the downloaded document
filename = "Indian_Penal_Code_1860.pdf"

try:
    reader = PdfReader(filename)
    text = ""
    for page in reader.pages:
        text += page.extract_text()
    print("✅ Text extracted successfully.")
except Exception as e:
    print(f"❌ Error extracting text: {e}")

✅ Text extracted successfully.


## Preprocess and Split the document into chunks

### Subtask:
Clean and preprocess the extracted text and divide it into smaller chunks.

In [4]:
import re

# Basic preprocessing: remove excessive whitespace and newlines
cleaned_text = re.sub(r'\s+', ' ', text).strip()

# Simple chunking: split into chunks of roughly 1000 characters
chunk_size = 1000
chunks = [cleaned_text[i:i + chunk_size] for i in range(0, len(cleaned_text), chunk_size)]

print(f"✅ Document split into {len(chunks)} chunks.")

✅ Document split into 447 chunks.


## Generate embeddings for the chunks

### Subtask:
Create numerical representations (embeddings) for each chunk.

In [5]:
!pip install sentence-transformers -q

[31mERROR: Operation cancelled by user[0m[31m
[0m

In [6]:
from sentence_transformers import SentenceTransformer

# Load a pre-trained sentence transformer model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Generate embeddings for each chunk
embeddings = model.encode(chunks)

print(f"✅ Generated embeddings for {len(embeddings)} chunks.")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


✅ Generated embeddings for 447 chunks.


## Build a retrieval system

### Subtask:
Implement a system to retrieve relevant document chunks based on user queries.

In [22]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def retrieve_chunks(query, embeddings, chunks, sentence_encoder_model, top_n=3):
    """
    Retrieves the most relevant chunks based on a query using cosine similarity.
    """
    # Print the type of the object being used for encoding
    print(f"Type of sentence_encoder_model inside retrieve_chunks: {type(sentence_encoder_model)}")

    # Generate embedding for the query
    query_embedding = sentence_encoder_model.encode([query])

    # Calculate cosine similarity between query embedding and chunk embeddings
    similarities = cosine_similarity(query_embedding, embeddings)[0]

    # Get the indices of the top_n most similar chunks
    most_similar_indices = np.argsort(similarities)[::-1][:top_n]

    # Return the most similar chunks
    return [chunks[i] for i in most_similar_indices]

print("✅ Retrieval function defined.")

✅ Retrieval function defined.


## Build the chatbot

### Subtask:
Integrate the retrieval system with a language model to create a chatbot.

In [28]:
# You will need to replace this with your actual API key or method for accessing a language model
import google.generativeai as genai
from google.colab import userdata

GOOGLE_API_KEY=userdata.get('GEMINI_API_KEY')
genai.configure(api_key=GOOGLE_API_KEY)

# Initialize the Generative Model (replace with your preferred model)
gemini_model = genai.GenerativeModel('gemini-pro-latest')

def chatbot_response(query, embeddings, chunks, sentence_transformer_model):
    """
    Generates a response to a query using retrieved chunks and a language model.
    """
    # Retrieve relevant chunks
    relevant_chunks = retrieve_chunks(query, embeddings, chunks, sentence_transformer_model)

    # Combine relevant chunks into a single context
    context = " ".join(relevant_chunks)

    # Create a prompt for the language model
    prompt = f"""Based on the following text from the Indian Penal Code, answer the user's question.

    Context: {context}

    Question: {query}

    Answer:
    """

    # Generate a response using the language model (replace with your model's generation call)
    try:
         response = gemini_model.generate_content(prompt)
         return response.text
    except Exception as e:
         return f"Error generating response: {e}"

print("✅ Chatbot function defined.")

✅ Chatbot function defined.


## Test and refine the chatbot

### Subtask:
Test the chatbot with various queries and refine its performance.

In [29]:
from sentence_transformers import SentenceTransformer

# Re-initialize the SentenceTransformer model just in case
model = SentenceTransformer('all-MiniLM-L6-v2')

# Example usage (uncomment and run after configuring the language model)
query = "What is the punishment for theft?"
response = chatbot_response(query, embeddings, chunks, model) # 'model' here refers to the global SentenceTransformer model
print(response)

Type of sentence_encoder_model inside retrieve_chunks: <class 'sentence_transformers.SentenceTransformer.SentenceTransformer'>
Based on the text provided, the punishment for theft varies depending on the circumstances:

*   **General Theft (Section 379):** The punishment is imprisonment for a term which may extend to three years, or a fine, or both.

*   **Theft in a dwelling house, etc. (Section 380):** If the theft is committed in any building, tent, or vessel used as a human dwelling or for the custody of property, the punishment is imprisonment for a term which may extend to seven years, and a fine.

*   **Theft by a clerk or servant (Section 381):** If a clerk or servant commits theft of property in the possession of their master or employer, the punishment is imprisonment for a term which may extend to seven years, and a fine.

*   **Theft after preparation for causing death, hurt, or restraint (Section 382):** If a person commits theft after having made preparations to cause dea

In [10]:
import google.generativeai as genai
from google.colab import userdata

GOOGLE_API_KEY=userdata.get('GEMINI_API_KEY')
genai.configure(api_key=GOOGLE_API_KEY)

for model in genai.list_models():
  if 'generateContent' in model.supported_generation_methods:
    print(model.name)

models/gemini-2.5-pro-preview-03-25
models/gemini-2.5-flash-preview-05-20
models/gemini-2.5-flash
models/gemini-2.5-flash-lite-preview-06-17
models/gemini-2.5-pro-preview-05-06
models/gemini-2.5-pro-preview-06-05
models/gemini-2.5-pro
models/gemini-2.0-flash-exp
models/gemini-2.0-flash
models/gemini-2.0-flash-001
models/gemini-2.0-flash-exp-image-generation
models/gemini-2.0-flash-lite-001
models/gemini-2.0-flash-lite
models/gemini-2.0-flash-preview-image-generation
models/gemini-2.0-flash-lite-preview-02-05
models/gemini-2.0-flash-lite-preview
models/gemini-2.0-pro-exp
models/gemini-2.0-pro-exp-02-05
models/gemini-exp-1206
models/gemini-2.0-flash-thinking-exp-01-21
models/gemini-2.0-flash-thinking-exp
models/gemini-2.0-flash-thinking-exp-1219
models/gemini-2.5-flash-preview-tts
models/gemini-2.5-pro-preview-tts
models/learnlm-2.0-flash-experimental
models/gemma-3-1b-it
models/gemma-3-4b-it
models/gemma-3-12b-it
models/gemma-3-27b-it
models/gemma-3n-e4b-it
models/gemma-3n-e2b-it
models

## Finish task

### Subtask:
Summarize the process and present the working chatbot.