10 min read

From Web Scraping to Conversational AI

From Web Scraping to Conversational AI
Photo by William Navarro / Unsplash

💡
A Step-by-Step Tutorial on how to make an AI chatbot for your website
💡
We will learn and use:
Python uv: blazing fast pip replacement
Scrapy: Web scraper toolkit
OpenAI: embeddings api & LLM
Langchain: Semantic Chunking
Qdrant: Vector DB to store embedding & query them

Did you see lately an increase in the number of websites that has an AI chatbot capability? That allows you to ask question about pricing, documentation, advanced questions, etc....

If you've noticed this trend, you're not alone. AI chatbots have become increasingly popular on websites across various industries, as they provide a convenient and efficient way for users to get the information they need quickly. These chatbots are powered by advanced technologies such as natural language processing (NLP), machine learning, and deep learning, which enable them to understand and respond to user queries in a human-like manner.

We will look into the process of creating your own AI chatbot for your website using Python and a combination of powerful tools and libraries. We'll cover web scraping using Scrapy to gather content from your site, creating embeddings from the scraped content using the OpenAI API, and utilizing Qdrant as a vector database to store and query the embeddings. Additionally, we'll explore how to use Langchain for semantic chunking and integrate the queried content with OpenAI's GPT-4 to provide accurate and context-aware responses to user queries.

Set Up & Env

We will make a twist and use uv as our blazing fast python installer, so go ahead and install if you don't have it:

# On macOS and Linux.
curl -LsSf https://astral.sh/uv/install.sh | sh

# On Windows.
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"

Then, we will create a new directory and install all the needed libraries:

mkdir chatbot && cd chatbot
uv venv
uv pip install scrapy
uv pip install beautifulsoup4
uv pip install openai
uv pip install qdrant-client
uv pip install langchain
uv pip install langchain_experimental
uv pip install langchain_openai
uv pip install gradio

Website Scrapping

Now let's create a new scrapy project for our website, in our use case, we will use yours truly codereliant.io:

scrapy startproject codereliant
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from codereliant.items import CodereliantItem


class CodereliantSpider(CrawlSpider):
    name = 'codereliant'

    allowed_domains = ['codereliant.io', 'www.codereliant.io']
    start_urls = ['https://www.codereliant.io/']
    rules = (Rule(LinkExtractor(allow=r'.*'), callback='parse_item'),)

    def parse_item(self, response):
        item = CodereliantItem()
        item['url'] = response.url
        return item

Crawl Scraper

In the file items.py modify the class CodereliantItem like below:

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class CodereliantItem(scrapy.Item):
    url = scrapy.Field()
    pass

Let's do a simple test to see if our scraping is working as intended:

cd codereliant
scrapy crawl codereliant -L ERROR -O output.json

From within the codereliant folder we run the scrapping command, which will result in a file output.json that contains all of our website urls:

less output.json
[
{"url": "https://www.codereliant.io"},
{"url": "https://www.codereliant.io/sre-interview-prep-plan-week-6/"},
{"url": "https://www.codereliant.io/free-goodies/"},
{"url": "https://www.codereliant.io/about/"},
{"url": "https://www.codereliant.io/the-most-tragic-bug/"},
{"url": "https://www.codereliant.io/pod-doctor/"},
{"url": "https://www.codereliant.io/from-reactive-to-proactive-transforming-software-maintenance-with-auto-remediation/"},
{"url": "https://www.codereliant.io/topics/"},
{"url": "https://www.codereliant.io/the-2038-problem/"},
{"url": "https://www.codereliant.io/14-years-of-go/"},
......

We now that we will need more than just the urls if we want to create the chatbot.

Data Extraction & Preprocessing

We already collected the url, now we need to other interesting data: page title, description, & page body.

Let's add these items to our CodereliantItem class in items.py:

class CodereliantItem(scrapy.Item):
    url = scrapy.Field()
    title = scrapy.Field()
    description = scrapy.Field()
    body = scrapy.Field()
    pass

Now our item class has these fields defined, lets' actually grab the data in parse_item method that is defined in CodereliantSpider within spiders/codereliant.py:

This post is for subscribers only