Back to home

How I Built a RAG Chatbot for My Portfolio.

Jun 9, 2026

Most portfolio sites have a contact form. I wanted something more interesting — a chatbot that can actually answer questions about me. Not a fake one that returns canned responses, but a real Retrieval-Augmented Generation (RAG) pipeline backed by my CV, streaming answers token-by-token, living right inside the portfolio UI.

This post walks through exactly how it works, from the database to the floating chat button.


The Architecture at a Glance

Notion image

Step 1 — The Vector Database

I used PostgreSQL with the pgvector extension to store document embeddings. A single migration sets everything up:

-- Enable pgvector
create extension if not exists vector;

-- Store chunked CV text alongside its embedding
create table if not exists public.documents (
    id        bigserial primary key,
    content   text,
    metadata  jsonb,
    embedding vector(768)
);

-- IVFFlat index for fast cosine similarity search
create index if not exists documents_embedding_idx
on public.documents
using ivfflat (embedding vector_cosine_ops)
with (lists = 100);

The match_documents function does the actual semantic search. It takes the question embedding and returns the top-N most similar chunks ranked by cosine similarity:

create function public.match_documents(
    query_embedding vector,
    match_count     integer default 10,
    filter          jsonb   default '{}'::jsonb
)
returns table (id bigint, content text, metadata jsonb, similarity double precision)
language plpgsql
as $$
begin
    return query
    select
        d.id,
        d.content,
        d.metadata,
        1 - (d.embedding <=> query_embedding) as similarity
    from public.documents as d
    where d.embedding is not null
      and d.metadata @> filter
    order by d.embedding <=> query_embedding
    limit match_count;
end;
$$;

The <=> operator is pgvector's cosine distance. 1 - distance = similarity, so the closest chunks come first.


Step 2 — Prisma with the PrismaPg Adapter

The Prisma schema maps the documents table and uses vector(768) as an unsupported (raw) type — pgvector isn't a native Prisma type yet, so raw SQL queries handle the vector operations:

model Document {
    id        BigInt                      @id @default(autoincrement())
    content   String?
    metadata  Json?
    embedding Unsupported("vector(768)")?

    @@map("documents")
}

Because this is a Next.js app with edge-ish server functions, I use the PrismaPg driver adapter over a pg connection pool to keep connections efficient and avoid the default binary protocol issues with pgvector:

import { PrismaClient } from '@/generated/prisma/client';
import { PrismaPg } from '@prisma/adapter-pg';
import { Pool } from 'pg';

const pool = new Pool({ connectionString: process.env.DATABASE_URL });
const adapter = new PrismaPg(pool);

const globalForPrisma = globalThis as unknown as { prisma?: PrismaClient };

export const prisma =
  globalForPrisma.prisma ?? new PrismaClient({ adapter });

if (process.env.NODE_ENV !== 'production') {
  globalForPrisma.prisma = prisma;
}

The globalThis singleton pattern prevents creating a new Prisma client on every hot-reload in development.


Step 3 — Parsing and Chunking the CV

Rather than maintaining a separate data ingestion pipeline, I parse my CV PDF directly at runtime. The CV lives at public/info.pdf.

I wrote a lightweight PDF text extractor that reads the raw PDF byte stream and pulls out string objects using the Tj text-show operator — no external PDF library needed:

function extractTextFromInfoPdf(buffer: Buffer) {
  const pdf = buffer.toString('latin1');
  const textParts: string[] = [];
  let index = 0;

  while (index < pdf.length) {
    const stringStart = pdf.indexOf('(', index);
    if (stringStart === -1) break;

    let cursor = stringStart + 1;
    let escaped = false;
    let depth = 1;
    let value = '';

    while (cursor < pdf.length && depth > 0) {
      const char = pdf[cursor];
      if (escaped) {
        value += `\\\\${char}`;
        escaped = false;
      } else if (char === '\\\\') {
        escaped = true;
      } else if (char === '(') {
        depth += 1;
        value += char;
      } else if (char === ')') {
        depth -= 1;
        if (depth > 0) value += char;
      } else {
        value += char;
      }
      cursor += 1;
    }

    const nextOperator = pdf.slice(cursor, cursor + 20);
    if (/\\sTj\\b/.test(nextOperator)) {
      textParts.push(unescapePdfString(value));
    }

    index = cursor;
  }

  return textParts.join('\\n').replace(/\\s+\\n/g, '\\n').trim();
}

The extracted text is then split into overlapping chunks using LangChain's RecursiveCharacterTextSplitter:

async function loadCvChunks() {
  const filePath = join(process.cwd(), 'public', 'info.pdf');
  const text = extractTextFromInfoPdf(await readFile(filePath));

  const textSplitter = new RecursiveCharacterTextSplitter({
    chunkSize: 1000,
    chunkOverlap: 100,
  });

  return textSplitter.createDocuments([text], [{ source: 'public/info.pdf' }]);
}

Step 4 — Embedding and Seeding

On the first ever request to the chat API, the system checks if the documents table is empty. If it is, it runs the seed:

async function ensureDocumentsLoaded() {
  const count = await prisma.document.count();
  if (!count) {
    await seedDocuments();
  }
}

Seeding embeds each chunk with Google Gemini (gemini-embedding-001) at 768 dimensions, then inserts everything in a single Prisma transaction:

async function seedDocuments() {
  const docs = await loadCvChunks();
  const ai = new GoogleGenAI({ apiKey: process.env.GOOGLE_API_KEY });

  const embeddings = await Promise.all(
    docs.map(async (doc) => {
      const response = await ai.models.embedContent({
        model: 'gemini-embedding-001',
        contents: doc.pageContent,
        config: {
          taskType: 'RETRIEVAL_DOCUMENT',
          outputDimensionality: 768,
        },
      });

      return {
        content: doc.pageContent,
        metadata: doc.metadata ?? {},
        embedding: response.embeddings?.[0]?.values ?? [],
      };
    }),
  );

  await prisma.$transaction(
    embeddings.map((entry) =>
      prisma.$executeRaw`
        insert into documents (content, metadata, embedding)
        values (
          ${entry.content},
          ${JSON.stringify(entry.metadata)}::jsonb,
          ${toVectorLiteral(entry.embedding)}::vector
        )
      `,
    ),
  );
}

toVectorLiteral converts a number[] into Postgres vector literal syntax — e.g. [0.12,0.84,...].


Step 5 — The RAG Query Pipeline

When a user sends a question, the API:

  1. Embeds the question using the same Gemini model (with RETRIEVAL_QUERY task type)
  2. Calls match_documents via a Prisma raw query
  3. Assembles the top chunks into a numbered context block
async function buildContext(question: string) {
  await ensureDocumentsLoaded();

  const ai = createAiClient();
  const response = await ai.models.embedContent({
    model: 'gemini-embedding-001',
    contents: question,
    config: {
      taskType: 'RETRIEVAL_QUERY',
      outputDimensionality: 768,
    },
  });
  const questionVector = response.embeddings?.[0]?.values ?? [];

  const chunks = await prisma.$queryRaw<MatchDocumentRow[]>`
    select * from match_documents(
      ${toVectorLiteral(questionVector)}::vector,
      ${10},
      ${JSON.stringify({})}::jsonb
    )
  `;

  return chunks
    .map((chunk, i) => (chunk.content?.trim() ? `[${i + 1}] ${chunk.content.trim()}` : ''))
    .filter(Boolean)
    .join('\\n\\n');
}

Step 6 — Streaming the LLM Response

The context is injected into a carefully scoped system prompt and fed to LLaMA 3.1 70B Instruct through the NVIDIA API using LangChain's ChatOpenAI (it's OpenAI-compatible):

function createLlmClient() {
  return new ChatOpenAI({
    apiKey: process.env.NVIDIA_API_KEY,
    model: 'meta/llama-3.1-70b-instruct',
    temperature: 0.7,
    maxTokens: 1024,
    streaming: true,
    configuration: {
      baseURL: '<https://integrate.api.nvidia.com/v1>',
    },
  });
}

The response is streamed back as a ReadableStream<Uint8Array> using the Web Streams API, encoded text-by-text:

async function createAnswerStream(question: string) {
  const context = await buildContext(question);

  const llm = createLlmClient();
  const stream = await llm.stream([
    ['system', `You are "Ask Prajwol"...\\n\\n<context>\\n${context}\\n</context>`],
    ['human', question],
  ]);

  const encoder = new TextEncoder();

  return new ReadableStream<Uint8Array>({
    async start(controller) {
      for await (const chunk of stream) {
        const text = extractChunkText(chunk.content);
        if (text) controller.enqueue(encoder.encode(text));
      }
      controller.close();
    },
  });
}

export async function POST(request: Request) {
  const { question } = await request.json();
  const answerStream = await createAnswerStream(question);

  return new NextResponse(answerStream, {
    headers: {
      'content-type': 'text/plain; charset=utf-8',
      'cache-control': 'no-cache, no-transform',
    },
  });
}

Step 7 — The Chat UI

The floating chat button is a self-contained React client component. It opens a panel, manages message state, and reads the streamed response incrementally using the Fetch ReadableStream reader:

const reader = response.body.getReader();
const decoder = new TextDecoder();
let answer = '';

while (true) {
  const { done, value } = await reader.read();
  if (done) break;

  answer += decoder.decode(value, { stream: true });
  // update the assistant message bubble in real time
  setMessages((current) =>
    current.map((msg) =>
      msg.id === assistantMessageId ? { ...msg, content: answer } : msg,
    ),
  );
}

While waiting for the first tokens to arrive, rotating loading messages give the user feedback:

useEffect(() => {
  if (!isOpen || !isSending) return;

  const interval = window.setInterval(() => {
    setLoadingMessageIndex((i) => (i + 1) % loadingMessages.length);
  }, 3000);

  return () => window.clearInterval(interval);
}, [isOpen, isSending]);

The button itself has a BorderBeam animation to draw attention without being intrusive.


Environment Variables

DATABASE_URL=postgresql://...
GOOGLE_API_KEY=...       # Gemini embeddings
NVIDIA_API_KEY=...       # LLaMA 3.1 70B via NVIDIA

What I'd Do Differently

  • Hybrid search — combine keyword (BM25) with vector search for better recall on exact name/technology queries.
  • Incremental seeding — detect CV changes by hash and re-embed only modified chunks, instead of a one-shot seed on empty table.
  • Rate limiting — add per-IP rate limiting on the API route to prevent abuse.
  • Conversation history — pass prior turns to the LLM for multi-turn context, currently each question is stateless.