TableRAG

SQL-native RAG system for tabular data that answers natural-language questions over Excel workbooks.

AI/MLDark Theme

Tech Stack

PythonPostgreSQLpgvectorSQLAlchemySentenceTransformersmultilingual-e5-baseFlask/FastAPIdotenvDocker

Key Highlights

Excel→PostgreSQL ETL with typed columns

Row-level embeddings with IVFFLAT indexing

Pure SQL cosine ANN retrieval via pgvector

Configurable summarization with row citations

Idempotent re-indexing with safety controls

Evaluation notebooks with Recall@k metrics

Project Details

I built a self-contained module that answers natural-language questions over a single Excel workbook by embedding each row (multilingual-e5-base, dim=768) into PostgreSQL + pgvector, running cosine ANN retrieval in pure SQL, and returning the most relevant rows with a concise summary.

**Ingestion & normalization:** ETL from Excel → PostgreSQL with typed columns, locale-safe currency/decimal cleanup, and a canonical text builder per row (joins important fields + lightweight templating).

**Embeddings at the DB edge:** Row embeddings stored in a vector(768) column; IVFFLAT index with tuned lists for fast ANN; pre-warm and VACUUM routines.

**Pure-SQL retrieval:** Prompt → embedding → ORDER BY embedding <=> :qvec LIMIT k (cosine distance via pgvector) → return rows + relevance scores; deterministic tie-breaks.

**Summarization:** Compact answer generator that cites row IDs; configurable to return only rows (no LLM) for audit-friendly workflows.

**Ops & safety:** Idempotent re-indexing, chunk/batch size controls, retry/backoff, and simple PII redaction hooks for sensitive columns.

**Observability:** Query traces (SQL + timings), ANN vs. exact recall probes, and quick evaluation notebooks (Recall@k / MRR on a small QA set).

My contributions:

Designed the PostgreSQL schema and pgvector index strategy; wrote the Excel→DB loader and canonical text builder.

Implemented the embedding pipeline (batching, retries, caching) and the SQL ANN search with cosine distance.

Authored the summarization prompt, guardrails (max tokens, safe fallbacks), and evaluation harness; packaged everything behind clean Python APIs.

© 2025 Hüseyin Bora Baran. All rights reserved.