TableRAG

SQL-native RAG system for tabular data that answers natural-language questions over Excel workbooks.

AI/MLDark Theme

Tech Stack

PythonPostgreSQLpgvectorSQLAlchemySentenceTransformersmultilingual-e5-baseFlask/FastAPIdotenvDocker

Key Highlights

Excel→PostgreSQL ETL with typed columns

Row-level embeddings with IVFFLAT indexing

Pure SQL cosine ANN retrieval via pgvector

Configurable summarization with row citations

Idempotent re-indexing with safety controls

Evaluation notebooks with Recall@k metrics

Project Details

I built a self-contained module that answers natural-language questions over a single Excel workbook by embedding each row (multilingual-e5-base, dim=768) into PostgreSQL + pgvector, running cosine ANN retrieval in pure SQL, and returning the most relevant rows with a concise summary.

**Ingestion & normalization:** ETL from Excel → PostgreSQL with typed columns, locale-safe currency/decimal cleanup, and a canonical text builder per row (joins important fields + lightweight templating).

**Embeddings at the DB edge:** Row embeddings stored in a vector(768) column; IVFFLAT index with tuned lists for fast ANN; pre-warm and VACUUM routines.

**Pure-SQL retrieval:** Prompt → embedding → ORDER BY embedding <=> :qvec LIMIT k (cosine distance via pgvector) → return rows + relevance scores; deterministic tie-breaks.

**Summarization:** Compact answer generator that cites row IDs; configurable to return only rows (no LLM) for audit-friendly workflows.

**Ops & safety:** Idempotent re-indexing, chunk/batch size controls, retry/backoff, and simple PII redaction hooks for sensitive columns.

**Observability:** Query traces (SQL + timings), ANN vs. exact recall probes, and quick evaluation notebooks (Recall@k / MRR on a small QA set).

My contributions:

• Designed the PostgreSQL schema and pgvector index strategy; wrote the Excel→DB loader and canonical text builder.

• Implemented the embedding pipeline (batching, retries, caching) and the SQL ANN search with cosine distance.

• Authored the summarization prompt, guardrails (max tokens, safe fallbacks), and evaluation harness; packaged everything behind clean Python APIs.