TableRAG
SQL-native RAG system for tabular data that answers natural-language questions over Excel workbooks.
Tech Stack
Key Highlights
Excel→PostgreSQL ETL with typed columns
Row-level embeddings with IVFFLAT indexing
Pure SQL cosine ANN retrieval via pgvector
Configurable summarization with row citations
Idempotent re-indexing with safety controls
Evaluation notebooks with Recall@k metrics
Project Details
I built a self-contained module that answers natural-language questions over a single Excel workbook by embedding each row (multilingual-e5-base, dim=768) into PostgreSQL + pgvector, running cosine ANN retrieval in pure SQL, and returning the most relevant rows with a concise summary.
**Ingestion & normalization:** ETL from Excel → PostgreSQL with typed columns, locale-safe currency/decimal cleanup, and a canonical text builder per row (joins important fields + lightweight templating).
**Embeddings at the DB edge:** Row embeddings stored in a vector(768) column; IVFFLAT index with tuned lists for fast ANN; pre-warm and VACUUM routines.
**Pure-SQL retrieval:** Prompt → embedding → ORDER BY embedding <=> :qvec LIMIT k (cosine distance via pgvector) → return rows + relevance scores; deterministic tie-breaks.
**Summarization:** Compact answer generator that cites row IDs; configurable to return only rows (no LLM) for audit-friendly workflows.
**Ops & safety:** Idempotent re-indexing, chunk/batch size controls, retry/backoff, and simple PII redaction hooks for sensitive columns.
**Observability:** Query traces (SQL + timings), ANN vs. exact recall probes, and quick evaluation notebooks (Recall@k / MRR on a small QA set).
My contributions:
• Designed the PostgreSQL schema and pgvector index strategy; wrote the Excel→DB loader and canonical text builder.
• Implemented the embedding pipeline (batching, retries, caching) and the SQL ANN search with cosine distance.
• Authored the summarization prompt, guardrails (max tokens, safe fallbacks), and evaluation harness; packaged everything behind clean Python APIs.