TelcoScrape
Ethical web-scraping pipeline for phone release data with normalization and business analytics insights.
Tech Stack
Key Highlights
Ethical web scraping with robots.txt compliance
Multi-source reconciliation with confidence scores
Data quality metrics and anomaly detection
Brand-level business analytics insights
Idempotent resume with CLI controls
Schema validation and QA testing pipeline
Project Details
I built an ethical web-scraping → cleaning → analysis pipeline that collects phone release/announce dates across brands, normalizes them into a clean table, and answers campaign questions (brand-level bills, city e-invoice adoption, 5G support vs 5G subscription sanity checks).
**Scraper:** Respectful crawling (robots.txt, backoff/jitter, rotating user-agents), retry with exponential backoff, HTML parsing → brand / model / announced_date / release_date; idempotent resume.
**Normalization:** Model name de-aliasing (Pro/Plus/5G variants), date parsing to ISO, duplicate collapse, multi-source reconciliation with confidence scores.
**Data quality:** Coverage/freshness metrics, anomaly flags (release < announce, missing year), brand/year completeness heatmap.
Case study analytics:
• Brand-level average bills (from a workbook mirroring prod schema).
• E-invoice adoption by city (groupby + rates, small multiples).
• Sanity check: device_supports_5g vs subscription_5g_flag consistency report.
**Outputs:** release_dates.csv (canonical), notebook with plots (histograms/boxplots), and a short, auditable commentary ("assumptions & caveats").
My contributions:
• Implemented the scraper core (session, retries, caching), parsers, and the normalization rules; wrote the de-dup & confidence logic.
• Built the pandas cleaning pipeline, currency/locale normalization, and the analytics notebook answering the three stakeholder questions concisely.
• Added QA checks (schema validator, null/dup tests) and lightweight CLI (--resume, --since YYYY) for incremental runs; documented runbook & ethics.