TelcoScrape

Ethical web-scraping pipeline for phone release data with normalization and business analytics insights.

Data EngineeringDark Theme

Tech Stack

Pythonrequests/httpxtenacityBeautifulSoup/lxmlpandasmatplotlibtyper CLIpytestPlaywright

Key Highlights

Ethical web scraping with robots.txt compliance

Multi-source reconciliation with confidence scores

Data quality metrics and anomaly detection

Brand-level business analytics insights

Idempotent resume with CLI controls

Schema validation and QA testing pipeline

Project Details

I built an ethical web-scraping → cleaning → analysis pipeline that collects phone release/announce dates across brands, normalizes them into a clean table, and answers campaign questions (brand-level bills, city e-invoice adoption, 5G support vs 5G subscription sanity checks).

**Scraper:** Respectful crawling (robots.txt, backoff/jitter, rotating user-agents), retry with exponential backoff, HTML parsing → brand / model / announced_date / release_date; idempotent resume.

**Normalization:** Model name de-aliasing (Pro/Plus/5G variants), date parsing to ISO, duplicate collapse, multi-source reconciliation with confidence scores.

**Data quality:** Coverage/freshness metrics, anomaly flags (release < announce, missing year), brand/year completeness heatmap.

Case study analytics:

• Brand-level average bills (from a workbook mirroring prod schema).

• E-invoice adoption by city (groupby + rates, small multiples).

• Sanity check: device_supports_5g vs subscription_5g_flag consistency report.

**Outputs:** release_dates.csv (canonical), notebook with plots (histograms/boxplots), and a short, auditable commentary ("assumptions & caveats").

My contributions:

• Implemented the scraper core (session, retries, caching), parsers, and the normalization rules; wrote the de-dup & confidence logic.

• Built the pandas cleaning pipeline, currency/locale normalization, and the analytics notebook answering the three stakeholder questions concisely.

• Added QA checks (schema validator, null/dup tests) and lightweight CLI (--resume, --since YYYY) for incremental runs; documented runbook & ethics.