Overview

A grocery delivery platform’s product team struggled with inconsistent merchant catalog updates that degraded search and browsing. Feeds arrived with missing categories, mismatched units, broken images, and duplicate items, and the search index reflected those issues. Intelligex implemented a validation and enrichment pipeline that checked incoming data against a canonical schema, auto?fixed common issues, quarantined out?of?policy records, and opened Jira tickets with diffs and evidence. Catalog quality stabilized, search complaints subsided, and release notes matched what customers actually saw—without replacing the search stack or merchant integrations.

Client Profile

  • Industry: Online grocery delivery and marketplace
  • Company size (range): Multi?region platform with diverse merchant integrations
  • Stage: Established catalog ingestion and search; quality handled with manual checks and ad hoc scripts
  • Department owner: Product Management & R&D
  • Other stakeholders: Catalog Operations, Merchant Operations, Search/Relevance, Data Engineering/Analytics, Web/Mobile Engineering, Customer Support, Legal/Compliance

The Challenge

Merchant catalogs arrived via APIs, flat files, and partner platforms with uneven structure and quality. Variations in category taxonomies, units, images, and identifiers led to inconsistent browse and search behavior. Facets disappeared when categories were missing, price sorting broke when units diverged, and identical products appeared multiple times with slight naming differences. Customers reported that obvious items were hard to find, and Support escalations piled up during peak hours.

Teams tried to patch the symptoms. Catalog Operations ran spot checks, engineers added conditional transforms for specific merchants, and product managers edited release notes to explain expected changes. None of that addressed the root cause: feeds were accepted as long as they parsed, the search index ingested whatever landed, and the organization lacked a governed path to validate, enrich, and track issues at the record level.

The platform needed a way to catch and correct issues before indexing, surface actionable diffs to the right owners, and maintain a clear audit trail. The approach had to work with the existing data stack and search infrastructure, align with product identifier standards like GS1 GTIN, and use familiar schema and validation practices such as JSON Schema and data checks inspired by Great Expectations. The search index remained on Elasticsearch, and tickets flowed through Jira.

Why It Was Happening

Root causes were fragmented taxonomies and insufficient validation. Merchants used different category trees and identifier conventions (SKU vs. GTIN), and some feeds mixed units and currencies within the same assortment. The ingestion layer enforced shape but not semantics, so category empties, unit mismatches, and broken media links passed through. Deduplication relied on brittle name matching instead of stable identifiers, and the index reflected that drift.

Ownership was split. Merchant Operations chased feed issues, Data Engineering maintained ingest jobs, Search owned indexing and relevance, and PMs were accountable for customer experience. Without a canonical catalog model, enrichment rules, and a gate to quarantine bad records, teams reacted to incidents instead of preventing them. Quality trends by merchant were obscured, so repeat offenders and high?impact failures were hard to prioritize.

The Solution

Intelligex implemented a catalog validation and enrichment pipeline between ingestion and search indexing. The pipeline validated records against a canonical schema, normalized units and categories, deduplicated by identifiers and attributes, and repaired common issues such as missing images or malformed URLs. Records that failed semantic checks were quarantined with reason codes, while auto?fixes proceeded with annotations. For unresolved issues, the system opened Jira tickets with merchant context, diffs, and sample records. Dashboards exposed quality scores and trends by merchant and category. Search indexing ran only on curated, validated outputs.

  • Integrations: Ingestion from existing brokers (for example, Apache Kafka or batch loaders); schema checks with JSON Schema; semantic validations aligned to patterns from Great Expectations; transformations orchestrated in tools such as Apache Airflow with curated models managed in dbt; indexing to Elasticsearch; tickets and approvals in Jira.
  • Canonical catalog model: Standard fields for identifiers (GTIN, merchant SKU), title/brand/size, units and currency, category path, attributes, availability, images, and effective dates. Mappings aligned merchant taxonomies to a shared tree.
  • Validation rules: Checks for required fields, category presence, unit/currency consistency, image URL reachability, attribute type conformance, and identifier sanity (including GTIN check digits).
  • Enrichment and normalization: Unit and currency normalization, brand cleanup, category mapping, and attribute standardization. Image fallback sourcing and URL normalization where permissible.
  • Deduplication: Heuristics combining identifiers, brand, size, and attribute signatures to merge duplicates safely; confidence thresholds and human review for low?confidence merges.
  • Quarantine and tickets: Out?of?policy records held from indexing with reason codes; Jira tickets auto?opened with diffs, sample payloads, and merchant metadata; watch lists tracked chronic issues.
  • Dashboards: Merchant?level scorecards, category coverage, trendlines of validation failures and auto?fix rates, and indexing posture. Filters highlighted high?impact breakages by search facet or storefront.
  • Governance: Change control for taxonomy, validation thresholds, and auto?fix rules; exception approvals with expirations; audit logs for all edits and overrides.

Implementation

  • Discovery: Mapped current feed types and merchant sources; inventoried common failure modes (category empties, unit mismatches, image errors, duplicates); traced how records moved from ingestion to Elasticsearch; collected Support and relevance pain points.
  • Design: Authored the canonical catalog schema and taxonomy mapping; defined validation checks and severity; specified normalization and dedup rules; designed quarantine paths and Jira ticket templates; outlined dashboards and ownership.
  • Build: Implemented schema validation and semantic checks; built normalization and enrichment services; added deduplication with confidence scoring; wired quarantine storage and Jira automation with diffs and samples; integrated curated outputs with Elasticsearch indexers; assembled dashboards.
  • Testing/QA: Ran in shadow mode: validated and enriched feeds while indexing continued unchanged; compared curated vs. raw index behavior on test collections; tuned thresholds, mappings, and auto?fix rules; conducted human review on low?confidence merges.
  • Rollout: Enabled curation for selected merchants and top categories first; retained legacy ingest as a controlled fallback; expanded coverage after stable cycles and clear improvements in search behavior and Support feedback.
  • Training/hand?off: Delivered sessions for Catalog Ops, Merchant Ops, Search, and PMs on dashboards, ticket patterns, and taxonomy governance; updated SOPs for merchant onboarding and feed remediation; transferred ownership of rules and mappings to Product Ops and Catalog Ops under change control.

Results

Catalog issues were caught before they reached customers. Category gaps, unit mismatches, and broken images triggered validations and either auto?fixes or quarantines. When feeds failed policy, Jira tickets carried exactly what changed, examples, and the merchant context. Search indexing consumed curated outputs, so browse and facet behavior matched product expectations and release notes.

Planning conversations shifted from symptom chasing to prioritized improvements. Dashboards showed which merchants and categories drove most validation work, and PMs coordinated targeted remediation and communication. Support scripts referenced the same scorecards, and relevance teams tuned ranking on a cleaner catalog. The ingestion stack and search infrastructure stayed intact; the organization added a governed layer that made quality predictable.

What Changed for the Team

  • Before: Feeds were accepted if they parsed. After: Schema and semantic checks blocked or fixed bad records before indexing.
  • Before: Duplicates and broken images leaked into search. After: Deduplication and URL validation cleaned the catalog automatically.
  • Before: Merchant issues surfaced through Support. After: Auto?opened Jira tickets with diffs and samples alerted owners proactively.
  • Before: Taxonomies drifted by merchant. After: Mappings normalized categories to a shared tree with governed updates.
  • Before: Release notes hedged around data caveats. After: Notes matched observed storefront changes and facet behavior.
  • Before: Quality trends were anecdotal. After: Dashboards showed merchant and category scorecards with clear priorities.

Key Takeaways

  • Validate semantics, not just shape; unit, category, and identifier checks prevent subtle breakages.
  • Normalize and enrich before you index; a curated catalog keeps search behavior predictable.
  • Automate the routine; auto?fix common issues and quarantine the rest with reason codes.
  • Make ownership clear; open tickets with diffs and samples so the right teams can act.
  • Govern taxonomy and rules; controlled updates avoid churn across merchants and storefronts.
  • Integrate, don’t replace; layer validation and enrichment onto your existing ingestion and search stack.

FAQ

What tools did this integrate with? The pipeline sat between existing ingestion and search. It validated schemas with JSON Schema, enforced semantic checks aligned with Great Expectations?style rules, orchestrated transforms in Apache Airflow and curated models in dbt, and indexed curated outputs to Elasticsearch. Jira handled tickets and approvals. Identifier checks aligned with GS1 GTIN.

How did you handle quality control and governance? Validation thresholds, taxonomy mappings, auto?fix rules, and dedup heuristics lived under change control. Low?confidence merges and category mappings required human review. All quarantines, fixes, and overrides were logged with reason codes and expirations. Dashboards made merchant and category quality visible, and exceptions were revisited on a schedule.

How did you roll this out without disruption? The pipeline ran in shadow mode first, producing curated outputs and draft tickets while indexing continued from raw feeds. Teams compared curated vs. raw storefront behavior and tuned rules. Curation and indexing were then enabled for selected merchants and categories, with legacy paths retained as a controlled fallback during early cycles.

How did auto?fixes work safely? Auto?fixes covered deterministic cases—unit normalization, URL repair, brand cleanup, and category mappings with high confidence. Each fix annotated the record with the rule applied. When confidence was low or a change could affect pricing or compliance, the record moved to quarantine and a Jira ticket requested review.

How were duplicates identified and merged? Deduplication combined stable identifiers (GTIN where available) with brand, size, and attribute signatures. Confidence scoring determined whether an automatic merge was safe. Low?confidence candidates were flagged for Catalog Ops review, and accepted merges fed back into the signature library.

Did merchants have to change their feeds? No. Existing feeds continued. The platform applied mappings and validations internally. When systemic issues were detected, Jira tickets and merchant communications included concrete examples and suggested fixes, and feed validators were shared as guidance during onboarding.

How did this affect the search index refresh? The indexers consumed curated tables only. If a category mapping or validation failed, the affected records were withheld from refresh until resolved, preventing partial or inconsistent facets from reaching customers. Successful runs included annotations so relevance and PMs could trace storefront changes back to feed events.

You need a similar solution?

Get a FREE
Proof of Concept
& Consultation

No Cost, No Commitment!