The Invisible Chaos: How Inconsistent Product Attributes Sabotage E-Commerce at Scale

2026-01-15 23:00:25

When retailers talk about scaling, they think of search engines, real-time inventory, and checkout optimization. These are visible problems. But beneath the surface lurks a more stubborn one: attribute values that simply don’t match. In real product catalogs, these values are rarely consistent. They are formatted differently, semantically ambiguous, or just incorrect. And when you multiply this across millions of products, a small annoyance becomes a systemic disaster.

The Problem: Small individually, but grandiose in scale

Let’s take concrete examples:

Size: “XL”, “Small”, “12cm”, “Large”, “M”, “S” — all mixed together
Color: “RAL 3020”, “Crimson”, “Red”, “Dark Red” — partly standards, partly colloquial
Material: “Steel”, “Carbon Steel”, “Stainless”, “Stainless Steel” — redundant and unclear

Each of these examples seems harmless on its own. But once you’re working with more than 3 million SKUs, each with dozens of attributes, a real problem arises:

Filters behave unpredictably
Search engines lose relevance
Customer searches become frustrating
Teams drown in manual data cleaning

This is the silent suffering lurking behind almost every large e-commerce catalog.

The approach: AI with guardrails instead of chaos algorithms

I didn’t want a black-box solution that sorts mysterious things nobody understands. Instead, I aimed for a hybrid pipeline that:

remains explainable
works predictably
truly scales
can be controlled by humans

The result: AI that thinks intelligently but always remains transparent.

The architecture: Offline jobs instead of real-time madness

All attribute processing runs in the background—not in real time. This was not a quick fix but a strategic design decision.

Real-time pipelines sound tempting but lead to:

unpredictable delays
expensive compute peaks
fragile dependencies
operational chaos

Offline jobs, on the other hand, provide:

Massive throughput (huge data volumes without stressing live systems)
Fault tolerance (failures never reach customers)
Cost control (computations during traffic-light times)
Consistency (atomic, predictable updates)

Separating customer-facing systems from data processing is crucial at this scale.

The process: From trash to clean data

Before AI touches the data, a critical cleaning step occurs:

Trim whitespace
Remove empty values
Remove duplicates
Format category context as clean strings

This guarantees that the LLM works with clean inputs. The principle is simple: Garbage in, garbage out. Small errors at this scale lead to big problems later.

The LLM service: Smarter than just sorting

The LLM doesn’t work blindly alphabetically. It thinks contextually.

It receives:

Cleaned attribute values
Category breadcrumbs
Attribute metadata

With this context, the model understands:

That “Voltage” in power tools is numeric
That “Size” in clothing follows a known progression
That “Color” may follow RAL standards
That “Material” has semantic relationships

It returns:

Ordered values
Refined attribute names
A decision: deterministic or AI-driven sorting

This allows handling different attribute types without coding each category individually.

Deterministic fallbacks: Not everything needs AI

Many attributes work better without artificial intelligence:

Numeric ranges (e.g., 5cm, 12cm, 20cm sort themselves)
Unit-based values
Simple quantities

These receive:

Faster processing
Predictable sorting
Lower costs
Zero ambiguity

The pipeline automatically detects these cases and uses deterministic logic. This keeps the system efficient and avoids unnecessary LLM calls.

Human vs. machine: Dual control

Retailers need control over critical attributes. Therefore, each category can be marked as:

LLM_SORT — the model decides
MANUAL_SORT — merchants define the order

This system distributes the workload: AI handles the bulk, humans make final decisions. It also builds trust, as teams can override the model when needed.

Infrastructure: Simple, centralized, scalable

All results are stored directly in a MongoDB database—the only operational storage for:

Sorted attribute values
Refined attribute names
Category tags
Product-specific sort orders

This makes it easy to review changes, overwrite values, reprocess categories, and synchronize with other systems.

Search integration: Where quality becomes visible

After sorting, values flow into two search assets:

Elasticsearch for keyword search
Vespa for semantic and vector-based search

This ensures:

Filters appear in logical order
Product pages show consistent attributes
Search engines rank more accurately
Customers navigate categories more easily

Here, in search, good attribute sorting becomes visible.

The results: From chaos to clarity

Attribute	Raw values	Sorted output
Size	XL, Small, 12cm, Large, M, S	Small, M, Large, XL, 12cm
Color	RAL 3020, Crimson, Red, Dark Red	Red, Dark Red, Crimson, RAL 3020(
Material	Steel, Carbon Steel, Stainless, Stainless Steel	Steel, Stainless Steel, Carbon Steel
Numeric	5cm, 12cm, 2cm, 20cm	2cm, 5cm, 12cm, 20cm

The impact was measurable:

Consistent sorting across 3M+ SKUs
Predictable numeric sequences
Full merchant control via tagging
Intuitive filters and cleaner pages
Better search relevance
Higher customer conversion

Key lessons

Hybrid beats pure AI: Guardrails are critical at scale
Context is king: It dramatically improves model accuracy
Offline processing wins: Necessary for throughput and reliability
Human control builds trust: Override mechanisms are not bugs, they are features
Clean inputs are foundational: No shortcuts in data cleaning

Sorting attribute values may seem trivial, but it becomes a real challenge with millions of products. Combining LLM intelligence with clear rules and merchant control creates a system that transforms invisible chaos into scalable clarity.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.