As we continue to scale our AI-powered expert platform, the quality and structure of our data become more important than ever. In this blog post, we share how we reimagined our expert tagging pipeline — moving from an early BERT-like classifier to a context-aware, LLM-driven system. What started as a practical solution to improve recommendation accuracy evolved into a broader initiative that improved data quality, added depth to our taxonomy, and delivered surprising cost efficiencies. Here’s a look inside the thinking, experimentation, and engineering behind our new approach.
Several years ago, we made the decision to tag our expert data with roles and seniority levels — a fairly standard practice across the industry, aimed at improving internal search and filtering capabilities. Our initial implementation was built around an early-generation BERT-like classifier. While it offered a promising start, its performance in production quickly revealed limitations. The model frequently misclassified roles and seniority levels, leading to inconsistencies that impacted downstream systems.
More importantly, the setup required a continuous feedback loop and regular re-training cycles to maintain quality — introducing overhead and complexity to our maintenance processes. With the emergence of publicly available GPT models, it became increasingly clear that the era of traditional classifiers was coming to an end. Last year, as we began rolling out new AI-driven features for our clients — tools that rely heavily on expert recommendations — the need for accurate, context-aware classification grew sharply. These products demand a deeper understanding of professional backgrounds, where subtle differences in roles or industries can significantly affect outcomes. High-quality tagging was no longer a backend enhancement — it became a core requirement.
It quickly became clear to us that our BERT-like classifier — while a solid starting point — was still too early-stage to fully grasp the semantics and structure of language in the way modern LLMs can. The model often misinterpreted context, failing to distinguish between titles like “Data Centres Construction Engineer” and “Data Engineer in Construction.” These nuances are crucial, particularly when working with expert profiles across multiple languages and cultural contexts, from Europe to US to Japan. To address these limitations, we made the decision to rebuild our tagging pipeline from the ground up, adopting a modern architecture powered by large language models that are purpose-built for contextual understanding and linguistic flexibility.
LLMs offer a clear advantage: they are context-aware, capable of understanding word order, and, in many cases, inherently multilingual. From the start, we debated whether to invest in fine-tuning a dedicated model or to pursue a more agile few-shot prompting approach. Fine-tuning, while powerful, demands a high-quality training dataset — something that requires significant time and resources to develop. Few-shot prompting, on the other hand, offers flexibility, allowing us to iterate quickly and refine the tagging process in real-time.
We chose to begin with a few-shot prompting approach, experimenting with a range of lightweight and compact language models. Using full-scale models wasn’t feasible given the scale of our data — we needed to strike a careful balance between output quality and processing cost. Compact models proved to be a strong fit, delivering consistent results at a fraction of the cost. However, models optimized for reasoning presented unexpected challenges. Their outputs were significantly larger, often producing ten times more tokens than necessary. This overhead wasn’t just noise — the additional tokens were consumed by the model’s internal reasoning process, which, while valuable in some contexts, made these models inefficient for high-volume tagging where concise, taxonomy-aligned output is critical.
With the technical foundation in place, our next step was to define a robust taxonomy for roles and seniority levels. This set the stage for a focused series of experiments and prompt tuning iterations.
To ensure the model stayed focused and avoided role-seniority confusion, we split the tagging prompts into two distinct flows — one dedicated to roles and another to seniority. We also applied aggressive text preprocessing to reduce noise. This helped minimize ambiguity and steer the model toward cleaner, more consistent outputs.
To optimize cost without compromising quality, we decided to tag all employment positions in a given expert profile as a single bundled request. This reduced the number of API calls and token usage significantly. To help the model navigate these bundled inputs, we introduced simple structural cues — such as numbering each line (1., 2., 3.) — which guided the model in maintaining alignment between input and output.
Another insight came from monitoring how models handled taxonomy boundaries. Constraining outputs strictly to predefined role sets proved difficult. The model would occasionally generate roles outside our taxonomy — while this required validation, it also revealed taxonomy gaps. After tagging our entire expert database, we found that just 0.5% of roles fell outside our defined taxonomy — a manageable number, and a useful feedback signal.
Given the low cost of the compact model, we processed historical data in batches using a semi-manual pipeline, avoiding the overhead of a full backend system. For real-time data, we use standard APIs to keep things simple and maintainable.
We also discovered that including company names alongside position titles produced better tagging results for roles. However, this limited our ability to cache results effectively, since caching by position alone no longer captured the full input context. As a workaround, we prompted the model to mark roles dependent on company name with an asterisk or similar notation. Our analysis showed only about 30% of results were company-agnostic — too low for reliable caching.
Lastly, one of our favorite “hacks” remains surprisingly effective: instructing the model with a sharp rule like “You will be fined $1,000 for selecting any other role.” This kind of prompt constraint consistently improves output alignment with the taxonomy.
In a rapid timeframe, we built a robust and cost-efficient expert tagging system that far surpasses our previous keyword-based solution. The new approach delivers significantly higher quality results — it understands context, handles multiple languages, and adapts to nuanced role descriptions with far greater precision.
One of the most surprising outcomes was the overall cost efficiency of the new system. Compact models proved especially effective at keeping operational costs low, making high-quality tagging sustainable at scale. For historical data, reprocessing through batch APIs — where available — offered an additional opportunity for cost reduction, allowing us to process large volumes of information without the need for heavy infrastructure or caching strategies.
Beyond performance and cost, the new system also became a valuable tool for continuously improving our taxonomy. By tracking out-of-scope results from the model, we’ve been able to identify and close important gaps — especially within our role classifications. This feedback loop ensures our taxonomy stays relevant and reflective of real-world data.
Finally, one important takeaway: not everything has to be solved with heavy code. Certain processes — like historical data tagging — can be executed manually or semi-manually using batch APIs. This keeps the system lightweight, reduces potential bugs, and speeds up delivery without sacrificing accuracy or control.