• p-s-v 7 hours ago
    Text I built a scraping project to solve a niche domain problem: figuring out which chef knife steels are actually "good" vs. just marketing hype, based on r/chefknives archives.

    The core technical challenge was Entity Resolution. I didn't want to burn thousands of tokens feeding raw threads to an LLM just to identify common terms like "Wüsthof" or "VG-10."

    My solution was a 4-step "Inverse Masking" pipeline:

    Local Fuzzy Match: Fuse.js scans text against a local catalog of ~500 brands/steels. This catches 80% of entities for zero cost.

    Masking: I replace found matches in the text (e.g., "[ENTITY_FOUND]") to hide them.

    LLM Discovery: I send the remaining text to an LLM (via OpenRouter). Because the "loud" common entities are masked, the model is much better at spotting obscure artisan makers or slang that the fuzzy matcher missed.

    Sentiment: The LLM assigns context-aware scores (-1.0 to 1.0) to distinguish between "I want to buy X" (Neutral) and "X chipped immediately" (Negative).

    The Findings (from 542 threads):

    MagnaCut is the technical winner (28:1 positive ratio).

    Ginsan (Silver 3) is the practical kitchen favorite. It beats premium powdered steels simply because users almost never complain about chipping.

    VG-10 is the most controversial. Highest volume, but statistically the highest ratio of "micro-chipping" complaints.

    Stack is Node.js and MongoDB. Full charts and breakdown are in the post.

    I'm curious if others are using this "Fuzzy First -> LLM Second" pattern for NER tasks to save context window, or if I should just move to vector embeddings for the initial lookup?

    https://new.knife.day/blog/reddit-steel-sentiment-analysis

  • alvinunreal 7 hours ago
    [flagged]