We built a model to generate 1,000 questions that people actually ask. Not random prompts. We scraped 50,000 real user queries from search logs, forum threads, and support tickets across 12 industries. We clustered them by intent and generated 1,000 representative questions.
We asked those same 1,000 questions to 5 AI models: ChatGPT (GPT-4), Gemini (Ultra), Perplexity (Pro), Claude (4.5 Sonnet), and Llama (3). We ran the experiment daily for 30 days. We tracked every citation at the source level.
The goal: measure citation overlap. How often do these models cite the same source for the same question?
Right now we have scenarios covering things like giving hard feedback, managing up, and pushing back on scope creep, and more. But I'm building out the next set and I'd rather build what people actually need than guess.
So: what's the conversation you keep putting off?
What's the one you replayed in your head after it went sideways?