-
Notifications
You must be signed in to change notification settings - Fork 167
Description
Hi, first of all, thank you for LiveCodeBench. It has become one of the default references whenever people talk about realistic coding benchmarks, especially in contrast to older, saturated datasets.
I am coming from a slightly different direction. I maintain an open framework called WFGY, and the current release, WFGY 3.0, is a pure-text “Singularity Demo” pack. It is 131 S-class questions written as long-form prompts, targeting things like alignment edge cases, long-horizon planning, fragile world-models and other high-tension scenarios. Everything is MIT-licensed and already used as a long-range stress test for LLMs.
One idea I have been exploring is: instead of only measuring raw coding performance, we can also ask “what happens if the model first absorbs a highly loaded theoretical context, then tries to solve hard code problems afterwards”. In other words, use a TXT like WFGY 3.0 as a fixed preamble and see whether LiveCodeBench scores shift, collapse, or reveal new failure modes.
Would something like a “high-tension preamble” track fit into your roadmap at all? Even a small experimental variant, where selected LiveCodeBench tasks are run with and without such a preamble, could already be very informative.
If this is compatible with your design philosophy, I can draft a minimal proposal so you can quickly judge whether it is worth integrating, or keep it as an external experiment.