Is your feature request related to a problem? Please describe.
Yes. Current LLM evaluation systems struggle to detect deep structural and syntactic errors in complex sentences, especially in morphologically rich languages like Turkish, where embedded clauses can change meaning without visible surface-level errors. This leads to incomplete evaluation of model performance in real-world complex language scenarios.
Describe the solution you'd like
I propose a structural evaluation layer that transforms sentences into hierarchical representations and recursively detects embedded clause structures across different syntactic roles. This allows the system to identify where LLM outputs preserve or distort deep grammatical relationships rather than only evaluating surface-level correctness.
Describe alternatives you've considered
Traditional evaluation methods such as BLEU scores, semantic similarity metrics, and general parsing approaches. However, these methods do not capture recursive embedded clause structures or fine-grained syntactic dependencies, especially in complex sentence constructions.
Additional context
I have implemented a working prototype in Google Colab that demonstrates this approach using structured JSON representations and recursive extraction of embedded clauses in Turkish sentences, including medical-style texts. A demo video is attached showing the full pipeline from sentence input to structured output extraction.
Demo 🔗 👉 https://screenrec.com/share/4dIDcZSfbF
Is your feature request related to a problem? Please describe.
Yes. Current LLM evaluation systems struggle to detect deep structural and syntactic errors in complex sentences, especially in morphologically rich languages like Turkish, where embedded clauses can change meaning without visible surface-level errors. This leads to incomplete evaluation of model performance in real-world complex language scenarios.
Describe the solution you'd like
I propose a structural evaluation layer that transforms sentences into hierarchical representations and recursively detects embedded clause structures across different syntactic roles. This allows the system to identify where LLM outputs preserve or distort deep grammatical relationships rather than only evaluating surface-level correctness.
Describe alternatives you've considered
Traditional evaluation methods such as BLEU scores, semantic similarity metrics, and general parsing approaches. However, these methods do not capture recursive embedded clause structures or fine-grained syntactic dependencies, especially in complex sentence constructions.
Additional context
I have implemented a working prototype in Google Colab that demonstrates this approach using structured JSON representations and recursive extraction of embedded clauses in Turkish sentences, including medical-style texts. A demo video is attached showing the full pipeline from sentence input to structured output extraction.
Demo 🔗 👉 https://screenrec.com/share/4dIDcZSfbF