Argumentative essay assessment with LLMs: A critical scoping review

Authors: Favero, L. A. , Gaudeau, G., Pérez-Ortiz, J. A. , Käser, T. , Oliver, N.

Publication: , 2026
PDF: Click here for the PDF paper

Large Language Models are rapidly reshaping Automated Essay Scoring (AES), yet the methodological, conceptual, and ethical foundations of Argumentative Automated Essay Scoring (AAES) remain underdeveloped. This critical review synthesizes 46 studies published between 2022–2025, following PRISMA 2020 guidelines and a preregistered protocol. We map the landscape of LLM-based AAES across six dimensions—datasets, traits, models, methods, evaluation, and analytics. Our findings show that AAES research remains fragmented and insufficiently grounded in argumentation theory. The field relies on non-comparable datasets which vary in availability, prompt diversity, rater configuration, and linguistic background. Trait analysis reveals substantial overrepresentation of rhetorical and linguistic features and sparse coverage of reasoning-oriented constructs (e.g., logical cogency, dialectical quality). Studies mainly rely on proprietary GPT-family models and rubric-based prompting, while only a minority employ fine-tuning, multi-agent approaches, or reasoning LLMs. Evaluation practices remain uneven: although studies report high human-model agreement, robustness analyses expose sensitivity to prompting, score distributions, and learner proficiency. FATEN analyses reveal recurrent concerns regarding fairness (e.g., style and L1 bias), transparency, randomness sensitivity, limited pedagogical alignment, and an absence of work on privacy or deployment safety. Taken together, the evidence suggests that while LLMs can approximate human scoring on several traits, current systems insufficiently model core argumentative reasoning and lack the validity, interpretability, and accountability required for high-stakes assessment. We conclude by proposing a research agenda focused on construct-valid datasets and rubrics, psychometric modeling, transparent evaluation protocols, and responsible design frameworks.