Beyond CEFR level prediction of texts in learner corpora: Exploring feedback to learners and learning analytics

October 30, 2019 all-day
Nicolas Ballier

A one day workshop at Université de Paris, 30 Oct 2019 Olympe de Gouges Building, room 115, first floor (tbc)

Provisional programme

MORNING: discussing our results

9 00 opening N. Ballier The Ulysse PHC project : aims, data and limitations

9.20 Thomas Gaillat investigating learner micro-systems and customizing CEFR criterial features : the micro-system feature set and its regex syntax

9h40 discussion

10h30 Bernardo Stearns (tbc) and Annanda Sousa : the user interface prototype demo
We hope to deliver a docker and a github version of our user interface that allows you to paste a text, have a coffee while the text is processed and then get the probability of the text of being of a given CEFR level.

10h45 Discussion

11h 15 Andrew Simpkins : overfitting ? comparison with a graded corpus
As a preliminary step, we have tested our current User Interface with the CEFR ASAG corpus to check whether our model is biased to the A1 level.

11H30 General discussion

12 15 LUNCH BREAK (poster session at Diderot)
Poster displayed at Diderot and on a shared google drive for distant participants (titles tbc).
Thomas Gaillat : the Viz project for visualing metrics
Carlos Balhana (Cambridge) : Grammatical Error Correction and Interlanguage Event Representation
Vinogradova et al : a module for punctuation with the REALEC data
Vinogradova et al : the REALEC web interface : data, activities, technologies
Volodina et al. A System Architecture for Intelligent CALL examples of NLP approaches to Swedish
Nikolay Babakov : recommandation system for CEFR-indexed texts (from Russian to English ?)
O’Donnell et al. The concept approach to learner errors (incl details on data, NLP techiques used)

AFTERNOON: Learner corpora and beyond: collecting and interpreting learning process and product data

A blueprint is to be circulated pointing out potential future directions.

13h30 STRAND 1 Adding more metrics/NLP-based methods for error detection / problematic areas for learners

15h STRAND 2 Exploring the relation between Learner corpus annotation, language testing, and individual feedback to learners

16h30 coffee break

17h STRAND3 Should we try to link learner corpus and learning analytics research – and what is there to be gained? Ideas for Tracking Development path ? (Fuchs, Götz & Werner 2016) How to develop learner profiles based on student input?

1815 closing remarks and future plans

1830 end of the workshop

Call for participation

As a closing event of a European-funded project, we invite colleagues to share their ideas about the automatic analysis of learner corpora and how they can be applied towards interlanguage analysis, CEFR level prediction, and error detection – and extended to support individual feedback to learners and learning analytics.

The morning session will present some of the results of this French-Irish project “PHC Ulysse 2019”: the features of the EFCAMDAT corpus we used as the first step for our experiments, the methodology we developed, and our main findings. We will present our prototype of user interface for automatic detection of CEFR levels and discuss aspects such as overfitting of a model based on the French and Spanish components of EFCAMDAT. We will also discuss the shared task we held on a portion of this

We will discuss posters over lunch recapitulating some of the issues. Poster presenters are asked to send their A0 PDF to by Oct 15th midnight, summarizing their approach, which may include results previously presented. The afternoon functions as a round table intended to build collaborations and extensions of our project and discuss potential work packages for a follow-up project. Invited colleagues will summarize their methodologies and share their views on possible next steps.

Admission is free but registration is compulsory (on a first come, first served basis) on this webpage:

The summary of the Ulysse PHC Project can be found here :

Discussants at Diderot :

Taylor Arnold (University of Richmond, is Assistant Professor of Statistics at the University of Richmond and has a strong interest in NLP as a data scientist and digital humanist, see

Detmar Meurers (University of Tübingen, is Professor of Computational Linguistics and head of the research group on Intelligent Computer-Assisted Language Learning there:

Contact person:
Nicolas Ballier :

Leave a comment