SPEAKING ASSESSMENT: IMPACT OF TRAINING SESSIONS

The article focuses on the problem of examiner’s objectivity in rating Speaking proficiency in a foreign language at standardized high-stakes tests. Since there are different factors which may impact the assessment reliability, special training sessions are widely usedby different testing centers. They are expected to eliminate examiners’ subjectivity and lead to interrater agreement and intrarater consistency. The research described in the article was aimed at finding empirical evidence of the efficiency of such sessions. The outcomes of the study proved the sessions to be efficient in terms of rating accuracy of the examiners. KEYWORDS

sample papers, scoring scales and rubrics to the participants at the beginning of the training session. Cambridge Assessment FCE Oral Exam educational videos were used as initial stage samples for the training session. The training sessions were conducted by professional educators, assessors and teachers.
The 40 participants assessed 20 candidates each, 10 at a pre-training session and 10 at a posttraining one. The responses were scored holistically by each rater individually in 5 domains (Grammar and Vocabulary, Discourse Management, Pronunciation, Interactive Communication and Global Achievement) and the scoring was adapted to a 10-point scale.
As a result, the 2-day event provided the study with 2 000 scores for the pre-training assessment session and 2 000 scores for the post-training session correspondingly. And this didn't include the experts' preliminary scoring, which was carried out according to the same specially created scoring scale (150 scores in total, 50 for the first session and 100 for the second). The expert scores later were used as reference scores and compared with the participants' ones to calculate the level of agreement prior to the training session and after it as well as to estimate the degree of a participant's severity or leniency and intrarater consistency.
On the first day of training the format of FCE and its assessment scale with rubrics were introduced to the participants so that they knew what to expect during the test and what to do. There was no further instruction on the methods of assessment of the exemplars at this stage. Thus, the scores received from the participants on the first day were later used as the pre-training results. On the second day, immediately after 8 hours of training on how to assess each of the criteria the participants assessed another group of 10 candidates and the obtained results were used as the post-training results.
The instruction consisted of 5 lectures on each criterion of the FCE scoring band (1) "English for Global Opportunities. Speaking Overview of the Most Widespread International Exams"; 2) "What's in a Tongue: Reflecting on Vocabulary and Grammar of Spoken English"; 3) "Being Ready for Everything. Exam Interviews"; 4) "Intelligible or is it? Teaching and Evaluating Pronunciation for B2 Level Exams"; 5) "We Need to Talk... Assessment Criteria for Interactive Communication"). The training sessions included discussions of the assessment of each criterion using the exemplars and were followed by a questions-answers session for further clarification of the rating procedure.
The obtained scores were brought together inspecial forms for each participant for both pretraining and post-training assessment. Later a discrepancy score for each criterion and each exemplar was calculated by comparing the scores of the raters with the benchmark.  Table 2. First day pre-training discrepancy scores As a discrepancy score does not show the correlation between the scores of all the participants and their distribution around mean, z score for each participant was calculated to be further used in analysis. The positive scores show that the rater's score is more lenient than that of an expert and on the contrary a negative z score indicates the examiner's higher than expected severity. The comparative analysis of the results before training and after it shows thatdiscrepancy z score of most of the raters improved. Strangely, however, the highest degree of leniency expressed by z score of 2.22 by rater No 29 though being the most "subjective" positive score at the first stage changed to 0,46 at the second stage, signifying the highest degree of efficiency of the training for the rater. The most severe score of rater No.7 at the first stage remained much different from the benchmark at the second stage but as a result it became twice more lenient and thus showed a significant positive dynamic. Some raters, such as No.4, 22, 26, 39 showed the tendency of becoming even harsher after the training and No.14 and No.16 strangely increased their severity by 10 times. As for No.6 and 8, on becoming harsher and having changed their discrepancy score from positive into negative, they still made a positive dynamic toward the benchmark. And on the contrary, examiners No.10, 11 and 34 developed even higher level of leniency and No. 36 changed the score from negative to positive so tremendously that the leniency score even acceded the previous severity score. However, No. 13, 32 and 40 having changed their results from negative to positive approached the benchmark in their progress. Table 4 illustrates the progress or regression of the raters whose score at either of the stages was above 1 or below 1. The analysis of the data leads us to the conclusion that for most of the raters the training session was efficient and resulted in the raters' score approaching the benchmark. Out of 40 raters only 10 showed negative changes in their assessment. Thus, the efficiency of the training session can be rated at 75%.

Fig. 2. Raters' progress after training
The study was also accompanied by 2 background Google Forms surveys. One was intended to collect personal data on participants and the other one was a follow-up survey used to collect the feedbacks of the participants about their possible progress after the training. Interestingly the "subjective" self-assessment of the participants'progress and the outcomes of statistical calculations appeared to be almost identical.

Fig. 3. Self-assessment data of the participants' progress
The findings of the study reported above show that though the examiners' biasedness in Speaking assessment was not absolutely eliminated, most of the raters became more accurate in their scoring. It proves the idea that although causal connections between attitudes and outcomes cannot be proved, it may be assumed that if training causes more user's satisfaction it is more effective, "leading to greater compliance with the benchmarks and hence increased conformity of rater behavior" [4; p.57].
The slight differences of the raters' severity, which cannot be eliminated, can thus 'be modelled in MFRA to some extent and the reduction of the variability in raters' severity should not be the main purpose of rater training" [6; p.4].
Conclusions. In terms of personal attitude to the possible progress in rating the participating students with little or no confidence in their progress showed less improvement than those who felt more confident about the benefit of the training. As for the level of the examiners' severity and leniency it changed in most cases but differently as some of the raters became harsher while others became more lenient. However, 75% of the participants of the training sessions made their progress toward the benchmark score which signifies anoticeable increase in their objectivity.