Assessing Yemeni EFL learners’ Oral skills via the Conceptualization of Target Language Use Domain: A Testing Framework

There is an evident lack of a comprehensive evaluation basis for Yemeni learners’ speaking skills in the English department, Hodeidah University. The present paper presents a detailed framework of oral assessment criteria that involves a description of target language use domains and then shows how such domains can be systematically related to test design. The framework takes as its main goal the development and description of a criterion referenced rating scale representing real-world criterion elements. The aim of the testing framework, therefore, is to ensure maximum appropriateness of score test interpretations and maximize the validity and fairness of local speaking tests. A five-point likert scale is carried out to elicit 10 trained raters’ perceptions of using the pilot scale. The research findings support the use and appropriateness of the scale as it aids raters identify underlying aspects of their learners’ oral discourse that cannot be observed in traditional discrete point tests.


Introduction
In the current teaching situation, teachers used to rate students usually based on intuition rather than disciplined testing scales whether the speaking task in hand is an interview, a role-play or a presentation.Segmental features of pronunciation and the use of appropriate vocabulary are the only aspects of oral discourse to be evaluated by local raters.However, suprasegmental features of pronunciation, morphological and syntactical features of words and sentences are not attended or even listed in the examiners' score sheet if any.Learners' oral discourse is sometimes recorded and raters subjectively arrive at a score mostly basing their overall assessment on whether learners' speech is intelligible or unintelligible.Hence, one central design decision of the testing framework relates to providing proper rating information to rating scale users.The rating scale, developed as an integral component of the present testing framework, comprises multiple descriptors from multiple sources of information principally associated with linguistic and stylistic features of real language use.The aim is to assist teachers to accurately rate how well a student can speak the language according to pre-defined criteria of different levels of performance.EFL teachers, therefore, can base decisions about test takers' actual performance on multiple sources of authentic discourse-based information, not on traditional constructed-response items and thus tailor effective instructions that fit the learners' needs in their subsequent learning.

Testing speaking skills
Speaking is a difficult task to teach and evaluate particularly in an EFL context where learners have limited L2 environment and teachers use L2 materials that mainly adopt the written norms of language towards more accuracy and at the expense of fluency.Bailey and Savage (1994) stated that "Speaking in a second or foreign language has often been viewed as the most demanding of the four skills …yet for many people, speaking is seen as the central skill" (p. vi-vii).Speaking is, also, as Golebiowska (1990) put it out, "the major and one of the most difficult task confronting any teacher of languages" (p.ix).
Several studies in testing language performance have pinpointed crucial considerations and guidelines for developing and conducting speaking tests.For instance, Gorsuch (2001) hinted at the need for appropriate selection of published speaking tests, as many textbook-based speaking tests do not provide adequate opportunities for learners to exhibit their speaking abilities.Fulcher and Márquez (2003) claimed that one way to reduce task difficulty is to consider crosscultural differences in developing a speaking test.With regard to Non-verbal language, Jenkins and Parra (2003) suggested that non-verbal strategies should be evaluated in any oral interviews as they can establish an interactive involvement in the same way verbal strategies do.Another consideration that is rarely assessed in oral speaking test is the evaluation of sociolinguistic rules of speaking.Lazaraton, (1992) significantly pointed out that aspects of conversational interaction such as turn taking, minimal responses and fillers should be developed through effective criteria as testing instruments in assessing any oral proficiency test.Orr (2002) hinted at the necessity to train raters and to encourage them to follow the criteria on which the rating scales are based.The influence of gender on test takers' performance in oral interviews has been also a controversial issue in SLA.O'Loughlin (2002) claimed that gender does not affect the individuals' performance on speaking tests regardless of the gender of raters or the test takers.In contrast, O'Sullivan (2000) claimed that the test takers performed better when interviewed by women regardless of the gender of the participants, as women usually tend to show emphatic support more than men do.

Performance rating scales
According to Underhill (1987) the purpose of using rating scale is, To describe briefly what the typical learner at each level can do, so that it is easier for the assessor to decide what level or score to give each learner in a test.The rating scale therefore offers the assessor a series of prepared descriptions, and she then picks the one which best fits each learner (p.98).
This seems to be an overly simplified view of rating scales given the complexity of validity and reliability issues in assessing language performance.Second language inquiry represents a broader scope in second language assessment with multiple perspectives and a wider application of sophisticated testing methodologies (Bachman & Savignon, 1986;Bachman, Lynch & Mason, 1995;Douglas & Selinker, 1993;Fulcher, 1996Fulcher, , 2011;;Elder, Iwashita & McNamara, 2002;Matthews 1990;Robinson, 2001).
The literature is also rife with discussions and overviews regarding validity, reliability and appropriateness of the use of performance rating scales (Bachman & Savignon, 1986;Fulcher, 1987;Fulcher & Márquez, 2003;Matthews, 1990;Upshur & Turner, 1995, 1999).Those studies pointed out reliability problems associated with published rating scales, such as, raters' inconsistency, inadequacy of such scales to measure learners' progress in later stages of their developmental learning processes, and usability of rating scales in different learning settings.Validity problems were even more scrutinizingly examined.Those involve the mismatch between scale's descriptors and language features addressed by course objectives, inability of language learners to address some pre-defined descriptors, and ordering of linguistic features in rating scales.
Further, Brindley (1998) argued that a valid rating scale should leave some gap for personal judgments by raters particularly when they are faced with vague descriptors.Nevertheless, raters' judgments could be problematic as it can affect the whole process of performance assessment (Brown, 1995;Caban, 2003;Kim, 2009).Upshur & Turner (1995) argued that two raters of the same student's performance would have different results as each rater has its own interpretation of scale descriptors.Lumley (2005) argued that the raters' agreement on the interpretation of test scores is not because of the rating scale, but is rather "derived from the broadly common experience shared by raters, that of language teaching" (p.301).
Clearly established criteria for rating of performances can be seen in the study of (Norris, Brown, Hudson, and Yoshioka, 1998, pp. 10) who claimed that performance rating scales should be based on appropriate: a. Categories of language learning and development b.Appropriate breadth of information regarding learner performance abilities c.Standards that are both authentic and clear to students 4. To enhance the reliability and validity of decisions as well as accountability, performance assessments should be combined with other methods for gathering information (for instance, self-assessments, portfolios, conferences, classroom behaviors, and so forth.

Test Fairness
For the last two decades, the process of test validation has been a central issue in great deal of recent research focusing on the development and use of educational tests.One of the important considerations in using a test is that test must be fair to all candidates and that measures of any test should not weight any bias (Winke, Gass, & Myford, 2013).Such validity, as pointed out by (Roever, 2005), provides optimal opportunities for test takers to exhibit their potential language abilities relevant to the purpose of the test.A test, then, should not exclude test takers on any basis other than the examinee's lack of knowledge.Test takers should be able to present skills the test is intended to measure regardless of age, gender, disability, race, ethnicity, or any other personal characteristics.
Many SLA studies have argued that training raters is a key to increase test fairness, validity of oral assessment and the accuracy of reporting test scores, particularly, when the assessment criteria involves multiple descriptors (Kim, 2009;Elder, Barkhuizen, Knoch, & von Randow, 2007).Fairness is, then, not an isolated concept, but must be conceptualized as an essential element throughout the process of designing and using oral assessment tests.Fairness, for instance, should extend to the accurate reporting of individual and group test results.
The present testing framework has been developed bearing in mind the above mentioned concerns with an aim to provide EFL teachers with well-articulated testing practices that guarantee that teachers operate a fairer testing scale IJALEL 3(5):57-71, 2014 when interpreting test scores.As Taylor (2006) pointed out, "teaching and testing depend heavily upon having welldescribed models of language use" (p.58).Hence, the study provides a description of the target language use domain and the test task, description of tables of specifications, test procedures, scoring method, and description of the scale's descriptors

Rationale
The uniqueness of the proposed framework lies in the fact that it seeks to establish a reciprocal correspondence between real-life tasks and the definitions of actual abilities to be assessed.Such relationship can be seen clearly in the detailed specification of test procedures used to predict inferences of real language use.Further, the framework considers assessment of speech styles that are rarely mentioned in published speaking tests.Speech styles are included in this testing framework as they are important means to initiate any conversational interaction between interlocutors and will show the degree of involvement of students while performing the role-play activities.In addition, the present framework is designed with the EFL Yemeni teaching and learning context in the mind, considering factors, such as, large EFL classes, newness of the proposed testing criteria to local teachers and test takers, and course objectives.

Purpose
The test task is designed to provide evidence of students' ability to converse appropriately in a small interactive talk by role-playing the act of "Buying Transportation Tickets" (plane ticket/train ticket/bus ticket).In Addition, the test is meant to provide students with meaningful feedback in order to guide them in their subsequent developmental processes in speaking.

Test type and scoring method
In the same vein, the interpretation of the test scores is based on a criterion-referenced scale of multiple descriptors of real language use in order to better describe students according to their potential ability to perform the task in hand.The test type constitutes a part of an achievement test.Students in pairs will role play a task taken from the general content area "encounter services" and the thematic subdivision "choosing among different types of transportation tickets".

Target language use domain
The description of target language use domain (TLU) is adapted from Bachman and Palmer's model (1996) of TLU task characteristics.The TLU situation is defined as, "a set of specific language use tasks that the test taker is likely to encounter outside of the test itself, and to which we want our inferences about language ability to generalize" (Bachman & Palmer, 1996, p. 44).The sample of TLU domain for the present project represents three different aspects of buying transportation ticket situations (plane ticket/train ticket/bus ticket).

Construct definition
The construct definition in this test, following a construct -based performance assessment (Bachman, 2002;Norris et al., 1998), is realized via predictions of the test-takers' abilities to accomplish a role-play task.Hence, construct validity is defined as the ability to converse in a small interactive talk in different situations of buying transportation tickets through role-play activities.This ability requires correct syntax, comprehensible pronunciation, adequate and appropriate use of vocabulary and appropriate register.It also requires students to use grammatical, textual, functional and strategic competences by asking/answering questions about transportation tickets (price, class, schedule, time, stops) giving opinions (expensive/cheap prices), etc. Conversation characteristics (speech styles) such as the use of backchannels, fillers will be assessed.The sociolinguistic rules such as register are also assessed.Writing and reading are not tested.Listening is included in the performance but not tested.

Description of the test task
The task chosen is a representative sample of the above mentioned TLU domain.It will bear similar characteristics to that of the TLU domain use.The test task is a role-play.The students have the choice to choose among three situations in buying transportation tickets.The targeted students are second year teacher trainees, majoring in English at the English Department, Faculty of Education, Hodeidah University, Yemen.There are 60 students aged between 18 and 22.In pairs, each student can choose with another partner to role-play only one situation, that is, for example, a dialogue about buying plane tickets.In each pair, one student plays the role of a clerk and the other plays the role of a customer.
With regard to the physical characteristics, the location is in a small room in the English department, Faculty of Education, Hodeidah, Yemen.The physical condition is quiet at the time of the activity, well lit, non-distracting.Test takers have the option to bring materials such as maps or schedules.Test takers are familiar with the rater (their teacher) and role-play activities.The participants are the test takers who will play the role of customers, employees, clerk, etc.Each two students should make up their dialogue and act the role-play activity in front of their teacher.The teacher will not take part in the role-play activities.
Considering the characteristics of the test rubric, instructions will be given one week before undertaking the test so that students will have the chance to prepare themselves for the test task.The rubric is written in the target language (English) in the written channel.Specifications of procedures and tasks are explicit.The structure of the task contains three role-play tasks that involve buying transportation tickets.Five minutes to ten minutes are allotted for each task.
The criteria for correctness are criterion referenced.Students are evaluated on a language ability scale from 1-4 for use of appropriate pronunciation, vocabulary, morphosyntax, and speech styles.Regarding the procedures of the scoring method, only one rater will rate students' performance on a criterion-referenced scale (1-4).The rater will follow predefined criteria for rating students.
The language characteristics involve organizational and pragmatic characteristics.The organizational characteristics include grammatical characteristics that are involved in producing accurate utterances using the knowledge of morphology, syntax and phonology.The language domain contains general, formal/informal and frequent vocabulary used in buying transportation ticket situations (tickets, train, plane, bus, etc.).Morphology and syntax consist of primarily organized structures.Phonology represents standard use of speech sounds.Nevertheless, some situations may involve examples of non-standard use of morphology, syntax and phonology.The organizational characteristics also include textual, cohesive and rhetorical characteristics.In the above mentioned TLU domain, cohesion involves the use of a narrow range of cohesive devices, such as pronouns, linking words, adverbs, etc.The rhetorical organization involves clear organizational development of information of language in use.
The pragmatic knowledge involves functional and sociolinguistic characteristics.The functional characteristics in the TLU domain involve ideational and manipulative functions, including requesting, asking for information, accepting, refusing, interrupting, etc.The sociolinguistics characteristics include features such as standard dialect, formal register, natural delivery of language, and minimal cultural references.The topical characteristics are relevant to the type of information and language features that are used in the above situations (e.g. the ability to ask about the direction of flights or the ability to provide information about the price of tickets).
An important category to be considered is seen in the relationship between the input and the response which is defined by Bachman and Palmer (1996) as "the extent to which the input or the response affects subsequent input and responses" (p.55).Such relationship is reciprocal in terms of reactivity.That is, the participants usually exhibit interactive involvement when performing the task in hand.The scope of the relationship is narrow as the relationship between the interlocutors in the above situations is often distant.The directness of the relationship between the interlocutors is direct as responses address specific questions in the input.(See appendix A for a description of the target language use domain).

Description of the table of specification
Many researches claimed that that a better specification of scoring criteria might increase rater's reliability (Hamp-Lyons, 1991;North, 1995North, , 2003;;North & Schneider, 1998).In this testing situation, the table of specifications contains four tables specifying the functions to be assessed with regard to language construct of the present testing framework.As (Chalhoub-Deville, 2001, p. 225) put it out, "Language testers and researchers need to expand their test specifications to include the knowledge and skills that underlie the language construct".
The assessment criteria, therefore, contain language linguistic and stylistic aspects (pronunciation, morphosyntax, vocabulary and speech styles).Organizational features (textual and grammatical organization) are embedded in the description of morphosyntax aspects.
Pragmatic features (functional, sociolinguistic and topical characteristics) are embedded in the description of speech styles aspects.Grammatical and pragmatic features of the TLU are also realized in aspects of task completion (greeting, asking for/offering help, requesting information, providing information, and thanking).Table (1) specifies the total score (100%) that will be devoted in half to linguistic and stylistic aspects (50%) and task completion (50%)

Task completion 50
Total 100 Table (2) presents the criteria for measuring linguistic and stylistic features.In the first column, there are four levels of linguistic and stylistic features of spoken language that will be measured.Each level will be assessed on a scale from 1-4 as shown in column two.Column three shows the weight (actual number) given to each level or criterion which should be multiplied to get the score in column four.Pronunciation is given the least score (10% of the total score) as the test takers are familiar with pronunciation aspects, such as, individual speech sounds, stress, intonation, etc. Speech styles (minimal responses, backchannels, fillers, etc.) are given the most score (16% of the total scores) as the test takers have been recently introduced to aspects of speech styles.Morphosyntax and vocabulary are given 24 % of the total scores.Total 50 In table (3), five functions for task completion that will be observed during the test takers' conversational interaction are listed in the first column.A specific weight of the assessment criteria is dedicated to each function.Greeting and thanking are given only 10 % of the total score (50%) as they are fixed formulaic expressions and students are supposed to know how and when to use them.Offering and asking for help, though a kind of formulaic speech, are weighted with 10 % of the total score (50%).Requesting information and providing information are weighted with 30% as they will enable the examiner to observe and assess his/her students' extended oral production and also their ability to use different speech styles.The presence of these functions will be weighted with (1) and their absence will be weighted with (0) as shown in the second column (1-0).The third column indicates the weight of the different tasks and the fourth column indicates the score and the percentages of each task.As illustrated above, the test task has a composite score of 100 points that are dedicated in half, 50% for linguistic and stylistic features and 50 % for the task completion.The examiner develops a set of instructions in English to guide the students in accomplishing the test task (See appendix B).Furthermore, this set of instructions provides the students with the assessment criteria of language aspects and task completion (See appendix C).
The scale includes descriptors that represent features of pronunciation such as segmental aspects (e.g.individual sounds) and suprasegmental aspects (e.g.intonation, stress), morphosyntax (e.g.derivational and inflectional morphemes), vocabulary (e.g.nouns, adverbs of time) and also features of speech styles (e.g.fillers, minimal responses) presented in (Lazarton 1992) and (Biber et al., 1992).The test takers have been introduced to these features throughout four spoken English courses (See Appendix E).
Finally, a score sheet for both raters and students is developed in order to facilitate the recording and the reporting of the assessment information (See Appendix C & D).In the above teaching situation, giving this kind of scoring sheet to students is unusual.However, providing students with this scoring sheet will be of great value as it will not only help them focus on important areas of language ability but also will guide them in their subsequent processing of features of real language use.

Test takers
The test takers in this speaking test are 60 intermediate second year students who are majoring in English in the English Department, Faculty of Education, Hodeidah, Yemen.They are 20 males and 40 females and their ages are between 18 and 22 chosen from Spoken English (course 4), second semester.They are familiar with the role-play activities as theses activities were regularly being introduced to them earlier in their speaking classes.

Administration
The test will be administered at the end of the speaking course.The test will take place in a small room in the English Department.The sixty students will be divided into 30 pairs.The groups will be tested throughout two days consecutively.Each pair in a group is given only 5 to 10 minutes to perform the role-play activity.The test takers should be given the instructions of the test one week before the test.

Soring procedures
The scale used for assessing the test takers' oral performance is an analytical scale.It is a criterion referenced language ability scale, including four aspects of linguistic and stylistic features (pronunciation, morphosyntax, vocabulary and speech styles).The criterion for each aspect is assessed on a four-band scale (1-4: 1 is poor, 4 is excellent).This part of the test constitutes 50% of the total score of the test task (100%).The test also includes a second part for task completion that is weighted with 50%.Therefore, the test task has a total score of 100%.The teacher will not take part in the role-play activities.The teacher will be the main rater in this speaking test.However, the teacher will select a small sample of the students' scores to be rated by another rater as to provide an acceptable consistency of the rating of the test scores.Then, this sample of test scores, which will be rated by another rater, will be correlated with the teacher's rating of the same portion of score.Such criteria for scoring are operationalized to provide an insight into what raters should pay attention to in the process of rating and, thus, contributes towards the validation of rating scales.

Reliability
Consistency will be across situations.That is, the three different situation of buying transportation tickets should be carefully evaluated in terms of the level of difficulty, performance required to accomplish each task, and the clarity of instructions.There should be an intra-rater consistency following the scoring criteria mentioned above.The teacher will select 10% percent sample of the test score to insure inter-rater consistency.A standard error of assessment will be developed in order to reasonably predict the test takers' true score and its relationship with the observed scores.

Construct validity
The content of the test task should reflect the skills that are to be measured and that could be achieved by providing tasks included in the role-plays that involve the test takers in providing evidence of using, for example, appropriate pronunciation, and use of speech styles.The content of the test task that involves the performance of aspects of language ability should be primarily related to the content of the materials that the test takers have been taught in their speaking courses.It should also be authentic, as it should reflect aspects of target language domain use.Thus, the content of the test should be a representative sample of the relevant language skills that students have been introduced to and that reflect the target language domain use.

Impact
Students should receive meaningful feedback as to guide them in their subsequent learning.The teacher will meet the students after the speaking test in order to discuss and talk about their performance.The students should take part in the discussion and describe their own experiences in preparing for the role-play activities.In addition, decision procedures should be applied uniformly to all groups of test takers.Therefore, we can make sure that all students are treated fairly regardless of the individual test takers group membership.

Practicality
Due to the number of students, some considerations should be taken into account.First, only five to ten minutes should be allotted to each role-play task.Second, role-play tasks should be administered throughout two days consecutively so that the teacher can carry on the activities without being exhausted and to reduce the practice effect of role-playing the same test activities.Tasks should not be administered during working hours in order to avoid noise and disturbance.
They will be administered in the afternoon after closing hours at 3.00 o'clock.

Methods
A five point likert-scale is conducted to determine the raters' beliefs on using the rating scale.It consists of questions upon which the respondents can express either agreement or disagreement attitudes towards the item in question.Each statement is given a numerical score to reflect its degree of attitudinal approval.The likert scale includes 12 items.The items are grouped into two categories.The first category include 6 items that address the possible limitations of the proposed rating scale whereas the other six questions in the second category address the potential advantages of using the rating scale.It is thought that twelve items would give a good picture of raters' perception of the proposed rating scale considering that all raters chosen to participate are M.Ed.holders and are able to clearly express their stand on the use of the pilot scale.
This likert-scale is typically appropriate to be used in the present study as the purpose is to urge test users and developers to operationalize comprehensive rating scales to ensure validity and test fairness while undertaking oral assessment.Such methodology, however, is thought to be of no value if realized via the involvement of students' opinions on the way their oral abilities are judged.This is because students might greatly be lenient in delivering their true perception of such rating scale.Linguistic and stylistic features involved in the oral assessment could be viewed by many students as difficult to cope with and would necessitate them to do extra effort to incorporate such features in their oral discourse regardless of their importance in any speaking context.Hence, there could be a kind of resistance from learners being judged on multiple aspects of oral discourse and as such there is a great probability of turning down the proposal by stake holders and sticking to traditional discrete-point tests.

Participants
The participants are ten English teachers, 7 females and 3 males.All of them have M.Ed. in language teaching and education.Their teaching experiences in schools and Hodeidah University range between 3 to 8 years.They have been introduced, in a workshop, to concepts of test validity, test fairness, and the different descriptors included in the proposed rating scale (grammatical, linguistic and stylistic features, etc.) and how they can effectively incorporate them while undertaking the oral assessment procedures.

Current results and Discussion
Specifically for the present study, in the first category (items 1-6), lower means indicate the raters' disagreement to any possible limitations in the speaking rating scale.Therefore, lower means show the positive side of the likert scale.In the same category, higher means indicate the raters' agreement on the presence of clear limitations in the rating scale.Higher means then represent the negative side of the likert scale.In table (5), as shown below, the average means of items, 1,2,4,5 respectively have lower means and as specified above constitute substantial significance.The specified items are concurrent with the usability and usefulness of the rating scale for oral assessment.Interestingly, in the first category, item3 and item6 show higher average means, though not substantially significant, indicating raters' agreement on two issues.Regarding item3, the raters show noticeable tendency towards the need for special training on the use of the new speaking rating scale.Item 6 indicates that most raters have the feeling that the multiple descriptors involved in the scale could be problematic and difficult for students to cope with.This seems to be normal as it is their first time to operationalize such scale in assessing learners' oral skills.Upon individual interviews, the raters revealed that the two-hour workshop was not enough to have a good grasp of the scale descriptors and that they had to examine it several times.Nine raters mentioned that the pilot scale guided them to focus more on different aspects of students' oral discourse.Seven raters indicated that the scale was objective and as such help them easily arrive at a score.All raters felt that the elaborated descriptive scale would provide students with specific information about where they did well, and what they need to work on.In general, the pilot speaking scale was perceived by the raters as positive.However, one limitation that could be noticed is that the pilot scale is a complete novelty for raters and that could affect the way they use it in their rating process.A more prolonged use of the rating scale could result even in more tangible evidence towards the efficacy of such scale to be used in local speaking tests.

Conclusion
The present study sought to design a testing framework for assessing Yemen learners' oral skills via the development of a rating scale of multiple descriptors representing features of real language in use.The testing framework, in the present study, underpins the use of a performance-based test approach for oral assessment that is operationalized via the description of the test task in relation to observable domains of target language use.Accordingly, the rating scale is developed bearing features of real world communication.The rating scale, therefore, places primary value upon observations of language performance with an aim to offer the promise of descriptive and complete picture of learners' performance than that of single-criterion rating scales or discrete-point tests.In sum, the present scale is meant to provide meaningful interpretations and inferences from test scores to the type of learners' actual performance in specified domains of target language use.
A vital research goal is, then, to place confidence in the quality of information and interpretation of test scores provided by local raters to the examination board.Another goal is to give guidance for test users and test developers in choosing and selecting appropriate testing tools, delivering valid interpretation of test scores, and providing test takers with appropriate feedback for their subsequent learning.
It is worth mentioning here that a sound and more effective scaling and description of real world language elements that can be traced back to actual performance could be seen in Fulcher's model of Performance Decision Tree (2011).The model is innovative in that it describes pragmatic and discourse variables via a boundary choice approach at arbitrary levels rather than ordering of such variables onto single scale.However, such scale is novel and needs to be put into test to validate its effectiveness for scoring speaking tests in classroom practices.
To conclude, the study's findings support the use and the appropriateness of the rating scale as a measure of speaking proficiency, as well as the utility of the devised discourse-based descriptors for the validation of speaking tasks in other assessment contexts.

Table 1 .
Total Score

Table 2 .
Criteria for measuring linguistic and stylistic features

Table ( 4
) displays an overall representation of scoring criteria for both linguistic/stylistic features and task competition.

Table 5
The mean of item12, in particular, is substantially significant.It shows evidently the raters' positive attitudes on the necessity to validate such informative speaking rating scale to be officially operationalized in the teaching context in Hodeidah University.The mean of item8 (4.7) is also significantly informative of the fairness and validity of the rating scale as perceived by the raters.
In the second category, higher means indicate the raters' agreement to the usability and usefulness of the rating scale for assessing oral skills.The overall means in the second category (item7 to item12) are between 4.4 and 4.8.Such results, as specified for the second category, indicate substantial significance towards positive attitudes on the use of the pilot rating scale.