Assessing Reliability of Two Versions of Vocabulary Levels Tests in Iranian Context

This study examined the equivalence and reliability of the two versions of the Vocabulary Levels Test in an Iranian context. This study was motivated by the fact that the Vocabulary Levels test is increasingly being used in Iran for both research and pedagogical purposes without having been checked for validity and reliability in this context. The equivalence and reliability of the two versions of the test were examined through the parallel-form approach to reliability in Classical True Score theory. Seventy-five intermediate learners of English as a foreign language at the Iran Language Institute took the two versions of the test with one week interval between the two administrations in a counterbalanced fashion. To examine the equivalence of the two versions, the means and variances of the scores obtained for the two tests were compared using paired-sample t-test and one-way ANOVA, respectively. The results of the analyses indicated that the difference between the means of the two versions was significant, and the two versions cannot be considered as parallel forms. To assess the reliability of the two versions, the correlation between the scores obtained from them was estimated using Pearson Product Moment correlation. The results of the analyses showed that the two versions are highly correlated and are reliable tests. It is concluded that the two versions should not be treated as equivalent in longitudinal and gain score studies.


Introduction
Vocabulary is an important component of language use.As Wilkins (1972) states, "without grammar very little can be conveyed, without vocabulary nothing can be conveyed " (p. 111).Lack of vocabulary knowledge impedes English as a Second Language (ESL) learners' ability to comprehend messages or express themselves clearly in English.There is plenty of evidence pointing to the importance of vocabulary in language use.One of the most systematic explorations of the relationship between vocabulary knowledge and language proficiency occurred as part of the development of the DIALANG tests (Alderson, 2005, as cited in Schmitt 2010).His research team compared scores on various vocabulary tests with the scores from the other language components of the DIALANG test.The results revealed that vocabulary has strong relationships with the language skills.The checklist test and the vocabulary test battery correlate with reading at .64, listening from .61-.65, writing from .70-.79, and grammar at .64.Several other studies have highlighted the importance of vocabulary knowledge for second language (L2) learners in reading (Haynes & Baker, 1993;Huckin & Bloch, 1993), speaking (Joe, 1998), listening (Elley, 1989;Ellis, 1994), and writing (Lee, 2003;Hinkel, 2001, Laufer & Nation 1995).What the above studies would appear to show is that vocabulary knowledge contributes to a great deal to overall language success.
Given the importance of vocabulary in language use, one key issue in vocabulary studies is how much vocabulary is necessary to enable communication.The short answer is a lot, but it depends on one's learning goals.If one wishes to achieve native-like proficiency, then presumably it is necessary to have a vocabulary size similar to native speakers.There have been a few well-designed studies which provide reliable estimates of native language vocabulary size.Goulden, Nation, and Read (1990) found that their New Zealand university undergraduates had a vocabulary size of about 17,000 word families.D 'Anna, Zechmeister, and Hall (1991) found that their university students knew a little under 17,000 of the headwords in the 1980 Oxford American Dictionary.Schmitt (2010) remarks that despite the fact that native speakers will always vary in their vocabulary size to some extent, a range of 16,000-20,000 word families seems a fair estimate of the vocabulary size for educated native speakers.
Luckily, L2 learners do not need to achieve native-like vocabulary sizes in order to use English well.A more reasonable vocabulary goal for these learners is the amount of lexis necessary to enable the various forms of communication in Flourishing Creativity & Literacy English.One of the most basic things a person might want to do is to communicate orally on an everyday basis (e.g.asking directions to the train station, describing one's holiday).If we assume that 98% of the vocabulary needs to be known (Hu and Nation, 2000), we can estimate the number of word families it takes to be able to engage in informal daily conversation.According to Nation (2006) a base of 6,000-7,000 word families is needed to meet this goal.He also estimated that 95% coverage would require knowledge of about 3,000 word families, plus proper nouns.Overall, the current evidence suggests that it requires between 2,000 and 3,000 word families to be conversant in English (if 95% coverage is adequate) or between 6,000 and 7,000 word families if 98% coverage is adequate.However, there is not enough evidence to confidently establish a coverage requirement for listening at the moment.We are on firmer ground for estimates of written vocabulary.To further complete the picture of L2 vocabulary size, Nation (2006) used The British National Corpus (BNC) data and 98% coverage to calculate that 8,000-9,000 word families are required to read authentic texts (e.g.novels or newspapers) in English.As such, it makes sense to be able to measure learners' knowledge of vocabulary.
The need for reliable and valid tests of vocabulary size is a critically important issue in the field of second language acquisition (SLA).This is equally true whether we are interested in pedagogical assessment in classrooms or in language acquisition research.Given this, one might expect there to be an accepted vocabulary test available for these uses.However, this is not the case yet.In lamenting the lack of reliable tests for measuring vocabulary size, Meara (1996) states that the nearest thing the field has to a standard test in vocabulary is the Vocabulary Levels Test (Nation, 1983(Nation, , 1990)).Different versions of the test have been widely employed in both assessment and research around the word, and it is staring to be used for different purposes in Iran.Despite this widespread use, the test has been properly checked for reliability and validity by very few studies.This article aims at examining the reliability of the test in an Iranian context.

Reliability
A language test as a measuring instrument is required to generate individual scores that are reliable and valid.Ary, Jacobs, and Sorensen (2010) define reliability of a measuring instrument as the "degree of consistency with which it measures whatever it is measuring" (p.236).Reliability means that scores from an instrument are stable and consistent.Scores should remain nearly the same when researchers administer the instrument at different occasions.Also, scores need to be consistent.When an individual answer certain questions one way, the individual should consistently answer closely related questions in the same way.Validity is the development of sound evidence to demonstrate that the test interpretation (of scores about the concept or construct that the test is assumed to measure) matches its proposed use (AERA, APA, NCME, 1999).
On a theoretical level, reliability is concerned with the effect of error on the consistency of scores.We must be concerned about errors of measurement, or unreliability, because we know that test performance is affected by factors other than the abilities we want to measure.Bachman (1990) groups factors other than communicative language ability that affect performance on language tests into the following three broad categories: (1) test method facets; (2) attributes of the test taker which are not among the language abilities we are interested in measuring; and (3) random factors which are not stable over time and cannot be predicted.Systematicity is a feature of the test method facets which means that they do not vary from one test administration to the next.Attributes of the test takers refer to individual characteristics such as cognitive style and content knowledge of a particular areas, or group characteristics such as gender, race, and ethnic background.Similar to test method facets, such attributes are systematic to the extent that they affect individuals' test performance at a regular pattern.An individual's test score is not affected only by systematic sources of error.Unsystematic, or random factors also exert their impact to some extent.2010) introduce three sources of random error that may lead to inconsistency in scores:

Ary et al. (
1. Characteristics of the individual: variations in individuals' motivation, level of fatigue, physical health, anxiety, and other mental and emotional factors may affect test results.
2. The administration procedures and conditions: administering or scoring of a test may depart from standardized procedures.Testing conditions such as light, heat, ventilation, time of day, and the presence of distractions can influence test performance.Also, test-taking instructions and directions may not be clear enough.The scoring method may introduce a further source of error.
3. The testing instrument: a major threat to reliability is a test being brief.The longer the test, the more reliable it is likely to be.A short test gives a small sample of behavior and this may result in an unstable score.Luck has a greater chance to contribute in a short test than in a long test.
The effect of such random errors of measurement on the consistency of test scores is what reliability deals with.The classical true score theory approaches the issue of estimating reliability in three different ways, each of which is concerned with different sources of error.Error sources originated from within the test and scoring procedures are addressed in internal consistency estimates while stability estimates show the degree to which test scores are consistent across different administrations.The comparability of scores on alternate forms of a test is examined through the equivalence estimates (Bachman, 1990).
The equivalence estimates (parallel forms reliability) checks the equivalence of scores from alternate versions of a test.
In cases where internal consistency estimates are not possible or suitable, the equivalence estimates are appropriate options for estimating reliability.In some situations, equivalent forms of a test may already be used to decrease the practice effect or the chance of cheating.The parallel forms reliability is particularly suitable in such cases (Bachman, 1990).If the students' scores obtained from alternate forms of a test in different administrations are correlated, the related coefficient is known as the coefficient of stability and equivalence.Two facets of test reliability are revealed by this coefficient: fluctuations in test performance from one occasion to another and fluctuations from one form of the test to another.A high coefficient of stability and equivalence suggests that the same ability is measured by the two forms of the test and this measurement is consistent during different occasions.This is the most challenging and the most accurate measure available for estimating the reliability of a test (Ary et al. 2010).
In order to determine the reliability of alternate forms of a given test, the procedure used is to administer both forms to a group of individuals.The means and standard deviations for each of the two forms can then be computed and compared to determine their equivalence, after which the correlation between the two sets of scores can be computed.This correlation is then interpreted as an indicator of the equivalence of the two tests, or as an estimate of the reliability of either one (Bachman, 1990).

The Vocabulary Levels Test
The Vocabulary Levels Test was developed in the early 1980s by Paul Nation at the Victoria University of Wellington in New Zealand.Initially, it constituted a simple instrument serving the classroom purpose of helping teachers prepare a vocabulary teaching and learning plan.It was then published in Nation (1983Nation ( , 1990) ) and has been extensively used in New Zealand and elsewhere since then (Xing & Fulcher, 2007).In an initial validation study in 1988, Read established the reliability of the instrument.He also found that individual scores for each frequency level formed an implicational scale according to which knowledge of lower-frequency words indicated knowing higher-frequency ones.Read's (1988) study was the last attempt for some time to validate the Levels Test.However, as Nation's book turned out to be a major vocabulary reference source, the test gained an international popularity.
Ten years after the test was first published, Norbert Schmitt revised the Levels Test in Nation's book (Version A) and wrote three additional versions (Versions B, C and D) using new bunches of words for each level.The original specifications remained intact in the new versions.No validation study was conducted for the versions written by Schmitt at that time.Still, the four versions' potential as a useful assessment instrument received considerable attention in numerous educational environments.Some vocabulary research studies have also used the tests as their instrument (e.g., Cobb, 1997;Schmitt and Meara, 1997;Laufer and Paribakht, 1998).Beglar and Hunt (1999) administered two forms of Schmitt's (1993) version for the 2000-Word-Level and for the University-Word-Level to EFL learners in secondary and tertiary institutions in Japan.They used the results to select 54 items among the best-performing ones to develop two fresh tests for each level containing 27 items each.They then equated the two pairs of tests statistically.Schmitt, Schmitt, and Clapham (2001) undertook a similar test-development project with the four full forms of the test.They administered the tests to106 non-native speaking British university students and created two longer versions which included 30 items instead of the original 18.This article examines these two new versions developed by Schmitt et al. (2001).
The Vocabulary Levels Test used word-definition matching format to require test-takers to match the words to the definitions.Rather than giving a single estimate of total vocabulary size, it measures knowledge of words at five levels: 2000, 3000, 5000, 10,000, and academic English words.Each level contains 30 items which are arranged in 10 clusters.
The ratio of different word classes in English is maintained in the test, with each section containing five noun clusters, three verb clusters, and two adjective clusters.The following illustrates the format of a noun cluster: You must choose the right word to go with each meaning.Write the number of that word next to its meaning.There are three definitions on the right and six words on the left.Candidates need to choose three out of the six words to match the three on the right.In total at each level, 30 definitions need to be matched to 30 out of 60 words.Schmitt et al. (2001) summarize the considerations kept in mind while writing each cluster as follows: 1.The options in this format are words instead of definitions.
2. The definitions are kept short, so that there is a minimum of reading, allowing for more items to be taken within a given period of time.
3. Words are learned incrementally, and tests should aim to tap into partial lexical knowledge.The Levels Test was designed to do this.The option words in each cluster are chosen so that they have very different meanings.Thus, even if learners have only a minimal impression of a target word's meaning, they should be able to make the correct match.
4. The clusters are designed to minimize aids to guessing.The target words are in alphabetical order, and the definitions are in order of length.In addition, the target words to be defined were selected randomly.
5. The words used in the definitions are always more frequent than the target words.The 2000 level words are defined with 1000 level words and, wherever possible, the target words at other levels are defined with words from the GSL (essentially the 2000 level).This is obviously important as it is necessary to ensure that the ability to demonstrate knowledge of the target words is not compromised by a lack of knowledge of the defining words.
6.The word counts from which the target words were sampled typically give base forms.However, derived forms are sometimes the most frequent members of a word family.Therefore, the frequency of the members of each target word family was checked, and the most frequent one attached to the test.
7. As much as possible, target words in each cluster begin with different letters and do not have similar orthographic forms.Likewise, similarities between the target words and words in their respective definitions were avoided whenever possible (pp.59).
As was mentioned earlier, despite the widespread use of the Vocabulary Levels Test, very few studies have examined the reliability and validity of the test, none of which have been conducted in Iran.Also, since reliability will be a function not only of the test, but of the performance of the individuals who take the test, any given estimate of reliability based on the CTS model is limited to the sample of test scores upon which it is based (Bachman, 1990;Ary et al., 2010).Thus, the reliability estimates based on the CTS model reported in other studies cannot be transferred to the Iranian context, and we must always estimate the reliability of scores of the specific groups with whom we may want to use the test.This purpose of the current study is to estimate the equivalence and reliability of the two new versions of the Vocabulary Levels Test developed by Schmitt et al. (2001) for Iranian intermediate learners of English as a foreign language.

Research Questions
This article is an attempt to address the following two research questions formulated based on the purpose of the study:

Participants
Seventy-five intermediate students who were studying English as a foreign language at the Iran Language Institute in Boukan served as the participants of the present study.Their proficiency level was judged to be intermediate based on the institute's placement test.The participants ranged in age from 16 to 27.Out of the 75 participants in this study, 46 were males and 29 females.

Instrumentation
Version 1 and version 2 of the Vocabulary Levels Test developed by Schmitt et al. (2001) were used as the data elicitation instruments.Each version is composed of five sections: 2000, 3000, 5000, 10000, and the academic vocabulary.Each section is made up of ten three-item clusters.The total possible score for each section is 30, and the total possible score for the whole test is 150 (see the Appendix).

Data Collection Procedure
Before the study was conducted, version 1 of the test was piloted on four intermediate students at the same institute from which the participants came from.The time needed to complete the test and difficulties they had in the process were observed, and the data was used to set the time and administration condition for the main study.The participants then took both versions of the test separately with one week interval between the two administrations.Half of the participants took version 1 first, and the other half took version 2 first to counterbalance the practice effect that would confound the relative equivalence of the two versions.Each participant took the two versions at the same testing conditions, including familiarity of the place, personnel, time of testing, and physical conditions.

Data Analysis
The data obtained from the participants was analyzed using version 20.0 of the Statistical Package for Social Science (SPSS).To examine the equivalence and reliability of the two versions and the individual sections, the means and standard deviations of the two versions and the individual sections were subjected to paired-sample t-tests and one-way ANOVAs, respectively.Furthermore, Pearson Product Moment correlation was used to examine the reliability of the two versions and the individual bands.

Results and Discussion
The first research question concerns the equality of the two versions of the test.As was mentioned earlier, two test are considered equal or parallel if their means and variances are the same (Bachman, 1990).Table 1 shows the descriptive statistics for the scores obtained from the two versions of the test.The results indicate that the 2.13 difference between the means of the two versions is statistically significant (t = 2.28, N = 75, p< .05).Therefore, the means of the two versions are not the same.To compare the variances of the two versions of the test, their standard deviations were subjected to a one-way ANOVA.The results are shown in Table 3.The F value for the two standard deviations is .323which is not significant, p= .571.This shows that the variances of the two versions are the same.However, since the means of the two versions are not the same, it is concluded that the two versions are not equal or parallel forms.Version 1 seems to be more difficult that version 2. To investigate the roots of this inequality between the two versions, the individual sections comprising the two versions were compared in terms of their means and standard deviations to determine if they are equal.Table 4 shows the descriptive statistics for the individual sections of each version.To determine whether the differences between the means and variances of the individual sections are significant, pairedsample t-tests and one-way ANOVAs were used for the means differences and standard deviations differences, respectively.The results are shown in Table 5.
As it can be seen in Table 5, some individual sections of the two versions are not equal or parallel.These sections are the 3000, 5000, and the academic sections whose means for the two versions are statistically different.Therefore, it is safe to claim that the inequality of the two versions as a whole results from the inequality of the 3000, 5000, and the academic sections.The inequality in each section may stem from the unbalanced distribution of items with high levels of item difficulty.
To address the second research question, which concerns estimation of the reliability of the two versions of the test through a parallel forms approach, the correlation between the two versions were calculated.Correlations tell us to what extent two variables are related or, in other words, related to each other.If there is a correlation between two variables, we should be able to impose a line on the scatterplot data points.The more tightly clustered around the line the data is, the stronger the correlation.Thus, the first assumption we must satisfy in order to test for correlation is that the relationship between the data is linear.The first step in performing correlation is to take a graphic look at our data (Larson-Hall, 2010).Figure 1 shows the scatterplot for the means of the scores obtained from the two versions.As can be seen, although the points do not lie in a perfect line, there is an obvious upward trend in the presented data.It is therefore appropriate to test for a linear relationship in the data by performing a correlation.To estimate the correlation between the means of the two versions of the Vocabulary Levels Test, Pearson Product Moment correlation was used.The results are shown in Table 6.The correlation coefficient obtained for the two versions is significant beyond the .01level (r = .938,p = .000,N = 75).This high level of correlation can be taken as evidence for the reliability of both version 1 and version 2 of the Vocabulary Levels Test.Additional support for the reliability of the two versions comes from the correlation between the individual sections.Table 7 shows the correlation coefficient obtained for the individual sections of the two versions.As is reported in Table 7, all the sections of the two versions are correlated with each other, which can be taken as additional support for the reliability of the two versions.Therefore, it can be concluded that version 1 and version 2 of the Vocabulary Levels Test are reliable tests.

Conclusion
The study reported in this paper aimed at examining the equivalence and reliability of the two new versions of the Vocabulary Levels Test for the Iranian learners of English as a foreign language.The results of the data analysis revealed high correlation between version 1 and version 2 and their individual sections.The two versions, therefore, are considered as highly reliable.However, the two versions were found not to be equivalent or parallel.Keeping this result in mind, treating the two versions as equal forms is untenable, hence vocabulary researchers are warned against using them as parallel tests particularly in the case of longitudinal or gain score research studies.However, given the relatively small scale of the differences between the mean scores that the two versions and their individual sections yield, they can probably be used in programs as alternate forms, as long as no high-stakes conclusions are drawn from a comparison between the two forms, and as long as the potential differences in scores between the two versions are kept in mind.

RQ 1 :
Are version 1 and version 2 of the Vocabulary Levels Test parallel forms?RQ 2: Are version 1 and version 2 of the Vocabulary Levels Test reliable tests?

Figure 1 .
Figure 1.Scatterplot for the Means of Scores obtained from Version 1 and Version 2

Table 1 .
Descriptive Statistics for Version 1 and Version 2 ScoresThe mean of the scores obtained from version 1 of the test is 57.26, while the mean of the scores obtained from version 2 is 59.40.The standard deviations for version and version 2 are 23.22 and 22.77, respectively.To determine whether the difference between the means of the two versions are significant, a paired-sample t-test was used.The results are shown in Table2.

Table 3 .
One-Way ANOVA Results for Version 1 and Version 2 Standard Deviations

Table 4 .
Descriptive Statistics for the Scores of the Individual Sections of Version 1 and Version 2

Table 5 .
Paired-Sample T-Tests and One-Way ANOVAs Results for the Means and Variances of the Individual ** Difference is significant at the .01level.

Table 6 .
Pearson Product Moment Correlation between Version 1 and Version 2 ** Correlation is significant at the .01level.

Table 7 .
Pearson Product Moment Correlation between Individual Sections of Version 1 and Version 2