A Study of the Lexical Complexity of Homogeneous Texts Using Stochastic Modeling and Analysis

Yanhui Zhang


This paper takes a system dynamic approach to study homogeneous texts where the dynamics of the lexical richness of such texts over time are of the focal concern. It is hypothesized that the progress of the lexical complexity is driven by how far away this process is from the maximum level of complexity, while is subject to the fluctuations due to the dynamic nature of the system. It is shown that the lexical dynamics of homogeneous texts can be effectively modeled by a stochastic differential equation with proper upper bounds. The linguistic validity and the statistical goodness of the model are empirically tested with the texts of CGWR. Given the ubiquity of the diffusion phenomena in various settings of language and linguistic studies (e.g. language development), the findings of the current work should provide a useful methodological reference in comparison to classic approaches such as statistical regressions.


Lexical Richness, Homogeneous Texts, Dynamical Complexity, Language Diffusion, Stochastic Modeling

Full Text:



Bailyn, M. (1994). A survey of thermodynamics. American Institute of Physics, New York.

Crossley, S. A., Salsbury, T., & McNamara D. S. (2011). Predicting the proficiency level of language learners using lexical indices. Language Testing, 29(2), 243-263.

Crossley,S.A.,& McNamara, D.S. (2011). Shared features of L2 writing: Intergroup homogeneity and text classification. Journal of Second Language Writing. doi:10.1016/j.jslw.2011.05.007

Diekmann, A., & Mitte, P. (2014). Stochastic Modelling of Social Processes, Elsevier.

Feilke H. (1996). From syntactical to textual strategies of argumentation: Syntactical development in written argumentative texts by students aged 10 to 22. Argumentation, 10, 197-212.

Feng, Z. (1991). Shuxue Yu Yuyan (Mathematics and Language). Hunan Education Press.

Gardiner, C. (2009). Stochastic Methods: A Handbook for the Natural and Social Sciences, Springer.

Gurney, P. J., & Gurney, L. W. (1998). Subsets and homogeneity: Authorship attribution in the Scriptories Historiae Augustae. Literacy & Linguistic Computing , 13 (3), 133-140.

Herdan, G. (1960). Quantitative linguistics. Butterworth, London.

Jarvis, S. (2013). Capturing diversity in lexical diversity. Language Learning, 63, 87-106.

Johansson, V. (2008). Lexical diversity and lexical density in speech and writing: a developmental perspective, Lund University, Dept. of Linguistics and Phonetics, Working Papers 53 (2008), 61

Kilgarriff, A. (2001). Comparing corpora. International Journal of Corpus Linguistics, 6(1), 1–37.

Koizumi, R., & In’nami Y. (2012). Effects of text length on lexical diversity measures: Using short texts with less than 200 tokens. System, 40, 522-532.

Lamprier, S., Amghar, T., Levrat, B., and Sanbion, F. (2007), SegGen: A genetic algorithm for linear text segmentation. Proceeding of the 20th International Joint Conferneces on Artificial Intelligence. AAAI Press, Menlo Park, CA. 1647-1652.

Levitin, L. B., & Reingold, Z. (1994). Entropy of natural languages: Theory and experiment. Chaos, Solitons, and Fractals, 4 (5), 709-743.

Lu, X. (2012). The Relationship of Lexical Richness to the Quality of ESL Learners’ Oral Narratives. The Modern Language Journal, 96(2), 190–208.

MacWhinney, B. (2007). The TalkBank Project. In J. C. Beal, K. P. Corrigan,&H. L. Moisl (Eds.), Creating and digitizing language corpora: Synchronic databases (Vol. 1, pp. 163–180). Houndmills, UK: Palgrave-Macmillan.

Malvern D., & Richards, B. (2002). Investigating accommodation in language proficiency interviews using a new measure of lexical diversity. Language Testing, 19. 85-104.

Malvern, D., Richards, B., Chipere, N., &Duran, P. (2004). Lexical diversity and language development: Quantification and assessment. Palgrave Macmillan.

McCarthy, P. and Jarvis S. (2010). MTLD, vocd-D, and HD-D: A validation study of sophisticated approaches to lexical diversity assessment. Behavior Research Methods. 42 (2), 381-392.

McCarthy, P. M., & Jarvis, S. (2007). vocd: A theoretical and empirical evaluation. Language Testing, 24(4), 459–488.

Wan, F. W. M. (2019). Stochastic Models in the Life Sciences and Their Methods of Analysis. World Scientific Publishing Co.

Yang, C. C., & Luk, J. (2003). Automatic Generation of English/Chinese Thesaurus Based on Corpus in Laws. Journal of the American Society for Information Science and Technology. 54 (7), 671-682.

Yuan, L., Wang, D., & Zhang, S. (1987). The probability distribution and entropy and redundancy in printed Chinese. In: Proceedings of International Conference on Chinese Information Processing, 505–509.

Zhang, Y. (2015). Entropic evolution of lexical richness of homogeneous texts over time: A dynamic complexity perspective. Journal of Language Modeling, 3 (2), DOI: http://dx.doi.org/10.15398/jlm.v3i2.111

DOI: http://dx.doi.org/10.7575/aiac.alls.v.11n.5p.1


  • There are currently no refbacks.

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.

2010-2021 (CC-BY) Australian International Academic Centre PTY.LTD.

Advances in Language and Literary Studies

You may require to add the 'aiac.org.au' domain to your e-mail 'safe list’ If you do not receive e-mail in your 'inbox'. Otherwise, you may check your 'Spam mail' or 'junk mail' folders.