Notebook Example

[1]:
from lexicalrichness import LexicalRichness
import lexicalrichness
lexicalrichness.__version__
[1]:
'0.4.0'
[2]:
# Enter your own text here if you prefer
text = """Measure of textual lexical diversity, computed as the mean length of sequential words in
                a text that maintains a minimum threshold TTR score.

                Iterates over words until TTR scores falls below a threshold, then increase factor
                counter by 1 and start over. McCarthy and Jarvis (2010, pg. 385) recommends a factor
                threshold in the range of [0.660, 0.750].
                (McCarthy 2005, McCarthy and Jarvis 2010)"""

# instantiate new text object (use the tokenizer=blobber argument to use the textblob tokenizer)
lex = LexicalRichness(text)

Attributes

[3]:
# Get list of words
list_of_words = lex.wordlist
print(list_of_words[:10], list_of_words[-10:])
['measure', 'of', 'textual', 'lexical', 'diversity', 'computed', 'as', 'the', 'mean', 'length'] ['factor', 'threshold', 'in', 'the', 'range', 'of', 'mccarthy', 'mccarthy', 'and', 'jarvis']
[4]:
# Return word count (w).
lex.words
[4]:
57
[5]:
# Return (unique) word count (t).
lex.terms
[5]:
39

Type-token ratio (TTR; Chotlos 1944, Templin 1957):

\[TTR = \frac{t}{w}\]

where \(t\) or \(t(w)\) is the number unique terms as function of the text of length \(w\) words.

[6]:
# Return type-token ratio (TTR) of text.
lex.ttr
[6]:
0.6842105263157895

Root TTR (RTTR; Guiraud 1954, 1960):

\[RTTR = \frac{t}{\sqrt{w}}\]
[7]:
# Return root type-token ratio (RTTR) of text.
lex.rttr
[7]:
5.165676192553671

Corrected TTR (RTTR; Guiraud 1954, 1960):

\[CTTR = \frac{t}{\sqrt{2w}}\]
[8]:
# Return corrected type-token ratio (CTTR) of text.
lex.cttr
[8]:
3.6526846651686067

Herdan’s C (Herdan 1960, 1964):

\[C = \frac{log(t)}{log(w)}\]
[9]:
# Return Herdan's C
lex.Herdan
[9]:
0.9061378160786574

Summer’s index (Summer 1966)

\[Summer = \frac{log \log(t)}{log\log(w)}\]
[10]:
# Return Summer's index
lex.Summer
[10]:
0.9294460323356605

Dugast’s index (Dugast 1978):

\[Dugast = \frac{log(w)^2}{log(w) - log (t)}\]
[11]:
# Return Dugast's index
lex.Dugast
[11]:
43.074336212149774

Maas’s index (Maas 1972):

\[Maas = \frac{log(w) - log(t)}{log(w)^2}\]
[12]:
lex.Maas
[12]:
0.023215679867353005

Yule’s K (Yule 1944, Tweedie and Baayen 1998):

\[k = 10^4 \times \left\{\sum_{i=1}^n f(i,N) \left(\frac{i}{N}\right)^2 -\frac{1}{N} \right\}\]
[13]:
lex.yulek
[13]:
153.8935056940597

Yule’s I (Yule 1944, Tweedie and Baayen 1998):

\[I = \frac{t^2}{\sum^{n_{\text{max}}}_{i=1} i^2f(i,w) - t}\]
[14]:
lex.yulei
[14]:
22.36764705882353

Herdan’s Vm (Herdan 1955, Tweedie and Baayen 1998):

\[V_m = \sqrt{\sum^{n_{\text{max}}}_{i=1} f(i,w) \left(\frac{i}{w} \right)^2 - \frac{1}{w}}\]
[15]:
lex.herdanvm
[15]:
0.08539428890448784

Simpson’s D (Simpson 1949, Tweedie and Baayen 1998):

\[D = \sum^{n_{\text{max}}}_{i=1} f(i,w) \frac{i}{w}\frac{i-1}{w-1}\]
[16]:
lex.simpsond
[16]:
0.015664160401002505

Methods

MSTTR: Mean segmental type-token ratio

  • computed as average of TTR scores for segments in a text

  • Split a text into segments of length segment_window. For each segment, compute the TTR. MSTTR score is the sum of these scores divided by the number of segments

  • (Johnson 1944)

[17]:
lex.msttr(
    segment_window=25  # size of each segment
)
[17]:
0.88

MATTR: Moving average type-token ratio

  • Computed using the average of TTRs over successive segments of a text

  • Then take the average of all window’s TTR

  • (Covington 2007, Covington and McFall 2010)

[18]:
# Return moving average type-token ratio (MATTR).
lex.mattr(
    window_size=25  # Size of each sliding window
)
[18]:
0.8351515151515151

MTLD: Measure of Lexical Diversity

  • Computed as the mean length of sequential words in a text that maintains a minimum threshold TTR score

  • Iterates over words until TTR scores falls below a threshold, then increase factor counter by 1 and start over

  • (McCarthy 2005, McCarthy and Jarvis 2010)

[19]:
lex.mtld(
    # Factor threshold for MTLD.
    # Algorithm skips to a new segment when TTR goes below the threshold
    threshold=0.72
)
[19]:
46.79226361031519

voc-D

  • Vocd score of lexical diversity derived from a series of TTR samplings and curve fittings

  • Step 1: Take 100 random samples of 35 words from the text. Compute the mean TTR from the 100 samples

  • Step 2: Repeat this procedure for samples of 36 words, 37 words, and so on, all the way to ntokens (recommended as 50 [default]). In each iteration, compute the TTR. Then get the mean TTR over the different number of tokens. So now we have an array of averaged TTR values for ntoken=35, ntoken=36,…, and so on until ntoken=50

  • Step 3: Find the best-fitting curve from the empirical function of TTR to word size (ntokens). The value of D that provides the best fit is the vocd score

  • Step 4: Repeat steps 1 to 3 for x number (default=3) of times before averaging D, which is the returned value

[20]:
lex.vocd(
    ntokens=50,  # Maximum number for the token/word size in the random samplings
    within_sample=100,  # Number of samples
    iterations=3,  # Number of times to repeat steps 1 to 3 before averaging
    seed=42  # Seed for reproducibility
)
[20]:
46.27679899103406
[21]:
lex.vocd()
[21]:
46.27679899103406

voc-D plot utility

  • Utility to plot empirical voc-D curve and the best fitting line

[22]:
lex.vocd_fig(
    ntokens=50,  # Maximum number for the token/word size in the random samplings
    within_sample=100,  # Number of samples
    seed=42,  # Seed for reproducibility
    savepath="images/vocd.png",
)
[22]:
<AxesSubplot:xlabel='Sample size', ylabel='TTR'>
_images/example_40_1.png

HD-D

  • Hypergeometric distribution diversity (HD-D) score

  • (McCarthy and Jarvis 2007)

[23]:
lex.hdd(
    draws=42  # Number of random draws in the hypergeometric distribution
)
[23]:
0.7468703323966486
[ ]: