Notebook Example

[1]:

from lexicalrichness import LexicalRichness
import lexicalrichness
lexicalrichness.__version__

[1]:

'0.4.0'

[2]:

# Enter your own text here if you prefer
text = """Measure of textual lexical diversity, computed as the mean length of sequential words in
                a text that maintains a minimum threshold TTR score.

                Iterates over words until TTR scores falls below a threshold, then increase factor
                counter by 1 and start over. McCarthy and Jarvis (2010, pg. 385) recommends a factor
                threshold in the range of [0.660, 0.750].
                (McCarthy 2005, McCarthy and Jarvis 2010)"""

# instantiate new text object (use the tokenizer=blobber argument to use the textblob tokenizer)
lex = LexicalRichness(text)

Attributes

[3]:

# Get list of words
list_of_words = lex.wordlist
print(list_of_words[:10], list_of_words[-10:])

['measure', 'of', 'textual', 'lexical', 'diversity', 'computed', 'as', 'the', 'mean', 'length'] ['factor', 'threshold', 'in', 'the', 'range', 'of', 'mccarthy', 'mccarthy', 'and', 'jarvis']

[4]:

# Return word count (w).
lex.words

[4]:

[5]:

# Return (unique) word count (t).
lex.terms

[5]:

Type-token ratio (TTR; Chotlos 1944, Templin 1957):

\[TTR = \frac{t}{w}\]

where \(t\) or \(t(w)\) is the number unique terms as function of the text of length \(w\) words.

[6]:

# Return type-token ratio (TTR) of text.
lex.ttr

[6]:

0.6842105263157895

Root TTR (RTTR; Guiraud 1954, 1960):

\[RTTR = \frac{t}{\sqrt{w}}\]

[7]:

# Return root type-token ratio (RTTR) of text.
lex.rttr

[7]:

5.165676192553671

Corrected TTR (RTTR; Guiraud 1954, 1960):

\[CTTR = \frac{t}{\sqrt{2w}}\]

[8]:

# Return corrected type-token ratio (CTTR) of text.
lex.cttr

[8]:

3.6526846651686067

Herdan’s C (Herdan 1960, 1964):

\[C = \frac{log(t)}{log(w)}\]

[9]:

# Return Herdan's C
lex.Herdan

[9]:

0.9061378160786574

Summer’s index (Summer 1966)

\[Summer = \frac{log \log(t)}{log\log(w)}\]

[10]:

# Return Summer's index
lex.Summer

[10]:

0.9294460323356605

Dugast’s index (Dugast 1978):

\[Dugast = \frac{log(w)^2}{log(w) - log (t)}\]

[11]:

# Return Dugast's index
lex.Dugast

[11]:

43.074336212149774

Maas’s index (Maas 1972):

\[Maas = \frac{log(w) - log(t)}{log(w)^2}\]

[12]:

lex.Maas

[12]:

0.023215679867353005

Yule’s K (Yule 1944, Tweedie and Baayen 1998):

\[k = 10^4 \times \left\{\sum_{i=1}^n f(i,N) \left(\frac{i}{N}\right)^2 -\frac{1}{N} \right\}\]

[13]:

lex.yulek

[13]:

153.8935056940597

Yule’s I (Yule 1944, Tweedie and Baayen 1998):

\[I = \frac{t^2}{\sum^{n_{\text{max}}}_{i=1} i^2f(i,w) - t}\]

[14]:

lex.yulei

[14]:

22.36764705882353

Herdan’s Vm (Herdan 1955, Tweedie and Baayen 1998):

\[V_m = \sqrt{\sum^{n_{\text{max}}}_{i=1} f(i,w) \left(\frac{i}{w} \right)^2 - \frac{1}{w}}\]

[15]:

lex.herdanvm

[15]:

0.08539428890448784

Simpson’s D (Simpson 1949, Tweedie and Baayen 1998):

\[D = \sum^{n_{\text{max}}}_{i=1} f(i,w) \frac{i}{w}\frac{i-1}{w-1}\]

[16]:

lex.simpsond

[16]:

0.015664160401002505

Methods

MSTTR: Mean segmental type-token ratio

computed as average of TTR scores for segments in a text
Split a text into segments of length segment_window. For each segment, compute the TTR. MSTTR score is the sum of these scores divided by the number of segments
(Johnson 1944)

[17]:

lex.msttr(
    segment_window=25  # size of each segment
)

[17]:

0.88

MATTR: Moving average type-token ratio

Computed using the average of TTRs over successive segments of a text
Then take the average of all window’s TTR
(Covington 2007, Covington and McFall 2010)

[18]:

# Return moving average type-token ratio (MATTR).
lex.mattr(
    window_size=25  # Size of each sliding window
)

[18]:

0.8351515151515151

MTLD: Measure of Lexical Diversity

Computed as the mean length of sequential words in a text that maintains a minimum threshold TTR score
Iterates over words until TTR scores falls below a threshold, then increase factor counter by 1 and start over
(McCarthy 2005, McCarthy and Jarvis 2010)

[19]:

lex.mtld(
    # Factor threshold for MTLD.
    # Algorithm skips to a new segment when TTR goes below the threshold
    threshold=0.72
)

[19]:

46.79226361031519

voc-D

Vocd score of lexical diversity derived from a series of TTR samplings and curve fittings
Step 1: Take 100 random samples of 35 words from the text. Compute the mean TTR from the 100 samples
Step 2: Repeat this procedure for samples of 36 words, 37 words, and so on, all the way to ntokens (recommended as 50 [default]). In each iteration, compute the TTR. Then get the mean TTR over the different number of tokens. So now we have an array of averaged TTR values for ntoken=35, ntoken=36,…, and so on until ntoken=50
Step 3: Find the best-fitting curve from the empirical function of TTR to word size (ntokens). The value of D that provides the best fit is the vocd score
Step 4: Repeat steps 1 to 3 for x number (default=3) of times before averaging D, which is the returned value

[20]:

lex.vocd(
    ntokens=50,  # Maximum number for the token/word size in the random samplings
    within_sample=100,  # Number of samples
    iterations=3,  # Number of times to repeat steps 1 to 3 before averaging
    seed=42  # Seed for reproducibility
)

[20]:

46.27679899103406

[21]:

lex.vocd()

[21]:

46.27679899103406

voc-D plot utility

Utility to plot empirical voc-D curve and the best fitting line

[22]:

lex.vocd_fig(
    ntokens=50,  # Maximum number for the token/word size in the random samplings
    within_sample=100,  # Number of samples
    seed=42,  # Seed for reproducibility
    savepath="images/vocd.png",
)

[22]:

<AxesSubplot:xlabel='Sample size', ylabel='TTR'>

HD-D

Hypergeometric distribution diversity (HD-D) score
(McCarthy and Jarvis 2007)

[23]:

lex.hdd(
    draws=42  # Number of random draws in the hypergeometric distribution
)

[23]:

0.7468703323966486

[ ]: