Attributes and Methods in LexicalRichness
This addendum exposes the underlying lexicalrichness measures from attributes and methods in the LexicalRichness class.
TTR: Type-Token Ratio (Chotlos 1944, Templin 1957)
- lexicalrichness.LexicalRichness.ttr()
Type-token ratio (TTR) computed as t/w, where t is the number of unique terms/vocab, and w is the total number of words. (Chotlos 1944, Templin 1957)
- Returns:
Type-token ratio
- Return type:
Float
RTTR: Root Type-Token Ratio (Guiraud 1954, 1960)]
- lexicalrichness.LexicalRichness.rttr()
Root TTR (RTTR) computed as t/sqrt(w), where t is the number of unique terms/vocab, and w is the total number of words. Also known as Guiraud’s R and Guiraud’s index. (Guiraud 1954, 1960)
- Returns:
Root type-token ratio
- Return type:
FLoat
CTTR: Corrected Type-Token Ratio (Carrol 1964)
- lexicalrichness.LexicalRichness.cttr()
Corrected TTR (CTTR) computed as t/sqrt(2 * w), where t is the number of unique terms/vocab, and w is the total number of words. (Carrol 1964)
- Returns:
Corrected type-token ratio
- Return type:
Float
Herdan: Herdan’s C (Herdan 1960, 1964)
- lexicalrichness.LexicalRichness.Herdan()
Computed as log(t)/log(w), where t is the number of unique terms/vocab, and w is the total number of words. Also known as Herdan’s C. (Herdan 1960, 1964)
- Returns:
Herdan’s C
- Return type:
Float
Summer: Summer (Summer 1966)
- lexicalrichness.LexicalRichness.Summer()
Computed as log(log(t)) / log(log(w)), where t is the number of unique terms/vocab, and w is the total number of words. (Summer 1966)
- Returns:
Summer
- Return type:
Float
Dugast: Dugast (Dugast 1978)
- lexicalrichness.LexicalRichness.Dugast()
Computed as (log(w) ** 2) / (log(w) - log(t)), where t is the number of unique terms/vocab, and w is the total number of words. (Dugast 1978)
- Returns:
Dugast
- Return type:
Float
Maas: Maas (Maas 1972)
- lexicalrichness.LexicalRichness.Maas()
Maas’s TTR, computed as (log(w) - log(t)) / (log(w) * log(w)), where t is the number of unique terms/vocab, and w is the total number of words. Unlike the other measures, lower maas measure indicates higher lexical richness. (Maas 1972)
- Returns:
Maas
- Return type:
Float
yulek: Yule’s K (Yule 1944, Tweedie and Baayen 1998)
- lexicalrichness.LexicalRichness.yulek()
Yule’s K (Yule 1944, Tweedie and Baayen 1998).
\[k = 10^4 \times \left\{\sum_{i=1}^n f(i,N) \left(\frac{i}{N}\right)^2 -\frac{1}{N} \right\}\]See also
frequency_wordfrequency_table
Get table of i frequency and number of terms that appear i times in text of length N.
- Returns:
Yule’s K
- Return type:
Float
yulei: Yule’s I (Yule 1944, Tweedie and Baayen 1998)
- lexicalrichness.LexicalRichness.yulei()
Yule’s I (Yule 1944).
\[I = \frac{t^2}{\sum^{n_{\text{max}}}_{i=1} i^2f(i,w) - t}\]See also
frequency_wordfrequency_table
Get table of i frequency and number of terms that appear i times in text of length N.
- Returns:
Yule’s I
- Return type:
Float
Herdan’s Vm (Herdan 1955, Tweedie and Baayen 1998)
- lexicalrichness.LexicalRichness.herdanvm()
Herdan’s Vm (Herdan 1955, Tweedie and Baayen 1998)
\[V_m = \sqrt{\sum^{n_{\text{max}}}_{i=1} f(i,w) \left(\frac{i}{w} \right)^2 - \frac{1}{w}}\]See also
frequency_wordfrequency_table
Get table of i frequency and number of terms that appear i times in text of length N.
- Returns:
Herdan’s Vm
- Return type:
Float
Simpson’s D (Simpson 1949, Tweedie and Baayen 1998)
- lexicalrichness.LexicalRichness.simpsond()
Simpson’s D (Simpson 1949, Tweedie and Baayen 1998)
\[D = \sum^{n_{\text{max}}}_{i=1} f(i,w) \frac{i}{w}\frac{i-1}{w-1}\]See also
frequency_wordfrequency_table
Get table of i frequency and number of terms that appear i times in text of length N.
- Returns:
Simpson’s D
- Return type:
Float
msttr: Mean Segmental Type-Token Ratio (Johnson 1944)
- lexicalrichness.LexicalRichness.msttr(self, segment_window=100, discard=True)
Mean segmental TTR (MSTTR) computed as average of TTR scores for segments in a text.
Split a text into segments of length segment_window. For each segment, compute the TTR. MSTTR score is the sum of these scores divided by the number of segments. (Johnson 1944)
See also
segment_generator
Split a list into s segments of size r (segment_size).
- Parameters:
segment_window (int) – Size of each segment (default=100).
discard (bool) – If True, discard the remaining segment (e.g. for a text size of 105 and a segment_window of 100, the last 5 tokens will be discarded). Default is True.
- Returns:
Mean segmental type-token ratio (MSTTR)
- Return type:
float
mattr: Moving Average Type-Token Ratio (Covington 2007, Covington and McFall 2010)
- lexicalrichness.LexicalRichness.mattr(self, window_size=100)
Moving average TTR (MATTR) computed using the average of TTRs over successive segments of a text.
Estimate TTR for tokens 1 to n, 2 to n+1, 3 to n+2, and so on until the end of the text (where n is window size), then take the average. (Covington 2007, Covington and McFall 2010)
See also
list_sliding_window
Returns a sliding window generator (of size window_size) over a sequence
- Parameters:
window_size (int) – Size of each sliding window.
- Returns:
Moving average type-token ratio (MATTR)
- Return type:
float
mtld: Measure of Textual Lexical Diversity (McCarthy 2005, McCarthy and Jarvis 2010)
- lexicalrichness.LexicalRichness.mtld(self, threshold=0.72)
Measure of textual lexical diversity, computed as the mean length of sequential words in a text that maintains a minimum threshold TTR score.
Iterates over words until TTR scores falls below a threshold, then increase factor counter by 1 and start over. McCarthy and Jarvis (2010, pg. 385) recommends a factor threshold in the range of [0.660, 0.750]. (McCarthy 2005, McCarthy and Jarvis 2010)
- Parameters:
threshold (float) – Factor threshold for MTLD. Algorithm skips to a new segment when TTR goes below the threshold (default=0.72).
- Returns:
Measure of textual lexical diversity (MTLD)
- Return type:
float
hdd: Hypergeometric Distribution Diversity (McCarthy and Jarvis 2007)
- lexicalrichness.LexicalRichness.hdd(self, draws=42)
Hypergeometric distribution diversity (HD-D) score.
For each term (t) in the text, compute the probabiltiy (p) of getting at least one appearance of t with a random draw of size n < N (text size). The contribution of t to the final HD-D score is p * (1/n). The final HD-D score thus sums over p * (1/n) with p computed for each term t. Described in McCarthy and Javis 2007, p.g. 465-466. (McCarthy and Jarvis 2007)
- Parameters:
draws (int) – Number of random draws in the hypergeometric distribution (default=42).
- Returns:
Hypergeometric distribution diversity (HD-D) score
- Return type:
float
vocd: vod-D (Mckee, Malvern, and Richards 2010)
- lexicalrichness.LexicalRichness.vocd(self, ntokens=50, within_sample=100, iterations=3, seed=42)
Vocd score of lexical diversity derived from a series of TTR samplings and curve fittings.
Vocd is meant as a measure of lexical diversity robust to varying text lengths. See also hdd. The vocd is computed in 4 steps as follows.
Step 1: Take 100 random samples of 35 words from the text. Compute the mean TTR from the 100 samples.
Step 2: Repeat this procedure for samples of 36 words, 37 words, and so on, all the way to ntokens (recommended as 50 [default]). In each iteration, compute the TTR. Then get the mean TTR over the different number of tokens. So now we have an array of averaged TTR values for ntoken=35, ntoken=36,…, and so on until ntoken=50.
Step 3: Find the best-fitting curve from the empirical function of TTR to word size (ntokens). The value of D that provides the best fit is the vocd score.
Step 4: Repeat steps 1 to 3 for x number (default=3) of times before averaging D, which is the returned value.
See also
ttr_nd
TTR as a function of latent lexical diversity (d) and text length (n).
- Parameters:
ntokens (int) – Maximum number for the token/word size in the random samplings (default=50).
within_sample (int) – Number of samples for each token/word size (default=100).
iterations (int) – Number of times to repeat steps 1 to 3 before averaging (default=3).
seed (int) – Seed for the pseudo-random number generator in ramdom.sample() (default=42).
- Returns:
voc-D
- Return type:
float
Helper: lexicalrichness.segment_generator
- lexicalrichness.segment_generator(List, segment_size)
Split a list into s segments of size r (segment_size).
- Parameters:
List (list) – List of items to be segmented.
segment_size (int) – Size of each segment.
- Yields:
List – List of s lists of with r items in each list.
Helper: lexicalrichness.list_sliding_window
- lexicalrichness.list_sliding_window(sequence, window_size=2)
Returns a sliding window generator (of size window_size) over a sequence. Taken from https://docs.python.org/release/2.3.5/lib/itertools-example.html
Example:
List = [‘a’, ‘b’, ‘c’, ‘d’]
window_size = 2
- list_sliding_window(List, 2) ->
(‘a’, ‘b’)
(‘b’, ‘c’)
(‘c’, ‘d’)
- Parameters:
sequence (sequence (string, unicode, list, tuple, etc.)) – Sequence to be iterated over. window_size=1 is just a regular iterator.
window_size (int) – Size of each window.
- Yields:
List – List of tuples of start and end points.
Helper: lexicalrichness.frequency_wordfrequency_table
- lexicalrichness.frequency_wordfrequency_table(bow)
Get table of i frequency and number of terms that appear i times in text of length N. For Yule’s I, Yule’s K, and Simpson’s D. In the returned table, freq column indicates the number of frequency of appearance in the text. fv_i_N column indicates the number of terms in the text of length N that appears freq number of times.
- Parameters:
bow (array-like) – List of words
- Return type:
pandas.core.frame.DataFrame