// learn.shawon.ch / hindi-for-bengalis / the-bollywood-corpus STUDY GUIDE
← Hindi for Bengali speakers

Hindi for Bengalis · Part III, Ch. 6

The Bollywood corpus by frequency

Try this first

Guess: how many of the most frequent Hindi words do you need to know to recognise about three-quarters of the words in a typical Bollywood song? Hold a number in your head — 200? 2,000? 20,000? — and check it against the data below.

Most language-learning advice is anecdotal: "learn ten words a day" or "the first thousand words get you most of the way". The actual numbers depend on the language and the corpus. For Hindi, in the specific domain of Bollywood lyrics, you can compute them exactly. The numbers below come from a frequency analysis of 11,959 Hindi film songs spanning 1931 to 2023.

11,959songs in the corpus
1.54Mword tokens (running)
41,984unique words
1931 – 2023year range

The one idea

Word frequency in any natural corpus follows a steep curve: a tiny fraction of words covers most of the running text. Learning words in frequency order is the highest-leverage way to spend your vocabulary study time. The same hour spent on the top 100 words gives you ten times the coverage as an hour spent on words ranked 10,000–11,000.

The coverage curve, drawn exactly

The chart below maps the top N most frequent words (x-axis, log scale) to the percentage of all running text those N words cover (y-axis).

top N words → % of text covered 25% 50% 75% 95% 100 1,000 10,000 vocabulary size (log scale) 1k → 76% 3k → 88% 8.3k → 95%
The coverage curve, with three landmark points. Diminishing returns kick in hard past 3,000 words.
The coverage curve as a table
Top N words% of running text coveredWhat this gets you
10045.4%About half the words on any page — mostly function words.
50067.2%Two-thirds. Enough to follow simple sentences.
1,00076.0%Skeleton of most sentences readable.
3,00087.9%Practical comprehension with frequent guessing.
5,00091.9%Solid middle ground.
8,300~95%Comfortable comprehension with context-guessing for the rest.
17,60098%Near-full, but the next several thousand words are heavily diminishing.

Going from 1,000 to 3,000 words adds twelve points of coverage. Going from 3,000 to 8,300 adds seven points and costs five thousand more words. Going from 8,300 to 17,600 adds another three points and costs over nine thousand more. This is the standard shape of any natural-language frequency distribution, but seeing it on this corpus makes the trade-off concrete.

What's actually in the high-frequency list

The most frequent meaningful word in the corpus is दिल (heart), with 21,389 occurrences. प्यार (love) is second at 10,444. The full top ten is exactly what you'd guess for a corpus made of film songs.

Top ten meaningful words (function words excluded)
#WordMeaningTokens
1दिलheart21,389
2प्यारlove10,444
3दुनियाworld4,728
4मनmind3,881
5आजtoday3,822
6रातnight3,598
7दिनday3,412
8बातmatter, talk2,988
9जानlife / beloved2,960
10नज़रgaze, glance2,539

The function-word top ten is even more concentrated: है, हैं, के, की, का, में, से, को, पर, ने. These ten words alone account for a huge fraction of every sentence's structural glue — which is why postpositions and copulas were the focus of Ch. 3.

One bonus: Hindi inflection makes the curve more generous

The 8,300-word target sounds like a lot, but Hindi is heavily inflected. जाना / जाने / गया / गई / जाओ / जाऊँ / जाएगा are seven separate tokens in the frequency count, but they're one verb. After collapsing inflected forms under their roots, the real lemma count for 95% coverage is closer to 4,000–5,000. For an Indo-Aryan-language native, where ~30% of those lemmas are already recognisable on sight from Bengali, the active learning load is smaller still.

Work one, then finish one

Worked. Suppose you've memorised the top 1,000 words. A song goes by with about 200 word tokens. Roughly how many of those 200 will you recognise? Apply 76% coverage: 152 words. The other 48 are content words you'll need context or a dictionary for. That's enough to get the broad meaning of a verse, especially when the chorus and structure are highly repetitive.

Your turn. You want to enjoy songs without looking anything up — about one unfamiliar word per minute is your tolerance. A song delivers about 80 words per minute. What target coverage do you need, and roughly how many words does that demand?

(Answer. One unfamiliar word per 80 = ~1.25% unknown, so ~98.75% coverage. From the table, 98% needs ~17,600 words — and 98.75% is past where the corpus even reliably measures. The honest answer: at that comfort threshold the comprehension-by-coverage approach starts breaking down; you'd switch to learning the rest in context, not from a frequency list.)

Why this earns a place in your toolkit

The curve is the cheat sheet for budgeting language-learning time. Most people overestimate how many words they need ("I should learn 50,000 to be fluent"). The data says no: the first thousand carry most of the load, the next two thousand polish it, and after that the marginal cost per percent coverage rises sharply. Spend most of your effort on the top 3,000, accept that the long tail is what context and re-listening are for, and you'll get further faster than the person memorising 100 new words a day from random vocabulary lists.

Recall check · no peeking

  1. What percentage of running text does the top 1,000 most frequent Hindi words cover, in this corpus?
  2. Why are the returns from learning words past 8,000 so much smaller per word than the returns from learning the first 1,000?
  3. Why is the 8,300 number an overestimate of how many you actually have to learn?
  4. Name two of the top five meaningful (non-function) words.

Explain it back

In two sentences, explain why a frequency-ordered word list is a better study plan than a thematic one (food, family, work, etc.).

Learn · Shawon Chowdhury · a study guide, kept rough on purpose