Element sentence analysis

The entire English sentence corpus runs on 138 unique words — and that is the most interesting thing about element sentences

Building sentences from chemical element symbols sounds like a creative writing exercise. It is not. It is a vocabulary compression problem. The app splits your text on whitespace, normalizes each word, and checks whether every single one can be decomposed into valid element symbols. One bad word kills the entire line. I spent a while thinking the hard part was making sentences sound natural. The data shows the hard part is that the usable vocabulary is staggeringly small.

I have 1,048 sentence entries and 925 phrase entries across 7 languages in the current build. Those numbers sound generous until you look at what they are made of. The English sentence corpus — 95 sentences, 481 total word uses — runs on just 138 distinct words. That is a vocabulary reuse of 3.49×. French is even more extreme: 494 sentences from 143 unique words, a reuse of 14.38×. The constraint does not just limit what you can say. It reshapes the vocabulary landscape so dramatically that the story is less about individual sentences and more about the structural bottleneck they all share.

Quick answer

Last updated March 31, 2026

Can you make sentences from element symbols? Yes. The current corpus ships 1,048 sentence entries and 925 phrase entries across 7 languages. The home page already supports multi-word input — paste a sentence and the app evaluates each word independently.

What is the catch? The usable vocabulary is tiny. The entire English sentence corpus uses only 138 distinct words. Function words like "is", "one", "this", "has" dominate because they happen to be elementizable. Content words are much harder to find, which is why successful sentences often sound narrow or repetitive.

What makes it interesting? The constraint itself. Across all seven languages, the vocabulary bottleneck creates patterns you would never predict. French reuses each unique word ~14.38 times on average. Some words like "casino" and "in" travel across five languages. And the element distribution in sentences is meaningfully different from the element distribution in the word corpus.

Sentence corpus

1,048

Curated sentence entries across 7 languages in the current app build.

Phrase corpus

925

Short phrase entries that sit between single words and full sentences.

English unique words

138

Only 138 distinct words power all 95 English sentences — a vocabulary reuse of 3.49×.

French vocab reuse

14.38×

French's 494 sentences use just 143 unique words — each one appears ~14 times on average.

Methodology — where the corpus comes from

I want to be upfront about this: the sentence corpus is not found in the wild. Nobody writes sentences from element symbols by accident. Nobody stumbles upon a paragraph where every whitespace-separated word happens to decompose into a valid chain of chemical element symbols. The corpus was generated using a pipeline, and I think understanding that pipeline is essential to understanding the data.

Dictionary filtering. The app maintains a per-language dictionary of elementizable words — words that can be decomposed into a valid chain of 1-, 2-, or 3-character element symbols. Every word in the dictionary is confirmed elementizable; the filtering happens at build time. English has 13,337 such words, French has 7,510, and the total across all seven languages is 85,911.
Constrained generation. An LLM (Gemini) is given the elementizable vocabulary for each language and asked to produce grammatically valid sentences using only words from that list. The instruction explicitly constrains the model to the allowed word set. This is not free composition — it is more like a constrained optimization problem where the model searches for valid combinations within a restricted dictionary.
Verification. Each generated sentence is split and every word is re-checked against the element-matching engine. Sentences where even one word fails are discarded. What ships in the app is the verified remainder. The failure rate during verification is non-trivial — the LLM frequently produces words that look elementizable but are not in the approved dictionary, or that contain letter sequences with no valid element decomposition.

This means the corpus is curated, not spontaneous. The sentences exist because they were designed to exist. That is worth knowing, because it changes how you should interpret the statistics below: they describe the structure of a constrained vocabulary, not the structure of natural language. If you see that English sentences have a vocabulary reuse of 3.49×, that is a property of what the generator found useful, not a property of English itself.

The interesting question is not "are these sentences real." The interesting question is: given the constraint, what does the vocabulary landscape actually look like? That is what the rest of this article tries to answer with real numbers.

The vocabulary bottleneck

This is the finding I keep returning to. When you look at the sentence corpora across languages, the vocabulary compression is remarkable — and it varies wildly by language. I expected each language to have a roughly proportional relationship between corpus size and vocabulary size. The data shows no such thing.

French — most compressed

14.38×

494 sentences built from 143 unique words. Each distinct word appears ~14.38 times on average. That is extreme reuse — the French corpus essentially recombines the same small word bank over and over. The articles "un" and "une" alone account for a staggering share of all word positions.

English — moderate

3.49×

95 sentences from 138 unique words (481 total word uses). Each word appears ~3.5 times. Still compressed, but far less claustrophobic than French. The English corpus feels varied sentence to sentence — until you realize that "is" and "one" together appear in over half of all entries.

Dutch — least compressed

1.67×

89 sentences using 184 unique words. Almost every word appears only once or twice. This means Dutch sentences are more varied but harder to extend — there is less proven vocabulary to reuse. The Dutch corpus has more unique words than it has sentences.

What vocabulary reuse actually tells you

Vocabulary reuse is totalTokens / uniqueTokens. A reuse of 1.0 means every word in the corpus appears exactly once — maximum diversity, zero repetition. A reuse of 14× means you are cycling through the same small set relentlessly. The metric captures how much the corpus leans on a few workhorses versus spreading load across a broad vocabulary.

High reuse (French at 14.38×) signals that the language has found a handful of elementizable function words and leans on them hard. The words "un" (325 occurrences) and "une" (243 occurrences) together account for roughly 28% of all word uses in the French sentence corpus. That is an astonishing concentration — two words doing more than a quarter of all the work.

Low reuse (Dutch at 1.67×, Welsh at 1.65×) means the corpus is more lexically diverse. But it also means each word is less battle-tested. If you are trying to build a new sentence in Dutch, you have fewer proven building blocks to work with. You are essentially starting from scratch each time, whereas a French speaker can assemble a new sentence by recombining the same proven template.

English sits in a sweet spot: enough reuse (3.49×) that the core function words are well-established, but enough diversity (138 unique words) that the sentences do not all sound identical. It is also worth noting that the English dictionary has 13,337 elementizable words, but the sentences use only 1% of them. The rest sit unused — not because they cannot be elementized, but because they are hard to combine with other elementizable words into grammatically valid sentences.

Vocabulary compression across all languages

Sentence corpus only. Sorted by vocabulary reuse (highest first). The gap between French at 14.38× and Welsh at 1.65× is nearly 9:1 — languages are not created equal under this constraint.

Language	Sentences	Total words	Unique words	Vocab reuse	Avg words/sent.
French	494	2,057	143	14.38×	4.2
English	95	481	138	3.49×	5.1
Spanish	97	413	124	3.33×	4.3
German	94	338	127	2.66×	3.6
Italian	97	252	134	1.88×	2.6
Dutch	89	307	184	1.67×	3.4
Welsh	82	247	150	1.65×	3

Why French looks so different

French has 494 sentence entries — more than five times the Welsh corpus (82) and more than five times the Dutch (89). But it uses only 143 unique words, barely more than English's 138. How?

The answer is structural. French has two elementizable articles ("un" decomposes as U + N; "une" as U + Ne) that are grammatically required before most nouns. The generator discovered that pairing one of these articles with a rotation of elementizable nouns, verbs, and names produces hundreds of valid sentences from the same template. It is efficiency through grammar — the language's own article system makes it easier to produce valid combinations.

Italian tells the opposite story: 97 sentences but 134 unique words, giving a reuse of only 1.88×. Italian sentences tend to be very short (2.6 words on average) and use each word only once or twice across the entire corpus. There is no dominant template — each sentence is more or less unique.

The anatomy of an element sentence

Once I tagged the corpus by part of speech, the structure became much clearer. The sentence problem is not just “find words that elementize.” It is “find a small grammatical machine where every moving part also elementizes.” In English, nouns carry the biggest share of actual word use at 26.3%, but verbs are right behind at 24.7%, and determiners add another 19.3%. That means a workable sentence is usually a noun-or-name, then a bridge verb, then a determiner, then a content word that can finish the idea.

English content words

69.5%

Nouns, verbs, adjectives, adverbs, and names together dominate actual word use in English.

English function words

30.6%

Determiners, pronouns, prepositions, and conjunctions are fewer by category but essential as bridges.

English names

Proper names are unusually valuable because they can anchor a sentence without needing extra syntax.

Top English skeleton

Most common pattern: name verb determiner adjective noun.

English POS distribution by actual word use

This is the practical grammar of the corpus. Nouns and verbs do most of the semantic work, but determiners such as “his”, “one”, “this”, and “that” are the hinges that keep sentences grammatical.

noun

26.3% · 264

verb

24.7% · 248

determiner

19.3% · 194

adjective

9.7% · 97

pronoun

7.7% · 77

name

6.6% · 66

preposition

3.3% · 33

adverb

2.2% · 22

What this says about sentence-building

In English, the strongest starting pieces are names and short determiners. Names such as bruno, clara, oscar, agnes act like ready-made subjects. They are semantically rich and syntactically simple: a name can start a sentence without needing an article.

The best bridge verbs are exactly the ones that dominate the corpus: is, can, has, offer. They are short, frequent, and easy to combine with the rest of the sentence. After that, determiners like his, one, this, that, no let the sentence point toward a noun or adjective without collapsing.

The important point is that grammar is not distributed evenly. There are many more nouns available than determiners, but the determiners are harder to replace. The corpus can survive without a specific noun. It cannot survive without a small, reusable set of bridge words.

Cross-language POS comparison

Languages solve the same constraint in different grammatical ways. French leans hard on determiners, German on pronoun-verb chains, and Welsh on names and prepositions far more than I expected.

Language	Function words	Content words	Names	Top pattern
English	30.6%	69.5%	38	name verb determiner adjective noun
German	39.5%	59.4%	18	pronoun verb pronoun verb
Spanish	32.5%	67.3%	10	determiner noun verb adjective
French	30.1%	69.8%	13	name determiner noun verb
Italian	33.2%	66.6%	0	noun adjective
Dutch	15.9%	83.8%	19	noun verb preposition noun
Welsh	22.5%	77.2%	62	verb noun

The honest takeaway

“What does it take to make a sentence?” In this corpus, the answer is not “a noun and a verb.” The answer is “a stable grammatical bridge.” You need a subject that already works, a bridge verb that the corpus has proven safe, and a determiner or short preposition that can carry you into the content word without breaking the line.

That is why English keeps returning to patterns like name verb determiner adjective noun and French keeps returning to name determiner noun verb. The corpus is not merely choosing words. It is discovering the smallest grammatical machines that can survive the element filter.

Sentence patterns that work

The POS skeletons make the best practical advice on the page. They show, in plain grammar, how the current corpus actually builds sentences that survive. If you want to invent your own sentence, start from the patterns the data already trusts.

English sentence skeletons

These are the most common POS patterns in the English sentence corpus.

name verb determiner adjective noun12 sentences

“bruno has one conspicuous ambition”

determiner noun verb determiner adjective noun10 sentences

“that association has one conspicuous backer”

determiner noun verb adjective6 sentences

“this action is obscene”

determiner noun verb determiner noun6 sentences

“that assassin has one choice”

determiner verb determiner adjective noun6 sentences

“that is one poisonous substance”

name verb determiner noun6 sentences

“clara accepts this crisis”

determiner noun verb noun5 sentences

“this baroness has poise”

determiner verb determiner noun preposition noun4 sentences

“this is one action of arson”

What the patterns teach

The English winner — name verb determiner adjective noun — is a perfect demonstration of the corpus strategy. Start with a name (bruno, clara, oscar), add a bridge verb (is, can, has), use a determiner (his, one, this), and close with an adjective + noun pair that is already proven safe.

French takes a different route. Its dominant pattern is name determiner noun verb, which tells you immediately how much the articles “un” and “une” matter there. German’s top structure is pronoun verb pronoun verb, which explains why its corpus feels more conversational and pronoun-led. Welsh’s top pattern is simply verb noun, which is a reminder that sentence structure under constraint can look very different from language to language.

The practical reading is straightforward: if you want a sentence that works, do not start by chasing a clever rare noun. Start by choosing a stable skeleton that the corpus already demonstrates, then substitute within that skeleton.

Best way to create your own sentence

Step 1

Start with a proven subject: a name like bruno, clara, oscar, agnes, or a determiner-led opening like “this” or “that”.

Step 2

Add a bridge verb from the safe set: is, can, has, offer, hear. These verbs show up repeatedly because they are grammatically flexible and elementizable.

Step 3

Finish with a determiner + adjective + noun, or a compact noun phrase. The corpus keeps returning to combinations like “one serious crisis”, “this classic baron”, or “his business”.

If I were giving the shortest useful advice, it would be this: start with a name, add a known bridge verb, and finish with an elementizable noun or adjective. The data does not just suggest that pattern. It repeats it over and over.

That is also why many successful sentences feel slightly formal or stylized. The corpus optimizes for survivable structure, not spontaneity. Once you understand the structure, though, you can work with it instead of fighting it.

Which words do the heavy lifting

In every language, the most-used sentence words are function words. This is not surprising — function words are short, frequent in natural language, and more likely to be elementizable because short letter sequences have fewer chances to fail. But the degree of dominance is worth examining, because it drives the stylistic signature of element sentences.

English sentence words

Top 15 words by frequency out of 138 unique words (481 total uses).

1is

62× (13%)

2one

58× (12%)

3this

32× (7%)

4has

31× (6%)

5that

21× (4%)

6serious

12× (2%)

7person

10× (2%)

8crisis

8× (2%)

9book

8× (2%)

10of

8× (2%)

11conspicuous

7× (1%)

12classic

7× (1%)

13ashore

7× (1%)

14bruno

6× (1%)

15clara

6× (1%)

French sentence words

Top 15 words by frequency out of 143 unique words (2057 total uses).

1un

325× (16%)

2une

243× (12%)

3ose

80× (4%)

4passe

62× (3%)

5accepte

55× (3%)

6refuse

54× (3%)

7erre

45× (2%)

8pirate

44× (2%)

9sbire

44× (2%)

10artiste

43× (2%)

11flic

42× (2%)

12alice

38× (2%)

13olivier

37× (2%)

14horace

36× (2%)

15oscar

35× (2%)

The function-word trap

In English, "is" appears 62 times and "one" appears 58 times. Together, those two words account for roughly 25% of all word uses across 95 sentences. Two words doing a quarter of the work — that shapes everything about how element sentences read.

This creates a stylistic signature. Element sentences in English gravitate toward declarative structures: "X is Y" patterns dominate because "is" is elementizable (I + S), "one" is elementizable (O + Ne), and "this" is elementizable (Th + I + S). You end up with sentences like "agnes is serene" — grammatically fine, but stylistically constrained in a way you can feel even before you know the rule. The sentence could say something surprising, but the vocabulary is not cooperating.

In French, the compression is even more extreme. The article "un" (U + N) appears 325 times. The feminine "une" (U + Ne) appears 243 times. These two words together represent 28% of all French sentence words. Every other word in the language is fighting for the remaining share. When more than a quarter of your word budget is consumed by articles, the creative space for the rest of the sentence is surprisingly narrow.

German shows a different pattern entirely. The pronoun "er" (Er) leads with 56 uses, followed by "es" (Es) at 42 and "hat" (H + At) at 39. German does not have a single overwhelming anchor — it distributes load across several short, common words. That is reflected in its lower vocabulary reuse (2.66×). The German corpus has less repetition but also less predictability.

Then there is the gap between function words and content words. In English, after "is" (62×), "one" (58×), "this" (32×), and "has" (31×), the next words are "that" (21×) and "serious" (12×). Notice the drop: from 62 uses to 12 uses. The top four function words have 183 combined appearances; the next four have only 51. Function words are not just leading — they are running away with the race.

German sentence words

Top 10 words. 127 unique across 338 uses.

1er56×

2es42×

3hat39×

4wir16×

5ich13×

6sind8×

7auf7×

8gehen5×

9kam4×

10ihn4×

Spanish sentence words

Top 10 words. 124 unique across 413 uses.

1es40×

2la39×

3un33×

4artista18×

5yo17×

6para15×

7una14×

8soy10×

9no10×

10necesario8×

Why content words are rare

Look at the English top-15 list again. The content words — "serious" (12×), "person" (10×), "crisis" (8×), "book" (8×) — appear far less frequently than the function words. This is partly natural frequency bias (function words are always more common), but it is also a reflection of the constraint: there are more elementizable function words than elementizable content words, because content words tend to be longer and therefore have more opportunities to contain an unmatchable letter sequence.

The practical consequence is that element sentences tend to have strong syntactic skeletons ("is", "has", "one", "this") but limited semantic variety. You can reliably form the frame of a sentence. Filling it with interesting content is the hard part.

Words that travel across languages

Some words appear in the sentence or phrase corpora of three or more languages. These are words that happen to be (a) elementizable, (b) meaningful in multiple languages, and (c) actually used by the generator. The cross-language analysis found 16 such words. They fall into several categories, and the most widely shared ones are genuinely surprising.

Five-language words

These words appear in the corpora of five different languages — the most widely shared words in the entire dataset.

casinode, es, fr, it, nl

inde, en, fr, it, nl

"Casino" is a loanword that most European languages absorbed without significant spelling changes — and it happens to decompose cleanly into C + As + I + N + O in the parser's preferred path (with alternative splits like Ca + Si + No also valid). Meanwhile "in" (I + N) is a preposition or adverb that works across Germanic and Romance languages alike. These are not cognates in the linguistic sense; they are orthographic coincidences that survive the element filter independently in each language.

I find this genuinely interesting because it is a kind of convergent evolution. Five different language pipelines, each with their own dictionaries and generators, independently arrived at the same handful of words. The element constraint creates a shared vocabulary that transcends linguistic boundaries — not through translation, but through the accident of compatible letter sequences.

Four-language words

oscarcy, en, fr, nl

"Oscar" appears in Welsh, English, French, and Dutch. It is a proper name — one of the few that crosses the element barrier in multiple languages simultaneously. This is a reminder that names are often the bridge between the element vocabulary and readable sentences. A sentence about "Oscar" is automatically more natural than a sentence that avoids all proper nouns, because names carry their own context and do not need to justify their presence syntactically.

Three-language words

The full list of words appearing in exactly three languages.

bar

es, fr, it

car

cy, en, fr

crash

de, fr, nl

crisis

en, es, nl

eric

cy, en, fr

fiasco

es, fr, nl

flora

en, es, nl

francisco

cy, en, es

laura

cy, en, nl

en, es, it

cy, es, it

es, fr, it

was

de, en, nl

The three-language group is the most diverse. It includes loanwords ("bar", "crash", "fiasco"), shared function words ("no", "un", "was"), proper names ("eric", "laura", "francisco"), a botanical term ("flora"), and a single-letter word ("o" — used as a vocative in Welsh, a conjunction in Spanish, and a vowel-article in Italian).

Each one survived independently in three different language pipelines, which is a low-probability event given how many words fail the element filter entirely. If you estimate that a given word has maybe a 15–20% chance of being elementizable, the probability of the same word passing in three independent languages is already quite low. The words that do make it through tend to be short, common, and orthographically simple — exactly the properties that also make them elementizable.

The element backbone of sentences

When I looked at which elements actually power the sentence corpus, I expected the same O-I-N-S-C pattern that dominates the word corpus. The broad strokes are similar, but the sentence-level data reveals meaningful differences — because sentences recycle a small word set, the element distribution is driven by those few words rather than by the broad statistical tendencies of the entire dictionary.

English sentence elements (top 10)

Element occurrences across all 481 word positions.

IIodine

209× (12.7%)

SSulfur

209× (12.7%)

OOxygen

207× (12.6%)

NNitrogen

117× (7.1%)

CCarbon

95× (5.8%)

HHydrogen

74× (4.5%)

NeNeon

68× (4.1%)

ThThorium

55× (3.3%)

BBoron

52× (3.2%)

AsArsenic

49× (3%)

French sentence elements (top 10)

Element occurrences across all 2057 word positions.

UUranium

703× (9.9%)

NNitrogen

597× (8.4%)

OOxygen

512× (7.2%)

IIodine

469× (6.6%)

SeSelenium

373× (5.2%)

NeNeon

369× (5.2%)

CCarbon

342× (4.8%)

PPhosphorus

280× (3.9%)

ReRhenium

266× (3.7%)

TeTellurium

251× (3.5%)

Sentences vs. words: where the element mix shifts

In the global word corpus, the top five elements are O (41%), I (41%), N (40.1%), S (35.6%), and C (30.5%). Those percentages are computed across 85,911 dictionary words. Single-letter symbols dominate because they can fill any single-character gap in a word.

In the English sentence corpus, I and S are tied at the top (12.7%), followed closely by O (12.6%). The gap between top elements is much smaller than in the word corpus — the sentence distribution is flatter, because a small recycled word set levels out the element frequencies rather than letting the broad statistical tendencies of the dictionary drive them.

The most interesting sentence-specific element is Neon (Ne): it ranks 7th in English sentences at 4.1% but barely registers in the global word corpus. Why? Because "one" (O + Ne) is the second most common English sentence word at 58 occurrences, and it pulls Neon into the sentence distribution almost single-handedly. This is a case where a single high-frequency word reshapes the element profile of an entire corpus.

Similarly, Thorium (Th) at 3.3% rides on "this" (32×) and "that" (21×). In German, Germanium (Ge) appears at 5.2% because "gestern" and the "ge-" prefix are common in the German sentence vocabulary. And in French, Uranium (U) leads the element chart at 9.9% — because "un" (U + N) is the most common word.

The pattern is consistent: in the sentence corpus, element frequency is a downstream effect of word frequency. Whichever few words the generator leans on hardest, those words' constituent elements get amplified proportionally. This is a fundamentally different relationship than the word corpus, where element frequency reflects the statistical properties of tens of thousands of dictionary entries.

Language-by-language comparison

Full sentence and phrase metrics for all 7 corpus languages. Median word counts show how compact (or expansive) typical entries are. The variation across languages is larger than I expected.

Language	Sentences	Phrases	Median sent. words	Median phrase words	Longest sentence
en	95	130	5 words	4 words	simon has one book also monica has one book
de	94	107	4 words	3 words	er kam als ich las
es	97	128	4 words	4 words	la obra es rara pero no es un fiasco
fr	494	155	4 words	2 words	un flic passe car une alerte fuse
it	97	128	3 words	3 words	lui reagisce con astio
nl	89	151	3 words	3 words	babe is in bikini op balkon
cy	82	126	3 words	3 words	cryf yw bryce ond ofnus yw bryson

French dominates the sentence count with 494 entries — more than five times the size of the Welsh corpus (82 entries). This is not because French is inherently more elementizable. It is because the French generator found a very efficient pattern: pair "un" or "une" with a rotation of elementizable nouns, verbs, and names. The same structural template produces hundreds of valid sentences.

Italian has the lowest vocabulary reuse (1.88×) and the shortest average sentence length (2.6 words per entry). This suggests that Italian elementizable sentences tend to be very compact — often just two or three words — and do not reuse words much across entries. Italian favours brevity over repetition.

The median sentence length across languages clusters between 3 and 5 words. This is a natural ceiling: each additional word is another chance for the sentence to break, so the corpus gravitates toward shorter, safer constructions. The longest sentences — like English's "simon has one book also monica has one book" at 9 words — represent the upper boundary of what the constraint allows. They work because they chain together proven words, not because they take risks with rare vocabulary.

Longest English sentence

At 9 words, this is the upper bound of what the English corpus sustains. Notice how it leans on proven words ("has", "one", "book") and avoids risk.

simon has one book also monica has one book

Shortest English sentence

Three words — about as minimal as a sentence can get while remaining grammatically complete. Most corpora cluster around this compact end because brevity is the safest strategy under the constraint.

agnes is serene

Longest French sentence

French achieves longer sentences by anchoring on "un/une" and rotating through a bank of elementizable verbs and nouns. The template-based approach allows sentences that feel natural even at length.

un flic passe car une alerte fuse

Longest Spanish sentence

Spanish has the third-highest sentence count (97) and makes good use of elementizable articles "la" and "un." Spanish sentences lean on declarative constructions similar to English.

la obra es rara pero no es un fiasco

Why the weakest word kills the sentence

Understanding this is the key to understanding everything else about element sentences. The system is all-or-nothing by design, and that all-or-nothing rule creates exponential fragility as sentences grow longer.

How the app processes a sentence

When you type a sentence on the home page, the app does this:

Split the input on whitespace to get individual words.
Normalize each word: strip punctuation, diacritics, digits.
For each normalized word, attempt to decompose it into a valid chain of 1-, 2-, or 3-character element symbols.
If every word succeeds, the sentence is valid. If any single word fails, the entire sentence fails.

There is no partial credit. A five-word sentence where four words are elementizable and one is not produces the same outcome as a five-word sentence where zero words are elementizable: failure. The UI will show the successful words in their element form, but the sentence as a whole is not marked valid.

A concrete example

Consider a real corpus sentence: "this financial crisis is serious." Each word:

this → Th + I + S ✓
financial → F + I + Na + N + C + I + Al ✓
crisis → Cr + I + S + I + S ✓
is → I + S ✓
serious → S + Er + I + O + U + S ✓

That sentence works. Now try swapping one word: "this situation is serious." The word "situation" fails because after "si" the parser hits "tu" — and there is no element symbol Tu. The entire sentence fails because of two letters in the middle of one word. This is why the corpus leans on proven, battle-tested words rather than exploring the full vocabulary: one untested word can poison an entire line.

The probability math

If you pick a random English word from the general vocabulary, the chance it is elementizable is high — the word corpus has a 100% elementizable rate because it is pre-filtered. But that is a curated dictionary. In unconstrained natural language, many common words fail the element test: "the" works (Th + E), but "that" works (Th + At), while "what" does not (W + H + At leaves nothing for the remaining characters... actually it might — the point is that many everyday words fail).

Now consider a five-word sentence. Even if each word independently has a 70% chance of being elementizable, the probability that all five succeed is 0.7⁵ = 16.8%. At 80% per word, you get 32.8%. At 90%, you get 59%. The multiplication is merciless — and real natural language has far more words that fail the filter than succeed.

This is why the corpus exists at all. The generator does not pick words randomly; it draws from a pre-verified vocabulary. And even then, the resulting sentences lean heavily on a small proven word set rather than exploring the full elementizable vocabulary. The data confirms this: 138 unique words across 95 sentences, when English alone has a dictionary of 13,337 elementizable words.

That is a utilization rate of roughly 1% of the available vocabulary. The sentences use about one in every hundred elementizable English words. The rest of the dictionary sits unused — not because those words cannot be elementized, but because they are hard to combine with other elementizable words into grammatically valid sentences. The constraint is not on individual words; it is on the combinations.

The phrase corpus: a less constrained middle ground

Between single words and full sentences, the corpus includes 925 phrase entries. Phrases are typically two to four words — noun phrases, verb phrases, or short clauses that do not form complete sentences. They are less syntactically constrained than full sentences, which means the generator has more freedom and the vocabulary can stretch a bit further.

The English phrase corpus has 176 unique words across 130 entries, giving a vocabulary reuse of 2.97×. That is lower than the sentence reuse (3.49×), which makes sense: phrases are shorter, so the generator has less opportunity to reuse the same structural anchors. But the most frequent phrase words tell a different story than sentence words.

In English phrases, "his" (H + I + S) dominates with 75 occurrences — more than any sentence word. The possessive pronoun is a workhorse for short noun phrases like "his business," "his crisis," "his concern." The possessive is not just elementizable; it is a natural head for a phrase where the complement can be any elementizable noun. This is a case where the grammar creates a natural template: "his" + [noun] is always valid as a phrase, so the generator can produce dozens of entries by varying only the noun.

The phrase corpus also makes heavier use of second-person and first-person pronouns: "i" (23×), "you" (12×), "she" (11×), "he" (10×). These are all elementizable because they are short: "i" is just Iodine, "he" is Helium. But they rarely appear in the sentence corpus at the same frequency, because sentences need verbs and objects after the pronoun, and those additional words narrow the possibilities much more aggressively.

Pros, cons, and constraints

These points are grounded in the current element-matching rules and API limits, not generic chemistry-word advice.

Pros

The shipped corpus proves multilingual sentence creation works — 1,048 verified sentences across 7 languages.
Function words like "is" (62× in English), "un" (325× in French), and "er" (56× in German) are reliable anchors that make sentence construction learnable.
Cross-language words (16 words in 3+ languages) create reusable building blocks: "casino", "in", "oscar" work across five or more corpora.
Whitespace splitting means successful sentences can be edited word-by-word — swap one proven word for another without retesting the whole line.

Cons

Vocabulary utilization is low: English sentences use ~1% of the 13,337-word elementizable dictionary.
Sentence wording can sound narrow because the same 138 words get reused at a 3.49× rate — readers may notice the repetition.
The corpus is generated and curated, not harvested from natural text. Successful sentences exist because they were designed to exist.
Longer sentences are exponentially more fragile: each additional word multiplies the chance of failure.

Hard constraints

The app splits text on whitespace and elementalizes each word independently — there is no cross-word matching or whole-sentence parsing.
Normalization strips punctuation, digits, and diacritics before matching. Punctuation survives visually but does not participate in element resolution.
The elementalise API accepts up to 50 words per request and rendering caps output at 16 words, bounding how far a sentence can travel through the full design pipeline.
The matching engine tries 1-, 2-, and 3-character symbol prefixes recursively. A word only succeeds if every remaining segment maps to a valid element — one unmatchable cluster kills the word.

What to do next

If you want to try a sentence, start with proven words. In English, build around "is", "one", "this", "has" — they are the most battle-tested anchors. Pick a subject from the names corpus (Oscar, Bruno, Clara, Agnes all work) and fill in with short elementizable adjectives or nouns. The goal is to minimize risk per word, not to maximize creativity per sentence. Creativity comes later, once you have a working skeleton.

If you want the fastest feedback loop, the home page shows element resolution in real time as you type. Paste a candidate sentence and watch which words light up and which fail. The designer turns a working sentence into a downloadable or printable periodic-table layout. Multi-word inputs in the designer produce a sequence of element tiles that reads left to right, with spacing between words — it is the most visual way to see the constraint in action.

If single words are more your speed, the words article covers the simpler single-word case — no combinatorial fragility, just a single word and its element decomposition(s). If you want to push the constraint further, the poems article shows what happens when you add line breaks and verse structure. And for raw symbol context, the periodic table closes the loop between language and chemistry.

The broader takeaway from this analysis is that element sentences are not a creative writing exercise — they are a combinatorial one. The vocabulary is small, the reuse is high, and the constraint is merciless. But within those boundaries, the data reveals genuine structure: cross-language convergence, element distributions shaped by individual high-frequency words, and a vocabulary bottleneck that is more interesting to understand than to fight against.

Try it on home Open in designer Browse names Explore symbols

Follow the adjacent intent when you want a shorter word, a full sentence, or a more constrained poem.

Words with chemical elements

Return to the simpler single-word case when you want the most forgiving discovery flow.

Poems with chemical elements

See what happens when the same word-by-word logic is stretched into multiline composition.

Spell names with chemical elements

See how often popular first names can be written with real element symbols across six countries.

The entire English sentence corpus runs on 138 unique words — and that is the most interesting thing about element sentences

Methodology — where the corpus comes from

The vocabulary bottleneck

What vocabulary reuse actually tells you

Vocabulary compression across all languages

Why French looks so different

The anatomy of an element sentence

English POS distribution by actual word use

What this says about sentence-building

Cross-language POS comparison

The honest takeaway

Sentence patterns that work

English sentence skeletons

What the patterns teach

Best way to create your own sentence

Which words do the heavy lifting

English sentence words

French sentence words

The function-word trap

German sentence words

Spanish sentence words

Why content words are rare

Words that travel across languages

Five-language words

Four-language words

Three-language words

The element backbone of sentences

English sentence elements (top 10)

French sentence elements (top 10)

Sentences vs. words: where the element mix shifts

Language-by-language comparison

Longest English sentence

Shortest English sentence

Longest French sentence

Longest Spanish sentence

Why the weakest word kills the sentence

How the app processes a sentence

A concrete example

The probability math

The phrase corpus: a less constrained middle ground

Pros, cons, and constraints

Pros

Cons

Hard constraints

What to do next

Related articles