Skip to main content
Logo image

Applied Discrete Structures

Section 14.2 Free Monoids and Languages

In this section, we will introduce the concept of a language. Languages are subsets of a certain type of monoid, the free monoid over an alphabet. After defining a free monoid, we will discuss languages and some of the basic problems relating to them. We will also discuss the common ways in which languages are defined.
Let A be a nonempty set, which we will call an alphabet. Our primary interest will be in the case where A is finite; however, A could be infinite for most of the situations that we will describe. The elements of A are called letters or symbols. Among the alphabets that we will use are B={0,1}, and the set of ASCII (American Standard Code for Information Interchange) characters, which we symbolize as ASCII.

Definition 14.2.1. Strings over an Alphabet.

A string of length n, n1 over alphabet A is a sequence of n letters from A: a1a2an. The null string, λ, is defined as the string of length zero containing no letters. The set of strings of length n over A is denoted by An. The set of all strings over A is denoted A.

Note 14.2.2.

  1. If the length of string s is n, we write |s|=n.
  2. The null string is not the same as the empty set, although they are similar in many ways. A0={λ}.
  3. A=A0A1A2A3 and if ij,AiAj=; that is, {A0,A1,A2,A3,} is a partition of A.
  4. An element of A can appear any number of times in a string.

Proof.

Case 1. Given the alphabet B={0,1}, we can define a bijection from the positive integers into B. Each positive integer has a binary expansion dkdk1d1d0, where each dj is 0 or 1 and dk=1. If n has such a binary expansion, then 2kn2k+1. We define f:PB by f(n)=f(dkdk1d1d0)=dk1d1d0, where f(1)=λ. Every one of the 2k strings of length k are the images of exactly one of the integers between 2k and 2k+11. From its definition, f is clearly a bijection; therefore, B is countable.
Case 2: A is Finite. We will describe how this case is handled with an example first and then give the general proof. If A={a,b,c,d,e}, then we can code the letters in A into strings from B3. One of the coding schemes (there are many) is a000,b001,c010,d011, and e100. Now every string in A corresponds to a different string in B; for example, ace. would correspond with 000010100. The cardinality of A is equal to the cardinality of the set of strings that can be obtained from this encoding system. The possible coded strings must be countable, since they are a subset of a countable set, B. Therefore, A is countable.
If |A|=m, then the letters in A can be coded using a set of fixed-length strings from B. If 2k1<m2k, then there are at least as many strings of length k in Bk as there are letters in A. Now we can associate each letter in A with with a different element of Bk. Then any string in A. corresponds to a string in B. By the same reasoning as in the example above, A is countable.
Case 3: A is Countably Infinite. We will leave this case as an exercise.

Definition 14.2.4. Concatenation.

Let a=a1a2am and b=b1b2bn be strings of length m and n, respectively. The concatenation of a with b, a+b, is the string a1a2amb1b2bn of length m+n.
There are several symbols that are used for concatenation. We chose to use the one that is also used in Python and SageMath.
The set of strings over any alphabet is a monoid under concatenation.

Note 14.2.5.

  1. The null string is the identity element of [A;+]. Henceforth, we will denote the monoid of strings over A by A.
  2. Concatenation is noncommutative, provided |A|>1.
  3. If |A1|=|A2|, then the monoids A1 and A2 are isomorphic. An isomorphism can be defined using any bijection f:A1A2. If a=a1a2anA1, f(a)=(f(a1)f(a2)f(an)) defines a bijection from A1 into A2. We will leave it to the reader to prove that for all a,b,A1,f(a+b)=f(a)+f(b).
The languages of the world, English, German, Russian, Chinese, and so forth, are called natural languages. In order to communicate in writing in any one of them, you must first know the letters of the alphabet and then know how to combine the letters in meaningful ways. A formal language is an abstraction of this situation.

Definition 14.2.6. Formal Language.

If A is an alphabet, a formal language over A is a subset of A.

Example 14.2.7. Some Formal Languages.

  1. English can be thought of as a language over of letters A,B,Z, both upper and lower case, and other special symbols, such as punctuation marks and the blank. Exactly what subset of the strings over this alphabet defines the English language is difficult to pin down exactly. This is a characteristic of natural languages that we try to avoid with formal languages.
  2. The set of all ASCII stream files can be defined in terms of a language over ASCII. An ASCII stream file is a sequence of zero or more lines followed by an end-of-file symbol. A line is defined as a sequence of ASCII characters that ends with the a “new line” character. The end-of-file symbol is system-dependent.
  3. The set of all syntactically correct expressions in any computer language is a language over the set of ASCII strings.
  4. A few languages over B are
    • L1={sBs has exactly as many 1's as it has 0's}
    • L2={1+s+0sB}
    • L3=0,01 = the submonoid of B generated by {0,01}.

Investigation 14.2.1. Two Fundamental Problems: Recognition and Generation.

The generation and recognition problems are basic to computer programming. Given a language, L, the programmer must know how to write (or generate) a syntactically correct program that solves a problem. On the other hand, the compiler must be written to recognize whether a program contains any syntax errors.

Problem 14.2.8. The Recognition Problem.

Given a formal language over alphabet A, the Recognition Problem is to design an algorithm that determines the truth of sL in a finite number of steps for all sA. Any such algorithm is called a recognition algorithm.

Definition 14.2.9. Recursive Language.

A language is recursive if there exists a recognition algorithm for it.

Example 14.2.10. Some Recursive Languages.

  1. The language of syntactically correct propositions over set of propositional variables expressions is recursive.
  2. The three languages in 7(d) are all recursive. Recognition algorithms for L1 and L2 should be easy for you to imagine. The reason a recognition algorithm for L3 might not be obvious is that the definition of L3 is more cryptic. It doesn’t tell us what belongs to L3, just what can be used to create strings in L3. This is how many languages are defined. With a second description of L3, we can easily design a recognition algorithm. You can prove that
    L3={sBs=λ or s starts with a 0 and has no consecutive 1's}.

Problem 14.2.11. The Generation Problem.

Design an algorithm that generates or produces any string in L. Here we presume that A is either finite or countably infinite; hence, A is countable by Theorem 14.2.3, and LA must be countable. Therefore, the generation of L amounts to creating a list of strings in L. The list may be either finite or infinite, and you must be able to show that every string in L appears somewhere in the list.

Proof.

Part (a) follows from the fact that A is countable; therefore, there exists a complete list of strings in A.
To generate all strings of L, start with a list of all strings in A and an empty list, W, of strings in L. For each string s, use a recognition algorithm (one exists since L is recursive) to determine whether sL. If sL, add it to W; otherwise “throw it out.” Then go to the next string in the list of A.

Example 14.2.13.

Since all of the languages in 7(d) are recursive, they must have generating algorithms. The one given in the proof of Theorem 14.2.12 is not usually the most efficient. You could probably design more efficient generating algorithms for L2 and L3; however, a better generating algorithm for L1 is not quite so obvious.
The recognition and generation problems can vary in difficulty depending on how a language is defined and what sort of algorithms we allow ourselves to use. This is not to say that the means by which a language is defined determines whether it is recursive. It just means that the truth of the statement “L is recursive” may be more difficult to determine with one definition than with another. We will close this section with a discussion of grammars, which are standard forms of definition for a language. When we restrict ourselves to only certain types of algorithms, we can affect our ability to determine whether sL is true. In defining a recursive language, we do not restrict ourselves in any way in regard to the type of algorithm that will be used. In the next section, we will consider machines called finite automata, which can only perform simple algorithms.
One common way of defining a language is by means of a phrase structure grammar (or grammar, for short). The set of strings that can be produced using set of grammar rules is called a phrase structure language.

Example 14.2.14. Zeros before Ones.

We can define the set of all strings over B for which all 0’s precede all 1’s as follows. Define the starting symbol S and establish rules that S can be replaced with any of the following: λ, 0S, or S1. These replacement rules are usually called production rules. They are usually written in the format Sλ, S0S, and SS1. Now define L to be the set of all strings that can be produced by starting with S and applying the production rules until S no longer appears. The strings in L are exactly the ones that are described above.

Definition 14.2.15. Phrase Structure Grammar.

A phrase structure grammar consists of four components:
  1. A nonempty finite set of terminal characters, T. If the grammar is defining a language over A, T is a subset of A.
  2. A finite set of nonterminal characters, N.
  3. A starting symbol, SN.
  4. A finite set of production rules, each of the form XY, where X and Y are strings over AN such that XY and X contains at least one nonterminal symbol.
If G is a phrase structure grammar, L(G) is the set of strings that can be produced by starting with S and applying the production rules a finite number of times until no nonterminal characters remain. If a language can be defined by a phrase structure grammar, then it is called a phrase structure language.

Example 14.2.16. Alternating bits language.

The language over B consisting of strings of alternating 0’s and 1’s is a phrase structure language. It can be defined by the following grammar:
  1. Terminal characters: λ, 0, and 1
  2. Nonterminal characters: S, T, and U
  3. Starting symbol: S
  4. Production rules:
    STSUSλS0S1S0TS1UT10TT10U01UU01
These rules can be visualized with a graph:
Production rules for the language of alternating 0’s and 1’s
Figure 14.2.17. Production rules for the language of alternating 0’s and 1’s
We can verify that a string such as 10101 belongs to the language by starting with S and producing 10101 using the production rules a finite number of times: S1U101U10101.

Example 14.2.18. Valid SageMath Variables.

Let G be the grammar with components:
  1. Terminal symbols = all letters of the alphabet (both upper and lower case), digits 0 through 9, and underscore
  2. Nonterminal symbols: {I,X},
  3. Starting symbol: I
  4. Production rules: Iα, where α is any letter, Iα+X for any letter α, XX+β for any letter, digit or underscore, β, and Xβ for any letter, digit or underscore, β. There are a total of 52+52+63+63=230 production rules for this grammar. The language L(G) consists of all valid SageMath variable names.

Example 14.2.19. Backus-Naur Form.

Backus-Naur form (BNF) is a popular alternate form of defining the production rules in a grammar. If the production rules AB1,AB2,ABn are part of a grammar, they would be written in BNF as A::=B1B2Bn. The symbol in BNF is read as “or” while the ::= is read as “is defined as.” Additional notations of BNF are that {x}, represents zero or more repetitions of x and [y] means that y is optional.
A BNF version of the production rules for a SageMath variable, I, is
letter::=abczABZdigit::=019I::=letter{letterdigit_}

Example 14.2.20. The language of simple arithmetic expressions.

An arithmetic expression can be defined in BNF. For simplicity, we will consider only expressions obtained using addition and multiplication of integers. The terminal symbols are (,),+,*, -, and the digits 0 through 9. The nonterminal symbols are E (for expression), T (term), F (factor), and N (number). The starting symbol is E. Production rules are
E::=E+TTT::=TFFF::=(E)NN::=[]digit{digit}.
One particularly simple type of phrase structure grammar is the regular grammar.

Definition 14.2.21. Regular Grammar.

A regular (right-hand form) grammar is a grammar whose production rules are all of the form At and AtB, where A and B are nonterminal and t is terminal. A left-hand form grammar allows only At and ABt. A language that has a regular phrase structure language is called a regular language.

Example 14.2.22.

  1. The set of Sage variable names is a regular language since the grammar by which we defined the set is a regular grammar.
  2. The language of all strings for which all 0’s precede all 1’s (Example 14.2.14) is regular; however, the grammar by which we defined this set is not regular. Can you define these strings with a regular grammar?
  3. The language of arithmetic expressions is not regular.

Exercises Exercises

1.

  1. If a computer is being designed to operate with a character set of 350 symbols, how many bits must be reserved for each character? Assume each character will use the same number of bits.
  2. Do the same for 3,500 symbols.
Answer.
  1. For a character set of 350 symbols, the number of bits needed for each character is the smallest n such that 2n is greater than or equal to 350. Since 29=512>350>28, 9 bits are needed,
  2. 212=4096>3500>211; therefore, 12 bits are needed.

2.

It was pointed out in the text that the null string and the null set are different. The former is a string and the latter is a set, two different kinds of objects. Discuss how the two are similar.

3.

What sets of strings are defined by the following grammar?
  1. Terminal symbols: λ , 0 and 1
  2. Nonterminal symbols: S and E
  3. Starting symbol: S
  4. Production rules: S0S0,S1S1,SE,Eλ,E0,E1
Answer.
This grammar defines the set of all strings over B for which each string is a palindrome (same string if read forward or backward).

4.

What sets of strings are defined by the following grammar?
  1. Terminal symbols: λ, a, b, and c
  2. Nonterminal symbols: S,T,U and E
  3. Starting symbol: S
  4. Production rules:
    SaSSTTbTTUUcUUEEλ

5.

Define the following languages over B with phrase structure grammars. Which of these languages are regular?
  1. The strings with an odd number of characters.
  2. The strings of length 4 or less.
  3. The palindromes, strings that are the same backwards as forwards.
Answer.
  1. Terminal symbols: The null string, 0, and 1. Nonterminal symbols: S, E. Starting symbol: S. Production rules: S00S, S01S, S10S, S11S, SE, E0, E1 This is a regular grammar.
  2. Terminal symbols: The null string, 0, and 1. Nonterminal symbols: S, A, B, C Starting symbol: S Production rules: S0A, S1A, Sλ , A0B, A1B, Aλ, B0C, B1C, BA, C0, C1, Cλ This is a regular grammar.
  3. See Exercise 3. This language is not regular.

6.

Define the following languages over B with phrase structure grammars. Which of these languages are regular?
  1. The strings with more 0’s than 1’s.
  2. The strings with an even number of 1’s.
  3. The strings for which all 0’s precede all 1’s.

7.

Prove that if a language over A is recursive, then its complement is also recursive.
Answer.
If s is in A and L is recursive, we can answer the question “Is s in Lc?” by negating the answer to “Is s in L?

8.

Use BNF to define the grammars in Exercises 3 and 4.

9.

  1. Prove that if X1,X2,is a countable sequence of countable sets, the union of these sets, i=1Xi is countable.
  2. Using the fact that the countable union of countable sets is countable, prove that if A is countable, then A is countable.
Answer.
  1. List the elements of each set Xi in a sequence xi1, xi2, xi3 . Then draw arrows as shown below and list the elements of the union in order established by this pattern: x11, x21, x12, x13, x22, x31, x41, x32, x23, x14, x15,
  2. Each of the sets A1 , A2 , A3,, are countable and A is the union of these sets; hence A is countable.
Exercise 9
Figure 14.2.23. Exercise 9
You have attempted 1 of 1 activities on this page.