Published: Monday, April 05, 1999
Multilingual Databasing
Techniques
By Alex Buchanan
Related Article: Multilingual Databasing, Part 2
The need for this kind of script arose when I was asked to write an online
phonebook for the Quebec Department of Justice. This was to be done in ASP
accessing a database through ODBC. Most databases containing proper names
are usually stripped of accents and other extended foreign characters
(é, à, ç...). Naturally, Quebec is predominantly
French and French names often contain accents.
The major constraint I was imposed was to leave the accents in the fields.
Furthermore, the database should be searchable with or without
accents. There are many names in French which are spelt with or without
accents (i.e. Eric, Éric...). Evidently, SQL takes care of character
case, but not foreign characters.
This article will cover the solution to this kind of situation.
I - Database structure
Before modeling the database, you must figure out if you want to prioritize
speed or space.
The space-efficient solution would be to compare the queried word to all the
possible combinations of a word, taking into consideration the characters
which can be accented.
For example, if a user submits "Réjean" as a keyword, the script
would search for "Rejean", "Réjean", "Rejèan"... -- until it
finds a suitable match or exhausts all possibilities (which are almost
endless). Granted you don't need to create additional tables or fields, it
seems like overkill considering storage space is cheap and databases are
more scalable and robust nowadays. I strongly suggest you don't use this
method.
The speed-efficient solution would be to store an additional field or table
in the database with the particular name/word stripped from its accents.
There are two feasible ways of implementing this:
Adding an extra field
This method consists of adding an extra field to an existing table which
would contain the stripped version of the name. For example the following
table Entry(EID, Fname, Lname, PhoneNum) would become
Entry(EID, Fname, Fname2, Lname, Lname2, PhoneNum) where
Fname2 and Lname2 are the stripped versions of
the fields with similar names (no accents).
| Table Entry |
| EID |
Fname |
Fname2 |
Lname |
Lname2 |
PhoneNum |
| 3 |
Pierre |
Pierre |
Côté |
Cote |
452-4252 |
| 4 |
Réjean |
Rejean |
Préçus |
Precus |
863-5218 |
These two fields would then become indexed (allowing duplicates) since
searching would be done using these fields. How the query works, is that
the user submits a particular keyword which is stripped of its accents. A
search is then performed comparing it to the stripped field (Fname2 /
Lname2). If it is found, the field with accents is returned and
displayed to the user. Implementation will be discussed further in this
article.
This method works fine but keep in mind that there is a considerate amount
of redundancy in the table design (although it is controlled). Read the
Data Modeling articles on 4GuysFromRolla for more
information.
Adding an extra table
This method consists of adding a table which avoids double-storing names by
mapping accented names onto their accentless versions. Consider this new
table: MappedNames(AccentedName (PK), AccentlessName). Only
names with accents would appear in this table; first and last names can be
mapped using the same fields.
| Table Entry |
| EID |
Fname |
Lname |
PhoneNum |
| 3 |
Pierre |
Côté |
452-4252 |
| 4 |
Réjean |
Préçus |
863-5218 |
|
|
| Table MappedNames |
| AccentedName |
AccentlessName |
| Côté |
Cote |
| Réjean |
Rejean |
| Préçus |
Precus |
|
A SELECT statement to query a name would go as follows:
PARAMETER queryname: Text;
SELECT Entry.Fname, Entry.Lname, Entry.PhoneNum
FROM Entry
INNER JOIN MappedNames ON
Entry.Lname = MappedNames.AccentedName
WHERE Entry.Lname = queryname
OR
MappedNames.AccentlessName = queryname;
The parameter queryname will naturally be stripped of accents
before being used in the query. This method avoids the redundancy of
common names being stored multiple times and the event where first names
can also be last names. Both methods vary in complexity but accomplish the
same task with similar efficiency.
II - Implementation
For clarity and ease of understanding, I will illustrate the implementation
by using the extra field method. The concept applies to any similar
method.
Data modeling
The first step is to create the stripped-down fields (i.e. Fname2, Lname2),
setting them to "Indexed" (allowing duplicates). This is probably easiest
done directly in your DBMS. Add one for each field which is used for
searching.
Data definition
Next, we insert or update the stripped fields by stripping down the main
accented fields. This can be done by using the following functions and SQL
statements:
Function Unapostrophize(theword)
Unapostrophize = Replace(theword, "'", "''")
End Function
Function StripAccents(theword)
Dim letter, tempWord
For i = 1 to Len(theword)
letter = Mid(theword,i,1)
Select case LCase(letter)
case "é","è","ê","ë"
letter = "e"
case "à","â","å","ä"
letter = "a"
case "ù","û"
letter = "u"
case "ò","ö","ô"
letter = "o"
case "ç"
letter = "c"
case "ï","ì"
letter = "i"
case "ÿ"
letter = "y"
case "'"
letter = "''"
'.
'. Add more characters as needed
'.
End Select
tempWord = tempWord & letter
Next
StripAccents = tempWord
End Function
'...
SQL = "INSERT INTO Entry (Fname, Fname2) VALUES ('" & _
Unapostrophize(theFname) & "', '" & _
StripAccents(theFname) & "')"
SQL = "UPDATE Entry SET Fname2 = '" & _
StripAccents(theFname) & _
"' WHERE EID = theEID"
Unapostrophize() doubles any apostrophes found. This bypasses the problem
where SQL interprets apostrophes as string delimiters should a "D'Angelo"
be entered into the database.
StripAccents() replaces all accented letters by unaccented versions as well
as takes care of the apostrophe.
theFname, theEID are variables which assumingly have correct Fname and EID
values (This might have been returned from a query).
The fields should now all be filled as needed. We will proceed to the
querying part.
Querying
Observe the following sample program using the functions and concepts
mentioned above:
qLname = Request.QueryString("lname")
If qLname <> "" Then
SQL = "SELECT Fname, Lname, PhoneNum FROM Entry WHERE Lname2 LIKE
'%" & StripAccents(qLname) & "%' ORDER BY Lname"
'Connect to database
SET DbObj = Server.CreateObject("ADODB.CONNECTION")
DbObj.Open "DSN=phonebk;UID=;PWD=;"
SET oRs = DbObj.Execute(SQL) 'Execute above SQL statement
While Not oRs.EOF
Response.Write oRs.Fields("Lname") & ", " & _
oRs.Fields("Fname") & _
" - " & oRs.Fields("PhoneNum")
'output a linebreak
Response.Write "<" & "BR>"
oRs.MoveNext
Wend
DbObj.Close
SET DbObj = Nothing
SET oRs = Nothing
End If
As you can see, we strip the query qLname and compare it to the
stripped field Lname2 in the database. Since both fields are
on the same level, finding results is no problem. You'll notice that in
the results, we output the actual fields with accents, so the user has no
clue what goes on.
As a quick review, we create an extra field or table containing a version of
a string without accents. We then accept a query which has been stripped
too of accents, which is then compared to the stripped fields. Matches can
then be found. Fields displayed or used will contain accents.
III - Credits
- Nicole Calinoiu, for offering the speed-efficient solution of adding
an extra table as well as including a good example.
IV - Final Notes
This article covers conceptually how to deal with a database containing
foreign characters, making it easily queryable by users. These characters
are output directly to HTML, which is fine in a controlled environment
where you know all users will be running the same operating system and
platform. If this is not the case, you may have problems. This will be
the topic of my next article, a follow-up of this one.
I hope this article has been of help to you.
I strongly believe in the sharing of information and talent between
professionals and amateurs alike. Should you know a better way of doing
the same thing, please share your knowledge. I'm open to comments and
criticism and can be emailed at alex@bh.qc.ca
Happy Programming!
Alex is a computer science student at Vanier College in Montreal,
Quebec, graduating in May 1999 with intentions to pursue his education at
Concordia University. He has been
developing web sites and web applications for several years now, only
dabbling in ASP for the last few months.