This article was published in the April 2009 issue of php|architect.
Multilingual applications use two types of texts requiring translation. The first ones are messages without a relation to the data (for example the title "Contents") which can easily be translated by GNU gettext or by a simple database table. The second type are texts related to the data (for example a German and Italian title for the same product with a given price). This article will examine database strategies for this second type of text.
Multilingual applications should use simple SQL queries (to avoid bugs), keep the performance high (which mainly means to utilize indexes) and produce the exact output (sort according to each language collation and respect the fulltext threshold). In additionally database consistency should be maintained, for example to preserve uniqueness of URLs. This article compares three different approaches for storing multilingual records considering these goals. In addition, the complexity of adding a new language and working with untranslated texts are covered.
This article demonstrates all approaches on a simple table of products with translatable names, URLs and descriptions, group identifiers and prices (common for all languages). The application needs to filter the products by the group identifier and then sort them by the name and we will perform a fulltext search on the name and description in a specified language. Therefore, in a single language environment, the table would look like this:
CREATE TABLE `product` ( `product_id` int(11) NOT NULL auto_increment, `group_id` int(11) NOT NULL, `name` varchar(50) NOT NULL, `url` varchar(50) NOT NULL, `price` decimal(9, 2) NOT NULL, `description` text NOT NULL, UNIQUE KEY `url` (`url`), KEY `group_id` (`group_id`, `name`), FULLTEXT KEY `name` (`name`, `description`), PRIMARY KEY (`product_id`) );
The unique index above URL guarantees that there can be only one product at each location. The query to retrieve the products in a given group sorted by the name is simple (PHP syntax is used):
" SELECT `url`, `name`, `price` FROM `product` WHERE `group_id` = $group_id ORDER BY `name` LIMIT 30 "
The usage of indexes is of course correct:
id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
---|---|---|---|---|---|---|---|---|---|
1 | SIMPLE | product | ref | group_id | group_id | 4 | const | 9 | Using where |
We can store the translatable columns in a separate table:
CREATE TABLE `language` ( `language_id` char(2) NOT NULL, `collation` varchar(64) NOT NULL, PRIMARY KEY (`language_id`) ); CREATE TABLE `product` ( `product_id` int(11) NOT NULL auto_increment, `group_id` int(11) NOT NULL, `price` decimal(9, 2) NOT NULL, KEY `group_id` (`group_id`), PRIMARY KEY (`product_id`) ); CREATE TABLE `product_translation` ( `product_id` int(11) NOT NULL, `language_id` char(2) NOT NULL, `name` varchar(50) NOT NULL, `url` varchar(50) NOT NULL, `description` tinytext NOT NULL, UNIQUE KEY `language_id_2` (`language_id`, `url`), KEY `language_id` (`language_id`, `name`), FULLTEXT KEY `name` (`name`, `description`), FOREIGN KEY (`language_id`) REFERENCES `language` (`language_id`), FOREIGN KEY (`product_id`) REFERENCES `product` (`product_id`), PRIMARY KEY (`product_id`, `language_id`) );
(The `language`.`collation`
column contains one of the collations returned by the SHOW COLLATION command and we need it to properly sort the results.)
With this schema, a simple query to retrieve the products in a given group sorted by the name is not so simple:
" SELECT `product_translation`.`url`, `product_translation`.`name`, `product`.`price` FROM `product` INNER JOIN `product_translation` USING (`product_id`) WHERE `product`.`group_id` = $group_id AND `product_translation`.`language_id` = '$language_id' ORDER BY `product_translation`.`name` COLLATE $collation LIMIT 30 "
(Variable $collation
is taken from `language`.`collation`
in the application initialization.)
The query is complex and its execution plan is bad:
id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
---|---|---|---|---|---|---|---|---|---|
1 | SIMPLE | product | ref | PRIMARY, group_id | group_id | 4 | const | 12 | Using temporary; Using filesort |
1 | SIMPLE | product_translation | eq_ref | PRIMARY, language_id_2, language_id | PRIMARY | 10 | product.product_id, const | 1 | Using where |
Because of filesort, MySQL has to read all entries and cannot use the LIMIT clause to skip unnecessary rows. It is N * log N
where as to perform the ordering with LIMIT clause would be a constant! MySQL cannost use the index on (language_id, name)
for sorting at all because we need a different collation for every language.
What about fulltext search? MySQL natural language search ignores words occurring in at least half of the rows. Because all languages are mixed in one column, this feature will likely exclude most words and thus be useless. Fulltext search can be good in some scenarios but it has negative impact on performance. Moreover, MySQL cannot use a fulltext index together with a normal index so the fulltext must search in all languages and then restrict the results by hand.
In a real-world application, we have to deal with untranslated texts. Usually, we use some default language in the place of them. With this table schema, we have two options available to solve this problem:
" SELECT IFNULL(`product_translation`.`url`, `product_default`.`url`) AS `url`, IFNULL(`product_translation`.`name`, `product_default`.`name`) AS `name`, `product`.`price` FROM `product` INNER JOIN `product_translation` `product_default` USING (`product_id`) LEFT JOIN `product_translation` ON `product`.`product_id` = `product_translation`.`product_id` AND `product_translation`.`language_id` = '$language_id' WHERE `product`.`group_id` = $group_id AND `product_default`.`language_id` = '$default_id' ORDER BY IFNULL(`product_translation`.`name`, `product_default`.`name`) COLLATE $collation LIMIT 30 "Unique index above URL is useless as we would have to check for existence of the same URL in a trigger or a stored procedure at the time of inserting, updating and deleting a product translation.
This approach copies also the data common to all languages.
CREATE TABLE `language` ( `language_id` char(2) NOT NULL, `collation` varchar(64) NOT NULL, PRIMARY KEY (`language_id`) ); CREATE TABLE `product` ( `product_id` int(11) NOT NULL auto_increment, PRIMARY KEY (`product_id`) ); CREATE TABLE `product_translation` ( `product_id` int(11) NOT NULL, `language_id` char(2) NOT NULL, `group_id` int(11) NOT NULL, `name` varchar(50) NOT NULL, `url` varchar(50) NOT NULL, `price` decimal(9, 2) NOT NULL, `description` text NOT NULL, UNIQUE KEY `language_id_2` (`language_id`, `url`), KEY `language_id` (`language_id`, `group_id`, `name`), FULLTEXT KEY `name` (`name`, `description`), FOREIGN KEY (`product_id`) REFERENCES `product` (`product_id`), FOREIGN KEY (`language_id`) REFERENCES `language` (`language_id`), PRIMARY KEY (`product_id`, `language_id`) );
(The product table can be just a sequence.)
The group identifier and the price are copied to all languages. Technically a trigger can achieve it, however we have to define a trigger on all translatable tables in the database and it has to respect all columns in the table (so addition or modification of a table would be difficult). The data also takes more space so the working set is less likely to fit in memory.
The simple query remains simple:
" SELECT `url`, `name`, `price` FROM `product_translation` WHERE `language_id` = '$language_id' AND `group_id` = $group_id ORDER BY `name` COLLATE $collation LIMIT 30 "
Even the execution plan would be good only if it were not for the COLLATE clause:
id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
---|---|---|---|---|---|---|---|---|---|
1 | SIMPLE | product_translation | ref | language_id_2, language_id | language_id | 10 | const, const | 9 | Using where; Using filesort |
The COLLATE is causing a filesort which is bad for the LIMIT clause. The query would be fast without it but the results would not respect the sorting order specific for the language.
Both problems with fulltext search remain.
We have to solve the same dilemma as in the previous approach regarding untranslated texts and adding a new language. Either the queries will be complicated and slower or the language addition will be painful. Unique index would not work in the case of falling back to the default language or we would have to do all modifications in a transaction when copying the default language.
The previous approaches have one basic problem - they are storing different data (texts with miscellaneous collations) to the same column. We have to separate the languages into the individual columns to solve this basic problem:
CREATE TABLE `language` ( `language_id` char(2) NOT NULL, `collation` varchar(64) NOT NULL, PRIMARY KEY (`language_id`) ); CREATE TABLE `product` ( `product_id` int(11) NOT NULL auto_increment, `group_id` int(11) NOT NULL, `name_en` varchar(50) NOT NULL, `name_cs` varchar(50) collate utf8_czech_ci NOT NULL, `url_en` varchar(50) NOT NULL, `url_cs` varchar(50) collate utf8_czech_ci NOT NULL, `price` decimal(9, 2) NOT NULL, `description_en` text NOT NULL, `description_cs` text collate utf8_czech_ci NOT NULL, UNIQUE KEY `url_en` (`url_en`), UNIQUE KEY `url_cs` (`url_cs`), KEY `group_id_en` (`group_id`, `name_en`), KEY `group_id_cs` (`group_id`, `name_cs`), FULLTEXT KEY `name_cs` (`name_cs`, `description_cs`), FULLTEXT KEY `name_en` (`name_en`, `description_en`), PRIMARY KEY (`product_id`) );
(Notice the different collations for columns with translations.)
Each translatable field gets its own column in each language. The same holds for indexes - there are more of them but they are smaller.
The table is not nice but the simple query remains simple and produces correct results:
" SELECT `url_$language_id`, `name_$language_id`, `price` FROM `product` WHERE `group_id` = $group_id ORDER BY `name_$language_id` LIMIT 30 "
(The $language_id
variable is whitelisted from `language`.`language_id`
.)
It is also as fast as a single-language query:
id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
---|---|---|---|---|---|---|---|---|---|
1 | SIMPLE | product | ref | group_id_en, group_id_cs | group_id_en | 4 | const | 11 | Using where |
Both problems with fulltext search disappear. Each index has separate statistics and there is no need to filter by language identifier when performing the search.
Adding a language definitely will be a difficult operation so we do not need to solve the dilemma whether to store the default language to untranslated texts. It can be stored to keep query simplicity and speed. Unique indexes above URLs simply work.
Adding a new language will be a complex operation when this approach is used. We have to go through all tables and add a column for each language field. We also have to create new indexes. We can achieve it by a stored procedure or by a PHP function:
<?php /** Addition of language columns to all tables * @param string language identifier * @param string default language identifier * @param string way of ordering texts * @return null execution of commands ALTER TABLE and UPDATE * @copyright Jakub Vrana, https://php.vrana.cz/ */ function add_language($language_id, $default_id = "en", $collation = "") { $result = mysql_query("SHOW TABLES"); while ($row = mysql_fetch_row($result)) { // addition and copy of columns $alter = array(); $update = array(); $result1 = mysql_query("SHOW COLUMNS FROM $row[0]"); while ($row1 = mysql_fetch_assoc($result1)) { if (substr($row1["Field"], -3) == "_$default_id") { $field = substr($row1["Field"], 0, -2) . $language_id; $null = ($row1["Null"] == "NO" ? " NOT NULL" : ""); $alter[] = "ADD $field $row1[Type]" . ($collation ? " COLLATE $collation" : "") . "$null AFTER $row1[Field]"; $update[] = "$field = $row1[Field]"; } } mysql_free_result($result1); if ($alter) { mysql_query("ALTER TABLE $row[0] " . implode(", ", $alter)); mysql_query("UPDATE $row[0] SET " . implode(", ", $update)); // addition of indexes $indexes = array(); $result1 = mysql_query("SHOW INDEXES FROM $row[0]"); while ($row1 = mysql_fetch_assoc($result1)) { $type = ($row1["Index_type"] == "FULLTEXT" ? "FULLTEXT" : ($row1["Non_unique"] ? "INDEX" : "UNIQUE")); $indexes[$type][$row1["Key_name"]] .= "$row1[Column_name], "; } mysql_free_result($result1); foreach ($indexes as $type => $type_indexes) { foreach ($type_indexes as $index) { if (strpos($index, "_$default_id, ")) { mysql_query("ALTER TABLE $row[0] ADD $type (" . substr(str_replace("_$default_id, ", "_$language_id, ", $index), 0, -2) . ")"); } } } } } mysql_free_result($result); } ?>
The maximum row size in MySQL is about 64 KB so with big tables and many languages, this limit could be reached. In this case, it is possible to separate some columns to a different table with one-to-one relationship. Another approach would be to store each language in a separate table and copy common columns the same way as we did in the Data copy approach. This however involves problems with trigger maintenance.
Other databases
Other database systems such as Oracle or PostgreSQL do not have support for a collation definition in each column and use a different approach for a language-dependent sorting. They define (PostgreSQL through an extension) an nlssort function accepting a string to sort and a language identifier to sort by. The database systems can use this function together with partial function indexes to sort the products effectively in the Data copy schema. However, we still have to define the indexes at the time of adding a language.
Let's name pros and cons of all three approaches:
Translations in a separate table | Data copy | Translation in the data table |
---|---|---|
- slow sorting - slow fulltext search - complex SQL queries Copy of default language: - complex language addition - complex product insertion - extra table for untranslated texts Fallback to default language: + simple language addition - even more complex SQL queries - complex URL modifications |
- slow or inaccurate sorting - slow fulltext search - bigger footprint - difficult table altering Copy of default language: + simple queries - complex language addition - complex product insertion - extra table for untranslated texts Fallback to default language: + simple language addition - complex SQL queries - complex any URL modification |
- variable database structure - very complex language addition Copy of default language: + simple queries + fast sorting + fast fulltext search - extra table for untranslated texts |
The third approach has a big handicap in the variable database structure (the user adding a language must have the ALTER privilege). It exceeds the other approaches in all other aspects and at the end of a day, we have to do the least work and still get a solution with the highest performance.
Jakub Vrána is one of the authors of official PHP manual, and he uses MySQL for developing web applications in PHP. He also teaches the MySQL basics at Charles University and conducts commercial trainings. He is the founder of compact MySQL management tool phpMinAdmin Adminer. Jakub lives in Prague, Czech Republic.
Diskuse je zrušena z důvodu spamu.