Storing multilingual records in the MySQL database

Školení, která pořádám

This article was published in the April 2009 issue of php|architect.

Multilingual applications use two types of texts requiring translation. The first ones are messages without a relation to the data (for example the title "Contents") which can easily be translated by GNU gettext or by a simple database table. The second type are texts related to the data (for example a German and Italian title for the same product with a given price). This article will examine database strategies for this second type of text.

Multilingual applications should use simple SQL queries (to avoid bugs), keep the performance high (which mainly means to utilize indexes) and produce the exact output (sort according to each language collation and respect the fulltext threshold). In additionally database consistency should be maintained, for example to preserve uniqueness of URLs. This article compares three different approaches for storing multilingual records considering these goals. In addition, the complexity of adding a new language and working with untranslated texts are covered.

This article demonstrates all approaches on a simple table of products with translatable names, URLs and descriptions, group identifiers and prices (common for all languages). The application needs to filter the products by the group identifier and then sort them by the name and we will perform a fulltext search on the name and description in a specified language. Therefore, in a single language environment, the table would look like this:

database schema
CREATE TABLE `product` (
	`product_id` int(11) NOT NULL auto_increment,
	`group_id` int(11) NOT NULL,
	`name` varchar(50) NOT NULL,
	`url` varchar(50) NOT NULL,
	`price` decimal(9, 2) NOT NULL,
	`description` text NOT NULL,
	UNIQUE KEY `url` (`url`),
	KEY `group_id` (`group_id`, `name`),
	FULLTEXT KEY `name` (`name`, `description`),
	PRIMARY KEY (`product_id`)
);

The unique index above URL guarantees that there can be only one product at each location. The query to retrieve the products in a given group sorted by the name is simple (PHP syntax is used):

"
	SELECT `url`, `name`, `price`
	FROM `product`
	WHERE `group_id` = $group_id
	ORDER BY `name`
	LIMIT 30
"

The usage of indexes is of course correct:

idselect_typetabletypepossible_keyskeykey_lenrefrowsExtra
1SIMPLEproductrefgroup_idgroup_id4const9Using where

Translations in a separate table

We can store the translatable columns in a separate table:

database schema
CREATE TABLE `language` (
	`language_id` char(2) NOT NULL,
	`collation` varchar(64) NOT NULL,
	PRIMARY KEY (`language_id`)
);

CREATE TABLE `product` (
	`product_id` int(11) NOT NULL auto_increment,
	`group_id` int(11) NOT NULL,
	`price` decimal(9, 2) NOT NULL,
	KEY `group_id` (`group_id`),
	PRIMARY KEY (`product_id`)
);

CREATE TABLE `product_translation` (
	`product_id` int(11) NOT NULL,
	`language_id` char(2) NOT NULL,
	`name` varchar(50) NOT NULL,
	`url` varchar(50) NOT NULL,
	`description` tinytext NOT NULL,
	UNIQUE KEY `language_id_2` (`language_id`, `url`),
	KEY `language_id` (`language_id`, `name`),
	FULLTEXT KEY `name` (`name`, `description`),
	FOREIGN KEY (`language_id`) REFERENCES `language` (`language_id`),
	FOREIGN KEY (`product_id`) REFERENCES `product` (`product_id`),
	PRIMARY KEY (`product_id`, `language_id`)
);

(The `language`.`collation` column contains one of the collations returned by the SHOW COLLATION command and we need it to properly sort the results.)

With this schema, a simple query to retrieve the products in a given group sorted by the name is not so simple:

"
	SELECT `product_translation`.`url`, `product_translation`.`name`, `product`.`price`
	FROM `product`
	INNER JOIN `product_translation` USING (`product_id`)
	WHERE `product`.`group_id` = $group_id AND `product_translation`.`language_id` = '$language_id'
	ORDER BY `product_translation`.`name` COLLATE $collation
	LIMIT 30
"

(Variable $collation is taken from `language`.`collation` in the application initialization.)

The query is complex and its execution plan is bad:

idselect_typetabletypepossible_keyskeykey_lenrefrowsExtra
1SIMPLEproductrefPRIMARY, group_idgroup_id4const12Using temporary; Using filesort
1SIMPLEproduct_translationeq_refPRIMARY, language_id_2, language_idPRIMARY10product.product_id, const1Using where

Because of filesort, MySQL has to read all entries and cannot use the LIMIT clause to skip unnecessary rows. It is N * log N where as to perform the ordering with LIMIT clause would be a constant! MySQL cannost use the index on (language_id, name) for sorting at all because we need a different collation for every language.

What about fulltext search? MySQL natural language search ignores words occurring in at least half of the rows. Because all languages are mixed in one column, this feature will likely exclude most words and thus be useless. Fulltext search can be good in some scenarios but it has negative impact on performance. Moreover, MySQL cannot use a fulltext index together with a normal index so the fulltext must search in all languages and then restrict the results by hand.

In a real-world application, we have to deal with untranslated texts. Usually, we use some default language in the place of them. With this table schema, we have two options available to solve this problem:

  1. One option is to copy the default language version of all texts to the newly added language. This greatly complicates adding a new language to the application. It also complicates adding a new product. Unique index above URL would work but then we would have to insert all translations in a transaction and rollback the transaction in a case of conflict.
  2. The second option is to use the default language version if a translation does not exist. This makes Adding a language very simple (just inserting a row in the language table) but the so called "simple query" gets complicated even more
    "
    	SELECT IFNULL(`product_translation`.`url`, `product_default`.`url`) AS `url`, IFNULL(`product_translation`.`name`, `product_default`.`name`) AS `name`, `product`.`price`
    	FROM `product`
    	INNER JOIN `product_translation` `product_default` USING (`product_id`)
    	LEFT JOIN `product_translation` ON `product`.`product_id` = `product_translation`.`product_id` AND `product_translation`.`language_id` = '$language_id'
    	WHERE `product`.`group_id` = $group_id AND `product_default`.`language_id` = '$default_id'
    	ORDER BY IFNULL(`product_translation`.`name`, `product_default`.`name`) COLLATE $collation
    	LIMIT 30
    "
    Unique index above URL is useless as we would have to check for existence of the same URL in a trigger or a stored procedure at the time of inserting, updating and deleting a product translation.

Data copy

This approach copies also the data common to all languages.

database schema
CREATE TABLE `language` (
	`language_id` char(2) NOT NULL,
	`collation` varchar(64) NOT NULL,
	PRIMARY KEY (`language_id`)
);

CREATE TABLE `product` (
	`product_id` int(11) NOT NULL auto_increment,
	PRIMARY KEY (`product_id`)
);

CREATE TABLE `product_translation` (
	`product_id` int(11) NOT NULL,
	`language_id` char(2) NOT NULL,
	`group_id` int(11) NOT NULL,
	`name` varchar(50) NOT NULL,
	`url` varchar(50) NOT NULL,
	`price` decimal(9, 2) NOT NULL,
	`description` text NOT NULL,
	UNIQUE KEY `language_id_2` (`language_id`, `url`),
	KEY `language_id` (`language_id`, `group_id`, `name`),
	FULLTEXT KEY `name` (`name`, `description`),
	FOREIGN KEY (`product_id`) REFERENCES `product` (`product_id`),
	FOREIGN KEY (`language_id`) REFERENCES `language` (`language_id`),
	PRIMARY KEY (`product_id`, `language_id`)
);

(The product table can be just a sequence.)

The group identifier and the price are copied to all languages. Technically a trigger can achieve it, however we have to define a trigger on all translatable tables in the database and it has to respect all columns in the table (so addition or modification of a table would be difficult). The data also takes more space so the working set is less likely to fit in memory.

The simple query remains simple:

"
	SELECT `url`, `name`, `price`
	FROM `product_translation`
	WHERE `language_id` = '$language_id' AND `group_id` = $group_id
	ORDER BY `name` COLLATE $collation
	LIMIT 30
"

Even the execution plan would be good only if it were not for the COLLATE clause:

idselect_typetabletypepossible_keyskeykey_lenrefrowsExtra
1SIMPLEproduct_translationreflanguage_id_2, language_idlanguage_id10const, const9Using where; Using filesort

The COLLATE is causing a filesort which is bad for the LIMIT clause. The query would be fast without it but the results would not respect the sorting order specific for the language.

Both problems with fulltext search remain.

We have to solve the same dilemma as in the previous approach regarding untranslated texts and adding a new language. Either the queries will be complicated and slower or the language addition will be painful. Unique index would not work in the case of falling back to the default language or we would have to do all modifications in a transaction when copying the default language.

Translation directly in the data table

The previous approaches have one basic problem - they are storing different data (texts with miscellaneous collations) to the same column. We have to separate the languages into the individual columns to solve this basic problem:

database schema
CREATE TABLE `language` (
	`language_id` char(2) NOT NULL,
	`collation` varchar(64) NOT NULL,
	PRIMARY KEY (`language_id`)
);

CREATE TABLE `product` (
	`product_id` int(11) NOT NULL auto_increment,
	`group_id` int(11) NOT NULL,
	`name_en` varchar(50) NOT NULL,
	`name_cs` varchar(50) collate utf8_czech_ci NOT NULL,
	`url_en` varchar(50) NOT NULL,
	`url_cs` varchar(50) collate utf8_czech_ci NOT NULL,
	`price` decimal(9, 2) NOT NULL,
	`description_en` text NOT NULL,
	`description_cs` text collate utf8_czech_ci NOT NULL,
	UNIQUE KEY `url_en` (`url_en`),
	UNIQUE KEY `url_cs` (`url_cs`),
	KEY `group_id_en` (`group_id`, `name_en`),
	KEY `group_id_cs` (`group_id`, `name_cs`),
	FULLTEXT KEY `name_cs` (`name_cs`, `description_cs`),
	FULLTEXT KEY `name_en` (`name_en`, `description_en`),
	PRIMARY KEY (`product_id`)
);

(Notice the different collations for columns with translations.)

Each translatable field gets its own column in each language. The same holds for indexes - there are more of them but they are smaller.

The table is not nice but the simple query remains simple and produces correct results:

"
	SELECT `url_$language_id`, `name_$language_id`, `price`
	FROM `product`
	WHERE `group_id` = $group_id
	ORDER BY `name_$language_id`
	LIMIT 30
"

(The $language_id variable is whitelisted from `language`.`language_id`.)

It is also as fast as a single-language query:

idselect_typetabletypepossible_keyskeykey_lenrefrowsExtra
1SIMPLEproductrefgroup_id_en, group_id_csgroup_id_en4const11Using where

Both problems with fulltext search disappear. Each index has separate statistics and there is no need to filter by language identifier when performing the search.

Adding a language definitely will be a difficult operation so we do not need to solve the dilemma whether to store the default language to untranslated texts. It can be stored to keep query simplicity and speed. Unique indexes above URLs simply work.

Adding a new language will be a complex operation when this approach is used. We have to go through all tables and add a column for each language field. We also have to create new indexes. We can achieve it by a stored procedure or by a PHP function:

<?php
/** Addition of language columns to all tables
* @param string language identifier
* @param string default language identifier
* @param string way of ordering texts
* @return null execution of commands ALTER TABLE and UPDATE
* @copyright Jakub Vrana, https://php.vrana.cz/
*/
function add_language($language_id, $default_id = "en", $collation = "") {
    $result = mysql_query("SHOW TABLES");
    while ($row = mysql_fetch_row($result)) {
        
        // addition and copy of columns
        $alter = array();
        $update = array();
        $result1 = mysql_query("SHOW COLUMNS FROM $row[0]");
        while ($row1 = mysql_fetch_assoc($result1)) {
            if (substr($row1["Field"], -3) == "_$default_id") {
                $field = substr($row1["Field"], 0, -2) . $language_id;
                $null = ($row1["Null"] == "NO" ? " NOT NULL" : "");
                $alter[] = "ADD $field $row1[Type]" . ($collation ? " COLLATE $collation" : "") . "$null AFTER $row1[Field]";
                $update[] = "$field = $row1[Field]";
            }
        }
        mysql_free_result($result1);
        if ($alter) {
            mysql_query("ALTER TABLE $row[0] " . implode(", ", $alter));
            mysql_query("UPDATE $row[0] SET " . implode(", ", $update));
            
            // addition of indexes
            $indexes = array();
            $result1 = mysql_query("SHOW INDEXES FROM $row[0]");
            while ($row1 = mysql_fetch_assoc($result1)) {
                $type = ($row1["Index_type"] == "FULLTEXT" ? "FULLTEXT" : ($row1["Non_unique"] ? "INDEX" : "UNIQUE"));
                $indexes[$type][$row1["Key_name"]] .= "$row1[Column_name], ";
            }
            mysql_free_result($result1);
            foreach ($indexes as $type => $type_indexes) {
                foreach ($type_indexes as $index) {
                    if (strpos($index, "_$default_id, ")) {
                        mysql_query("ALTER TABLE $row[0] ADD $type (" . substr(str_replace("_$default_id, ", "_$language_id, ", $index), 0, -2) . ")");
                    }
                }
            }
        }
    }
    mysql_free_result($result);
}
?>

The maximum row size in MySQL is about 64 KB so with big tables and many languages, this limit could be reached. In this case, it is possible to separate some columns to a different table with one-to-one relationship. Another approach would be to store each language in a separate table and copy common columns the same way as we did in the Data copy approach. This however involves problems with trigger maintenance.

Other databases

Other database systems such as Oracle or PostgreSQL do not have support for a collation definition in each column and use a different approach for a language-dependent sorting. They define (PostgreSQL through an extension) an nlssort function accepting a string to sort and a language identifier to sort by. The database systems can use this function together with partial function indexes to sort the products effectively in the Data copy schema. However, we still have to define the indexes at the time of adding a language.

Final thoughts

Conclusion

Let's name pros and cons of all three approaches:

Translations in a separate tableData copyTranslation in the data table
- slow sorting
- slow fulltext search
- complex SQL queries

Copy of default language:
- complex language addition
- complex product insertion
- extra table for untranslated texts

Fallback to default language:
+ simple language addition
- even more complex SQL queries
- complex URL modifications
- slow or inaccurate sorting
- slow fulltext search
- bigger footprint
- difficult table altering

Copy of default language:
+ simple queries
- complex language addition
- complex product insertion
- extra table for untranslated texts

Fallback to default language:
+ simple language addition
- complex SQL queries
- complex any URL modification
- variable database structure
- very complex language addition

Copy of default language:
+ simple queries
+ fast sorting
+ fast fulltext search
- extra table for untranslated texts

The third approach has a big handicap in the variable database structure (the user adding a language must have the ALTER privilege). It exceeds the other approaches in all other aspects and at the end of a day, we have to do the least work and still get a solution with the highest performance.

About the author

Jakub Vrána is one of the authors of official PHP manual, and he uses MySQL for developing web applications in PHP. He also teaches the MySQL basics at Charles University and conducts commercial trainings. He is the founder of compact MySQL management tool phpMinAdmin Adminer. Jakub lives in Prague, Czech Republic.

Jakub Vrána, Výuka, 24.8.2009, comments: 13 (new: 0)

Comments

ikona Patrik Votoček (Vrtak-CZ):

Bude česká verze?

kluvi:

http://php.vrana.cz/ukladani-vicejazycnych-zaznamu.php
:-)nevim jestli je to uplne presne to stejny.. ale obsah je +/- stejnej :-)

ikona Jakub Vrána OpenID:

Ano, to je jednodušší článek na stejné téma. Kompletní překlad tohoto článku psát nebudu.

Marek Hrabě:

Prosím: http://translate.google.com/translate?prev=…&history_state0=

Je to taková pseudo-čeština :D

Jinak koukni ta, kam psal kluvi...(#d-9028)
#d-9033 reply


Marek Hrabě:
Moc pěkný článek, Jakube! :) Jen by to možná chtělo zaměnit phpMinAdmin za Adminer...

ikona Jakub Vrána OpenID:

Díky za upozornění. Ale jde o přetisk článku z php|architecta, takže v něm výjimečně změny dělat nebudu.

Mastodont:

Chybí mi tu čtvrtá možnost - samostatná tabulka pro každý jazyk překladu. Nevýhodou je sice proměnný počet tabulek, ale zase by se mělo značně vylepšit řazení.

ikona Jakub Vrána OpenID:

Ano, je to kompromis mezi druhým a třetím přístupem. Odstraňuje problémy

- slow or inaccurate sorting
- slow fulltext search

a naopak přidává

- variable database structure
- complex language addition (i ve variantě Fallback to default language)

Prdlořeznictví Krkovička, n. p.:

In additionally - buď Additionally, nebo In addition,
MySQL cannost use the index - cannot nebo can not
+pár menších.
Dobrý článek. Díky, že ses o něj podělil.

ikona Jakub Vrána OpenID:

Korektury si v php|architectu naštěstí dělali sami, článek byl celý červený :-). Sem jsem ale asi dal verzi bez korektur, protože se mi je nechtělo přenášet z jejich formátu do HTML.

Ludmila Pičmanová:

Ten limit 64 KB na řádku, ten platí pro součet všech sloupců kromě typů text a blob?

ikona Jakub Vrána OpenID:

Ano a je to skutečně bajtů (nikoliv znaků). Takže např. pro varchar(100) v kódování UTF-8 vyhradí MySQL 300 bajtů. Typy text a blob zabírají nějaké místo na pointery. InnoDB má navíc ještě vlastní limit, který je jen 8 KiB.

Hari KT:

Time for you to edit "He is the founder of compact MySQL management tool phpMinAdmin" and say adminer.

Thank you

ikona Jakub Vrána OpenID:

Fixed, thanks.

Diskuse je zrušena z důvodu spamu.

avatar © 2005-2024 Jakub Vrána. Publikované texty můžete přetiskovat pouze se svolením autora. Ukázky kódu smíte používat s uvedením autora a URL tohoto webu bez dalších omezení Creative Commons. Můžeme si tykat. Skripty předpokládají nastavení: magic_quotes_gpc=Off, magic_quotes_runtime=Off, error_reporting=E_ALL & ~E_NOTICE a očekávají předchozí zavolání mysql_set_charset. Skripty by měly být funkční v PHP >= 4.3 a PHP >= 5.0.