Source Code Analysis: Translation Coverage

How to Make Translation Term Coverage Report

Let’s assume our application has a UI translation module. We created dictionary and use terms from there within the code of application components. Everything is clear so far. But one day we begin to suspect that not all of the terms presented in the dictionary are really used within the application. Besides, probably some of terms which are used in components are not available in the dictionary. So we need a script to traverse all the source code files searching translation module use occurrences. Having the list of all encountered terms the script can compare it with the terms of the dictionary.

Since PHP has for the case a simple but powerful tool token_get_all, let’s try it and see what is it:

$tokens = token_get_all('<?php
class Foo {
    public function Bar()
    {
        return "Trolololo";
    }
}
');

Output:

Array
(
    [0] => Array
        (
            [0] => 367
            [1] => <?php

            [2] => 1
        )

    [1] => Array
        (
            [0] => 352
            [1] => class
            [2] => 2
        )

    [2] => Array
        (
            [0] => 370
            [1] =>  
            [2] => 2
        )

    [3] => Array
        (
            [0] => 307
            [1] => Foo
            [2] => 2
        )
...
}

Just looking on it, that’s not yet very implicit, what the output array means. The first element of every row contains code of parser token, found text for the token and code line where the token was encountered respectively. Using token_name function to map the token codes , we can easily convert the array into something like that:

Array
(
    [0] => Array
        (
            [token] => T_OPEN_TAG
            [text] => <?php

            [line] => 1
        )

    [1] => Array
        (
            [token] => T_CLASS
            [text] => class
            [line] => 2
        )

    [2] => Array
        (
            [token] => T_WHITESPACE
            [text] =>  
            [line] => 2
        )

    [3] => Array
        (
            [token] => T_STRING
            [text] => Foo
            [line] => 2
        )
...
}

Now you see we can find anything what we want in source code. When you use gettext() translation technique you just search for consequences of T_STRING/gettext and T_CONSTANT_ENCAPSED_STRING. True? Not so easy. We should find scopes defined by textdomain and analyze all of them. Moreover I guess you use some other translation approach, e.g. Zend_Translation. Here we also need to find translater object declaration and all the calls for translate method.

My solution is to create an Iterator class, which extends ArrayIterator, and provide the class with method match, which receives an array of tokes as parameter. Iterating the tokens array match checks whether sequence of given token encountered or not.

class Tokenizer_Iterator extends ArrayIterator
{
    /**
     *
     * @var int
     */
    private $_savedPos = 0;
   /**
     * Save the cursor position to restore later
     */
    public function savePos()
    {
       $this->_savedPos = $this->key();
    }
    /**
     * Restore array cursor position
     */
    public function restorePos()
    {
       $this->seek($this->_savedPos);
    }
   /**
     * Check it chain of tokens matches to the iterator cursor position
     * NOTE: When it's matched, cursor is moved to the end of the chain
     * @param array $array
     * @return boolean
     */
    private function _matchArray(array $array)
    {
        $this->savePos();
        foreach ($array as $fetch) {
            if (!$this->match($fetch[0], isset ($fetch[1]) ? $fetch[1] : null)) {
                $this->restorePos();
                return false;
            }
            $this->next();
        }
        // When expectations matched, cursor is moved to the end position
        // of statement chain
        return true;
    }

    /**
     * Check it the token (by token and text) matches to the iterator cursor position
     * @param string|array $token - token code or array with sequence of 
     *                              token code / text pairs
     * @param string $text
     * @return boolean
     */
    public function match($token, $text = null)
    {
       if (empty ($token)) {
           throw new Exception('Invalid parameter: token must not be empty');
       }

       if (is_array($token)) {
            return $this->_matchArray($token);
       }

       list($_token, $_text) = $this->current();

       if (null === $text) {
            return ($_token == $token);
       } else {
            return ($_token == $token and $_text == $text);
       }
    }
}

So, for the case of gettext the script finds a texdomain occurrence and collect texts out of all the gettext calls till the next texdomain encountered.


$it = new Tokenizer_Iterator($tokens);
while ($it->valid()) { 
    if ($it->match(array(
        array(T_STRING, 'gettext'),
        array(T_CONSTANT_ENCAPSED_STRING),
    ))) {
        $terms[$textdomain][] = $it->current();
    }
    $it->next();
}

When use of a translation class (e.g. $tr = new Zend_Translate) the same way object declaration can be found:

$it = new Tokenizer_Iterator($tokens);
while ($it->valid()) { 
    if ($it->match(array(
        array(T_WHITESPACE),
        array(T_NEW),
        array(T_WHITESPACE),
        array(T_STRING, 'Zend_Translate'),
    ))) {
        $obj = $it->seek($it->key() - 7);
    }
    $it->next();
}

Though I prefer to remove all the WHITESPACE tokens in the array before giving it to Tokenizer_Iterator, not to care of whitespace number and sequence for the match occurrences. When object name is found, we can iterate the array once again, looking now for $obj->translate() occurrences.


Translation terms extracting from source code algorithm


If you are required to search in class or function scopes, before Tokenizer_Iterator creation, traverse the array of tokens and , encountering an open parenthesis, keep adding this index to each array element as id of scope opener till the proper close parenthesis is found.

When you have array of all the terms encountered in source code, you can match it against terms of dictionary and get all the reports you want.



What about non-PHP source code?

You use the same technique to parse JS, CSS, Java or other source code files. Just cheat the tokenizer by prepending to the code '

$tokens = token_get_all('<?php ' . file_get_contents('default.css'));