getting the pid from random numbers

php, until this day, uses the mersenne twister algorithm as its default pseudo random number generator. it is not cryptographically secure, as its state can be reconstructed from at most 624 consecutive outputs.

cracking the seed of the marsenne twister has been proven possible not only in php but also in python, which also uses this algorithm as its default random number generator. i want to take a different approach on attacking this prng in php and reverse-engineer how the initial seed for the algorithm is generated.

specifically, i will focus on php 7.0 which took a very interesting approach for generating the seed. it turns out that it is possible to determine the process id of the php process which generated a random number if we know exactly when it's called.

finding the seed of the seed

in order to find out how php generates the seed for the random number generator, we need to look in the php-src repository which holds the implementation of the php built-in functions

/* {{{ proto int mt_rand([int min, int max])
   Returns a random number from Mersenne Twister */
PHP_FUNCTION(mt_rand)
{
  <SNIP>

    if (!BG(mt_rand_is_seeded)) {
        php_mt_srand(GENERATE_SEED());
    }
  <SNIP>

the above code snippet is the implementation of the php function mt_rand. i removed parts of the code to only show the important part - the seed generation. we see that if a seed is needed, which is on every first call of the function in a new process, the result of GENERATE_SEED() function is supplied to the seeding function.

#define GENERATE_SEED() (((zend_long) (time(0) * getpid())) ^ ((zend_long) (1000000.0 * php_combined_lcg())))

it turns out that this function is actually a macro which combines the current time, process id and the current value of php_combined_lcg().

php_compined_lcg

so far we see that the seed for the default random number generator in php 7.0 depends on the current time and pid. this information does not help us much as long as we don’t understand how php_combined_lcg works.

understanding the inner workings of the algorithm itself, i.e. the linear congruential generator, is not needed to break the algorithm - we just have to analyse its implementation.

PHPAPI double php_combined_lcg(void) /* {{{ */
{
    php_int32 q;
    php_int32 z;

    if (!LCG(seeded)) {
        lcg_seed();
    }

    MODMULT(53668, 40014, 12211, 2147483563L, LCG(s1));
    MODMULT(52774, 40692, 3791, 2147483399L, LCG(s2));

    z = LCG(s1) - LCG(s2);
    if (z < 1) {
        z += 2147483562;
    }

    return z * 4.656613e-10;
}

the above code shows how each consecutive value of the lcg is calculated. just as with the implementation of mt_rand, the initial state of the lcg must be initialized using its respective function lcg_seed().

let’s look at its implementation:

static void lcg_seed(void) /* {{{ */
{
    struct timeval tv;

    if (gettimeofday(&tv, NULL) == 0) {
        LCG(s1) = tv.tv_sec ^ (tv.tv_usec<<11);
    }

    <SNIP>
    LCG(s2) = (zend_long) getpid();

    <SNIP>
    if (gettimeofday(&tv, NULL) == 0) {
        LCG(s2) ^= (tv.tv_usec<<11);
    }
    <SNIP>
}

well what do we have here - the seeding function for the lcg is again dependent on the time and the current pid.

putting seed generators together

all together we know:

  • php uses GENERATE_SEED() to seed the mt_rand which combines the time, pid and the output of the php_combined_lcg
    • time(0) returns the current unix time in seconds
  • php_combined_lcg is initialized using the current time and the pid
    • gettimeofday and its tv_usec attribute returns the microseconds of the current unix time
    • this function is called twice for ‘more entropy’. the time difference between the calls in my tests is less than 10us.
  • given that each php request is executed in the same thread and shares the pid, we need to only guess it once

in order to reverse-engineer the seed of an mt_rand instance in php we need:

  • the output of the random number generator
  • a somewhat precise time of when the number was generated.

having this information, we can brute-force the possible values for the pid and the microseconds used in lcg_seed and figure out which configuration returns our number

brute-forcing pid

consider the simple php code:

<?php
echo "start: ".microtime().PHP_EOL;
echo "rand: ". mt_rand().PHP_EOL; 
echo "end: ".microtime().PHP_EOL;
?>

with the following output:

start: 0.80655400 1776516932
rand: 1917976159
end: 0.80658100 1776516932

based on what we know about how the mt_rand function in php, the above code yields enough information for us to determine the pid of the process which handled the above code. we can reuse the implementation of the GENERATE_SEED() macro and php_combined_lcg() and parameterize it such that it does not use the true time and pid, but values we provide.

then, we can construct something like this:

long unix_time = 1776516932;

// iterate between the possible microsecond values usec1 to usec2
for (int i = 806554; i <= 806581; i++) {
      // go in through the '10us window' between 
      // the two calls of gettimeofday in php_combined_lcg
      for (int j = i; j <= i+10; j++) {
             // iterate over possible pids
             // 2<<15 in 32bit systems
             // 2<<22 in 64bit systems
             for (int pid = 1; pid <= max_pid ; pid++) {
                   lcg_seed_param(unix_time, i, j, pid);
                   // print possible seeds based on the pid
                   printf("%d %ld\n", pid, GENERATE_SEED(sec, pid, php_combined_lcg()));
             }
      }
}

the above code simply iterates over all combinations of the microseconds and pids which could have been used to generate the seed for the random number generator. you can view the full implementation of the code in my codeberg repository.

finding the seed

my simple bruteforcer generates pairs of possible seeds with their corresponding pid

$ ./lcg 1776516932 806554 806581 100 100 | head
1 1776528905
2 3552980914
3 5329562859
4 7106014212
5 8882597205
<SNIP>

since the time window was very small, our script generated only 40600 possible seeds. we can use the following simple php ‘seed-decoder’ to find the correct seed.

<?php
$filePath = '/seeds.txt'; // list of the seeds
$file = fopen($filePath, 'r');
if ($file) {
    while (($line = fgets($file)) !== false) {
        mt_srand(intval(explode(' ',$line)[1]); // set seed
        $v = mt_rand();
        if ($v == 1917976159) {
            echo "Seed found! -> " . $line . PHP_EOL;
            die();
        }
    }
}
echo "no luck finding the seed";
?>

running the code again takes less than one second and returns the following seed

root@php7.0-test:/$ php brute.php  
Seed found! -> 68 120802817128

where else?

it is a rare occurrence that random numbers are ever reflected as-they-are. in most cases, a generated random number is processed and used for example to shuffle an array or generate some other random string.

here are some examples which use the insecure random function

spl_object_hash

according to the official documentation this function is used for the following purpose:

"This function returns a unique identifier for the object. This id can be used as a hash key for storing objects, or for identifying an object, as long as the object is not destroyed"

the function is implemented as follows:

PHPAPI zend_string *php_spl_object_hash(zval *obj) /* {{{*/
{
  <SNIP>
    if (!BG(mt_rand_is_seeded)) {
        php_mt_srand((uint32_t)GENERATE_SEED()); //initialization
    }

    SPL_G(hash_mask_handle)   = (intptr_t)(php_mt_rand() >> 1);  //here
    SPL_G(hash_mask_handlers) = (intptr_t)(php_mt_rand() >> 1);  //here
    SPL_G(hash_mask_init) = 1;
   
    <SNIP>
    return strpprintf(32, "%016lx%016lx", hash_handle, hash_handlers);
}
/* }}} */

as we see, the code follows the same pattern → we initialize the mt_rand using GENERATE_SEED() and use the function php_mt_rand() to get the random object id hash. the output of the function looks as follows:

php-7.0 -r "echo spl_object_hash((object) array('1' => 'test'));"
000000006cf06ed200000000391dcb4d

we can plug that output and iterate over different possible seeds using mt_srand() to find the correct seed and its corresponding pid.

however, no php application would ever reflect this object id, so we need to look somewhere else for a more realistic function.

array_rand (and shuffle)

a more promising candidate for real life exploitation is the array_rand method. it is used to pick a specified amount of random elements from an array:

array_rand([1,2,3,4,5,6],3) -> [1,4,5]

it does so by randomly skipping or keeping elements from an array

/* {{{ proto mixed array_rand(array input [, int num_req])
   Return key/keys for random entry/entries in the array */
PHP_FUNCTION(array_rand)
{
    <SNIP>
    ZEND_HASH_FOREACH_KEY(Z_ARRVAL_P(input), num_key, string_key) {
        if (!num_req) {
            break;
        }

        randval = php_rand();
        if ((randval / (PHP_RAND_MAX + 1.0)) < num_req / num_avail) {

    <SNIP>

one ‘flaw’ (depending on your usecase) of this function is that it maintains the order of the initial array. if you want to randomize its order you would have to use shuffle. we will focus on array_rand here, but shuffle also uses the mt_rand internally and can be reverted in the same way.

interestingly, many people still use array_rand to generate randomized arrays of specified size. most notably the gregwar captcha library, one of the most popular php captcha image generators.

the source code for generating phrases, i.e. random strings which humans must decipher from the image, looks somewhat like this:

/**
* Generates  random phrase of given length with given charset
*/
public function build($length = null, $charset = null)
{
    <SNIP>
    $phrase = '';
    $chars = str_split('abcdefghijklmnpqrstuvwxyz123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ');

    for ($i = 0; $i < $this->length; $i++) {
        $phrase .= $chars[array_rand($chars)];
    }
    return $phrase;
}

using array_rand to build the phrase allows us to extract the pid from captcha images generated by this library.

getting the pid from captcha images

consider the following captcha image which was generated using the gregwar’s captcha library

i setup the most simple captcha image server running on php7.0 and timed a request which returned the above image

$ date +%s.%N && curl -s localhost:3000 > image.jpg && date +%s.%N
1776607071.104677403
1776607071.245074633

this output gives us enough information to bruteforce the process id:

  • we have the unix time
  • the range of microseconds when the image was generated
  • and we know how the random string is being generated

we can plug the information into our seed generator which now takes significantly longer to generate the possible seeds than our previous experiments in lab conditions

$ time ./lcg 1776607071 104677 245074 20 100 > seeds.txt

real    0m39.491s
user    0m39.176s
sys     0m0.230s

then, using a reconstructed phrase generator with variable seeds, we can find the seed which returns the correct phrase. again, it takes some time, as opposed to our previous examples

root@php7.0-test:/$ time php brute_captcha.php 
Seed found! -> 43 76395027261
real    5m51.244s
user    5m48.816s
sys     0m0.707s

the pid is only 43 because i run it in a docker container. it uses a separate namespace for processes and always starts at 1. in a real-world scenario this number would be much higher.

there is an important caveat to this experiment - it works only if each request is processed in a separate process. if we just start the php server using php -s localhost:3000, each request will be handled in the same process and reuse the already initialized mt_rand.

our technique, as is, would only work for the first request to that server. each subsequent request will share the pid, but the mt_rand function will be initialized already. in order to guess the pid, we would have to know the precise time when the first call of mt_rand was performed and then generate subsequent random values to eventually find our output.

conclusion

in php7.0 the initial seed for the default random number generator, and all functions which use it, is derived from the current time and the pid. if we can time the call of the initialization function, we can brute-force all possible seeds within that time frame. by trying these possible seeds we can eventually find the one which was used to generate our random data and determine its pid.

is the pid value relevant in any way? absolutely not

is this exploit practical in real life? absolutely not

was it fun to reverse-engineer the algorithm? absolutely yes