0

I'm looking for an optimal method for putting unique constraints on strings.

My use case is linking all email messages together by their "Message-ID" and "In-Reply-To" fields on the SMTP feed.

However, because the number of messages can grow into the millions, and we have no plans to delete anything, I need a way to index them very quickly. The problem is that I'm under the impression that strings are inherently slower to index that numbers (If I'm wrong about that, please explain).

My solution so far is to convert the message ID to a sha256 hash, then convert to into an 8 x 32 bit chunk 256 bit number like so:

// Its not actually written in C
struct message_id {
    int32_t id;
    char[255] originalMessageId;

    int32_t p01;
    int32_t p02;
    int32_t p03;
    int32_t p04;
    int32_t p05;
    int32_t p06;
    int32_t p07;
    int32_t p08;
}

And then put a unique constraint on all the nodes.

Now before anyone says anything about the quality of uniqueness of the message ID, I am aware, but this system isn't designed to have high integrity, just high performance.

So my question is this:

Will this suffice, or is there a trick to indexing strings in MySql that I've missed?

EDIT: Adding Model Design.

MessageIdentity.php

/**
 * MessageIdentity
 *
 * @ORM\Table(name="inbox_message_identity",
 *      uniqueConstraints={
 *          @ORM\UniqueConstraint(columns={
 *              "p01", "p02", "p03", "p04",
 *              "p05", "p06", "p07", "p08"
 *          })
 *      })
 * @ORM\Entity(repositoryClass="AppBundle\Entity\Inbox\MessageIdentityRepository")
 */
class MessageIdentity
{
    /**
     * @var integer
     *
     * @ORM\Column(name="id", type="integer")
     * @ORM\Id
     * @ORM\GeneratedValue(strategy="AUTO")
     */
    private $id;

    /**
     * @var string
     *
     * @ORM\Column(name="originalMessageId", type="string", length=255)
     */
    private $originalMessageId;

    /**
     * @var integer
     *
     * @ORM\Column(name="p01", type="integer")
     */
    private $p01;

    /**
     * @var integer
     *
     * @ORM\Column(name="p02", type="integer")
     */
    private $p02;

    /**
     * @var integer
     *
     * @ORM\Column(name="p03", type="integer")
     */
    private $p03;

    /**
     * @var integer
     *
     * @ORM\Column(name="p04", type="integer")
     */
    private $p04;

    /**
     * @var integer
     *
     * @ORM\Column(name="p05", type="integer")
     */
    private $p05;

    /**
     * @var integer
     *
     * @ORM\Column(name="p06", type="integer")
     */
    private $p06;

    /**
     * @var integer
     *
     * @ORM\Column(name="p07", type="integer")
     */
    private $p07;

    /**
     * @var integer
     *
     * @ORM\Column(name="p08", type="integer")
     */
    private $p08;

    /**
     * @param $string
     */
    public function __construct($string)
    {
        parent::__construct();

        $bits = self::createBits($this->originalMessageId = $string);

        $this->p01 = $bits[0];
        $this->p02 = $bits[1];
        $this->p03 = $bits[2];
        $this->p04 = $bits[3];
        $this->p05 = $bits[4];
        $this->p06 = $bits[5];
        $this->p07 = $bits[6];
        $this->p08 = $bits[7];
    }

    public static function createBits($string)
    {
        $hash = hash('sha256', $string);
        $bits = array();


        // Bits are packed in pairs of 16 bit chunks before unpacking as signed 32 bit chunks
        // in order to guarrentee there is no overflow when converting the unsigned hex number into a
        // PHP integer on 32 bit machines.
        $bits[] = self::pluck(unpack('l', pack('s', hexdec(substr($hash, 0,  4))) . pack('s', hexdec(substr($hash, 4,  4)))));
        $bits[] = self::pluck(unpack('l', pack('s', hexdec(substr($hash, 8,  4))) . pack('s', hexdec(substr($hash, 12, 4)))));
        $bits[] = self::pluck(unpack('l', pack('s', hexdec(substr($hash, 16, 4))) . pack('s', hexdec(substr($hash, 20, 4)))));
        $bits[] = self::pluck(unpack('l', pack('s', hexdec(substr($hash, 24, 4))) . pack('s', hexdec(substr($hash, 28, 4)))));
        $bits[] = self::pluck(unpack('l', pack('s', hexdec(substr($hash, 32, 4))) . pack('s', hexdec(substr($hash, 36, 4)))));
        $bits[] = self::pluck(unpack('l', pack('s', hexdec(substr($hash, 40, 4))) . pack('s', hexdec(substr($hash, 44, 4)))));
        $bits[] = self::pluck(unpack('l', pack('s', hexdec(substr($hash, 48, 4))) . pack('s', hexdec(substr($hash, 52, 4)))));
        $bits[] = self::pluck(unpack('l', pack('s', hexdec(substr($hash, 56, 4))) . pack('s', hexdec(substr($hash, 60, 4)))));

        return $bits;
    }

    protected static function pluck($array)
    {
        return $array[1];
    }
}

MessageIdentityRepository.php

class MessageIdentityRepository extends \Doctrine\ORM\EntityRepository
{
    public function getExisting($string)
    {
        $bits = MessageIdentity::createBits($string);

        $qb = $this->createQueryBuilder('i');
        $qb
            ->where($qb->expr()->andX(
                $qb->expr()->eq('i.p01', $qb->expr()->literal($bits[0])),
                $qb->expr()->eq('i.p02', $qb->expr()->literal($bits[1])),
                $qb->expr()->eq('i.p03', $qb->expr()->literal($bits[2])),
                $qb->expr()->eq('i.p04', $qb->expr()->literal($bits[3])),
                $qb->expr()->eq('i.p05', $qb->expr()->literal($bits[4])),
                $qb->expr()->eq('i.p06', $qb->expr()->literal($bits[5])),
                $qb->expr()->eq('i.p07', $qb->expr()->literal($bits[6])),
                $qb->expr()->eq('i.p08', $qb->expr()->literal($bits[7]))
            ))
            ->setMaxResults(1)
        ;

        return $qb->getQuery()->getOneOrNullResult();
    }
}

MessageRepository.php

class MessageRepository extends \Doctrine\ORM\EntityRepository
{
    public function getLastWithMessageID(MessageIdentity $messageIdentity)
    {
        $qb = $this->createQueryBuilder('m');
        $qb
            ->where('m.messageIdentity = :identity')
            ->setParameter(':identity', $messageIdentity)
            ->orderBy('m.date', 'DESC')
            ->setMaxResults(1)
        ;
        return $qb->getQuery()->getOneOrNullResult();
    }
}

This is a model built using Doctrine2. The message itself holds a foreign key to the MessageIdentity table.

The MessageIdentity is searched for by reconstructing the bit set, and searching on all of the columns which should perfectly utilize the unique constraint placed on the table.

The message is searched for based on the mapped identity, order by descending date, and only a single row is fetched.

13
  • So you have a table messages or something? Showing the structure could help answering your question. Commented Aug 24, 2015 at 14:06
  • 2
    you will eventually have collisions. you will note that any cryptography method has it's own moderate to extreme degradation of performance. Have you shown that using the smtp ID does not scale? Commented Aug 24, 2015 at 14:06
  • The problem of making strings unique in the database is usually solved the way you went with it - create a hash, store the hash, make it unique. Commented Aug 24, 2015 at 14:08
  • @Drew That is OK, because when matching a new message to another, I will only select the most recent match. Its a usability tool. To prevent accidental duplicates, I will use the message ID, plus a few other fields to "guess" whether it is actually unique. Commented Aug 24, 2015 at 14:11
  • 1
    Sorry if I was misleading you - no, there's no native thingy whatsoever, I was just trying to confirm that what you did is ok. You can store the hash in the binary column using UNHEX() MySQL method so you can cut down on storage requirement (no need for varchars). Commented Aug 24, 2015 at 14:59

1 Answer 1

3

You are solving a non-existent problem.

Sure comparing two strings is a little slower than comparing two INTs. But not enough slower to warrant standing on your head with MD5/SHA1/etc. All the overhead of such would slow things down more than the string compares.

On the other hand, if you plan on having a unique string longer than 767 bytes, something needs to be done. If so, I'll discuss some ways out.

Meanwhile, I will argue that SHA256 is gross overkill. For a 128-bit MD5, "There is one chance in 9 trillion of a false duplicate in a table of 9 trillion strings." For mere "millions", the odds are even more remote.

Another point... BINARY(20) and CHAR(20) COLLATE ..._bin are handled identically. More complex collations take some more effort.

Addenda

Background: Essentially the only index type in MySQL is BTree. That is an ordered list. It is also very efficient for a "point query" (given a key, find the record). For InnoDB, the BTree blocks are 16KB blocks. For a million rows, the BTree will be about 3 levels deep. A trillion -- 6.

SHA256 / UUID / GUID / other digests -- all of these perform terribly when you have a very large number of rows. This is because they are very random, and you are unlikely to have the desired block cached in RAM in the case of a very large table.

Here is a compromise solution, using an artificial table as my example:

CREATE TABLE foo (
    in_reply_to VARCHAR(500) NOT NULL CHARACTER SET ...,
    ...
    KEY(in_reply_to(20))
    PRIMARY KEY (...)
) ENGINE=InnoDB;

SELECT ... FROM foo
    WHERE in_reply_to = '...';

Notice that I do not have a sha256 or other digest.

KEY(in_reply_to(20)) builds an index with the first 20 characters of that field.
UNIQUE(in_reply_to(20)) would do likewise, but also constrain the 20 chars to be unique. This is not what you want.

Here's how my SELECT works:

  1. Look in that key, finding 1 or 2 or maybe several rows matching the first 20 chars.
  2. Verify that the entire in_reply_to matches the search string -- exactly.
  3. Deliver the resultset.

It would be good to pick the '20' to be a good compromise between

  • Long enough to usually be unique
  • Short enough to keep the key's BTree small. (Smaller --> more cacheable --> faster).

You have probably noticed one thing missing -- MySQL is not doing the uniqueness check; your code will have to do that by performing some variant of my SELECT.

A SHA256 has 256 bits, so it needs BINARY(32). Even with 1e25 different documents, there is still a chance of about 1 in 1e25 of having a false duplicate.

Sign up to request clarification or add additional context in comments.

5 Comments

The hashing algorithm isn't the part I'm worried about. That part could take a whole second for the number of times it will get used. As soon as the ID is collected and transformed, only the transformed version matters, I.E. the Sha256 encoded string with a fixed length of 64 chars. The problem will occur when millions of these start being entered in. Uniqueness is not very important in this scenario, it only needs to be very unlikely, so if reducing the hash complexity is better, so be it. But search speed is what I really what to get O(1) times in.
This is why I've set up these partitions, in the hope MySql treats it like a tree lookup. What I'm hoping to figure out with this question, is if MySql performs this kind of partitioning on indexed strings or binary values anyway.
Thanks for the update. This is quite informative. The fortunate thing for me is that I can implement multiple ways of doing this at once and find out in a few months which is best. I will add this example to my implementation. Just to clarify, I need to lose the unique index, add the raw data to a varchar[500] and index with length 20? Thanks again.
Yes, implement multiple approaches. While doing it, see if you can figure out why one works better for your situation. Write a blog about it all. Yes, lose the UNIQUE index, have VARCHAR(something), and use the prefix of 20 (or some value).
SELECT COUNT(DISTINCT LEFT(foo, 15)), COUNT(DISTINCT( LEFT(foo, 20), ... will give you some feel for how big a prefix to use.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.