0

Consider the following "tweets" (left) and "retweets" (right) tables:

  +----------+-----------------+     +----------+----+
  | tweet_id |  text           |     | tweet_id | rt |
  +----------+-----------------+     +----------+----+
  |  1       | foo {RT|123} bar|     |  1       | 123|
  |  2       | foobar          |     |  3       | 456|
  |  3       | {RT|456} baz    |     |  4       | 789|
  |  4       | bazbar {RT|789} |     +----------+----+
  |  5       | bar baz         |
  +----------+-----------------+

The tweets table contains millions of preprocessed tweets. In some tweets, a custom label is added of the form {RT|xx} with xx being a 17 to 20 figure number. The retweets table is currently empty, but it needs to be filled as demonstrated: tweets.text should be scanned for {RT|xx} labels, and if found, the number should be extracted from the label and inserted into the retweets table together with the tweet_id.

To do this, I started off with selecting all tweets that have {RT}-labels:

SELECT * FROM tweets WHERE `text` LIKE '%{RT|%'

A second step would be to loop through the resultset in PHP and filter the number from the label using a regular expression, and then perform an INSERT INTO operation. This, however, would take a lot of time - making me wonder if this would perhaps be faster with a SQL query? And if so, what would the query have to look like? I have never worked with regular expressions in SQL statements before.

3 Answers 3

1

If your database if MySQL, you can do it using a simple query:

INSERT INTO `retweets` SELECT id, SUBSTR(`text`, LOCATE('{RT|', `text`)+4, LOCATE('}', `text`) - LOCATE('{RT|', `text`)-4) AS `num` FROM `tweets` HAVING `num` REGEXP '^[0-9]+$';
Sign up to request clarification or add additional context in comments.

3 Comments

Thanks, but I don't understand the "+4"? The number in the label could be of any length, for instance a tweet could be "hello {RT|12} world {RT|43653465} this is {RT|13253453534543543}a string with{RT|13}examples". Also notice there might not be a space in front or after the label, but nothing (meaning: label is at beginning of string) or another character ({RT|342}foo)
+4 is length of '{RT|' prefix string!
Wow, this worked brilliantly...only a minute or two...would have taken hours and hours with my initial PHP-based, cronjob approach :-) thanks!
1

Maybe like this (untested);

SELECT SUBSTR(
    `text`,
    LOCATE('{RT|', `text`) + 4,
    LOCATE('}', `text`, LOCATE('{RT|', text) )
)
FROM `tweets`
WHERE `text` LIKE '%{RT|%';

Comments

0

This will work in oracle:

SELECT tweet_id, REGEXP_SUBSTR(REGEXP_SUBSTR(text, '[{RT|][^}]+'), '[[:digit:]]+') FROM tweets WHERE text LIKE '%{RT|%'

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.