Convert chararray user ID's to integers with pig

In a previous article, Friso explained how to monotonically increase row IDs with MapReduce. If you read the article (for which the markdown version sports some 1700 words) you may have noticed that the process is not exactly straightforward. Thankfully pig, from version 0.11, introduced the RANK function which allows to do the same in pig with only a handful of lines of code.

The basic usage of RANK is as simple as typing:

score = LOAD '/path/to/my/file' AS (object_id:int, score:float);

new_score = RANK score;

Using the RANK function adds to new_score a new column with respect to score, containing a unique integer per row.

The usefulness of the RANK function doesn't end here though. Let's imagine we need to use the itemsimilarity Mahout algorithm, and all we have is a csv file where every line is in the form

user_id, object_id, score

where user_id is a chararray. Mahout, unfortunately, doesn't accept hashes in this case, but only integers.

But blindly using the RANK function wouldn't cut it. As you may know, itemsimilarity computes similarity between objects based on interactions of users with multiple objects and in this case RANK would assign a different unique id to every user_id.

Let's see how we can solve this problem. The desired output is in the form

integer_user_id, object_id, score

as this is what Mahout loves. The code to accomplish this is

score = LOAD '/path/to/my/file' AS (user_id:chararray, object_id:int, score:float);

user = FOREACH score GENERATE user_id;
unique_users = DISTINCT user;
new_users = RANK unique_users;

new_score = JOIN score BY user_id, new_users BY user_id;
new_score = FOREACH new_score GENERATE rank_unique_users, object_id, score;

STORE new_score INTO '/path/to/new/file' USING PigStorage(',');

The first line loads the file, putting it into score. Then using

user = FOREACH score GENERATE user_id;
unique_users = DISTINCT user;
new_users = RANK unique_users;

we put into user the user_id column, and in unique_users all distinct user_id's. This is a crucial step, as we want equal users to have equal integers id's. The last line of the block adds the rank_unique_user column (the name choice is a pig convention) to new_users. After that we create new_score

new_score = JOIN score BY user_id, new_users BY user_id;
new_score = FOREACH new_score GENERATE rank_unique_users, object_id, score;

through a JOIN. The last line basically discards the user_id column, as it is not needed anymore for Mahout purposes. At last we save our file

STORE new_score INTO '/path/to/new/file' USING PigStorage(',');

which will now have the desired format.

Stay up to date on the latest insights and best-practices by registering for the GoDataDriven newsletter.