Hadoop and LDAP, as seen through Venetian blinds

My wife recently asked me to mount new Venetian blinds in the kids' bathroom. I thought that I'd be done in five minutes, but two hours later I still had to drill a single hole.

It is not lack of experience. We have the exactly the same type of blinds in four other rooms and they were all mounted by me. What happened?

As I was approaching the window to see where the drill holes, I couldn't help but notice that the blinds were approximately 4 to 5 mm wider than the opening. That meant cutting each slat individually (there are about 40 slats).

If you have any familiarity with manual work or with software development, you know that anything can be done quickly if there is no custom work to do. Once you need modifications on top of a library or slightly modify Venetian blinds, then the problems begin. And five minutes ain't nearly enough!

I have a saw to cut iron, but I didn't have anything for the slats. I turned to my neighbour who was very helpful and gave me a good saw.

That meant, however, that all the slats had to be removed individually. If you are unfamiliar with how Venetian blinds are made, this image should help you:

Venetian blinds

Basically all slats are kept together by three small ropes. The three ropes have to be untied. The slats can then be removed, cut, and put back in place. The ropes have to then be inserted back.

This takes a long time. But I wanted to get it done, so I went in the garden with my son, and I started.

As I was working, one of the neighbours came over to ask if I knew why water was coming out of the wall of one of the nearby houses1. Of course I didn't know, but we went and looked through the windows: water was dripping from the first to the ground floor, and apparently overflowing to the outer wall as well.

After 15 minutes the police arrived2, to determine if the firefighters had to be called. Five minutes later the firefighters arrived. They got in the house and found a weed plantation, whose irrigation system broke down, flooding the whole house.

All the neighbourhood was in the street, watching what was happening. The firefighters were so kind to host all my kids on their truck. Then the real police, with a real car, arrived and they were also so kind to host all my kids in their car.

While only half of my slats were cut, and the other half was waiting.

Long story short, because of this distraction, I needed 3 hours just to cut and re-assemble the slats.

Mounting them after resizing took five minutes.

What is most peculiar, however, is that I had the same type of experience at work that day, trying to connect Hadoop and LDAP.

When Kerberos is not used, Hadoop relies on simple security, i.e. it believes you are who you say you are.

However, to determine to which group a user belong to, Hadoop does by default a lookup on the NameNode. If the user does not exist in the NameNode, the lookup will turn up empty. This means that there is no way to know to which groups someone belongs to.

Enter LdapGroupsMapping

To fix this, Hadoop provides LdapGroupsMapping to lookup up a user inside an LDAP directory. I thought I would give it a crack. How hard can it be?

If you clicked on the previous link, you probably thought that the piece of documentation isn't enough to really get started. Luckily we have search engines these days, so I stumbled upon a page by Hortonworks "explaining" how to do it.

I quote "explaining" as you don't get much wiser if you look at it (there are also a couple of typo's). Let's start with the easy part:

  • hadoop.security.group.mapping should be org.apache.hadoop.security.LdapGroupsMapping. Nothing to change here;
  • hadoop.security.group.mapping.ldap.bind.user should be the user that has read access the LDAP, usually the administrator. In my case it was cn=Administrator,cn=users,dc=some,dc=domain,dc=com. You have to configure this to your situation though;
  • hadoop.security.group.mapping.ldap.bind.password: no comment here;
  • hadoop.security.group.mapping.ldap.url needs to be the LDAP address. The form is ldap://address:port. If you LDAP is listening through the standard 389 port, you can omit it. If your LDAP is behind SSL you need to use the ldaps protocol (it then assumes port 636) and to configure some extra SSL properties;
  • hadoop.security.group.mapping.ldap.base is almost straightforward as well. It is the common part that all users of your LDAP will have. In my example it could be dc=some,dc=domain,dc=com so that all users under dc=other,dc=domain,dc=com will not be found.

Now comes the most challenging part, namely:

  1. hadoop.security.group.mapping.ldap.search.filter.user;
  2. hadoop.security.group.mapping.ldap.search.filter.group;
  3. hadoop.security.group.mapping.ldap.search.attr.member;
  4. hadoop.security.group.mapping.ldap.search.attr.group.name.

To find out what to fill, we need ldapsearch, a tool available in most Linux distributions.3

Once available, you can query your LDAP like so

export PW=...
export ADMIN="cn=Administrator,cn=users,dc=some,dc=domain,dc=com"
export URL="ldap://url"
export BASE="dc=some,dc=domain,dc=com"

ldapsearch -b "$BASE" -H "$URL" -D "$ADMIN" -w $PW  -x \
           "your query here"

filter.user

To determine the filter.user property, you need to find the query that returns users.

You need to look into your LDAP to see how are users defined (as opposed to groups). In my case I was using Amazon AD, so I just look up how this is defined. I found that using (&(objectCategory=user)(sAMAccountName=glanzani)) was returning my user.

If you got that far, then you can use the following value for the hadoop.security.group.mapping.ldap.search.filter.user property:

(&(objectcategory=user)(samaccountname=*{0}*))

This instructs Hadoop to execute this command to look up users

ldapsearch -b "$BASE" -H "$URL" -D "$ADMIN" -w $PW  -x \
           "(&(objectCategory=user)(sAMAccountName=*{0}*))"

where {0} is the name of the user Hadoop is looking up. The wildcards are very important here, as, for example, if we are searching for the spark user, it will return all users that contain the spark string. I will show later why this is extremely important.

filter.group

In a similar fashion, you need to know how Hadoop (or ldapsearch) can find groups. In case of Amazon this is (objectCategory=group)4:

ldapsearch -b "$BASE" -H "$URL" -D "$ADMIN" -w $PW  -x \
           "(objectCategory=group)"

What you will get back is fundamental for the next two steps. In my case this was, for each group, an entry that looks like this5:

# spark, Users, some.website.com
dn: CN=spark,CN=Users,DC=some,DC=website,DC=com
objectClass: top
objectClass: group
cn: spark  # NOTE `cn` HERE
[...]
member: CN=glanzani,CN=Users,DC=some,DC=website,DC=com  # NOTE `member` HERE AND ON THE NEXT LINE
member: CN=spark_user,CN=Users,DC=some,DC=website,DC=com

We can see that the group name is given by cn (i.e. spark), and its members are in the various member attributes.

attr.member and attr.group.name

If you got the previous step down, the next is super easy

  • hadoop.security.group.mapping.ldap.search.attr.member should be member and
  • hadoop.security.group.mapping.ldap.search.attr.group.name should be cn.

Putting it all together

At this moment you can stitch everything together in /etc/hadoop/conf/core-site.xml (note that the filter.user property gained a few characters after &: it is now &)

  <property>
    <name>hadoop.security.group.mapping</name>
    <value>org.apache.hadoop.security.LdapGroupsMapping</value>
  </property>

  <property>
    <name>hadoop.security.group.mapping.ldap.bind.user</name>
    <value>cn=Administrator,cn=users,dc=some,dc=domain,dc=com</value>
  </property>

  <property>
    <name>hadoop.security.group.mapping.ldap.bind.password</name>
    <value>insert_password_here</value>
  </property>

  <property>
    <name>hadoop.security.group.mapping.ldap.url</name>
    <value>ldap://url</value>
  </property>

  <property>
    <name>hadoop.security.group.mapping.ldap.base</name>
  <value>dc=some,dc=domain,dc=com</value>
  </property>

  <property>
    <name>hadoop.security.group.mapping.ldap.search.filter.user</name>
    <value>(&amp;(objectCategory=user)(sAMAccountName=*{0}*))</value>
  </property>

  <property>
    <name>hadoop.security.group.mapping.ldap.search.filter.group</name>
    <value>(objectCategory=group)</value>
  </property>

  <property>
    <name>hadoop.security.group.mapping.ldap.search.attr.member</name>
    <value>member</value>
  </property>

  <property>
    <name>hadoop.security.group.mapping.ldap.search.attr.group.name</name>
    <value>cn</value>
  </property>

Restarting HDFS should be the final step:

$ sudo service hadoop-hdfs-namenode stop
$ sudo service hadoop-hdfs-namenode start
$ hdfs groups glanzani
glanzani : spark

What about non-interactive users?

This section could probably could have much less workarounds if I'd knew more about LDAP.

But I'm a data scientist at heart and I want to get things done.

If you ever dealt with Hadoop, you know that there are a bunch of non-interactive users, i.e. users who are not supposed to login, such as hdfs, spark, hadoop, etc. These users are important to have. However the groups with the same name are also important to have. For example when using airflow and launching a spark job, the log folders will be created under the airflow user, in the spark group.

LDAP, however, doesn't allow you, to my knowledge6, to have overlapping user/groups, as Unix does.

The way I solved it was to create, in LDAP, the spark_user (or hdfs_user or ...) to work around this limitation. In fact, using the wilcards specified above to match an Hadoop user to an LDAP user, the flow would be like this

# we ask: to which groups does the spark user (in Unix land) exists
hdfs groups spark
# hadoop creates the following `ldapsearch` query
ldapsearch -b "$BASE" -H "$URL" -D "$ADMIN" -w $PW  -x \
           "(&(objectCategory=user)(sAMAccountName=*spark*))"

Here LDAP matches the spark_user, which belongs to the spark group. It doesn't care that I've asked about the spark user. At this point it creates a query to lookup to which groups spark_user belongs to. It will return, in my case, the spark.

Great!

Quickly creating users and groups

Creating users and groups with LDAP can be a pain. A quicker way to do so it to use adtool. Create a ~/.adtool.cfg, with the following content

## Do NOT surround values with either " or '
uri ldap://url
binddn cn=Administrator,cn=users,dc=some,dc=domain,dc=com
bindpw insert_password_here
searchbase dc=some,dc=domain,dc=com

At this point you can do the following (add more users if you want)

export BASE="dc=some,dc=domain,dc=com"
for service in hdfs spark hadoop hive; do
    adtool groupcreate $service $BASE
    adtool usercreate "${service}_user" $BASE
    adtool groupadduser $service "${service}_user"
done

Wrapping things up

Reading it in a blog post takes maybe three minutes. After you know the parameters of your AD you need five minutes to implement it. But if you don't know all of the above, it can take you a day, just like mounting my Venetian blinds.

If you have any feedback on how to improve the non-interactive users part, I'd love to hear it. You can find me on Twitter.

We are hiring


  1. The house was built one year ago, but the owner is not living there yet. 

  2. In typical Dutch style, they arrived with an electrical scooter. 

  3. Not the Mozilla variant. 

  4. No ampersand here. 

  5. This is the example for the spark group. 

  6. Which, again, is very limited and I couldn't get Google to tell me much more. 

Stay up to date on the latest insights and best-practices by registering for the GoDataDriven newsletter.