Spark packages from a password protected Repository

At my current client, we use Sonatype Nexus to store our artifacts. The repository is secured with a username/password both for publishing as downloading artifacts.

Spark is having support for specific repositories with the --repositories configuration.

We use it like this:

pyspark \
 --repositories https://readonly:secret_password@nexus/repository/maven-public/\
 --packages com.example:foobar:1.0.0

Unfortunately, we ran into the following issue:

    ==== repo-1: tried

      https://readonly:secret_password@nexus/repository/maven-public/com/example/foobar/1.0.0/foobar-1.0.0.pom

      -- artifact com.example#foobar;1.0.0!foobar.jar:

      https://readonly:secret_password@nexus/repository/maven-public/com/example/foobar/1.0.0/foobar-1.0.0.jar

        ::::::::::::::::::::::::::::::::::::::::::::::

        ::          UNRESOLVED DEPENDENCIES         ::

        ::::::::::::::::::::::::::::::::::::::::::::::

        :: com.example#foobar;1.0.0: not found

        ::::::::::::::::::::::::::::::::::::::::::::::

The strange thing: The url is correct. With curl we can download the dependency:

curl -s -o /dev/null -v https://readonly:secret_password@nexus/repository/maven-public/com/example/foobar/1.0.0/foobar-1.0.0.pom
* Hostname was NOT found in DNS cache
*   Trying 35...
* Connected to foobar.com (35.xxx.xxx.x) port 443 (#0)
* successfully set certificate verify locations:
*   CAfile: none
  CApath: /etc/ssl/certs
...
...
< HTTP/1.1 200 OK

Okay, let's debug this thing by using ivy directly.

Ivy is using a config file to configure the Nexus repository so I tried:

<ivysettings>
  <settings defaultResolver="nexus"/>
  <property name="nexus-public"
                   value="https://nexus/repository/maven-public"/>
  <resolvers>
      <ibiblio name="nexus" m2compatible="true" root="${nexus-public}"/>
    </resolvers>
</ivysettings>
curl -L -O http://search.maven.org/remotecontent?filepath=org/apache/ivy/ivy/2.4.0/ivy-2.4.0.jar
java -jar ivy-2.4.0.jar -settings ivy.settings -dependency com.example foobar 1.0.0 -debug

Here we end up with the same issue. So the issue is not Spark related, but Ivy.

    ==== nexus: tried

      https://readonly:secret_password@nexus/repository/maven-public/com/example/foobar/1.0.0/foobar-1.0.0.pom

      -- artifact com.example#foobar;1.0.0!foobar.jar:

      https://readonly:secret_password@nexus/repository/maven-public/com/example/foobar/1.0.0/foobar-1.0.0.jar

        ::::::::::::::::::::::::::::::::::::::::::::::

        ::          UNRESOLVED DEPENDENCIES         ::

        ::::::::::::::::::::::::::::::::::::::::::::::

        :: com.example#foobar;1.0.0: not found

        ::::::::::::::::::::::::::::::::::::::::::::::

With the -debug option we find the following:

HTTP response status: 401 url=https://readonly:secret_password@nexus/repository/maven-public/com/example/foobar/1.0.0/foobar-1.0.0.jar
CLIENT ERROR: Unauthorized url=https://readonly:secret_password@nexus/repository/maven-public/com/example/foobar/1.0.0/foobar-1.0.0.jar
    nexus: resource not reachable for com/example#foobar;1.0.0: res=https://readonly:secret_password@nexus/repository/maven-public/com/example/foobar/1.0.0/foobar-1.0.0.jar

Now we understand the issue, we can start googling. I found this StackOverflow issue

So Let's change the basic authentication in the URL to a credentials block.

<ivysettings>
  <settings defaultResolver="nexus"/>
  <property name="nexus-public"
                   value="https://nexus/repository/maven-public"/>
  <credentials host="nexus" realm="Sonatype Nexus Repository Manager"
    username="readonly" passwd="secret_password" />
  <resolvers>
      <ibiblio name="nexus" m2compatible="true" root="${nexus-public}"/>
    </resolvers>
</ivysettings>

Now everything works like a charm. Time to fix the pyspark command.

pyspark\
  --packages com.example:foobar:1.0.0\
  --conf spark.jars.ivySettings=/tmp/ivy.settings

Now Spark is able to download the packages as well. I'm a happy camper again. What is left for us to do, is to add this in our init script to initialize new dataproc clusters with this setup.

Author
Follow us for more of this
Recent posts
Recent tweets
Stay up to date on the latest insights and best-practices by registering for the GoDataDriven newsletter.
Follow us for more of this