2

Using docker to test and develop an ETL data pipeline with Airflow and AWS glue. I'm currently using this blog post as a guide to launch the containers: https://towardsdatascience.com/develop-glue-jobs-locally-using-docker-containers-bffc9d95bd1 (Dockerfile github link: https://github.com/jnshubham/aws-glue-local-etl-docker/blob/master/Dockerfile). When I run docker build -t glue:latest I get the error below. The error is caused by RUN pip install 'apache-airflow[postgres]'==1.10.10 --constraint https://raw.githubusercontent.com/apache/airflow/1.10.10/requirements/requirements-python3.7.txt within the dockerfile. I've googled solutions for the first error and tried adding RUN yum install -y python3-devel to the dockerfile but still got the same error. I've also read that it may have to do with the gcc version. Currently it's:

Configured with: --prefix=/Library/Developer/CommandLineTools/usr --with-gxx-include-dir=/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/4.2.1
Apple clang version 11.0.3 (clang-1103.0.32.62)
Target: x86_64-apple-darwin19.4.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin

docker build -t glue:latest . Error:

    Running setup.py install for psutil: started
    Running setup.py install for psutil: finished with status 'error'
    ERROR: Command errored out with exit status 1:
     command: /usr/bin/python3.6 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-ndmkn_ag/psutil/setup.py'"'"'; __file__='"'"'/tmp/pip-install-ndmkn_ag/psutil/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-nduz8awp/install-record.txt --single-version-externally-managed --compile --install-headers /usr/local/include/python3.6m/psutil
    gcc -pthread -Wno-unused-result -Wsign-compare -DDYNAMIC_ANNOTATIONS_ENABLED=1 -DNDEBUG -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong -grecord-gcc-switches -m64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection -D_GNU_SOURCE -fPIC -fwrapv -fPIC -DPSUTIL_POSIX=1 -DPSUTIL_SIZEOF_PID_T=4 -DPSUTIL_VERSION=570 -DPSUTIL_LINUX=1 -DPSUTIL_ETHTOOL_MISSING_TYPES=1 -I/usr/include/python3.6m -c psutil/_psutil_common.c -o build/temp.linux-x86_64-3.6/psutil/_psutil_common.o
    unable to execute 'gcc': No such file or directory
    Traceback (most recent call last):
      File "/usr/lib64/python3.6/distutils/unixccompiler.py", line 127, in _compile
        extra_postargs)
      File "/usr/lib64/python3.6/distutils/ccompiler.py", line 909, in spawn
        spawn(cmd, dry_run=self.dry_run)
      File "/usr/lib64/python3.6/distutils/spawn.py", line 36, in spawn
        _spawn_posix(cmd, search_path, dry_run=dry_run)
      File "/usr/lib64/python3.6/distutils/spawn.py", line 159, in _spawn_posix
        % (cmd, exit_status))
    distutils.errors.DistutilsExecError: command 'gcc' failed with exit status 1
    

My dockerfile consist of:

FROM centos as glue
# initialize package env variables
ENV MAVEN=https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-common/apache-maven-3.6.0-bin.tar.gz
ENV SPARK=https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-1.0/spark-2.4.3-bin-hadoop2.8.tgz
ENV GLUE=https://github.com/awslabs/aws-glue-libs.git
#install required packages needed for aws glue
RUN yum install -y python3 java-1.8.0-openjdk java-1.8.0-openjdk-devel tar git wget zip

RUN yum install -y python3-devel

RUN ln -s /usr/bin/python3 /usr/bin/python
RUN ln -s /usr/bin/pip3 /usr/bin/pip
RUN mkdir /usr/local/glue
WORKDIR /usr/local/glue
RUN git clone -b glue-1.0 $GLUE
RUN wget $SPARK
RUN wget $MAVEN
RUN tar zxfv apache-maven-3.6.0-bin.tar.gz
RUN tar zxfv spark-2.4.3-bin-hadoop2.8.tgz
RUN rm spark-2.4.3-bin-hadoop2.8.tgz
RUN rm apache-maven-3.6.0-bin.tar.gz
RUN mv $(rpm -q -l java-1.8.0-openjdk-devel | grep "/bin$" | rev | cut -d"/" -f2- |rev) /usr/lib/jvm/jdk
ENV SPARK_HOME /usr/local/glue/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8
ENV MAVEN_HOME /usr/local/glue/apache-maven-3.6.0
ENV JAVA_HOME /usr/lib/jvm/jdk
ENV GLUE_HOME /usr/local/glue/aws-glue-libs
ENV PATH $PATH:$MAVEN_HOME/bin:$SPARK_HOME/bin:$JAVA_HOME/bin:$GLUE_HOME/bin
RUN sh aws-glue-libs/bin/glue-setup.sh
#compile dependencies with maven build
RUN sed -i '/mvn -f/a rm /usr/local/glue/aws-glue-libs/jarsv1/netty-*' /usr/local/glue/aws-glue-libs/bin/glue-setup.sh
RUN sed -i '/mvn -f/a rm /usr/local/glue/aws-glue-libs/jarsv1/javax.servlet-3.*' /usr/local/glue/aws-glue-libs/bin/glue-setup.sh
#clean tmp dirs
RUN yum clean all
RUN rm -rf /var/cache/yum

ENV AIRFLOW_HOME /usr/local/airflow

WORKDIR /usr/local/src

COPY requirements.txt ./

RUN pip install --upgrade pip && \
    pip install --no-cache-dir -r requirements.txt && \
    pip install 'apache-airflow[postgres]'==1.10.10 \
    --constraint https://raw.githubusercontent.com/apache/airflow/1.10.10/requirements/requirements-python3.7.txt

RUN mkdir glue_etl_scripts
COPY glue_etl_scripts/log_data.py glue_etl_scripts/log_data.py

RUN mkdir config
COPY config/aws.cfg /config/aws.cfg
COPY config/airflow.cfg $AIRFLOW_HOME/airflow.cfg

RUN mkdir scripts
COPY scripts/entrypoint.sh scripts/entrypoint.sh
COPY scripts/connections.sh scripts/connections.sh

ENTRYPOINT ["scripts/entrypoint.sh"]
CMD ["webserver"]
4
  • 3
    Can you please extract and provide a minimal reproducible example and revise the tags you applied? As a new user, please also take the tour and read How to Ask. Commented Jun 23, 2020 at 5:29
  • 3
    You have no gcc in your image. It is needed to compile the C components of psutil. Commented Jun 23, 2020 at 5:38
  • 2
    Doing so many RUN commands will result in a bloaty image (and slow building). Consider reducing them. Doing RUN rm makes no sense - the file is anyway saved in the history. Commented Jun 23, 2020 at 8:55
  • Thanks for the suggestions I edited my post @Ulrich Eckhardt. Thanks for the hint @Klaus D. I added a line to install gcc and it works great now (see answer). I updated my dockerfile to reduce the amount of RUN lines, thanks for the tip @KamilCuk. Commented Jun 23, 2020 at 9:03

2 Answers 2

2

Adding the line below to the Dockerfile did the trick.

RUN yum install -y gcc python3-devel

Sign up to request clarification or add additional context in comments.

Comments

0

To resolve the error, I had to run this on openSUSE Leap 15.3:

sudo zypper install -t pattern devel_basis

Which is equivalent to running this on Ubuntu:

sudo apt-get install build-essential

https://stackoverflow.com/a/58680740/3405291

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.