Difference between revisions of "VScaler: Thoubleshooting Kolla issues"

From Define Wiki
Jump to navigation Jump to search
(bifrost issues and workarounds)
(Added "Waiting for nova-compute service up timing out due to rabbitmq breaking")
 
(One intermediate revision by the same user not shown)
Line 107: Line 107:
 
</syntaxhighlight>
 
</syntaxhighlight>
  
== bifrost-base source image build failure ==
+
== Waiting for nova-compute service up timing out due to rabbitmq breaking ==
 
 
The current version of kolla in the pip repositories is affected by the following bug:
 
 
 
[https://bugs.launchpad.net/kolla/+bug/1667308]
 
 
 
Symptoms:
 
 
 
<syntaxhighlight>
 
ERROR:kolla.image.build.bifrost-base:The command '/bin/sh -c bash -c './scripts/env-setup.sh && source ./env-vars && ansible-playbook -vvvv -i /bifrost/playbooks/inventory/target /bifrost/playbooks/install.yaml -e @/tmp/build_arg.yml && yum clean all'' returned a non-zero code: 1
 
</syntaxhighlight>
 
 
 
Workaround - Apply the following patch to kolla/docker/bifrost/bifrost-base/Dockerfile.j2:
 
 
 
[https://review.openstack.org/#/c/437974/2/docker/bifrost/bifrost-base/Dockerfile.j2]
 
 
 
<syntaxhighlight>
 
- RUN bash -c './scripts/env-setup.sh && source ./env-vars && \
 
+ RUN bash -c 'sed -e "s/\-\-force\-reinstall //g" -i /bifrost/playbooks/roles/bifrost-{ironic,keystone}-install/tasks/install.yml' \
 
    && bash -c './scripts/env-setup.sh && source ./env-vars && \
 
</syntaxhighlight>
 
 
 
== Missing rabbitmq-server during bifrost container deploy ==
 
<syntaxhighlight>
 
TASK [bifrost-ironic-install : Start rabbitmq-server] **************************
 
task path: /bifrost-base-source/bifrost-4.0.0/playbooks/roles/bifrost-ironic-install/tasks/bootstrap.yml:55
 
<127.0.0.1> ESTABLISH LOCAL CONNECTION FOR USER: root
 
<127.0.0.1> EXEC /bin/sh -c '( umask 77 && mkdir -p "` echo $HOME/.ansible/tmp/ansible-tmp-1511182013.54-192485478560145 `" && echo ansible-tmp-1511182013.54-192485478560145="` echo $HOME/.ansible/tmp/ansible-tmp-1511182013.54-192485478560145 `" ) && sleep 0'
 
<127.0.0.1> PUT /tmp/tmpiFmEMA TO /root/.ansible/tmp/ansible-tmp-1511182013.54-192485478560145/service
 
<127.0.0.1> EXEC /bin/sh -c 'chmod u+x /root/.ansible/tmp/ansible-tmp-1511182013.54-192485478560145/ /root/.ansible/tmp/ansible-tmp-1511182013.54-192485478560145/service && sleep 0'
 
<127.0.0.1> EXEC /bin/sh -c 'LANG=en_US.UTF-8 https_proxy='"'"''"'"' LC_MESSAGES=en_US.UTF-8 no_proxy='"'"''"'"' LC_ALL=en_US.UTF-8 http_proxy='"'"''"'"' /var/lib/kolla/venv/bin/python /root/.ansible/tmp/ansible-tmp-1511182013.54-192485478560145/service; rm -rf "/root/.ansible/tmp/ansible-tmp-1511182013.54-192485478560145/" > /dev/null 2>&1 && sleep 0'
 
fatal: [127.0.0.1]: FAILED! => {"changed": false, "failed": true, "invocation": {"module_args": {"arguments": "", "enabled": true, "name": "rabbitmq-server", "pattern": null, "runlevel": "default", "sleep": null, "state": "started"}, "module_name": "service"}, "msg": "Error when trying to enable rabbitmq-server: rc=1 Failed to execute operation: No such file or directory\n"}
 
</syntaxhighlight>
 
 
 
Solution:
 
Install rabbitmq-server in the bifrost container
 
<syntaxhighlight>
 
docker exec -it bifrost_deploy bash
 
yum install rabbitmq-server
 
</syntaxhighlight>
 
 
 
==  "Upgrade ironic DB Schema" fails during bifrost deploy ==
 
 
 
"Upgrade ironic DB Schema" fails at
 
"ironic-dbsync upgrade --config-file /etc/ironic/ironic.conf"
 
with a file not found error. This is caused by the ironic-dbsync executable not being installed.
 
Solution - Install ironic packages into the container:
 
<syntaxhighlight>
 
yum install openstack-ironic-common openstack-ironic-api openstack-ironic-conductor
 
</syntaxhighlight>
 
 
 
== PXE/iPXE (pxelinux.0/undionly.kpxe) missing during bifrost deploy ==
 
 
 
Follow the guide at [https://docs.openstack.org/project-install-guide/baremetal/draft/configure-pxe.html] to install and configure PXE/iPXE.
 
 
 
== nginx is missing during bifrost deploy==
 
 
 
Install nginx:
 
<syntaxhighlight>
 
yum install nginx
 
</syntaxhighlight>
 
 
 
== SELinux related tasks failing due to SELinux being disabled in the bifrost deploy container ==
 
 
 
 
Symptoms:
 
Symptoms:
 
<syntaxhighlight>
 
<syntaxhighlight>
TASK [bifrost-ironic-install : Explicitly allow nginx and IPA port (TCP) on selinux] ***
+
TASK [nova : Waiting for nova-compute service up] ****************************************************************************************************
task path: /bifrost-base-source/bifrost-4.0.0/playbooks/roles/bifrost-ironic-install/tasks/bootstrap.yml:260
+
FAILED - RETRYING: Waiting for nova-compute service up (20 retries left).
fatal: [127.0.0.1]: FAILED! => {"failed": true, "msg": "The conditional check '(ansible_os_family == 'RedHat' or ansible_os_family == 'Suse') and ansible_selinux.status == 'enabled' and ansible_selinux.mode == \"enforcing\"' failed. The error was: error while evaluating conditional ((ansible_os_family == 'RedHat' or ansible_os_family == 'Suse') and ansible_selinux.status == 'enabled' and ansible_selinux.mode == \"enforcing\"): 'bool object' has no attribute 'status'\n\nThe error appears to have been in '/bifrost-base-source/bifrost-4.0.0/playbooks/roles/bifrost-ironic-install/tasks/bootstrap.yml': line 259, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n    iptables -I INPUT -p tcp --dport 6385 -i {{ network_interface }} -j ACCEPT\n- block:\n  ^ here\n"}
+
FAILED - RETRYING: Waiting for nova-compute service up (19 retries left).
 +
FAILED - RETRYING: Waiting for nova-compute service up (18 retries left).
 +
FAILED - RETRYING: Waiting for nova-compute service up (17 retries left).
 +
FAILED - RETRYING: Waiting for nova-compute service up (16 retries left).
 +
FAILED - RETRYING: Waiting for nova-compute service up (15 retries left).
 +
FAILED - RETRYING: Waiting for nova-compute service up (14 retries left).
 +
FAILED - RETRYING: Waiting for nova-compute service up (13 retries left).
 +
FAILED - RETRYING: Waiting for nova-compute service up (12 retries left).
 +
FAILED - RETRYING: Waiting for nova-compute service up (11 retries left).
 +
FAILED - RETRYING: Waiting for nova-compute service up (10 retries left).
 +
FAILED - RETRYING: Waiting for nova-compute service up (9 retries left).
 +
FAILED - RETRYING: Waiting for nova-compute service up (8 retries left).
 +
FAILED - RETRYING: Waiting for nova-compute service up (7 retries left).
 +
FAILED - RETRYING: Waiting for nova-compute service up (6 retries left).
 +
FAILED - RETRYING: Waiting for nova-compute service up (5 retries left).
 +
FAILED - RETRYING: Waiting for nova-compute service up (4 retries left).
 +
FAILED - RETRYING: Waiting for nova-compute service up (3 retries left).
 +
FAILED - RETRYING: Waiting for nova-compute service up (2 retries left).
 +
FAILED - RETRYING: Waiting for nova-compute service up (1 retries left).
 +
fatal: [node01 -> node01]: FAILED! => {"attempts": 20, "changed": false, "cmd": ["docker", "exec", "kolla_toolbox", "openstack", "--os-interface", "internal", "--os-auth-url", "http://10.10.11.254:35357", "--os-identity-api-version", "3", "--os-project-domain-name", "default", "--os-tenant-name", "admin", "--os-username", "admin", "--os-password", "uBz6L2iZDlkKIv8kCaOzoFVjGSOHRf6x9gsgGNDp", "--os-user-domain-name", "default", "compute", "service", "list", "-f", "json", "--service", "nova-compute"], "delta": "0:00:02.176113", "end": "2018-03-13 15:59:06.604696", "rc": 0, "start": "2018-03-13 15:59:04.428583", "stderr": "", "stderr_lines": [], "stdout": "[]", "stdout_lines": ["[]"]}
 +
 +
 +
/var/lib/docker/volumes/kolla_logs/_data/nova/nova-compute.log
 +
2018-03-13 16:58:38.788 7 ERROR oslo.messaging._drivers.impl_rabbit [req-803bcc07-60c9-4129-ab06-40809c99e46d - - - - -] [89b4b687-89f7-42fe-92a1-9085e6fe35a7] AMQP server on 10.10.11.3:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 32 seconds. Client port: None: error: [Errno 111] ECONNREFUSED
 
</syntaxhighlight>
 
</syntaxhighlight>
  
 
Solution:
 
Solution:
Edit <syntaxhighlight>vi /bifrost-base-source/bifrost-4.0.0/playbooks/roles/bifrost-ironic-install/tasks/bootstrap.yml</syntaxhighlight>
 
Comment out or remove the following block:
 
<syntaxhighlight>
 
- block:
 
    - name: "Explicitly allow nginx and IPA port (TCP) on selinux"
 
      seport:
 
        ports: "{{ file_url_port }},6385"
 
        proto: tcp
 
        setype: http_port_t
 
        state: present
 
 
    - name: "Add proper context on created data for http_boot"
 
      command: semanage fcontext -a -t httpd_sys_content_t "{{ http_boot_folder }}(/.*)?"
 
 
    - name: Copy ironic policy file to temporary directory
 
      copy:
 
        src: ironic_policy.te
 
        dest: /tmp/ironic_policy.te
 
 
    - name: Check ironic policy module
 
      command: checkmodule -M -m -o /tmp/ironic_policy.mod /tmp/ironic_policy.te
 
 
    - name: Package ironic policy module
 
      command: semodule_package -m /tmp/ironic_policy.mod -o /tmp/ironic_policy.pp
 
 
    - name: Include ironic policy module
 
      command: semodule -i /tmp/ironic_policy.pp
 
  
    - name: Enable ironic policy module
+
Rabbitmq server names must match node hostnames. The connection has to go through the correct interface.
      command: semodule -e ironic_policy
 
  when: (ansible_os_family == 'RedHat' or ansible_os_family == 'Suse') and
 
ansible_selinux.status == 'enabled' and ansible_selinux.mode == "enforcing"
 
</syntaxhighlight>
 
  
== Image-build failing during bitfrost deploy due to missing diskimage-builder ==
+
Eg:
 
 
Symptoms:
 
<syntaxhighlight>
 
TASK [bifrost-create-dib-image : Initiate image build] *************************
 
task path: /bifrost-base-source/bifrost-4.0.0/playbooks/roles/bifrost-create-dib-image/tasks/main.yml:121
 
<127.0.0.1> ESTABLISH LOCAL CONNECTION FOR USER: root
 
<127.0.0.1> EXEC /bin/sh -c '( umask 77 && mkdir -p "` echo $HOME/.ansible/tmp/ansible-tmp-1511265018.75-262833849350496 `" && echo ansible-tmp-1511265018.75-262833849350496="` echo $HOME/.ansible/tmp/ansible-tmp-1511265018.75-262833849350496 `" ) && sleep 0'
 
<127.0.0.1> PUT /tmp/tmprxrYqP TO /root/.ansible/tmp/ansible-tmp-1511265018.75-262833849350496/command
 
<127.0.0.1> EXEC /bin/sh -c 'chmod u+x /root/.ansible/tmp/ansible-tmp-1511265018.75-262833849350496/ /root/.ansible/tmp/ansible-tmp-1511265018.75-262833849350496/command && sleep 0'
 
<127.0.0.1> EXEC /bin/sh -c 'LANG=C LC_MESSAGES=C no_proxy='"'"''"'"' http_proxy='"'"''"'"' https_proxy='"'"''"'"' LC_ALL=C DIB_INSTALLTYPE_simple_init=repo /var/lib/kolla/venv/bin/python /root/.ansible/tmp/ansible-tmp-1511265018.75-262833849350496/command; rm -rf "/root/.ansible/tmp/ansible-tmp-1511265018.75-262833849350496/" > /dev/null 2>&1 && sleep 0'
 
fatal: [127.0.0.1]: FAILED! => {"changed": false, "cmd": "disk-image-create -o /httpboot/deployment_image.qcow2 -t qcow2 centos vm enable-serial-console simple-init", "failed": true, "invocation": {"module_args": {"_raw_params": "disk-image-create        -o /httpboot/deployment_image.qcow2 -t qcow2          centos vm enable-serial-console simple-init ", "_uses_shell": false, "chdir": null, "creates": null, "executable": null, "removes": null, "warn": true}, "module_name": "command"}, "msg": "[Errno 2] No such file or directory", "rc": 2}
 
</syntaxhighlight>
 
 
 
Solution:
 
Install diskimage-builder
 
 
<syntaxhighlight>
 
<syntaxhighlight>
yum install diskimage-builder
+
root@headnode:~# cat /etc/hosts
 +
10.10.10.2 node01
 +
10.10.11.2 node01-eth1
 
</syntaxhighlight>
 
</syntaxhighlight>
  
Note: Make sure <syntaxhighlight>dib_os_element: centos7</syntaxhighlight> is set in <syntaxhighlight>/etc/bifrost/dib.yml</syntaxhighlight> instead of <syntaxhighlight>dib_os_element: centos</syntaxhighlight>
+
Assuming kolla is configured to use the eth1 interface, the hostname used in the inventory file will be "node01-eth1" in this case. The node hostname must match the one in the inventory.
Otherwise the following failure may occur:
 
 
<syntaxhighlight>
 
<syntaxhighlight>
TASK [bifrost-create-dib-image : Initiate image build] *************************
+
root@node01:~# hostnamectl set-hostname node01-eth1
task path: /bifrost-base-source/bifrost-4.0.0/playbooks/roles/bifrost-create-dib-image/tasks/main.yml:121
 
<127.0.0.1> ESTABLISH LOCAL CONNECTION FOR USER: root
 
<127.0.0.1> EXEC /bin/sh -c '( umask 77 && mkdir -p "` echo $HOME/.ansible/tmp/ansible-tmp-1511265599.4-237699719654061 `" && echo ansible-tmp-1511265599.4-237699719654061="` echo $HOME/.ansible/tmp/ansible-tmp-1511265599.4-237699719654061 `" ) && sleep 0'
 
<127.0.0.1> PUT /tmp/tmpeWn782 TO /root/.ansible/tmp/ansible-tmp-1511265599.4-237699719654061/command
 
<127.0.0.1> EXEC /bin/sh -c 'chmod u+x /root/.ansible/tmp/ansible-tmp-1511265599.4-237699719654061/ /root/.ansible/tmp/ansible-tmp-1511265599.4-237699719654061/command && sleep 0'
 
<127.0.0.1> EXEC /bin/sh -c 'LANG=C LC_MESSAGES=C no_proxy='"'"''"'"' http_proxy='"'"''"'"' https_proxy='"'"''"'"' LC_ALL=C DIB_INSTALLTYPE_simple_init=repo /var/lib/kolla/venv/bin/python /root/.ansible/tmp/ansible-tmp-1511265599.4-237699719654061/command; rm -rf "/root/.ansible/tmp/ansible-tmp-1511265599.4-237699719654061/" > /dev/null 2>&1 && sleep 0'
 
fatal: [127.0.0.1]: FAILED! => {"changed": true, "cmd": ["disk-image-create", "-o", "/httpboot/deployment_image.qcow2", "-t", "qcow2", "centos", "vm", "enable-serial-console", "simple-init"], "delta": "0:00:00.740333", "end": "2017-11-21 12:00:00.311265", "failed": true, "invocation": {"module_args": {"_raw_params": "disk-image-create        -o /httpboot/deployment_image.qcow2 -t qcow2          centos vm enable-serial-console simple-init ", "_uses_shell": false, "chdir": null, "creates": null, "executable": null, "removes": null, "warn": true}, "module_name": "command"}, "rc": 1, "start": "2017-11-21 11:59:59.570932", "stderr": "Traceback (most recent call last):\n  File \"/usr/bin/element-info\", line 10, in <module>\n    sys.exit(main())\n  File \"/usr/lib/python2.7/site-packages/diskimage_builder/element_dependencies.py\", line 337, in main\n    elements = _get_elements(args.elements)\n  File \"/usr/lib/python2.7/site-packages/diskimage_builder/element_dependencies.py\", line 248, in _get_elements\n    return _expand_element_dependencies(elements, all_elements)\n  File \"/usr/lib/python2.7/site-packages/diskimage_builder/element_dependencies.py\", line 148, in _expand_element_dependencies\n    raise MissingElementException(\"Element '%s' not found\" % element)\ndiskimage_builder.element_dependencies.MissingElementException: Element 'centos' not found", "stdout": "diskimage-builder version 2.8.0\nBuilding elements: base  centos vm enable-serial-console simple-init", "stdout_lines": ["diskimage-builder version 2.8.0", "Building elements: base  centos vm enable-serial-console simple-init"], "warnings": []}
 
 
</syntaxhighlight>
 
</syntaxhighlight>

Latest revision as of 10:36, 14 March 2018

Log location

The logs are on the nodes under: /var/lib/docker/volumes/kolla_logs/_data/

When a service fails you will find useful info in the koala logs of the container of that service. To check the logs of nova-conductor service for example we'll do:

[root@head01 ~]# ssh controller01
[root@controller01-enp2s0 ~]# tail /var/lib/docker/volumes/kolla_logs/_data/nova/nova-conductor.log

Interface ansible_<if> does not exist

If you see a message of this sort in the kolla-ansible output, it's most likely referring to a node that has an interface with a different name than the one specified in the "network_interface" variable in the /etc/kolla/globals.yaml file.

Solution

To get past this issue, just add api_interface=ens5 next to the node's name in the inventory file. I've had a time when I need to add tunnel_interface=ens5 as well. The error message will tell you if the tunnel or the api one is the problem. The inventory file should look like this:

...
gpu01-ens5 tunnel_interface=ens5 api_interface=ens5
...

Connection refused errors in nova-conductor logs

When I was doing a deploy the nova service wouldn't come up properly. After checking the nova-conductor logs on the controller node that reported the error, I saw a lot of errors like this:

ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 192.168.0.106:5672 is unreachable: [Errno 111] ECONNREFUSED

Solution

Disable SELINUX on the controller nodes and reboot them!

Debugging containers that dont start

When a container fails to start - you can recreate the error using the image name and then pass -a to docker start:

[root@controller01 ~]# docker ps -a
CONTAINER ID        IMAGE                                                     COMMAND             CREATED             STATUS                    PORTS               NAMES
4de70ccf5a4d        10.10.10.1:4000/kolla/centos-binary-glance-api:4.0.3      "kolla_start"       10 hours ago        Exited (1) 10 hours ago                       bootstrap_glance
10f4038a7d77        10.10.10.1:4000/kolla/centos-binary-keystone:4.0.3        "kolla_start"       10 hours ago        Up 10 hours                                   keystone
e96bf1cb3258        10.10.10.1:4000/kolla/centos-binary-rabbitmq:4.0.3        "kolla_start"       10 hours ago        Up 10 hours                                   rabbitmq
b0094b42cb75        10.10.10.1:4000/kolla/centos-binary-mariadb:4.0.3         "kolla_start"       10 hours ago        Up 10 hours                                   mariadb
d49e0b00bf84        10.10.10.1:4000/kolla/centos-binary-memcached:4.0.3       "kolla_start"       10 hours ago        Up 10 hours                                   memcached
1a1599296c59        10.10.10.1:4000/kolla/centos-binary-keepalived:4.0.3      "kolla_start"       10 hours ago        Up 10 hours                                   keepalived
accc84f93171        10.10.10.1:4000/kolla/centos-binary-haproxy:4.0.3         "kolla_start"       10 hours ago        Up 10 hours                                   haproxy
f25d30f403d2        10.10.10.1:4000/kolla/centos-binary-cron:4.0.3            "kolla_start"       10 hours ago        Up 10 hours                                   cron
0be143a36b6d        10.10.10.1:4000/kolla/centos-binary-kolla-toolbox:4.0.3   "kolla_start"       10 hours ago        Up 10 hours                                   kolla_toolbox
2f667f97a160        10.10.10.1:4000/kolla/centos-binary-fluentd:4.0.3         "kolla_start"       10 hours ago        Up 10 hours                                   fluentd
[root@controller01 ~]# docker start -a  bootstrap_glance
INFO:__main__:Loading config file at /var/lib/kolla/config_files/config.json
INFO:__main__:Validating config file
INFO:__main__:Kolla config strategy set to: COPY_ALWAYS
INFO:__main__:Copying service configuration files
INFO:__main__:Deleting file /etc/glance/glance-api.conf
INFO:__main__:Coping file from /var/lib/kolla/config_files/glance-api.conf to /etc/glance/glance-api.conf
INFO:__main__:Setting file /etc/glance/glance-api.conf owner to glance:glance
INFO:__main__:Setting file /etc/glance/glance-api.conf permission to 0600
ERROR:__main__:MissingRequiredSource: /var/lib/kolla/config_files/ceph.* file is not found

"AttributeError: 'module' object has no attribute 'APIClient'" during kolla-build

If the following error appears during kolla build:

[root@node01-head tools]# kolla-build bifrost-deploy --type source --base centos
INFO:kolla.image.build:Found the docker image folder at /usr/share/kolla/docker
Traceback (most recent call last):
  File "/usr/bin/kolla-build", line 11, in <module>
    sys.exit(main())
  File "/usr/lib/python2.7/site-packages/kolla/cmd/build.py", line 30, in main
    statuses = build.run_build()
  File "/usr/lib/python2.7/site-packages/kolla/image/build.py", line 1110, in run_build
    kolla = KollaWorker(conf)
  File "/usr/lib/python2.7/site-packages/kolla/image/build.py", line 586, in __init__
    self.dc = docker.APIClient(version='auto', **docker_kwargs)
AttributeError: 'module' object has no attribute 'APIClient'

Work around by removing docker-py and installing docker==2.4.

pip uninstall docker-py
pip install docker==2.4

If a later version of docker is installed the following error will be produced instead:

[root@node01-head tools]# kolla-build bifrost-deploy --type source --base centos
Traceback (most recent call last):
  File "/usr/bin/kolla-build", line 7, in <module>
    from kolla.cmd.build import main
  File "/usr/lib/python2.7/site-packages/kolla/cmd/build.py", line 26, in <module>
    from kolla.image import build
  File "/usr/lib/python2.7/site-packages/kolla/image/build.py", line 32, in <module>
    import docker
ImportError: No module named docker

Downgrade docker to work around:

pip install docker==2.4

Waiting for nova-compute service up timing out due to rabbitmq breaking

Symptoms:

TASK [nova : Waiting for nova-compute service up] ****************************************************************************************************
FAILED - RETRYING: Waiting for nova-compute service up (20 retries left).
FAILED - RETRYING: Waiting for nova-compute service up (19 retries left).
FAILED - RETRYING: Waiting for nova-compute service up (18 retries left).
FAILED - RETRYING: Waiting for nova-compute service up (17 retries left).
FAILED - RETRYING: Waiting for nova-compute service up (16 retries left).
FAILED - RETRYING: Waiting for nova-compute service up (15 retries left).
FAILED - RETRYING: Waiting for nova-compute service up (14 retries left).
FAILED - RETRYING: Waiting for nova-compute service up (13 retries left).
FAILED - RETRYING: Waiting for nova-compute service up (12 retries left).
FAILED - RETRYING: Waiting for nova-compute service up (11 retries left).
FAILED - RETRYING: Waiting for nova-compute service up (10 retries left).
FAILED - RETRYING: Waiting for nova-compute service up (9 retries left).
FAILED - RETRYING: Waiting for nova-compute service up (8 retries left).
FAILED - RETRYING: Waiting for nova-compute service up (7 retries left).
FAILED - RETRYING: Waiting for nova-compute service up (6 retries left).
FAILED - RETRYING: Waiting for nova-compute service up (5 retries left).
FAILED - RETRYING: Waiting for nova-compute service up (4 retries left).
FAILED - RETRYING: Waiting for nova-compute service up (3 retries left).
FAILED - RETRYING: Waiting for nova-compute service up (2 retries left).
FAILED - RETRYING: Waiting for nova-compute service up (1 retries left).
fatal: [node01 -> node01]: FAILED! => {"attempts": 20, "changed": false, "cmd": ["docker", "exec", "kolla_toolbox", "openstack", "--os-interface", "internal", "--os-auth-url", "http://10.10.11.254:35357", "--os-identity-api-version", "3", "--os-project-domain-name", "default", "--os-tenant-name", "admin", "--os-username", "admin", "--os-password", "uBz6L2iZDlkKIv8kCaOzoFVjGSOHRf6x9gsgGNDp", "--os-user-domain-name", "default", "compute", "service", "list", "-f", "json", "--service", "nova-compute"], "delta": "0:00:02.176113", "end": "2018-03-13 15:59:06.604696", "rc": 0, "start": "2018-03-13 15:59:04.428583", "stderr": "", "stderr_lines": [], "stdout": "[]", "stdout_lines": ["[]"]}
 
 
/var/lib/docker/volumes/kolla_logs/_data/nova/nova-compute.log
2018-03-13 16:58:38.788 7 ERROR oslo.messaging._drivers.impl_rabbit [req-803bcc07-60c9-4129-ab06-40809c99e46d - - - - -] [89b4b687-89f7-42fe-92a1-9085e6fe35a7] AMQP server on 10.10.11.3:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 32 seconds. Client port: None: error: [Errno 111] ECONNREFUSED

Solution:

Rabbitmq server names must match node hostnames. The connection has to go through the correct interface.

Eg:

root@headnode:~# cat /etc/hosts
10.10.10.2	node01
10.10.11.2	node01-eth1

Assuming kolla is configured to use the eth1 interface, the hostname used in the inventory file will be "node01-eth1" in this case. The node hostname must match the one in the inventory.

root@node01:~# hostnamectl set-hostname node01-eth1