Personal tools
You are here: Home Documentation Handbook Troubleshooting Guide
Document Actions

Troubleshooting Guide

by admin last modified 2007-07-07 12:00
  • Problem (dCache): I get this error from SRM:
    No Route to cell for packet {uoid=<1181077961014:278>;path=[>PinManager@local];msg=Tunnel cell PinManager@local< not found at >dCacheDomain<}
    Solution: The PinManager can't be found and probably has crashed.  The PinManager runs on the head node in the utility Domain (currently dcache-head).  It requires Postgres to be on the node and operational.  (Note: If the headnode's diskspace fills up, Postgres crashes).  To restart the two services, you can do this:
    /etc/init.d/postgresql restart
    /opt/d-cache/jobs/utility stop
    /opt/d-cache/jobs/utility start
    You should only restart postgres if it has crashed; you can check to see if it's functional by logging in with psql.  Then, test a hand SRM transfer to make sure everything has recovered.
  • Problem (UNL): NAT is down:

    Solution: Run the following on dcache-head as root.
    /sbin/iptables -A POSTROUTING -t nat -o eth0 -j MASQUERADE
    /sbin/iptables -A INPUT -j DROP -m state --state NEW,INVALID -i ippp0
    /sbin/iptables -A FORWARD -j DROP -m state --state NEW,INVALID -i ippp0
  • Problem (UNL): SRMWatch is down.  SRMWatch should be available at this URL (on campus):
    http://dcache-head.unl.edu:8080/srmwatch/
    Solution:  Run the following command on dcache-head -
    /root/srmwatch_start.sh
    There's a corresponding srmwatch_stop.sh to stop it.
  • Problem: Root partition of srm.unl.edu filled up.  If the root partition of srm.unl.edu fills up, making more space is not sufficient to return the node to health. 

    Solution: The three things which must additionally be done are:
    • Restart postgres:
      /etc/init.d/postgresql restart
      The database crashes as soon as the partition fills and does not automatically recover.
    • Restart srm:
      /opt/d-cache/bin/dcache-srm stop
      /opt/d-cache/bin/dcache-srm start
      SRM does not recover from the lost connections when you restarted postgres.
    • Check SRM for functionality:
      srmcp -debug=true srm://srm.unl.edu:8443/pnfs/unl.edu/data4/testfile.unl.3 file:////dev/null
      You must do this from a node where you have a valid proxy.
  • Problem: Globus submission works but Condor-G submission doesn't.  It is possible to see GSISSL errors from the Condor client while the Globus one doesn't work.  One problem in the past was that Condor and Globus were using different certificates directories, and the Condor one wasn't being updated.
    Solution: Make sure that Condor is using the correct directory for CA certificates.  Examine the value of the GSI_DAEMON_TRUSTED_CA_DIR in the condor_config file.  On gpn-husker.unl.edu, it is currently set as:
    GSI_DAEMON_TRUSTED_CA_DIR = /etc/grid-security/certificates
  • Problem: dCap transfers fail.  The following error is given from the dCap client:
    java.net.ConnectException: Connection timed out
    This message is from the remote dCache server pool, not from the local node.  This means that the remote pool cannot contact the local node.  Usually this is indicative of an incorrect networking configuration on the remote pool.  Check to make sure it is set up correctly.  Alternately, from the remote dCache pool, try to ping the local node; if the pings go through, the network problem has been resolved.
  • Problem:

Powered by Plone, the Open Source Content Management System