DVM TroubleShooting

This playbook provides detailed instructions for common DVM troubleshooting resolutions. Knowing what to do when your system displays certain symptoms could greatly reduce data loss. 


First Step

As soon as your DVM goes down - contact DefenseStorm. Dependent on our agreement with you, we may have already been notified of the issue and started remediation.

Rebooting the DVM is the last step taken. While the DVM is down, it is still attempting to send logs, and once the connection is made all the logs utilizing TCP Protocol from when the DVM was down are sent to the console, and they display correctly since we utilize the ingestion timestamp, not the time of connection.

Note for Cloud based services

In the event that you DVM goes down, we still receive logs from cloud based service like Office365 and OpenDNS. These items flow from their cloud directly to our cloud, not the DVM. If you laptop utilizes the DefenseStorm Windows Agent (DWA), this also sends data directly to our cloud, bypassing the DVM. Rebooting the DVM causes all logs from the time it went down until the time it is back up to be lost. 


DVM Alerts and Troubleshooting Resolutions


Reset DVM Clock

Run the following commands

cat /etc/cron.daily/ntpupdate

ntpdate -s -u pool.ntp.org


Disk Almost Full

Symptom: DVM gives warning that the disk space is full.

Cause (possible): When logs do not rotate off the system as expected or update packages are downloaded more than once, therefore taking up double the space.

Mitigation Steps:  Manually clean the logs off the system or purge more space.

STEP 1: Determine if the disk is full because of log rotation error

Check /var/log for runaway log file sizes (ls -lh /var/log), inspect any very long file using tail -n 40 filenamehere.

STEP 2: Determine if the disk is full because of already applied update packages

Run the sudo apt-get autoremove --purge command. This first runs an analysis estimating the reclaimable space, and can be cancelled at this point before any permanent removal is triggered.

Space commands

df

display disk usage by device

du -h <pathhere>            

display size of directory, append / to display per file

uname -r

display active DVM kernel version (useful when comparing packages on disk)

Sudo commands

These commands will need the DVM administrator to input the DVM login password. (commands below: &#124; = |)

sudo du -x / &#124; sort -n                

get size of all file objects, then sort top-down

sudo apt-get autoremove --purge

remove  all unused / already installed packages


Syslog Configuration Problems

Symptom:  /var/log/syslog-ng.log contains spammed error lines such as, maximum connections reached; rejecting connection.  Maximum concurrent connections: 500.

Mitigation Steps: Increase Max Connections

STEP 1: Check DVM configuration (the SyslogNG section), and inspect the maxconnections values. If this is lower than the number of machines registered to the DVM, error log spam can overflow the log file size (and possibly fill disk).

The following lines display in:   /etc/praesidio/praesidio.conf:

[SyslogNG] (port 514, 516, 601…..)

Tcp514maxconnections = 100

Tcp516maxconnections = 100

Tcp601maxconnections = 500

STEP 2: Increase the number of connections for the port in use (example: = 1000),

run sudo vi /etc/praesidio/praesidio.conf

STEP 3: Edit  /etc/syslog-ng/conf.d/praesidio.conf

For all config sections that look like:

network(
  …
  tags("tcp514")
  tags("tcp516")
  tags("tcp601")

Where max_connections(100) is present, replace with max_connections(1000). Or some other high number to reduce possibility of this occurring again.

STEP 4: Repeat for the other conf sections for ports 516 and 601.

STEP 5: Log clean up and final check

  • Delete the error.log and error.log.1 files. (sudo rm error.log)
  • Reboot the box (sudo reboot)
  • DVM Console --> Get DVM Status
  • DVM Console --> Troubleshooting

STEP 6: Post reboot log check

Bash Shell --> Check contents of error.log in /var/log.

  • Run the command  Tail -n 50 error.log

STEP 7: Post reboot monitor of syslog

Syslog-ng.conf (syslogng)

  •  Run the command /etc/syslog-ng/conf.d/praesidio.conf


Frequent DVM Reboot Alerts 

Symptom: The DVM sends unusually frequent reboot alerts.

Cause: The reboot required flag sets when an OS package upgrade requests it. The DVM is hardcoded to check for security updates daily. Until all packages have been updated, the alert may continue to display frequently.

Mitigation Steps: Enable the automatic DVM reboot feature. This only reboots the DVM if a security update has been applied in the last day that requires it.

STEP 1: Select Option (11) Configure Automatic Security Updates

Within the DVM’s Main Menu, select option 11 to enable Automatic Security Updates.

STEP 2: Select Automatic Reboot Time

After you select to enable automatic reboot, set the reboot time.


DVM down soon after reboot 

Symptom: After being rebooted, the DVM went down again after an hour.

Root cause: Disk failure was due to log spam from syslog-ng due to too many client connections. For example, if the maximum pool (port 601) is set to 500, and there are 800 machines configured for communication with the DVM, the error spam fills the disk and prevents syslog-ng from starting. This causes event data to never make it to the DefenseStorm platform.

Mitigation Steps: The following actions brought the DVM back to a stable state, and it should prevent this from reoccurring in the near future.

STEP 1: Run the command >> Df -h

  • Dev/sd117gb/1.1bg free

STEP 2: Run the command >> Du -h/var/log

  • 16G/var/log

STEP 3: Run the command >> cd/var/log

STEP 4: Run the command >> tail phython_sqs.log (blank)

STEP 5: Run the command >> ls -lh

  • Error.log4.5G
  • Error.log.1 (sep 18)    5.8G
  • Syslog3.#G

STEP 6: Run the command >> tail error.log

  • “Rejecting connection from client: maximum connection attempts reached”
    • IPs listed are local bank IPs.
      • Desktops, Windows servers, various hardware


STEP 7: Run the command >> du -h . 

  • 16G/var/log
  • ….

Syslog-ng.log: maximum connections reached; rejecting connection. Maximum concurrent connections: 500.

STEP 8: Identify the syslog config problem 

correct the syslog config problem before purging logs and restoring customer asset connectivity to DVM.

  1. Check DVM configuration (the SyslongNG section), and inspect the maxconnections values. If this is lower than the number of machines registered to the DVM, error log spam can overflow the log file size (and possible fill disk).

/etc/praesidio/praesidio.conf

[SyslogNG] (port 514,516, 601…..)

Tcp514maxconnections = 100

Tcp516maxconnections = 100

Tcp601maxconnections = 500

>> sudo vi/etc/praesidio/praesidio.conf

  • Changed entries to 1000 on each port.
>> /etc/syslog-ng/conf.d/praesidio.conf

STEP 9: Mitigation steps for this file:

For all config regions that look like:

Network (
…
tags(“tcp514”)
tags(“tcp516”)
tags(“tcp601”)
Example:
Network (
port(514)
…
…
Max_connections (100)
tags(“tcp514”)

STEP 10: Change highlighted to: max_connections(1000)

STEP 11: Repeat for other conf sections for 516 and 601.

STEP 12: Delete the error.log and error.log.1 files.

Run the following command >> sudo rm error.log

STEP 13: Reboot the box

Run the following command >>  sudo reboot

STEP 14: DVM Console → Get DVM Status


    • All services up, 37% disk usage, queues are empty

STEP 15: DVM Console → Verification of resolution

  1. Ran connectivity test, all green now
  2. Post reboto log check:
    1. Bash Shell → Check contents of error.log in /var/log.
  3. Run command >> Tail -n 50 error.log
    1. Just NTP errors observed. No issues with IPs right now.


Verify data flow

STEP 1: Open a shell session on the DVM.

STEP 2: From the DVM, run the following command:

sudo tcpdump -vvv -s 4096 -X host 10.10.10.10

And port 514 where the IP address is the device sending log data and the port is the port it is sending to, which is typically 514 or 516.


Increase partition size

For the purpose of this procedure, we are increasing the primary partition to 22 gigs. Always make a snapshot of backup of the current instance, just in case something goes wrong.

STEP 1: Log into the DVM and go to the command line.

STEP 2: Turn off swap

sudo swapoff --all --verbose

STEP 3: Remove swap partition

sudo parted /dev/sda rm 2

STEP 4: Resize root partition

sudo parted /dev/sda resizepart 1 yes 24000

STEP 5: Interactively make new swap partition

praesidio@ubuntu:~$ sudo parted /dev/sda mkpart
Partition type? primary/extended? primary
File system type? [ext2]? linux-swap
Start? 24001
End? 25000

STEP 6: Make swap filesystem

sudo mkswap /dev/sda2

STEP 7:  Turn swap back on

sudo swapon --all --verbose

STEP 8: Resize root filesystem

sudo resize2fs /dev/sda1 22000M

STEP 9: Check that filesystem has grown

df -h

STEP 10: Reboot

sudo reboot

STEP 11: After reboot check that filesystem is still 22GB

df -h

STEP 12: Check swap is present (you should see 900 odd M for swap)

free -h


High CPU Usage

Symptom:  Seems that the DVM is running hot with a low number of CPUs.

Cause: While copying the syslog file, the DVM got stuck in a bad state.

Mitigation Steps: If it is the syslog file that is stuck, you can restart it without having to reboot the whole system. The following steps bring the DVM back to a healthy state and should return the DVM to a normal usage.

STEP 1: Enter the bash shell, and execute the following:

sudo top

This provides a table with all running processes. At least one process should display a high CPU percentage.

STEP 2: Restart the stuck process or DVM

If syslog-ng service is the process stuck, then you can restart it without rebooting the whole system by executing the following command: 

sudo/etc/init.d/syslog-ng restart

If syslog-ng is not the process that is stuck, you can always try a reboot of the DVM itself. If that still does not set the DVM to a healthy CPU usage state, then escalate this further to DefenseStorm.


Missed Events

Symptom:  Console may miss a few event logs.

Cause: Traffic too high, and compression setting not correct.

Mitigation Steps: If you believe your DVM may be dropping events, you can follow these mitigation steps to create a fallback for network monitoring.

STEP 1: Nload

  1. Nload provides a good picture of overall network utilization in real-time, displayed per network interface.  Useful to determine if a single NIC is saturated.
  2. Install steps:
    1.    Sudo apt-get update
    2. Sudo apt-get upgrade
    3. Sudo apt-get install nload

STEP 2: lftop

  1. Iftop provides a good picture of overall network utilization, with utilization displayed on a per-process level.  Useful to determine with processes are using bandwidth and identify unexpected sources.
  2. Install steps:
    1. Sudo apt-get install iftop


Usage:

Iftop, ran by itself, drops into a console monitoring mode.  For the DVM, you’ll want to identify the processes that are:

  • Receiving traffic on syslog ports
  • Sending traffic out to an external AWS IP over port 443 (https)

Screenshots of the various values here, along with the measured totals from vnstat, should help us understand the characteristics of the network better.

 

STEP 3: Mtr (on ubuntu: mtr-tiny)

  1. MTR can be used to trace network routes and obtain reporting data.  Useful to determine if the path to our SQS server (Amazon US-West) is congested, and if the source of the issue is within the customer network, customer ISP, or caused by an external network location altogether
  2. Install steps:
    1. Sudo apt-get install mtr-tiny

Additional info (external link to Linode’s website): https://www.linode.com/docs/networking/diagnostics/diagnosing-network-issues-with-mtr

 

Down Disk Queue during Start-up

Sometimes during the initial DVM startup, the screen displays a disk write error (example shown below).


To correct this error, simply reboot the DVM by selecting Option 9 - Reboot from the DVM main menu.