This playbook provides detailed instructions for common DVM troubleshooting resolutions. Knowing what to do when your system displays certain symptoms could greatly reduce data loss.
As soon as your DVM goes down - contact DefenseStorm. Dependent on our agreement with you, we may have already been notified of the issue and started remediation.
Rebooting the DVM is the last step taken. While the DVM is down, it is still attempting to send logs, and once the connection is made all the logs utilizing TCP Protocol from when the DVM was down are sent to the console, and they display correctly since we utilize the ingestion timestamp, not the time of connection.
Note for Cloud based services
In the event that you DVM goes down, we still receive logs from cloud based service like Office365 and OpenDNS. These items flow from their cloud directly to our cloud, not the DVM. If your laptop utilizes the DefenseStorm Windows Agent (DWA), this also sends data directly to our cloud, bypassing the DVM. Rebooting the DVM causes all logs from the time it went down until the time it is back up to be lost.
DVM Alerts and Troubleshooting Resolutions
DVM Disk Queue Malfunctioning
The DVM Disk Queue is the service on the DVM that attempts to write incoming events to disk in the event of a long-term outbound network connection failure (If the DVM cannot send events to Amazon SQS cloud; generally happens if the internet connection goes out, but still has local network access).
Initial Troubleshooting Steps
- Run this query: app_name:pvm_stats AND -category:alert
- Look at the "pdiskqueue", if it says DOWN, you'll need to restart the pdiskqueue service.
- Look at the "syslog_ng" field, this is more important than the "pdiskqueue" service, if it says "UP" then you are receiving the pvm_stats heartbeat events normally.
- Look at the "Dropped_events", "Stored_events" count recorded in the DVM stats messages since the incident, and if the stored event volume is close to zero, then the DVM appears to be healthy and operating normally at this time.
What needs to be done?
To fully resolve this issue, restart the "pdiskqueue" service and restart the DVM by following the instructions listed below.
- Go to DVM Main menu, select option 10 from the main menu (Bash shell)
- Run this command
- sudo /etc/init.d/pdiskqueue restart
- exit (goes back to DVM menu)
- Get DVM Status by going to Option (8) from the main menu
- Check to see if this action reenabled the disk queue.
Reset DVM Clock
Run the following commands
cat /etc/cron.daily/ntpupdate ntpdate -s -u pool.ntp.org
Disk Almost Full
Symptom: DVM gives warning that the disk space is full.
Cause (possible): When logs do not rotate off the system as expected or update packages are downloaded more than once, therefore taking up double the space.
Mitigation Steps: Manually clean the logs off the system or remove installers for already applied update packages.
STEP 1: Determine if the disk is full because of log rotation error
In the Bash Shell (DVM menu, option 10), check /var/log for runaway log file size using the shell command below.
|ls -lhrS /var/log||List files in /var/log, sorted by size ascending (largest at end).|
Inspect any very large files at the end of the list, by invoking the following command in the shell:
|tail -n 40 filenamehere||Print the last 40 lines of the file.|
Note any errors for discussion with DefenseStorm support, and delete these log files to reclaim space.
STEP 2: Determine if the disk is full because of already applied update packages
Run the sudo apt-get autoremove --purge command. This first runs an analysis estimating the reclaimable space, and can be cancelled at this point before any permanent removal is triggered.
|df||display disk usage by device|
|du -h directorypath||display size of directory, append / to display per file|
|uname -r||display active DVM kernel version (useful when comparing packages on disk)|
These commands will need the DVM administrator to input the DVM login password.
|sudo du -x / | sort -n|
sudo du -hx | sort -h
|get size of all file objects, then sort top-down |
(optionally, in human-readable format)
|sudo apt-get autoremove --purge||remove all unused / already installed packages|
DVM Hung Upon Boot
Applicable Versions: DVM 1.1.5 and below (Ubuntu LTS 12.04, 14.04), VMware Fusion 5 and 6.
Symptom: Error - Host SMBus controller not enabled
Cause (possible): VMware doesn’t provide that level interface for CPU access, but Ubuntu tries to load the kernel module anyway.
Mitigation Steps: These mitigation steps work for VMware Fusion 5 and 6, and Ubuntu LTS 12.04 and 14.04
- Reboot the DVM - Keep an eye out for the GRUB splash screen to appear.
- Press ESC at the GRUB prompt
- Press 'e' for edit
- Highlight the line that begins "ubuntu ......... or kernel (recovery mode)”, if you have multiple versions with the recovery mode, select the topmost version marked as recovery mode,
- press e
- Highlight the "Kernel...." and press e
- Replace "ro single" with “rw init=/bin/bash”
- hit 'Enter',
- Press 'b' to boot the system
- You are now in bash shell
- Go in to this file: vi /etc/modprobe.d/blacklist.conf
- Add the following lines to the bottom of the file
- blacklist i2c-piix4
- blacklist piix4_smbus
- blacklist intel_rapl
- Reboot the DVM again
Syslog Configuration Problems
Symptom: /var/log/syslog-ng.log contains spammed error lines such as, maximum connections reached; rejecting connection. Maximum concurrent connections: 500.
Mitigation Steps: Increase Max Connections
STEP 1: Check DVM configuration
From the DVM menu, open a Bash Shell (option 10), then run the following command to view the DVM configuration file.
|On DVM 1.2.0+: nano /etc/praesidio/praesidio.conf|
On DVM 1.1.5 or below: vi /etc/praesidio/praesidio.conf
|Uses nano (DVM v.1.2.0+) or vi (DVM 1.1.5 or below) to view the main configuration file.|
Navigate to the SyslogNG section, and inspect the maxconnections values. If this is lower than the number of machines registered to the DVM, error log spam can overflow the log file size (and possibly fill disk).
/etc/praesidio/praesidio.conf: Relevant section and default values
Tcp514maxconnections = 100
Tcp516maxconnections = 100Tcp601maxconnections = 500 Tcp1602maxconnections = 500
STEP 2: Increase the number of connections for the port in use (example: = 1000),
Change the port connection values as needed in the editor to accommodate your logging host count; save the file and exit back to the shell once complete.
STEP 3: Run "sudo /usr/local/bin/pConfig —syslog" to reconfigure the DVM to use the new settings
STEP 4: "sudo service syslog-ng restart" to restart syslog-ng
STEP 5: Log clean up and final check
- Delete the error.log and error.log.1 files. (sudo rm error.log)
- Reboot the box (sudo reboot)
- DVM Console --> Get DVM Status
- DVM Console --> Troubleshooting
STEP 6: Post reboot log check
Bash Shell --> Check contents of error.log in /var/log.
- Run the command Tail -n 50 error.log
STEP 7: Post reboot monitor of syslog
- Run the command /etc/syslog-ng/conf.d/praesidio.conf
Frequent DVM Reboot Alerts
Symptom: The DVM sends unusually frequent reboot alerts.
Cause: The reboot required flag sets when an OS package upgrade requests it. The DVM is hardcoded to check for security updates daily. Until all packages have been updated, the alert may continue to display frequently.
Mitigation Steps: Enable the automatic DVM reboot feature. This only reboots the DVM if a security update has been applied in the last day that requires it.
STEP 1: Select Option (11) Configure Automatic Security Updates
Within the DVM’s Main Menu, select option 11 to enable Automatic Security Updates.
STEP 2: Select Automatic Reboot Time
After you select to enable automatic reboot, set the reboot time.
DVM down soon after reboot
Symptom: After being rebooted, the DVM went down again after an hour.
Root cause: Disk failure was due to log spam from syslog-ng due to too many client connections. For example, if the maximum pool (port 601) is set to 500, and there are 800 machines configured for communication with the DVM, the error spam fills the disk and prevents syslog-ng from starting. This causes event data to never make it to the DefenseStorm platform.
Mitigation Steps: The following actions brought the DVM back to a stable state, and it should prevent this from reoccurring in the near future.
STEP 1: Run the command >> Df -h
- Dev/sd117gb/1.1bg free
STEP 2: Run the command >> Du -h/var/log
STEP 3: Run the command >> cd/var/log
STEP 4: Run the command >> tail phython_sqs.log (blank)
STEP 5: Run the command >> ls -lh
- Error.log.1 (sep 18) 5.8G
STEP 6: Run the command >> tail error.log
- “Rejecting connection from client: maximum connection attempts reached”
- IPs listed are local bank IPs.
- Desktops, Windows servers, various hardware
- IPs listed are local bank IPs.
STEP 7: Run the command >> du -h .
Syslog-ng.log: maximum connections reached; rejecting connection. Maximum concurrent connections: 500.
STEP 8: Identify the syslog config problem
correct the syslog config problem before purging logs and restoring customer asset connectivity to DVM.
- Check DVM configuration (the SyslongNG section), and inspect the maxconnections values. If this is lower than the number of machines registered to the DVM, error log spam can overflow the log file size (and possible fill disk).
[SyslogNG] (port 514,516, 601…..)
Tcp514maxconnections = 100
Tcp516maxconnections = 100
Tcp601maxconnections = 500
>> sudo vi/etc/praesidio/praesidio.conf
- Changed entries to 1000 on each port.
STEP 9: Mitigation steps for this file:
For all config regions that look like:
Network ( … tags(“tcp514”) tags(“tcp516”) tags(“tcp601”) Example: Network ( port(514) … … Max_connections (100) tags(“tcp514”)
STEP 10: Change highlighted to: max_connections(1000)
STEP 11: Repeat for other conf sections for 516 and 601
STEP 12: Delete the error.log and error.log.1 files.
Run the following command >> sudo rm error.log
STEP 13: Reboot the box
Run the following command >> sudo reboot
STEP 14: DVM Console → Get DVM Status
- All services up, 37% disk usage, queues are empty
STEP 15: DVM Console → Verification of resolution
- Ran connectivity test, all green now
- Post reboto log check:
- Bash Shell → Check contents of error.log in /var/log.
- Run command >> Tail -n 50 error.log
- Just NTP errors observed. No issues with IPs right now.
Verify data flow
STEP 1: Open a shell session on the DVM.
STEP 2: From the DVM, run the following command:
sudo tcpdump -vvv -s 4096 -X host 10.10.10.10
And port 514 where the IP address is the device sending log data and the port is the port it is sending to, which is typically 514 or 516.
Increase partition size
For the purpose of this procedure, we are increasing the primary partition to 22 gigs. Always make a snapshot of backup of the current instance, just in case something goes wrong.
STEP 1: Log into the DVM and go to the command line.
STEP 2: Turn off swap
sudo swapoff --all --verbose
STEP 3: Remove swap partition
sudo parted /dev/sda rm 2
STEP 4: Resize root partition
sudo parted /dev/sda resizepart 1 yes 24000
STEP 5: Interactively make new swap partition
praesidio@ubuntu:~$ sudo parted /dev/sda mkpart Partition type? primary/extended? primary File system type? [ext2]? linux-swap Start? 24001 End? 25000
STEP 6: Make swap filesystem
sudo mkswap /dev/sda2
STEP 7: Turn swap back on
sudo swapon --all --verbose
STEP 8: Resize root filesystem
sudo resize2fs /dev/sda1 22000M
STEP 9: Check that filesystem has grown
STEP 10: Reboot
STEP 11: After reboot check that filesystem is still 22GB
STEP 12: Check swap is present (you should see 900 odd M for swap)
High CPU Usage
Symptom: Seems that the DVM is running hot with a low number of CPUs.
Cause: While copying the syslog file, the DVM got stuck in a bad state.
Mitigation Steps: If it is the syslog file that is stuck, you can restart it without having to reboot the whole system. The following steps bring the DVM back to a healthy state and should return the DVM to a normal usage.
STEP 1: Enter the bash shell, and execute the following:
This provides a table with all running processes. At least one process should display a high CPU percentage.
STEP 2: Restart the stuck process or DVM
If syslog-ng service is the process stuck, then you can restart it without rebooting the whole system by executing the following command:
If syslog-ng is not the process that is stuck, you can always try a reboot of the DVM itself. If that still does not set the DVM to a healthy CPU usage state, then escalate this further to DefenseStorm.
Missed Events or Event Lag
Symptom: Console may event logs.
Cause: Traffic too high, compression setting not correct, need to update the Windows Agent profile.
Mitigation Steps: If you believe your DVM may be dropping events, you can follow these mitigation steps to create a fallback for network monitoring.
STEP 1: Nload
- Nload provides a good picture of overall network utilization in real-time, displayed per network interface. Useful to determine if a single NIC is saturated.
- Install steps:
- Sudo apt-get update
- Sudo apt-get upgrade
- Sudo apt-get install nload
STEP 2: lftop
- Iftop provides a good picture of overall network utilization, with utilization displayed on a per-process level. Useful to determine with processes are using bandwidth and identify unexpected sources.
- Install steps:
- Sudo apt-get install iftop
Iftop, ran by itself, drops into a console monitoring mode. For the DVM, you’ll want to identify the processes that are:
- Receiving traffic on syslog ports
- Sending traffic out to an external AWS IP over port 443 (https)
Screenshots of the various values here, along with the measured totals from vnstat, should help us understand the characteristics of the network better.
STEP 3: Mtr (on ubuntu: mtr-tiny)
- MTR can be used to trace network routes and obtain reporting data. Useful to determine if the path to our SQS server (Amazon US-West) is congested, and if the source of the issue is within the customer network, customer ISP, or caused by an external network location altogether
- Install steps:
- Sudo apt-get install mtr-tiny
Additional info (external link to Linode’s website): https://www.linode.com/docs/networking/diagnostics/diagnosing-network-issues-with-mtr
Down Disk Queue during Start-up
Sometimes during the initial DVM startup, the screen displays a disk write error (example shown below).
To correct this error, simply reboot the DVM by selecting Option 9 - Reboot from the DVM main menu.