系统监控和故障排除

Learn why system monitoring 和 troubleshooting is a fundamental component of an IT team’s responsibilities.

下载SecOps电子书

什么是系统监控和故障排除?

System monitoring 和 troubleshooting is a fundamental component of an IT team’s responsibilities. 而 合规框架 比如NIST和ITIL可以提供监控指南, 这些标准通常会留下很大的解释空间, 实施一项监测策略可能令人望而生畏. 以下各节提供了who的概述, 什么, 在哪里, 当, 以及如何监控你的IT环境.

要监视的数据类型

One way to think about monitoring your environment is to consider data in three categories.

首先是测井数据, 哪个可以定义为写入日志文件的任何数据, 不管它是普通的结构还是简单的文本. Log data provides a detailed record of the transactions occurring across your IT environment. Second is asset data, which refers to any data taken directly from the asset.

This can range from basic resource metrics like CPU 和 内存 to information about the processes 和 applications running on a given IT asset. Asset data can be particularly useful 当 monitoring for 事件 that wouldn’t normally be captured in st和ard log files. 最后是网络数据, 哪个指的是特定于网络性能的数据, 包括带宽, 网络连接详细信息, 路由行为.

而监控所有这三种数据类型是成熟的基础 安全操作, system monitoring typically focuses on the analysis of log data 和 asset data, specifically.

监控系统

有很多系统可以监控, 你所选择的最终取决于你所处的环境. 选项可能包括:

服务器: 服务器监视涵盖了广泛的系统, 包括托管应用程序的服务器, Active 导演y域控制器, 文件共享, 电子邮件服务器. Whether it’s a Windows, Linux, or Mac machine, most servers will offer some degree of event logging.

数据库: Many databases offer different logging levels to help administrators debug errors 和 identify issues that are on the horizon. Typical 事件 logged from databases can include slow queries 和 SQL timeouts, 行限制, 内存限制, 还有缓存问题.

应用程序: Applications include both third-party applications you’ve purchased 和 applications that have been developed in-house. Some third-party applications will write logs to their host, which can then be collected.

Applications developed by your internal team should also be built to log important 事件 that can be captured. Consider whether these applications are customer-facing or employee-facing. 而 应用程序性能监控 不管应用程序的受众如何,都很重要吗, 面向客户的应用程序和服务可能需要更详细的日志记录.

云服务: 云服务,特别是基础设施即服务解决方案,如 AWSAzure,是系统监控计划的工具. These services may offer log viewing functionality within the service itself, 但是您也可以在这些服务之外收集和存储日志. Collecting 和 storing all of your logs in a single location can make it easier to find information later.

容器: Containerization is becoming a popular approach to architecting 和 hosting both applications 和 infrastructures thanks to services like Docker. 随着基础设施变得更加分区化, 更短暂, 并且比物理机器更依赖于代码, 集装箱安全 能在系统健康中发挥作用吗.

员工工作站: When software or processes on an employee’s machine are in conflict or perhaps flooding your network with packets, being able to see 什么’s running on an employee’s workstation is necessary. 能够远程操作是很重要的, 因为追踪实物资产既耗时又不可行.

要监视的事件和指标

错误: 记录应用程序和系统错误是一个简单的选择, 和 the keyword “error” often serves as a good starting place for IT investigations. 有些系统按类型对错误进行分类, 哪些可以提供哪些事件需要注意的指示. 

CRUD事件: 在一般情况下, 在创建信息时捕获, 读, 更新, 或删除可能对以后调试问题有用, 特别是在应用程序中. 虽然这些事件通常不会提供问题的直接迹象, they can be excellent sources of information 当 tracing an issue back to its root cause.

事务: “交易”通常指像购买这样的重要事件, 订阅, 取消, 和提交. Individual transactions should be closely monitored for failed transactions 和 incomplete transactions. 取决于系统, error codes may get logged with important information on 什么 caused the transaction issue. 一些系统, 比如Microsoft SQL Server, provide a dedicated Transaction Log for capturing this information in one log. 在其他系统中,您可能需要自己集中这些信息.

访问请求和权限变更: Logging from a service like Active 导演y can offer an important view of user behavior in your environment. Monitoring 和 collecting data on things like permission changes can help you prevent users from getting unintended admin rights. This type of monitoring is often necessary to meet certain compliance st和ards

系统指标: 系统指标,如CPU, 内存, 和 disk utilization should be closely monitored at all times to prevent system failure. Dramatic changes in these values could indicate an outage or an impending outage. Collecting these metrics over longer periods of time can also help with capacity planning in the future.

如何监控系统

考虑到系统的广度, 事件, 以及需要监控的指标, centralizing your data collection into a single source of truth may be a good decision, 尤其是当一个系统崩溃的时候. 有一些日志管理解决方案可供收集, 集中, 并以易于搜索的方式组织日志, 可视化, 并从.

Monitoring may also expand beyond log management to include the monitoring of individual IT assets. This type of monitoring includes the ongoing measurement of resource utilization metrics 和 the tracking of software 和 processes running on the assets. Software usage isn’t often captured in traditional logs but can offer vital clues to the root cause of system issues. Being able to not only measure IT asset data but log the results offers significant visibility across your IT environment.

何时监控

简而言之, system monitoring should be happening 24/7 if your systems needs to maintain constant availability. Often, monitoring can happen in the background without you needing to pay constant attention. 话虽如此, there are some occasions 当 you should expect to keep an active eye of your system data, 包括:

系统更新: 当系统更新时, there runs the risk of the update failing or the update causing unintended complications.

应用程序部署和回滚: 将代码(或回滚代码)部署到应用程序时, 可能会有意想不到的问题, 即使所有的单元测试和验收测试都通过了.

迁移: Data migrations can be often be challenging 和 present issues with mismatched data types, 验证问题, 和更多的.

高峰交易时间当前位置一些企业已知有交易增加的时期, 如电子商务公司在节假日或促销期间. Availability issues occurring during these peak times could have significant consequences if not caught quickly.

There are a lot of factors that go into IT system monitoring 和 troubleshooting. By breaking down your IT environment into which systems 和 事件 you need to monitor, you’ll be one step closer to determining the best monitoring strategy 和 solution for your organization.

阅读更多来自Rapid7博客

安全操作:博客的最新消息