Apache 日志解析器

Apache logs (combined format) have a precise structure:

Code

127.0.0.1 - - [10/Oct/2024:13:55:36 +0000] "GET /index.html HTTP/1.1" 200 1234

A pattern with named groups extracts every field in one shot:

Code

^(?<ip>\d+\.\d+\.\d+\.\d+)\s+\S+\s+\S+\s+\[[^\]]+\]\s+"(?<metodo>\w+)\s+(?<path>\S+)\s+\S+"\s+(?<status>\d+)\s+(?<size>\d+)

It looks monstrous, but it's the concatenation of simple patterns:

(?<ip>\d+\.\d+\.\d+\.\d+) -- the source IPv4.
\s+\S+\s+\S+ -- the user IDs (usually -).
\[[^\]]+\] -- the date in square brackets.
"(?<metodo>\w+)\s+(?<path>\S+)\s+\S+" -- the request in quotes, with HTTP method, path and version.
(?<status>\d+)\s+(?<size>\d+) -- status code and bytes sent.

General strategy

Start from the real line and pin the delimiters: spaces, quotes, brackets.
Between delimiters, identify the type of token (IP, word, number).
Wrap in named groups only what you need to extract.

Parsing complex logs

In log parsing, fields are usually space-separated except when they contain quoted strings or brackets (e.g. the user-agent). Using excluding character classes instead of the wildcard dot prevents merging multiple columns incorrectly.

Try it

锻炼#regex.m8.l2.e1

尝试：0加载中...

Extract the IPv4 at the start of every log line as a named group `ip`.

正在加载编辑器...

显示提示

Anchor to the start of the line with ^ and the m flag, and wrap the IP in (?<ip>...).

3 次尝试后可用的解决方案

Review exercise

锻炼#regex.m8.l2.e2

尝试：0加载中...

Extract the HTTP method and path of the request between quotes, as groups `metodo` and `path`.

正在加载编辑器...

显示提示

Open with ", then (?<metodo>\w+)\s+(?<path>\S+).

3 次尝试后可用的解决方案

Additional challenge

锻炼#regex.m8.l2.e3

尝试：0加载中...

Extract only the date and time part enclosed in brackets in logs, e.g. `10/Oct/2024:13:55:36 +0000`, as a named group `timestamp`.

正在加载编辑器...

显示提示

Use \[(?<timestamp>[^\]]+)\] to prevent the engine from matching past the closing bracket.

3 次尝试后可用的解决方案