Module lessons (2/4)
Apache log parser
Apache logs (combined format) have a precise structure:
127.0.0.1 - - [10/Oct/2024:13:55:36 +0000] "GET /index.html HTTP/1.1" 200 1234A pattern with named groups extracts every field in one shot:
^(?<ip>\d+\.\d+\.\d+\.\d+)\s+\S+\s+\S+\s+\[[^\]]+\]\s+"(?<metodo>\w+)\s+(?<path>\S+)\s+\S+"\s+(?<status>\d+)\s+(?<size>\d+)It looks monstrous, but it's the concatenation of simple patterns:
(?<ip>\d+\.\d+\.\d+\.\d+)-- the source IPv4.\s+\S+\s+\S+-- the user IDs (usually-).\[[^\]]+\]-- the date in square brackets."(?<metodo>\w+)\s+(?<path>\S+)\s+\S+"-- the request in quotes, with HTTP method, path and version.(?<status>\d+)\s+(?<size>\d+)-- status code and bytes sent.
General strategy
- Start from the real line and pin the delimiters: spaces, quotes, brackets.
- Between delimiters, identify the type of token (IP, word, number).
- Wrap in named groups only what you need to extract.
Parsing complex logs
In log parsing, fields are usually space-separated except when they contain quoted strings or brackets (e.g. the user-agent). Using excluding character classes instead of the wildcard dot prevents merging multiple columns incorrectly.
Try it
Extract the IPv4 at the start of every log line as a named group `ip`.
Show hint
Anchor to the start of the line with ^ and the m flag, and wrap the IP in (?<ip>...).
Solution available after 3 attempts
Review exercise
Extract the HTTP method and path of the request between quotes, as groups `metodo` and `path`.
Show hint
Open with ", then (?<metodo>\w+)\s+(?<path>\S+).
Solution available after 3 attempts
Additional challenge
Extract only the date and time part enclosed in brackets in logs, e.g. `10/Oct/2024:13:55:36 +0000`, as a named group `timestamp`.
Show hint
Use \[(?<timestamp>[^\]]+)\] to prevent the engine from matching past the closing bracket.
Solution available after 3 attempts