Module lessons (3/4)
Extracting URLs and IPs
In free text (logs, articles, dumps) it's common to want to extract URLs and IP addresses. Let's look at robust patterns for both.
http/https URLs
Pattern: https?:\/\/[\w.-]+(?:\:\d+)?(?:\/[^\s]*)?https?:\/\/-- scheme, with thesoptional.[\w.-]+-- host (domain, subdomains, possiblylocalhost).(?:\:\d+)?-- optional port.(?:\/[^\s]*)?-- optional path, up to the first whitespace.
Captures https://example.com, http://localhost:3000/api/users,
https://docs.dev/path?query=value.
IPv4
An IPv4 is 4 decimal octets separated by dots:
Pattern: \b(?:\d{1,3}\.){3}\d{1,3}\b"Good enough" version: it also accepts invalid values like
999.999.999.999. For the strict version you'd need range alternation
(25[0-5]|2[0-4]\d|[01]?\d\d?), which is much longer.
Pattern (strict): \b(?:(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\bPrecision versus brevity
Matching URLs or IPs requires balancing pattern tolerance. A strict IP validator verifies that no octet exceeds 255. A practical extractor, on the other hand, typically looks for simplified patterns and delegates fine validation to dedicated code.
Try it
Find every http or https URL in the text. Scheme + host + optional path.
Show hint
https? for the optional s, [\\w.-]+ for the domain, and a group (?:\\/[^\\s]*)? for the optional path.
Solution available after 3 attempts
Review exercise
Find every IPv4 (4 decimal octets separated by dots). Permissive version, no 0-255 check.
Show hint
Use (?:\\d{1,3}\\.){3} to repeat 'octet + dot' 3 times, then \\d{1,3} for the last one.
Solution available after 3 attempts
Additional challenge
Find all IPv4 addresses in the format `X.X.X.X` (composed of four 1-to-3 digit numbers separated by dots).
Show hint
Use \b to enforce boundaries, (?:\d{1,3}\.){3} to repeat the octet and dot three times, and finally \d{1,3}.
Solution available after 3 attempts