🏷️ Table of Contents

What is Content Discovery?
Manual Discovery: Robots.txt
Manual Discovery: Favicon
3.1 Practical Exercise
Manual Discovery: Sitemap.xml
Manual Discovery: HTTP Headers
Manual Discovery: Framework Stack
Manual Discovery: Google Hacking and Dorking
OSINT: Wappalyzer
OSINT: Wayback Machine
OSINT: GitHub
OSINT: S3 Buckets
Automated Discovery
12.1 What is Automated Discovery?
12.2 What are wordlists?
12.3 Automation Tools

📚 Study Notes

What is Content Discovery?

Content discovery means looking for hidden or less obvious parts of a website, such as old pages, backup files, or staff-only sections that regular visitors don’t normally see. These can be found by checking the site manually, using automated tools, or gathering information from publicly available sources (OSINT).

Now it is time to start your AttackBox :)

❓What is the Content Discovery method that begins with M?
Manually

❓What is the Content Discovery method that begins with A?
Automated

❓What is the Content Discovery method that begins with O?
OSINT

Manual Discovery: Robots.txt

One simple place to check when looking for hidden website content is the robots.txt file. This file tells search engines which parts of a website they should not index or show in search results.

Website owners often use it to hide areas such as admin panels, private customer files, or other sensitive pages from search engines. While this doesn’t actually secure those pages, it can accidentally reveal useful locations to someone testing the website.

For beginners in security testing, the robots.txt file can act like a small map of pages the website owner prefers people not to find.

Look at the robots.txt file on the Acme IT Support website: open Firefox on the AttackBox and check the URL: http://MACHINE_IP/robots.txt

❓What is the directory in the robots.txt that isn't allowed to be viewed by web crawlers?
/staff-portal

Manual Discovery: Favicon

A favicon is the small icon you see in a browser tab next to a website’s name.

Sometimes websites are built using pre-made frameworks, and developers forget to replace the default favicon that comes with them. If that happens, the icon can reveal which framework or technology the website is using.

Security testers can compare the favicon with known icons in databases like the OWASP favicon database to identify the technology behind the website. Once the framework is known, it becomes easier to research possible weaknesses or common vulnerabilities related to it.

Practical Exercise:

Open the site https://static-labs.tryhackme.cloud/sites/favicon/ in Firefox on the AttackBox and check the browser tab icon, which shows the website’s favicon, indicating the site is using one.

Viewing the page source you’ll see line six contains a link to the images/favicon.ico file.

To download the favicon and get its md5 hash value run: curl https://static-labs.tryhackme.cloud/sites/favicon/images/favicon.ico | md5sum Note: This curl will fail on the AttackBox if you are a free user, in which case you should use a VM for this. If your hash ends with 427e then your curl failed, and you may need to try it again.

To run it on Windows in Powershell run: PS C:> curl https://static-labs.tryhackme.cloud/sites/favicon/images/favicon.ico -UseBasicParsing -o favicon.ico PS C:> Get-FileHash .\favicon.ico -Algorithm MD5

❓What framework did the favicon belong to?
cgiirc

Manual Discovery: Sitemap.xml

The sitemap.xml file lists pages that a website owner wants search engines to index. It can sometimes reveal hard-to-find pages or old parts of the website that are still accessible.

Look at the sitemap.xml file on the Acme IT Support website by opening http://MACHINE_IP/sitemap.xml in the FireFox browser on your AttackBox.

❓What is the path of the secret area that can be found in the sitemap.xml file?
/s3cr3t-area

Manual Discovery: HTTP Headers

When you visit a website, the server sends HTTP headers with its response. These headers can reveal useful information such as the web server software and programming language being used (for example, NGINX or PHP). This information can help security testers check if the website is running outdated or vulnerable software.

To find headers run curl http://MACHINE_IP -v, where -v switch will enable verbose mode, which will output info you want. You will find something very interesting ;)

❓What is the flag value from the X-FLAG header?
THM{H*****_F***}

Manual Discovery: Framework Stack

After identifying the framework a website uses (for example from the favicon or page source), you can research it on its official website. This can help you learn how the framework works and where hidden files or features might exist, which can lead to discovering more website content.

When checking the page source of the Acme IT Support website, you can find a comment that includes a link to the framework’s website. By reading the framework’s documentation, you can discover the location of the admin panel, which can then be accessed on the target website.

❓What is the flag from the framework's administration portal?
THM{C*****_D*****_C**********}

Manual Discovery: Google Hacking and Dorking

OSINT (Open-Source Intelligence) tools are publicly available resources that help gather information about a target website.

Google Hacking (also called Google Dorking) is a technique that uses advanced search operators in Google Search to find specific information on the internet.

Instead of a normal search, you use special filters (operators) to narrow down results and locate hidden or sensitive information such as admin pages, documents, or login portals.

Common Google Dork Operators

Operator	Example	What it Does
`site:`	`site:tryhackme.com`	Shows results only from a specific website
`inurl:`	`inurl:admin`	Finds pages that contain the word in the URL
`filetype:`	`filetype:pdf`	Finds specific file types
`intitle:`	`intitle:admin`	Finds pages that contain the word in the page title

More info about google hacking is here

❓What Google dork operator can be used to only show results from a particular site?
site:

OSINT: Wappalyzer

Wappalyzer is an online tool and browser extension that helps identify what technologies a website uses, such as frameworks, Content Management Systems (CMS), payment processors and much more, and it can even find version numbers as well.

❓What online tool can be used to identify what technologies a website is running?
Wappalyzer

OSINT: Wayback Machine

The Wayback Machine is a historical archive of websites that dates back to the late 90s. You can search a domain name, and it will show you all the times the service scraped the web page and saved the contents. This service can help uncover old pages that may still be active on the current website.

❓What is the website address for the Wayback Machine?
https://archive.org/web/

OSINT: GitHub

Git is a tool that tracks changes in project files. It helps teams work together by keeping a history of edits and showing who changed what.

Basic workflow:

A user edits files on their computer
They commit the changes with a short message
They push the changes to a central storage called a repository
Other users pull the changes to update their own copies

GitHub is a website that hosts Git repositories online.

Repositories can be:

Public - anyone can see them
Private - only authorized users can access them

❓What is Git?
version control system

OSINT: S3 Buckets

S3 Buckets (AWS Storage) are a cloud storage service provided by Amazon AWS.
They allow users to store files, images, backups, or even static websites that can be accessed through HTTP or HTTPS.

The owner of an S3 bucket can set different permissions:

Public - anyone can access the files
Private - only authorized users can access them
Writable - users may be able to upload files

Sometimes these permissions are misconfigured, which can accidentally expose files that should not be public.

S3 Buckets format: http(s)://{bucket-name}.s3.amazonaws.com

name is decided by the owner, e.g. http://tryhackme-assets.s3.amazonaws.com/

❓What URL format do Amazon S3 buckets end in?
.s3.amazonaws.com

Automated Discovery

What is Automated Discovery?

Automated discovery means using tools to find hidden files and directories on a website instead of searching manually.

These tools send many requests (sometimes thousands or millions) to a web server to check if certain files or folders exist.
This can reveal content that wasn’t publicly linked or visible before.

What are wordlists?

Wordlists are text files that contain many commonly used words.

In web discovery, wordlists usually contain common file names and directory names that tools test on a website.

Example uses of wordlists:

Password cracking - list of common passwords
Web discovery - list of common directories and files

An excellent resource for wordlists that is preinstalled on the THM AttackBox is this which Daniel Miessler curates.

Automation Tools

there are many different content discovery tools available, but in this room are mentioned only three (which are also preinstalled on this attack box) - ffuf, dirb and gobuster.
on the AttackBox execute these commands:
1. ffuf: user@machine$ ffuf -w /usr/share/wordlists/SecLists/Discovery/Web-Content/common.txt -u http://MACHINE_IP/FUZZ
2. dirb: user@machine$ dirb http://MACHINE_IP/ /usr/share/wordlists/SecLists/Discovery/Web-Content/common.txt
3. Gobuster: user@machine$ gobuster dir --url http://MACHINE_IP/ -w /usr/share/wordlists/SecLists/Discovery/Web-Content/common.txt

❓What is the name of the directory beginning "/mo...." that was discovered?
/monthly

❓What is the name of the log file that was discovered?
/development.log