Introducing Katana: The CLI web crawler from PD

Introducing Katana: The CLI web crawler from PD

What is Katana?

Katana is a command-line interface (CLI) web crawling tool written in Golang. It is designed to crawl websites to gather information and endpoints. One of the defining features of Katana is its ability to use headless browsing to crawl applications. This means that it can crawl single-page applications (SPAs) built using technologies such as JavaScript, Angular, or React. These types of applications are becoming increasingly common, but can be difficult to crawl using traditional tools. By using headless browsing, Katana is able to access and gather information from these types of applications more effectively.

Katana is designed to be CLI-friendly, fast, efficient and with a simple output format. This makes it an attractive option for those looking to use the tool as part of an automation pipeline. Furthermore, regular updates and maintenance ensure that this tool remains a valuable and indispensable part of your hacker arsenal for years to come.

Tool integrations

Katana is an excellent tool for several reasons, one of which is its simple input/output formats. These formats are easy to understand and use, allowing users to quickly and easily integrate Katana into their workflow. Katana is designed to be easily integrated with other tools in the ProjectDiscovery suite, as well as other widely used CLI-based recon tools.

What is web crawling?

Any search engine you use today is populated using web crawlers. A web crawler indexes web applications by automating the “click every button” approach to discovering paths, scripts and other resources. Web application indexing is an important step in uncovering an application’s attack surface.

Installation

Katana allows a couple of different installation methods, downloading the pre-compiled binary, compiling the binary using go, or docker.

Binary

There are two ways to install the binary directly onto your system:

  1. Download the pre-compiled binary from the release page.
  2. Run go install:
go install github.com/projectdiscovery/katana/cmd/katana@latest

Docker

  1. Install/Update docker image to the latest tag
docker pull projectdiscovery/katana:latest

2. Running Katana
a. Normal mode:

docker run projectdiscovery/katana:latest -u https://tesla.com

b. Headless mode:

docker run projectdiscovery/katana:latest -u https://tesla.com -system-chrome -headless

Options

Here are the raw options for your perusal – we'll take a closer look at each below!

Configuration

-d, -depth Defines maximum crawl depth, ex: -d 2
-jc, -js-crawl
Enables endpoint parsing/crawling from JS files
-ct, -crawl-duration Maximum time to crawl the target for, ex: -ct 100
-kf, -known-files
Enable crawling for known files, ex: all,robotstxt,sitemapxml, etc.
-mrs, -max-response-size
Maximum response size to read, ex: -mrs 200000
-timeout
Time to wait for request in seconds, ex: -timeout 5
-aff, -automatic-form-fill
Enable optional automatic form filling. This is still experimental
-retry Number of times to retry the request, ex: -retry 2
-proxy
HTTP/socks5 proxy to use, ex: -proxy http://127.0.0.1:8080
-H, -headers
Include custom headers/cookies with your request, ex: TODO
-config
Path to the katana configuration file, ex: -config /home/g0lden/katana-config.yaml
-fc, -form-config
Path to form configuration file, ex: -fc /home/g0lden/form-config.yaml

Headless

-hl, -headless Enable headless hybrid crawling. This is experimental
-sc, -system-chrome Use a locally installed Chrome browser instead of katana’s
-sb, -show-browser Show the browser on screen when in headless mode
-ho, -headless-options Start headless chrome with additional options
-nos, -no-sandbox Start headless chrome in --no-sandbox mode

Scope

-cs, -crawl-scope In-scope URL regex to be followed by crawler, ex: -cs login
-cos, -crawl-out-scope
Out-of-scope url regex to be excluded by crawler, ex: -cos logout
-fs, -field-scope
Pre-defined scope field (dn,rdn,fqdn), ex: -fs dn
-ns, -no-scope
Disables host-based default scope allowing for internet scanning
-do, -display-out-scope Display external endpoints found from crawling

Filter

-f, -field Field to display in output (url,path,fqdn,rdn,rurl,qurl,qpath,file,key,value,kv,dir,udir), ex: -f qurl
-sf, -store-field
Field to store in selected output option (url,path,fqdn,rdn,rurl,qurl,qpath,file,key,value,kv,dir,udir), ex: -sf qurl
-em, -extension-match
Match output for given extension, ex: -em php,html,js
-ef, -extension-filter
Filter output for given extension, ex: -ef png,css

Rate-limit

-c, -concurrency Number of concurrent fetchers to use, ex: -c 50
-p, -parallelism
Number of concurrent inputs to process, ex: -p 50
-rd, -delay
Request delay between each request in seconds, ex: -rd 3
-rl, -rate-limit
Maximum requests to send per second, ex: -rl 150
-rlm, -rate-limit-minute
Maximum number of requests to send per minute, ex: -rlm 1000

Output

-o, -output File to write output to, ex: -o findings.txt
-j, -json
Write output in JSONL(ines) format
-nc, -no-color Disable output content coloring (ANSI escape codes)
-silent Display output only
-v, -verbose Display verbose output
-version Display project version

Different inputs

There are four different ways to give katana input:

  1. URL input
katana -u https://tesla.com

2. Multiple URL input

katana -u https://tesla.com,https://google.com

3. List input

katana -list url_list.txt

4. STDIN input (piped)

echo “https://tesla.com” | katana

Crawling modes

Standard mode

Standard mode uses the standard Golang HTTP library to make requests. The upside of this mode is that there is no browser overhead, so it’s much faster than headless mode. The downside is that the HTTP library in Go analyzes the HTTP response as is and any dynamic JavaScript or DOM (Document Object Model) manipulations won’t load, causing you to miss post-rendered endpoints or asynchronous endpoint calls.

If you are confident the application you are crawling does not use complex DOM rendering or has asynchronous events, then this mode is the one to use as it is faster. Standard mode is the default:

katana -u https://tesla.com

Headless mode

Headless mode uses internal headless calls to handle HTTP requests/responses within a browser context. This solves two major issues:

  • The HTTP fingerprint from a headless browser will be identified and accepted as a real browser – including TLS and user agent.
  • Better coverage by analyzing raw HTTP responses as well as the browser-rendered response with JavaScript.

If you are crawling a modern complex application that utilizes DOM manipulation and/or asynchronous events, consider using headless mode by utilizing the -headless option:

katana -u https://tesla.com -headless

Controlling your scope

Controlling your scope is important to returning valuable results. Katana has four main ways to control the scope of your crawl:

  • Field-scope
  • Crawl-scope
  • Crawl-out-scope
  • No-scope

Field-scope

When setting the field scope, you have three options:

  1. rdn - crawling scoped to root domain name and all subdomains (default)
  2. Running katana -u https://tesla.com -fs dn returns anything that matches *.tesla.com
  1. fqdn - crawling scoped to given sub(domain)

a. Running katana -u https://tesla.com -fs fqdn returns nothing because no URLs containing only “tesla.com” are found

b. Running katana -u https://www.tesla.com -fs fqdn only returns URLs that are on the “www.tesla.com” domain.

  1. dn - crawling scoped to domain name keyword

a. Running katana -u https://tesla.com -fs dn returns anything that contains the domain name itself. In this example, that is “tesla”. Notice how the results returned a totally new domain suppliers.teslamotors.com

Crawl-scope

The crawl-scope (-cs) flag works as a regex filter, only returning matching URLs. Look at what happens when filtering for “shop” on tesla.com. Only results with the word “shop” are returned.

Crawl-out-scope

Similarly, the crawl-out-scope (-cos) flag works as a filter that will remove any urls that match the regex given after the flag. Filtering for “shop” removes all urls that contain the string “shop” from the output.

No-scope

Setting the no-scope flag will allow the crawler to start at the target and crawl the internet. Running katana -u https://tesla.com -ns will pick up other domains that are not on the beginning target site “tesla.com” as the crawler will crawl any links it finds.

Making Katana a crawler for you with configuration

Depth

Define the depth of your crawl. The higher the depth, the more recursive crawls you will get. Be aware this can lead to long crawl times against large web applications.

katana -u https://tesla.com -d 5

Crawling JavaScript

For web applications with handfuls of JavaScript files, turn on JavaScript parsing/crawling. This is turned off by default, but turning this on will allow the crawler to crawl and parse JavaScript files. These files can be hiding all kinds of useful endpoints.

katana -u https://tesla.com -jc

Crawl duration

Set a predefined crawl duration and the crawler will return all URLs it finds in the specified time.

katana -u https://tesla.com -ct 2

Known files

Find and crawl any robots.txt or sitemap.xml files that are present. This functionality is turned off by default.

katana -u https://tesla.com -kf robotstxt,sitemapxml

Automatic form fill

Enables automatic form-filling for known and unknown fields. Known field values can be customized in the form config file (default location: $HOME/.config/katana/form-config.yaml)

katana -u https://tesla.com -aff

Handling your output

Field

The field flag is used to filter the output for the desired information you are searching for. ProjectDiscovery has been kind enough to give a very detailed table of all the fields with examples:

Look what happens when filtering the output of the crawl to only return URLs with query parameters in it:

Store-field

The store-field flag does the same thing as the field flag we just went over, except that it filters the output that is being stored in the file of your choice. It is awesome that they are split up. Between the store-field flag and the field flag above, you can make the data you see and the data you store different if needed.

katana -u https://tesla.com -sf key,fqdn,qurl

Extension-match & extension-filter

You can use the extension-match flag to only return urls that end with your chosen extensions

katana -u https://tesla.com -silent -em js,jsp,json

If you would rather filter for file extensions you DON’T want in the output, then you can filter them out of the output using the extension-filter flag

katana -u https://tesla.com -silent -ef css,txt,md

JSON

Katana has a JSON flag that allows you to output a JSON format that includes the source, tag, and attribute name related to the discovered endpoint.

Rate limiting and delays

Delay

The delay flag allows you to set a delay (in seconds) between requests while crawling. This feature is turned off by default.

katana -u https://tesla.com -delay 20

Concurrency

The concurrency flag is used to set the number of URLs per target to fetch at a time. Notice that this flag is used along with the parallelism flag to create the total concurrency model.

katana -u https://tesla.com -c 20

Parallelism

The parallelism flag is used to set the number of targets to be processed at one time. If you only have one target, then there is no need to set this flag.

katana -u https://tesla.com -p 20

Rate-limit

This flag allows you to set the maximum number of requests that the crawler is sending out per second

katana -u https://tesla.com -rl 100

Rate-limit-minute

A rate-limiting flag similar to the one above, but used to set a maximum number of requests per minute.

katana -u https://tesla.com -rlm 500

Chaining Katana with other ProjectDiscovery tools

Since katana can take input from STDIN, it is straightforward to chain katana with the other tools that ProjectDiscovery has released. A good example of this is:

subfinder -d tesla.com -silent | httpx -silent | katana

Conclusion

Hopefully, this has excited you to go out and crawl the planet. With all the options available, you should have no problem fitting this tool into your workflows. ProjectDiscovery has made this wonderful web crawler to cover many sore spots created by crawlers of the past. Katana makes crawling look like running!

Author – Gunnar Andrews, @g0lden1

Subscribe to our newsletter and stay updated.

Don't miss anything. Get all the latest posts delivered straight to your inbox. It's free!
Great! Check your inbox and click the link to confirm your subscription.
Error! Please enter a valid email address!
--