What is Katana?
Katana is designed to be CLI-friendly, fast, efficient and with a simple output format. This makes it an attractive option for those looking to use the tool as part of an automation pipeline. Furthermore, regular updates and maintenance ensure that this tool remains a valuable and indispensable part of your hacker arsenal for years to come.
Katana is an excellent tool for several reasons, one of which is its simple input/output formats. These formats are easy to understand and use, allowing users to quickly and easily integrate Katana into their workflow. Katana is designed to be easily integrated with other tools in the ProjectDiscovery suite, as well as other widely used CLI-based recon tools.
What is web crawling?
Any search engine you use today is populated using web crawlers. A web crawler indexes web applications by automating the “click every button” approach to discovering paths, scripts and other resources. Web application indexing is an important step in uncovering an application’s attack surface.
Katana allows a couple of different installation methods, downloading the pre-compiled binary, compiling the binary using go, or docker.
There are two ways to install the binary directly onto your system:
- Download the pre-compiled binary from the release page.
- Run go install:
go install github.com/projectdiscovery/katana/cmd/katana@latest
- Install/Update docker image to the latest tag
docker pull projectdiscovery/katana:latest
2. Running Katana
a. Normal mode:
docker run projectdiscovery/katana:latest -u https://tesla.com
b. Headless mode:
docker run projectdiscovery/katana:latest -u https://tesla.com -system-chrome -headless
Here are the raw options for your perusal – we'll take a closer look at each below!
-d, -depth Defines maximum crawl depth, ex:
-jc, -js-crawl Enables endpoint parsing/crawling from JS files
-ct, -crawl-duration Maximum time to crawl the target for, ex:
-kf, -known-files Enable crawling for known files, ex:
-mrs, -max-response-size Maximum response size to read, ex:
-timeout Time to wait for request in seconds, ex:
-aff, -automatic-form-fill Enable optional automatic form filling. This is still experimental
-retry Number of times to retry the request, ex:
-proxy HTTP/socks5 proxy to use, ex:
-H, -headers Include custom headers/cookies with your request, ex:
-config Path to the katana configuration file, ex:
-fc, -form-config Path to form configuration file, ex:
-hl, -headless Enable headless hybrid crawling. This is experimental
-sc, -system-chrome Use a locally installed Chrome browser instead of katana’s
-sb, -show-browser Show the browser on screen when in headless mode
-ho, -headless-options Start headless chrome with additional options
-nos, -no-sandbox Start headless chrome in --no-sandbox mode
-cs, -crawl-scope In-scope URL regex to be followed by crawler, ex:
-cos, -crawl-out-scope Out-of-scope url regex to be excluded by crawler, ex:
-fs, -field-scope Pre-defined scope field (dn,rdn,fqdn), ex:
-ns, -no-scope Disables host-based default scope allowing for internet scanning
-do, -display-out-scope Display external endpoints found from crawling
-f, -field Field to display in output (url,path,fqdn,rdn,rurl,qurl,qpath,file,key,value,kv,dir,udir), ex:
-sf, -store-field Field to store in selected output option (url,path,fqdn,rdn,rurl,qurl,qpath,file,key,value,kv,dir,udir), ex:
-em, -extension-match Match output for given extension, ex:
-ef, -extension-filter Filter output for given extension, ex:
-c, -concurrency Number of concurrent fetchers to use, ex:
-p, -parallelism Number of concurrent inputs to process, ex:
-rd, -delay Request delay between each request in seconds, ex:
-rl, -rate-limit Maximum requests to send per second, ex:
-rlm, -rate-limit-minute Maximum number of requests to send per minute, ex:
-o, -output File to write output to, ex:
-j, -json Write output in JSONL(ines) format
-nc, -no-color Disable output content coloring (ANSI escape codes)
-silent Display output only
-v, -verbose Display verbose output
-version Display project version
There are four different ways to give katana input:
- URL input
katana -u https://tesla.com
2. Multiple URL input
katana -u https://tesla.com,https://google.com
3. List input
katana -list url_list.txt
4. STDIN input (piped)
echo “https://tesla.com” | katana
If you are confident the application you are crawling does not use complex DOM rendering or has asynchronous events, then this mode is the one to use as it is faster. Standard mode is the default:
katana -u https://tesla.com
Headless mode uses internal headless calls to handle HTTP requests/responses within a browser context. This solves two major issues:
- The HTTP fingerprint from a headless browser will be identified and accepted as a real browser – including TLS and user agent.
If you are crawling a modern complex application that utilizes DOM manipulation and/or asynchronous events, consider using headless mode by utilizing the
katana -u https://tesla.com -headless
Controlling your scope
Controlling your scope is important to returning valuable results. Katana has four main ways to control the scope of your crawl:
When setting the field scope, you have three options:
- rdn - crawling scoped to root domain name and all subdomains (default)
katana -u https://tesla.com -fs dnreturns anything that matches *.tesla.com
- fqdn - crawling scoped to given sub(domain)
katana -u https://tesla.com -fs fqdn returns nothing because no URLs containing only “tesla.com” are found
katana -u https://www.tesla.com -fs fqdn only returns URLs that are on the “www.tesla.com” domain.
- dn - crawling scoped to domain name keyword
katana -u https://tesla.com -fs dn returns anything that contains the domain name itself. In this example, that is “tesla”. Notice how the results returned a totally new domain suppliers.teslamotors.com
The crawl-scope (-cs) flag works as a regex filter, only returning matching URLs. Look at what happens when filtering for “shop” on tesla.com. Only results with the word “shop” are returned.
Similarly, the crawl-out-scope (-cos) flag works as a filter that will remove any urls that match the regex given after the flag. Filtering for “shop” removes all urls that contain the string “shop” from the output.
Setting the no-scope flag will allow the crawler to start at the target and crawl the internet. Running
katana -u https://tesla.com -ns will pick up other domains that are not on the beginning target site “tesla.com” as the crawler will crawl any links it finds.
Making Katana a crawler for you with configuration
Define the depth of your crawl. The higher the depth, the more recursive crawls you will get. Be aware this can lead to long crawl times against large web applications.
katana -u https://tesla.com -d 5
katana -u https://tesla.com -jc
Set a predefined crawl duration and the crawler will return all URLs it finds in the specified time.
katana -u https://tesla.com -ct 2
Find and crawl any robots.txt or sitemap.xml files that are present. This functionality is turned off by default.
katana -u https://tesla.com -kf robotstxt,sitemapxml
Automatic form fill
Enables automatic form-filling for known and unknown fields. Known field values can be customized in the form config file (default location:
katana -u https://tesla.com -aff
Handling your output
The field flag is used to filter the output for the desired information you are searching for. ProjectDiscovery has been kind enough to give a very detailed table of all the fields with examples:
Look what happens when filtering the output of the crawl to only return URLs with query parameters in it:
The store-field flag does the same thing as the field flag we just went over, except that it filters the output that is being stored in the file of your choice. It is awesome that they are split up. Between the store-field flag and the field flag above, you can make the data you see and the data you store different if needed.
katana -u https://tesla.com -sf key,fqdn,qurl
Extension-match & extension-filter
You can use the extension-match flag to only return urls that end with your chosen extensions
katana -u https://tesla.com -silent -em js,jsp,json
If you would rather filter for file extensions you DON’T want in the output, then you can filter them out of the output using the extension-filter flag
katana -u https://tesla.com -silent -ef css,txt,md
Katana has a JSON flag that allows you to output a JSON format that includes the source, tag, and attribute name related to the discovered endpoint.
Rate limiting and delays
The delay flag allows you to set a delay (in seconds) between requests while crawling. This feature is turned off by default.
katana -u https://tesla.com -delay 20
The concurrency flag is used to set the number of URLs per target to fetch at a time. Notice that this flag is used along with the parallelism flag to create the total concurrency model.
katana -u https://tesla.com -c 20
The parallelism flag is used to set the number of targets to be processed at one time. If you only have one target, then there is no need to set this flag.
katana -u https://tesla.com -p 20
This flag allows you to set the maximum number of requests that the crawler is sending out per second
katana -u https://tesla.com -rl 100
A rate-limiting flag similar to the one above, but used to set a maximum number of requests per minute.
katana -u https://tesla.com -rlm 500
Chaining Katana with other ProjectDiscovery tools
Since katana can take input from STDIN, it is straightforward to chain katana with the other tools that ProjectDiscovery has released. A good example of this is:
subfinder -d tesla.com -silent | httpx -silent | katana
Hopefully, this has excited you to go out and crawl the planet. With all the options available, you should have no problem fitting this tool into your workflows. ProjectDiscovery has made this wonderful web crawler to cover many sore spots created by crawlers of the past. Katana makes crawling look like running!
Author – Gunnar Andrews, @g0lden1