Keyword analysis is an essential part of driving traffic to your website, and it's something every content creator and SEO professional should master. Luckily, there are some great tools out there, and you can even build your own using the Python programming language. Here's how to do it.
This guide assumes you have Python installed and know how to run scripts – if you're new, check out our guide to getting started with Python.
I'll explain each part of the script and provide the complete code at the end so you can copy and paste it.
If you don't already have them installed, you'll need to install the beautifulsoup4, requests, and nltk libraries before running the script with the following code in the command line:
pip install beautifulsoup4 request nltk
Now, let's get to the script. First, we set up our environment with the libraries we installed:
Import nltk
nltk.download('punkt')
nltk.download('stopwords')
Next, we'll use the Requests library to retrieve content from a web page. It takes a URL as input, sends an HTTP GET request to the specified URL, and returns the HTML content of the page.
Import Request
def fetch_content(url):
response = request.get(url)
If response.status_code == 200:
Return the response text
other than that:
print(f”Failed to retrieve content from {url}”)
return “”
The text is then cleaned up by removing punctuation and excess whitespace, and tokenized for further analysis and processing.
Import Re
import word_tokenize from nltk.tokenize
def clean_and_tokenize(text):
text = re.sub(r'\s+', ' ', text) # Remove excess whitespace
Text = re.sub(r'[^\w\s]', '', text) # Remove punctuation
token = word_tokenize(text.lower())
Return the token
The next step is to filter out common stop words – common words like “the”, “is” and “in” that are not relevant to the meaning of your content but may affect your keyword analysis.
Import stop words from nltk.corpus
def remove_stopwords(token):
stop_words = set(stopwords.words('english'))
Filtered Tokens = [word for word in tokens if word not in stop_words]
Returns the filtered token
Next, calculate the frequency and density of each keyword. Keyword density refers to the percentage of occurrences of a keyword in a text compared to the total number of words.
Importing counters from a collection
def analyze_keywords(token):
counter = counter(token)
Total words = sum(counter.values())
keyword_density = {word: (count / total_words) * 100 (for words, count in counter.items())}
Returns keyword density
Once the content has been analyzed, Python will print a report containing the data to the screen.
def generate_report(url, keyword_density):
sorted_keywords = sorted(keyword_density.items(), key=lambda item: item[1]Reverse=True)
report = f”Keyword density report for {url}\n”
report += “-” * 50 + “\n”
For keywords, the density in sorted_keywords[:10]: # Display top 10 keywords
report += f”keyword: {keyword}, ​​density: {density:.2f}%\n”
Returns Report
Once all the individual functions are in place, you can place them into a main function that takes a URL as input and outputs a report when the script is run.
Define main() :
url = input(“Enter the URL of the webpage: “)
html_content = fetch_content(url)
For html_content:
Soup = BeautifulSoup(html_content, 'html.parser')
Text = soup.get_text()
token = clean_and_tokenize(text)
filtered tokens = remove stopwords(tokens)
keyword density = analyze_keywords(filtered tokens)
report = generate_report(url, keyword density)
Print (report)
If __name__ == “__main__” :
Main()
Here's the complete code for you to copy and paste:
Import Request
Import BeautifulSoup from bs4
Import stop words from nltk.corpus
import word_tokenize from nltk.tokenize
Importing counters from a collection
Import nltk
Import Re
# Download the required NLTK data
nltk.download('punkt')
nltk.download('stopwords')
# Get the webpage content
def fetch_content(url):
try:
response = request.get(url)
response.raise_for_status() # Raise an HTTPError for invalid responses (4xx and 5xx)
Return the response text
Exceptions: requests.exceptions.HTTPError as http_err.
print(f”An HTTP error occurred: {http_err}”)
Exception: request.exceptions.ConnectionError is raised as conn_err:
print(f”Connection error: {conn_err}”)
Excluding request.exceptions.Timeout as timeout_err:
print(f”A timeout error occurred: {timeout_err}”)
Exception: requests.exceptions.RequestException as req_err:
print(f”An error occurred: {req_err}”)
return “”
# Clean and tokenize the text
def clean_and_tokenize(text):
text = re.sub(r'\s+', ' ', text) # Remove excess whitespace
Text = re.sub(r'[^\w\s]', '', text) # Remove punctuation
text = text.lower() # Convert text to lower case
token = word_tokenize(text)
Return the token
# Remove stop words from tokens
def remove_stopwords(token):
stop_words = set(stopwords.words('english'))
Filtered Tokens = [word for word in tokens if word not in stop_words]
Returns the filtered token
# Analyze keyword density
def analyze_keywords(token):
counter = counter(token)
Total words = sum(counter.values())
keyword_density = {word: (count / total_words) * 100 (for words, count in counter.items())}
Returns keyword density
# Generate a keyword density report
def generate_report(url, keyword_density):
sorted_keywords = sorted(keyword_density.items(), key=lambda item: item[1]Reverse=True)
report = f”Keyword density report for {url}\n”
report += “-” * 50 + “\n”
For keywords, the density in sorted_keywords[:10]: # Display top 10 keywords
report += f”keyword: {keyword}, ​​density: {density:.2f}%\n”
Returns Report
# Main features
Define main() :
url = input(“Enter the URL of the webpage: “)
html_content = fetch_content(url)
For html_content:
Soup = BeautifulSoup(html_content, 'html.parser')
Text = soup.get_text()
token = clean_and_tokenize(text)
filtered tokens = remove stopwords(tokens)
keyword density = analyze_keywords(filtered tokens)
report = generate_report(url, keyword density)
Print (report)
input(“Press Enter to exit…”)
If __name__ == “__main__” :
Main()
To get more useful Python scripts and leave your comments and questions, follow GeekSided.