Importing Organic Google Search data to Adobe Analytics with a single script

Frederik Werner | Posted on May 30, 2020 |

Some time ago, I published an article explaining how to get Google Search performance data from the Google Search Console to Adobe Analytics. For that post, I explained to query the Google Search API, write the result to an Adobe Analytics Data Source file, and upload it to Adobe Analytics. The same can be achieved in a more automated way using the Adobe Analytics Data Sources API, which is what this article is about. It explains how to use a script I published on Github. If this feels to advanced, feel free to go back to the old article.

So, why another article about this topic? The old post received a lot of attention and led to some companies adopting the methods I described there. But if you try to implement it in a production environment, you would need to take care of some things yourself. For example, you need to create a cronjob to dynamically feed the current date to the script. After that, you need to upload the result file to Adobe Analytics. Both tasks can be automated, but there is some reliance on external tools. On the other hand, it allows a novice user to understand the mechanisms behind a Data Source in Adobe Analytics. After using it, you will understand what happens when you import data to Analytics much better.

For production environments, some more configurability and less reliance on other tools would be nice. Ideally, a cronjob could invoke a single script that handles everything itself. If it doesn’t need to create additional files we could deploy it on a simple serverless environment. As a bonus, it should keep track of which days have already been imported, so we don’t double count any performance data.

Such a script is what I created and published on Github. This article describes how to use it and how it works on the inside. We will start with the configuration (which might be everything you are looking for) and then dive into what the script actually does.

Modules & Configuration

Since this script is written in Python, we start with the modules we need. If you don’t have it already, you will need to install PyJWT and the Google Python Client modules:

import datetime
import requests
import sys
import jwt
import re
import httplib2

from apiclient.discovery import build
from apiclient.errors import HttpError
from oauth2client.client import flow_from_clientsecrets
from oauth2client.file import Storage
from oauth2client.tools import argparser, run_flow

More relevant to the normal user is what comes next. This is where we put the configuration, so adjust this section as needed. Here is the block of code, see below for explainations:

config = {
"apiKey":"3ec159485be87ed8fk6f9g37j79d67153b31e6",
"technicalAccountId":"6JCD048F50A6495F35C8D9D4D2@techacct.adobe.com",
"orgId":"25DB24210614E744C980A8A7@AdobeOrg",
"secret":"d033109-fd7a71ba2-489-9cf455-f2f87f4298ab",
"google_property":"https://www.website.com/",
"data_source_name": "Google Search Console Import",
"report_suite_id": 'orgreportsuiteid',
"job_prefix": "GSC-Import",
"lookback_days": 100,
"type_evar":"197",
"url_evar":"198",
"keyword_evar":"199",
"ctr_event":"997",
"clicks_event":"998",
"impressions_event":"999",
"position_event":"1000",
"key":b'-----BEGIN PRIVATE KEY-----\nMIIEvAIBADAN7wGu1P3aNA3yjqGA==\n-----END PRIVATE KEY-----',

"metascopes":"ent_analytics_bulk_ingest_sdk",
"imsHost":"ims-na1.adobelogin.com",
"imsExchange":"https://ims-na1.adobelogin.com/ims/exchange/jwt",
"discoveryUrl":"https://analytics.adobe.io/discovery/me"
}

So, let’s dive in. Line 2-5 must be filled with the client data you create in the Adobe IO console. It is used to connect to the Adobe Analytics API. At the bottom of the config block, put in your private key to the Adobe IO account at line 18.
Line 6 holds the property from the Google Search Console. This is the property that your data will be imported from. Usually, it is equal to your website URL.
Line 7 contains the name of the Data Source you create in Adobe Analytics. Make sure this value matches the name exactly. In line 8 we define the Report Suite ID where we want our data to be imported. To better recognize your imported files, Line 9 demands a prefix of the imported filenames.
The lookback window we define in Line 10 limits how far back the imported data should be. This value is in days and starts with today. The script keeps track of what data has been imported already, so a long window only leads to a longer import on the first run.

Line 11 trough 17 are most important for what data gets actually imported. This is where we define which Dimensions and Events will be used. If you don’t want to use a certain dimension, you can leave the value empty. At minimum, we need the type eVar and one Event. The script determines what should be imported by the variables we define. Some examples:

#Only import the Query type to eVar 197 with Clicks in Event 998 and Impressions in Event 999:
"type_evar":"197",
"clicks_event":"998",
"impressions_event":"999"

#Import all available data to the mentioned variables:
"type_evar":"197",
"url_evar":"198",
"keyword_evar":"199",
"ctr_event":"997",
"clicks_event":"998",
"impressions_event":"999",
"position_event":"1000"

#Import Query Type and Keywords together with Clicks, Impressions and Click-trough-rate:
"type_evar":"197",
"keyword_evar":"199",
"ctr_event":"997",
"clicks_event":"998",
"impressions_event":"999"

Connecting to the APIs

To differentiate between import types, the type eVar describes what was imported in the current query. In the next section of the code, we create a list of dates to query and check if the configuration regarding the dimensions and events is valid. If not, the script exits:

base = datetime.datetime.today()
date_list = [(base - datetime.timedelta(days=x)).strftime("%Y-%m-%d") for x in range(config["lookback_days"])]

if config["clicks_event"] or config["impressions_event"] or config["position_event"] or config["ctr_event"]:
    if config["url_evar"] and config["keyword_evar"]:
        print("Both URL and Keyword eVar given. Importing Keywords per URL...")
        operating_mode = "URL and Keyword"
        query_dimensions = ['date','page','query']
        datasource_columns = ['Date', 'Evar '+config["type_evar"], 'Evar '+config["url_evar"], 'Evar '+config["keyword_evar"]]
    elif config["keyword_evar"]:
        print("Only Keyword eVar given. Importing only Keywords...")
        operating_mode = "Keyword Only"
        query_dimensions = ['date','query']
        datasource_columns = ['Date', 'Evar '+config["type_evar"], 'Evar '+config["keyword_evar"]]
    elif config["url_evar"]:
        print("Only URL eVar given. Importing only URLs...")
        operating_mode = "URL Only"
        query_dimensions = ['date','page']
        datasource_columns = ['Date', 'Evar '+config["type_evar"], 'Evar '+config["url_evar"]]
    else:
        print("No eVars given. Importing metrics only")
        operating_mode = "Metrics Only"
        query_dimensions = ['date']
        datasource_columns = ['Date', 'Evar '+config["type_evar"]]
    if config["clicks_event"]:
        datasource_columns.append("Event "+config["clicks_event"])
    if config["impressions_event"]:
        datasource_columns.append("Event "+config["impressions_event"])
    if config["position_event"]:
        datasource_columns.append("Event "+config["position_event"])
    if config["ctr_event"]:
        datasource_columns.append("Event "+config["ctr_event"])
else:
    print("No events given. Aborting...")
    sys.exit()

if not config["type_evar"]:
    print("No Type Evar given. Aborting...")
    sys.exit()

Next, we need to connect to the different APIs. The Google APIs expect the client_credentials.json from the Google API Console. The next blocks contain the code to connect to the Adobe APIs through Oauth:

def get_authenticated_google_service():
    flow = flow_from_clientsecrets("client_secrets.json", scope="https://www.googleapis.com/auth/webmasters.readonly",
    message="MISSING_CLIENT_SECRETS_MESSAGE")
    storage = Storage("oauth2.json")
    credentials = storage.get()
    if credentials is None or credentials.invalid:
        credentials = run_flow(flow, storage)
    return build("webmasters", "v3", http=credentials.authorize(httplib2.Http()))

search_console = get_authenticated_google_service()

def get_jwt_token(config):
    return jwt.encode({
        "exp": datetime.datetime.utcnow() + datetime.timedelta(seconds=30),
        "iss": config["orgId"],
        "sub": config["technicalAccountId"],
        "https://{}/s/{}".format(config["imsHost"], config["metascopes"]): True,
        "aud": "https://{}/c/{}".format(config["imsHost"], config["apiKey"])
    }, config["key"], algorithm='RS256')

def get_access_token(config, jwt_token):
    post_body = {
        "client_id": config["apiKey"],
        "client_secret": config["secret"],
        "jwt_token": jwt_token
    }

    response = requests.post(config["imsExchange"], data=post_body)
    return response.json()["access_token"]

def get_first_global_company_id(config, access_token):
    response = requests.get(
        config["discoveryUrl"],
        headers={
            "Authorization": "Bearer {}".format(access_token),
            "x-api-key": config["apiKey"]
        }
    )
    return response.json().get("imsOrgs")[0].get("companies")[0].get("globalCompanyId")

jwt_token = get_jwt_token(config)
access_token = get_access_token(config, jwt_token)
global_company_id = get_first_global_company_id(config, access_token)

Receiving completed Data Source import jobs from Adobe Analytics

Now that we are connected and have our Access Token, we can query the Adobe Analytics API to receive a list of Data Sources for the configured Report Suite. If it matches the name we have chosen in the configuration we continue with the Data Source ID:

dataSources = requests.post(
        "https://api.omniture.com/admin/1.4/rest/?method=DataSources.Get",
        headers={
            "Authorization": "Bearer {}".format(access_token),
            "x-api-key": config["apiKey"],
            "x-proxy-global-company-id": global_company_id
        }, 
        data={'reportSuiteID':config["report_suite_id"]}
    ).json()

for dataSource in dataSources:
    if dataSource["name"] == config["data_source_name"]:
        dataSourceID = dataSource["id"]
        print("Found Data Source ID")
        break

If we find the Data Source we are looking for, we get a list of the completed import jobs for this Data Source. This is how we keep track of what data has been imported already, because we tightly define how the jobs should be named and have them include the date of the imported data. If we find completed jobs we remove those dates from the list of dates we want to import:

if dataSourceID:
    jobs = requests.post(
        "https://api.omniture.com/admin/1.4/rest/?method=DataSources.GetJobs",
        headers={
            "Authorization": "Bearer {}".format(access_token),
            "x-api-key": config["apiKey"],
            "x-proxy-global-company-id": global_company_id
        }, 
        data={'reportSuiteID':config["report_suite_id"],'dataSourceID':dataSourceID}
    ).json()
    for job in jobs:
        jobname = job["fileName"]
        if config["job_prefix"].lower() in jobname:
            matchstring = '^'+re.escape(config["job_prefix"].lower())+"_"+operating_mode.lower()+'_([0-9]{4}-[0-9]{2}-[0-9]{2})_'+config["report_suite_id"]+'_'+dataSourceID+'_[0-9]*\.tab$'
            p = re.compile(matchstring)
            regex_match = p.match(job["fileName"])
            if regex_match and job["status"] != "failed":
                jobdate = regex_match.group(1)
                if jobdate in date_list:
                    date_list.remove(jobdate)
else:
    print("Data Source not found. Please check your configured Data Source name.")
    sys.exit()

Query Search Console data and upload to Adobe Analytics

Now to the main gig. Now that we know what dates we need to import, we can start pulling data from Google day by day and create an import jobs for the Adobe Analytics Data Source:

print("Number of days to fetch: "+str(len(date_list)))
i = 1
for query_date in date_list:
    print("Fetching Google Search Console Data for "+query_date+". Query "+str(i)+"/"+str(len(date_list)))
    request = {
        'startDate': query_date,
        'endDate': query_date,
        'dimensions': query_dimensions,
        'rowLimit': 10000
    }
    result_rows = []
    result = search_console.searchanalytics().query(siteUrl=config["google_property"], body=request).execute()
    if "rows" in result:
        print("Received "+str(len(result["rows"]))+" rows of data. Uploading to Adobe...")
        for row in result["rows"]:
            row_to_append = []
            row_to_append.append(row["keys"][0][5:7]+"/"+row["keys"][0][8:10]+"/"+row["keys"][0][0:4]+"/00/00/00")
            row_to_append.append("Import Type: "+operating_mode)

            if operating_mode != "Metrics Only":
                row_to_append.append(row["keys"][1])
            if operating_mode == "URL and Keyword":
                row_to_append.append(row["keys"][2])

            if config["clicks_event"]:
                row_to_append.append(str(row["clicks"]))
            if config["impressions_event"]:
                row_to_append.append(str(row["impressions"]))
            if config["position_event"]:
                row_to_append.append(str(row["position"]))
            if config["ctr_event"]:
                row_to_append.append(str(row["ctr"]))
            result_rows.append(row_to_append)

        if len(result_rows)  > 0:
            jobresponse = requests.post(
                "https://api.omniture.com/admin/1.4/rest/?method=DataSources.UploadData",
                headers={
                    "Authorization": "Bearer {}".format(access_token),
                    "x-api-key": config["apiKey"],
                    "x-proxy-global-company-id": global_company_id
                }, 
                json={
                    "columns": datasource_columns,
                    'reportSuiteID': config["report_suite_id"],
                    'dataSourceID':dataSourceID,
                    "finished": True,
                    "jobName": config["job_prefix"]+"_"+operating_mode+"_"+query_date,
                    "rows": result_rows
                }
            )
    else:
        print("No Data for "+query_date)
    i+=1

We are “only” importing 10,000 lines of data per day to not stress the Adobe Analytics API too much. If you want to contribute to this little script, you could come up with some iteration to import more than 10,000 rows per imported day.

That’s it! Go ahead and build something amazing with this, the complete script is on Github. Feel free to adapt it to other tools as well and maybe think about sharing your work as well ?

Frederik Werner

German Analyst and Data Scientist working in and writing about (Web) Analytics and Online Marketing Tech.

2021 – current Adobe Analytics Champion

EMEA Adobe Analytics User Group Lead

Adobe Analytics Community Advisor

Modules & Configuration

Connecting to the APIs

Receiving completed Data Source import jobs from Adobe Analytics

Query Search Console data and upload to Adobe Analytics

Other cool posts: