Processing Adobe Analytics Data Feeds with Apache NiFi for Adobe Experience Platform

Frederik Werner | Posted on June 16, 2021 |

In the series of posts that is currently being released on this blog I’m showing how companies can move from Adobe Analytics to the brand new Customer Journey Analytics to utilize the many advantages of the new tool. However, I feel like the current Adobe-provided solution for bringing data from that old to the new world lacks some essential information. I did an extensive comparison in the most recent post of the series, but will give some of the reasons here again.

When we use the Adobe Analytics Data Connector to bring data from an Adobe Analytics Report Suite into Experience Platform, we are dealing with some limitations:

The data is based on what Adobe calls mid-values. Those sit between raw, unprocessed data, and fully processed data in the processing chain. Because of this, we don’t have access to dimensions like persisted Evars, Visit Number, and other data points we take for granted with Adobe Analytics.
We can only export data for up to 13 months. If you want to export more data (or want to start with one Report Suite today but wait for a few months with the next one) you are out of luck.

In the already mentioned post I describe an alternative using manual data processing with tools like the Open Source Apache NiFi. Since not a lot of Adobe Analytics users might know about this awesome, free tool, I wanted to write this post to introduce you to the tool with a concrete use case. Specifically, we are also going to cover the part of exporting data out of Adobe Analytics and import it into Experience Platform.

Step 0: Preparation

There are quite a few different ways to bring data from Adobe Analytics data feeds to other systems. Besides the usual cloud storage options like AWS S3 or Azure Blob Storage we can keep things old-school and widely compatible by using FTP connections. One of the big advantages we get with that is that Adobe will be happy to provide you with a dedicated FTP account if you reach out to Client Care and ask them nicely. That is super cool, so we just go with that!

On the FTP server, I typically like to use a folder structure like this:

In my aepdatafeeds folder, I have two other folders for both the raw data feed files (rawfeeds folder) as well as the processed data for Experience Platform to ingest (processedfeeds folder). That’s all we need to get going!

Step 1: Creating the Data Feed in Adobe Analytics

Now we need to create the actual data feed in the Admin interface. Head over to the section in Analytics and create a new feed:

Finding the Data Feeds interface in Adobe Analytics

For this post, we are going to use this pretty easy setup, where we have daily files with multiple files per day. Note how the path matches what we have created on the FTP server:

Creating the Data Feed in Adobe Analytics

With the data feed saved, Adobe Analytics will now start to deliver the data to the FTP server. With the setup above, Adobe Analytics will deliver files as shown below to the FTP account:

Awesome, that worked like a charm! Now we have the data file (.tsv.gz), the accompanying lookup file (.tar.gz) and the completion file (.txt). If you have never worked with those files before, let me explain why we need need the manual processing: The data file does only contain the actual data, but no headers, in a tab-separated format. That is why we need to process it, since Experience Platform needs files with headers as CSV (comma-separated) to work properly with the file. The headers are one of the files in the lookup archive (column_headers.tsv) together with other lookups for browsers, devices, etc. If we extract that lookup archive, we get quite a list of lookup files:

Now that we have all the files ready, we can start with the main gig: Processing it all in NiFi!

Step 2: Processing data feeds with NiFi

Since I don’t expect everyone to know NiFi, I will be a bit more detailed here. If you haven’t already, you can download and install NiFi from https://nifi.apache.org. It’s a free and open-source Java application that should run virtually everywhere. Once you run it, you can access its very nice web interface:

I already went ahead and created a process group on this screenshot. Wit a process group, we can keep NiFi nice and tidy and group all processing steps into a single, high-level group. We can enter that group via double-clicking it.

In that group, we first need to get the file that indicates a completed transfer from the Adobe FTP server. To do that, we drag a new GetFTP processor onto the canvas:

Now we need to configure it. Double-click it and go the properties tab. In there, we give the processor all the details of our FTP server matching the data feeds directory:

After all is configured, we just need to connect the GetFTP processor to the next one. In our case, we need an ExtractText processor next. We drag that on the canvas, connect it to the GetFTP one and are able to start the GetFTP processor now. Once we do that, we will see the files showing up in the processing queue:

With the ExtractText processor, we can extract information from the file we are processing to later work with it. From the .txt file, we are interested in two chunks of information: The name of the data file as well as the name of the lookup file. As a refresher, here is how that file looks like:

Datafeed-Manifest-Version: 1.0
 Lookup-Files: 1
 Data-Files: 1
 Total-Records: 14

 Lookup-File: filename_date-lookup_data.tar.gz
 MD5-Digest: 62fd74e46c10a9bdac302376a3a00820
 File-Size: 33502558

 Data-File: 01-filename_date.tsv.gz
 MD5-Digest: b6033da2813d3ee846113bd70866cc0f
 File-Size: 1746
 Record-Count: 14

The ExtractText processor lets us define regex expressions for the data we want to extract. To do that, we define two new properties by hitting the plus-icon and defining the regex to fetch the file names:

With that information extracted, we can now fetch those two files from the FTP server with the FetchFTP processor. We need to do it that way to be sure the files are completely uploaded by Adobe. This is guaranteed since the .txt file is always uploaded last. For simplicity, we are going to use two FetchFTP processors, one for the lookup and one for the data file:

In the properties, we can then define which file we want to get from the FTP server. We use the information we have extracted in the step before, using either the lookup or data file name in the processor config:

Fetching the lookup file from the FTP server

Once both FetchFTP processors are configured, you can funnel their output files into a CompressContent processor to decompress their content:

For both processors, we configure them to decompress the data like this:

On the process for the lookup files, we now need to extract the files that are packed together using a UnpackContent processor with a Tar Packaging Format. You can see how we now have 14 files in the resulting queue instead of only one:

Everything up to this point should be quite clear: We downloaded the files from the FTP server, decompressed them, and extracted the content of the lookup file archive. Now we need to take care of the main problem, which is combining the header file with the actual data file. To do this, we need to first identify the header file in the queue of 14 files from the lookup archive. We can do this by inserting a RouteOnAttribute processor next:

Filtering out the header file from the lookup archive

The configuration of that processor is quite simple. We need to add a property that evaluates to true if a file should be routed to the outgoing connection. In our case, we are looking at the file’s name, since that will always be column_headers.tsv:

Now we need two types of processors to join the header and data files together. For the actual merge, we are going to use the MergeContent processor with a defragment merge strategy:

Combining header and data file with MergeContent

But there is one little thing we need to do first. MergeContent with this configuration will try to recombine files that have been split before. To know which files should go where and how many it needs to merge, we first need to add this information to our files using the UpdateAttribute processors. For the header file, we need a fragment.index of 0 (as below), and an index of 1 for the data file:

Adding the merge information to the files

To give you a bit of an overview of what is happening here, below is how the flow looks like to merge the data together. On the top route, we are decompressing, unpacking, and filtering of the lookup files to then only process the header. On the bottom route, we are decompressing and enriching the data file to then merge both together in the bottom right:

That is quite sophisticated and gives us a nice merged file with all column headers and the data in one big file.

Another thing we need to do with the rest of the lookup files is also give them a proper header. For that, we use one more MergeContent processor, but this time with a different configuration. This processor is pretty cool for this use case since we can just manually define the header we want to use. In my case, I want to name the columns “ID” and “Value” as below:

Since we are cleaning things up already, let’s define a better filename for the data file that we merged in the step before. Luckily we can just use a part of the original file name:

Changing the name of the merged data file

We are almost done! The only big thing left is to convert the tab-separated files into comma-separated CSV files. To do that, we can use the ConvertRecord processor (in the bottom right of the screenshot):

Feeding TSV files into ConvertRecord for conversion to CSV

In that processor, we need to define formats for both input and output. We use CSV for both but signify the difference in the name:

We don’t need to change anything on the CSV writer, but the TSVReader needs to be switched to a Tab Separated format:

Once the conversion is done, we clean up the filename to replace the .tsv with a .csv extension. Then we are already ready to upload the files again, for wich we use a PutFTP processor:

Cleaning up the file name and re-uploading the CSV files

Remember the other directory we created in step 0? Now it’s time to use it in the PutFTP processor!

Once we start this whole flow, we can see the data coming in on our destination folder on the FTP server. With our configuration above, the lookup files will always be overwritten when a new version comes in, which is super handy. On production, we may want to have the data files in a dedicated folder, but for now we are good:

If you want a (literally) high-level overview, this is how our complete NiFi flow now looks like:

With NiFi, we are able to pull the data from the FTP server, merge the headers and the data file together, convert all from TSV to CSV, give it some proper names, and re-upload it in an “easy” to understand, graphical interface. I hope you know now why I like NiFi so much!

Step 3: Pull data into Experience Platform

Now that we have the data in a good format on our FTP server, it’s time to pull it all into Experience Platform. In the Sources section of Platform we can select the FTP option:

FTP connector in Adobe Experience Platform

That will open the wizard where we first connect to the FTP server:

Connecting Experience Platform to the FTP server

On the next screen, we can select which of our files (or folders) we want to import and check if the file can be read as we expect it:

After that is configured, we can go to the next step. In there, we have to map the columns in our file to the XDM schema fields of an existing or new data set. Here is where we define where the data should actually go:

Mapping FTP file columns to XDM schema fields

Once you are done with the mapping, you can complete the wizard and schedule the import. That’s all we need!

Conclusion

This post was, at least for me, refreshingly in-depth and nerdy. I hope I could show why I really like processing data with NiFi, since both the capabilities and the user interface are very, very good. A process like pulling data from a FTP server, processing it in a completely custom way, and storing it again is super straight-forward. We could even mix and match other sources and destinations like AWS, GCP, or Azure along the way. For a production setup, we would likely not reference single files in Platform but whole folders, which means we would need to separate data from lookup files in NiFi, uploading them to specific folders. I trust you would know to do that by now, since we’ve done similar things above!

Let me know if you found this post useful. See you on the next one!

Frederik Werner

German Analyst and Data Scientist working in and writing about (Web) Analytics and Online Marketing Tech.

2021 – current Adobe Analytics Champion

EMEA Adobe Analytics User Group Lead

Adobe Analytics Community Advisor

Step 0: Preparation

Step 1: Creating the Data Feed in Adobe Analytics

Step 2: Processing data feeds with NiFi

Step 3: Pull data into Experience Platform

Conclusion

Other cool posts: