XML Importing Tutorial

How to import from another job board (Advanced)

Author: Adam M. / Jamit Software
Date: Jun 25th, 2009

Introduction

Here I will show you how I setup my Jamit Job Board to import jobs from another job board.
I assume that you, the reader of this tutorial, has experience with XML files and PHP programming and is comfortable with many of the technical details that are given here.

If you want to display jobs from other job boards and you are looking for an easier way, please try the Indeed XML plugin which is available from here http://www.jamit.com/plugins.htm

1. XML Feed Analysis

Is it compliant?

I begin by receiving a sample XML file from the other job board, and see what fields it gives me and how they are structured. Typically, the importing tool that is bundled with Jamit can import almost any type of XML feed, as long as it conforms to the following limitations:

- Sequence element: The built-in importing tool in Jamit cannot read XML files with multiple sequence elements. It can only import files with one sequence element. Sequence elements are elements that repeated in the feed and the importer loops through them as it imports each record incrimentally. An example of a sequence element is a job element, which is looped through the XML feed, and the jobs are imported one-by-one in a sequence.

- Attributes: The built-in importing tool in Jamit cannot import data from attributes, and discards this data. Data can only be imported from in between elements. eg <something attribute="ignored blah blah">This is the data that will be imported</something>

- Check-boxes, and multiple-select fiels assume that multiple selections are comma delimited. For example, if you have 'skills' field which can be multiple-selected, then you can import the following values 'php,javascript,mysql'. If importing in to a check-box or multiple select field, then the values will be split using the comma for the delimiter, and processed individually.

- Categories cannot be multiple selected. This means that the category elements in your feed can only contain a single value

Most feeds are compliant, and should work - incluidng RSS feeds which are just really simple XML fiels.Perhaps in the future we may expand the functionality of our import tool to remove these limitations, although this may cost us the simplicity. Besides, a workaround can be developed for non-compliant feeds, and here I will show you a workaround.

- Feed encoding must be in UTF-8

Do I have all the fields that I need?

To import a job, the crucial fields that I require are:
- Job Title
- Description / snipplet of the description
- Date
- link to original job posting
- location
(That's pretty much what data is available in an RSS file, except for the location)

Additionally, it would be great if I could also have these:
- Application url
- Company

Finally, any other information helps too, including the account information.

First look at the sample

Ok, now lets see what I received from the other job board:

(I've taken a screenshot of the raw feed which I received, and highlighted some sections which I will discuss below)

This feed contains all the data that I need, but I cannot work with it yet because there are some incompatibilities

- The vertical line that I've drawn in green is what I refer to as a 'Sequence Element'. In this case the element's name is job. This is the element which contains all the job sub-elements and the importer will loop through a batch of them in sequence to import all the jobs. Perfect!

- The first line that I've highlited in yellow shows a client element. I need the Client Id because I could then use it to import under this ID in the job board. However, looks like my luck ran out because the data is placed in an attribute 'ID', and the Jamit importer ignores the attributes.

- The next two lines that I've highlited in yellow show an Item element which is repeated. If an element is repeated more than once, then it becomes a sequence element. Unfortunatelly, the Jamit importer can work with only one sequence element per feed, and cannot deal with multiple sequence element. (There is already a job sequence element in this feed). It will also ignore the 'name' attributes.

So this means that I cannot import this feed directly using the built-in importer... I will need to write some custom code to massage the xml feed a bit.

If your feed does not have the problems highlited above, excellent! You can skip the workaround step.

2. Workaround

I'm going to prepare a custom PHP script that will process the feed and massage it to the proper format that I want. I'm going to use PHP's regular expression functions (regex) to manipulate the text.

Changing the attributes in to an element

The client id would be useful for the feed because it means that I can import the feed under different usernames. Because it's an attribute, I need to get the value out of the attribute and put it in as data inside an element. I use this regex to extract the client id frm the Client element.

/<Client\s+ID=\"([^"]+?)\"/i

(/s* means 1 or more white-space characters, ([^"]+?) means capture all the characters in to a variable and stop if you get double quotes.)

For example

<client ID="1234567">

The above regex will mach 1234567 (case insensitive).

The other attributes I need to convert are Reference and MarketSegments, and I use a similar regex for these.

Changing the sequence elements to unique elements

I need to change all the Item elements names so that they are unique. I notice that they all have a unique attribute called Name, so I can use this for my regex. This is the regex that I came up with for preg_replace(Match, Replace, string):

Match:
<Item\s+Name=\"([^"]+?)\"\s*?>(.*?)<\/Item>/i

(This means captchure the Name attribute value and put in in to a variable $1, also capture the data stored in the element and put it in to a variable$2. \s* means match white-space and by adding a ? it means 'not greedy')

Replace:
<Item_$1>$2</Item_$1> (The$1 and $2 are variables which were captured by the above regex) The code Here is a PHP script that I came up with. It is reading the original feed from a file called 'feed.xml'. It reads the file line-by-line, and I process the lines with the regex functions. In this example, the output is sent to the browser, so I also added a header callto tell the browser what type of output it generates. I also use the utf8_encode() and utf8_decode() functions because the feed file is utf-8 encoded.  <?php header('Content-type: application/xml; charset=UTF-8');$fp = fopen('feed.xml', 'r'); while ($line = fgets($fp, 8192)) {     $line = utf8_decode($line);     // copy the client id:     if (preg_match('/<Client\s+ID=\"([^"]+?)\"/i', $line,$m)) {         $client_id_elem = '<client_id>'.$m[1].'</client_id>';     }     // put the Reference element inside the job element     // this also detects the start of the job element     // so we can put in a bunch of additional elements that we need     if (preg_match('/<Job\s+Reference=\"([^"]+?)\"/i', $line,$m)) {         // put it reference in to the feed as an element         $line .= '<reference>'.$m[1]."</reference>\n";         // put in client_id element         $line .=$client_id_elem."\n";     }     // copy the MarketSegments attribute and put it inside the job element     if (preg_match('/<Listing\s+MarketSegments=\"([^"]+?)\"/i', $line,$m)) {         $ref = '<MarketSegments>'.$m[1].'</MarketSegments>';         // paste it in to the feed as an element         $line .=$ref."\n";     }     // replace the <item> elements to make the element names unique.     // eg. <item_JOBTITLE></Item_JOBTITLE>     $line = preg_replace('/<Item\s+Name=\"([^"]+?)\"\s*?>(.*?)<\/Item>/i', '<Item_$1>$2</Item_$1>', $line); // replace the Classification elements to make them unique$line = preg_replace('/<Classification\s+Name=\"([^"]+?)\"\s*?>(.*?)<\/Classification>/i', '<Classification_$1>$2</Classification_$1>',$line);     echo utf8_encode($line); } fclose ($fp); ?> 

It could be possible to modify this script to save in to a file to save it on your disk. For now, I just used my browser to save a copy of the outputted file, I will use it for the next step.

One limitation is that the above code does not process elements with multiple lines.

Here is a more complete PHP code listing which is able to process multiple lines and also also convert data to CDATA, and clean the data. To be able to read multiple lines, the algoriithm puts the lines in a buffer until the ending tag is matched. (Assuming that each element is on a seperate line).

 <?php     ############################# massage_feed(); ############################# // map external user id to jamit job board id function get_emp_username() {     return 1; // just to illustrate this example. } function get_employer($username) { // just to illustrate this this example. // your function could query the database to get this value. return "Jamit Software"; } // convert string to CDATA function to_cdata($str) {     $str = trim($str);     $str = str_replace(']]>', ']]]]><![CDATA[>',$str); // http://en.wikipedia.org/wiki/CDATA     $str = '<![CDATA['.$str.']]>';     return $str; } function convert_data($str) {     // your function could do anything here to clean the data     switch (strtolower(trim($str))) { case 'Northern Territory': return 'NT - Other'; break; case 'New South Wales': return 'NSW - Other'; break; case 'Victoria': return 'VIC - Other'; break; case 'Queensland': return 'QLD - Other'; break; case 'South Australia': return 'SA - Other'; break; case 'Australian Capital Territory': return 'ACT - Other'; break; case 'Western Australia': return 'WA - Other'; break; default: return$str;         } } function process_line($line, &$count) {     $count = 0; static$emp_username;     static $emp_username_elem; // copy the client id: if (preg_match('/<Client\s+ID=\"([^"]+?)\"/i',$line, $m)) {$emp_username = get_emp_username($m[1]);$emp_username_elem = '<username>'.$emp_username.'</username>'; } // put the Reference element inside the job element // this also detects the start of the job element // so we can put in a bunch of additional elements that we need if (preg_match('/<Job\s+Reference=\"([^"]+?)\"/i',$line, $m)) { // put it reference in to the feed as an element$line .= '<reference>'.$m[1]."</reference>\n";$line .= "<pass></pass>\n"; // blank password         if ($employer = get_employer($emp_username)) {             $line .= "<company>".$employer['CompName']."</company>\n"; // company         }                  // put in emp_username element         $line .=$emp_username_elem."\n";     }          // replace the <item> elements to make the element names unique.     // eg. <item_JOBTITLE></Item_JOBTITLE>     $rep = preg_replace('/<Item\s+Name=\"([^"]+?)\"\s*?>(.*?)<\/Item>/ies', "'<Item_\\1>'.\to_cdata(convert_data('\\2')).'</Item_\\1>'",$line);     if (strcmp($line,$rep)!==0) {         $count = true;$line = $rep; } return$line;      } function massage_feed() {     // un-comment if outputting to file     //$fp_out = fopen('out.xml', 'w'); header('Content-type: application/xml; charset=UTF-8');$fp = fopen('feed.xml', 'r'); // change this to your input file     while ($line = fgets($fp, 8192)) {         $line = utf8_decode($line);                  if ((!$start_matched)) {$line = trim($line); if ($line=='')                  continue;         }         if (strpos($line, '<?xml')!== false) {$start_matched = true;              $line .= "\n"; }$line = process_line($line,$count);         $buffer =$line;                  $i=0; while (!$count && preg_match('/<Item\s+Name=/', $line) ) { // line started with <item name, but end was not matched // here we read futher lines to try to match the closing </item> // if the closing </item> is found then replace it$i++;             $buffer .= str_replace("\n", '', utf8_decode(fgets($fp, 8192)));             $rep = preg_replace('/<Item\s+Name=\"([^"]+?)\"\s*?>(.*?)<\/Item>/ies', "'<Item_\\1>'.\to_cdata(convert_data('\\2')).'</Item_\\1>'",$buffer);             if (strcmp($buffer,$rep)!==0) {                 $count = true;$line = $rep; break; } if ($i==100) { // give up after 100 lines..                 $line = process_line($buffer, $count); break; } } echo utf8_encode($line);         // un-comment if outputting to file         //fwrite ($fp_out, utf8_encode($line));     }     fclose ($fp); // un-comment if outputting to file //fclose ($fp_out); } ?> 

3. Setup the feed.

Now we are ready to setup the job board to import the jobs feed.

2. Click on the 'Add a new Feed to Import' button

3. Fill in the required fields. For the XML Sample file, I use the sample file that I came up with in step 2. If you skipped step 2, then just use the sample file that was given to you.

(There are a few was in which the job board can pickup the xml feed. For this example, I will choose 'Fetch the file form URL', and put in the URL to the xml feed that is generated by the script I came up in step 2.)

This is where we indentify the 'sequence element', which is the element which will get looped through to import all the job records. I click on the radio button to identify it, and a green line will be drawn to indicate the start and end of this element.

5. Click on the 'Please map your fields' link next your feed name. Here I associate the fields on my posting form with the elements from the XML feed. At least, I need to map the fields which are required. These are denoted with an * .

Each job board is different, and may require different fields. If your feed does not have a certain feature, then you would need to get your programming tools out and massage the feed a little as I shown in step 2.

For my case, my setup looks something like this:

For the setting 'How to associate the jobs with employer accounts', I have chosen the 3rd one 'Insert using the employer's username, create a new account from the account data present in the feed'. The reason is that I do not have a password in the feed, I only have 'client_id' for which I can use as a username. For the password field when mapping the Account Details, I just choose a dummy field which I know will be always blank.

I do not select anything for the Command Field. I assume that all the jobs in the feed are to be inserted. One most important field to set is the 'GUID'. This field must be unique for each job that is to be imported.

I then complete the rest of the form. Here is a screenshot my final setup

Importing

I'm ready to import! I click the 'Fetch' link next to feed name, this will tell the job board to fetch my feed and start importing it. A log of the import will be shown in the IFRAME below, this is what I get:

Looks good. Now I check my job board and have these jobs in: