Sunday, May 07, 2006

Using MySql 5.0 stored procedures with MySQLdb

Using MySql 5.0 stored procedures with MySQLdb

When I wanted to try out MySQL 5.0 stored procedures I didn't find too much on the web. This was a few months ago and there may be some better tutorials out there now but I figured I would share some of my tricks.

Why use stored procedures in the first place? While stored procedures give a you a performance boost (I have not benchmarked so I can't say how much) the argument I usually use when making the case for using stored procedures is abstraction. One can make changes to a stored procedure and not have to touch the calling code just as long as you don't break the interface. There have been many instances where I have avoided a code change - QA cycle - rollout just by adding an additional field to a select statement in a stored procedure.

Step 0: Install and configure MySql 5.0
MySql is at version 5.0.21 at the time of this writing. Installing MySql can be tricky, and going into all the detail of installing MySql is beyond the scope of this article, but here are some tips...

I usually install from source. The key is to follow the directions provided in the install notes and make sure that the directory the MySql installs to is owned by "mysql:mysql". Once the permissions are correct and the database has been installed, mysql usually fires right up.

Step 1: Installing MySQLdb.
Get the latest version of MySQLdb (1.2.1_p2 when I wrote this) and install...
$> tar zxfv MySQL-python-1.2.1_p2.tar.gz && cd MySQL-python-1.2.1_p2/
#> python setup.py install

Step 2: Create some tables and insert some data

mysql> use test;
mysql> create table products (
productid int unsigned not null primary key auto_increment,
categoryid int unsigned not null,
productname varchar(80) not null,
description varchar(255) not null,
statuscd tinyint(1) unsigned not null,
createdt datetime not null,
index cat_prod_idx
(categoryid, productid, productname, description, statuscd)
) type=innodb;

mysql> create table categories (
categoryid int unsigned not null primary key auto_increment,
category varchar(80) not null,
description varchar(255) not null,
statuscd tinyint(1) unsigned not null,
createdt datetime not null
) type=innodb;

mysql> insert into categories (category, description, statuscd, createdt)
values ('cool category', 'these are all cool products', 1, now());

mysql> insert into categories (category, description, statuscd, createdt)
values ('lame category', 'these are the not so cool products', 1, now());

mysql> insert into products (categoryid, productname, description, statuscd, createdt)
values (1, 'slackware linux', 'this is my distro of choice.', 1, now());

mysql> insert into products (categoryid, productname, description, statuscd, createdt)
values (1, 'MicroSoft Windows', 'see category for description.', 2, now());



Step 3: Build the 4 basic types of queries (select, insert, update and delete)

Select stored procedures are pretty simple the things to note are:
1. It is good to only bring back fields that you need 'select *'s' are evil
2. I prefixed the input parameter with an underscore so that it is easily identified by MySql and the next person that has to read this sp.


mysql> delimiter //
mysql> create procedure usp_get_products_by_category
(
_categoryid int unsigned
)
begin
select
p.productid,
p.productname,
p.description,
c.category,
c.description
from
products p
inner join categories c on p.categoryid = c.categoryid
where
p.categoryid = _categoryid
and p.statuscd = 1
and c.statuscd = 1;
end;
//
mysql> delimiter ;



When writing an insert sp you always want to return the id of the record you just created hence the "select LAST_INSERT_ID()". This means that you will get a one record result set back after you call the insert sp and the first and only field in the the record will be the id.


mysql> delimiter //
mysql> create procedure usp_ins_product
(
_categoryid int unsigned,
_productname varchar(80),
_description varchar(255),
_statuscd tinyint(1) unsigned
)
begin
insert into products
(categoryid, productname, description, statuscd, createdt)
values
(_categoryid, _productname, _description, _statuscd, now());
select LAST_INSERT_ID();
end;
//
mysql> delimiter ;



When I write my update and delete stored procedures I usually let MySql do it's thing. I always check the number of rows updated through the MySQLdb cursor object:
"count = curs.execute(stmt)"


mysql> delimiter //
mysql> create procedure usp_upd_productstatus
(
_productid int unsigned,
_statuscd tinyint(1) unsigned
)
begin
update products
set statuscd = _statuscd
where
productid = _productid;
end;
//
mysql> delimiter ;

mysql> delimiter //
mysql> create procedure usp_del_product
(
_productid int unsigned
)
begin
delete from products where productid = _productid;
end;
//
mysql> delimiter ;



Step 4: Testing it all out!

Fire up a python interpreter and ...

>>> import MySQLdb
>>> from MySQLdb.constants import CLIENT
>>> cnn = MySQLdb.connect(host='localhost', user='USER', passwd='PASS', db='test', client_flag=CLIENT.MULTI_STATEMENTS)
>>> cnn.autocommit(1)
>>> curs = cnn.cursor()
>>> count = curs.execute('call usp_upd_productstatus(%d, %d)' % (1,0)) # update
>>> count = curs.execute('call usp_get_products_by_category(%d)' % (2)) # select
>>> curs.fetchall()
>>> curs.execute("call usp_ins_product(1, 'FreeBSD', 'Another great OS', 1)")
>>> curs.fetchall() # this returns the id of the new 'FreeBSD' record
>>> count = curs.execute('call usp_del_product(%d)' % (2)) # delete
>>> curs.close()
>>> cnn.close()


The INNODB storage engine supports row level locking which means writes won't tie up the entire table. Because I created the tables using the 'type=innodb' I had to set the autocommit flag to true in the connection otherwise my queries won't get commited.

Monday, February 27, 2006

URL to DSL: Turning JobJitsu's URLs into a job query language.

Useful URL's (KIHF)
URL to DSL: Turning JobJitsu's URLs into a job query language.

One of the goals behind JobJitsu is to uphold the KISS (Keep It Simple Stupid) philosophy and where ever possible uphold the KIHF (Keep It Hacker Friendly) philosophy. That is especially true for URL's. One thing I really like about del.icio.us is their site navigation. It is pretty simple, and at the same time very expressive/powerful. That being said we decided to make our URL's useful for searching and navigation.

For example:
When searching for python software engineering jobs in San Diego, one would use a url like...
  http://jobjitsu.com/jobs/sandiego/sw/python
The structure of the url is:
  http://jobjitsu.com/jobs/locations/categories/tags

where
  locations -> comma separated list of locations
  categories - > comma separated list of categories
  tags -> list of tags and operators
  operators -> ,+!
    , := or
    + := and
    ! := not

A more complex example:
Show me all the jobs in San Diego or Los Angeles that are in the software engineering or quality assurance categories having the tags python and lisp and linux but not microsoft...
  http://jobjitsu.com/jobs/sandiego,losangeles/sw,qa/+python+lisp+linux!microsoft

But wait there is more...
By adding the feed sub-domain to your url you'll get a jobs atom feed:
  http://jobjitsu.com/jobs/sandiego/sw/python
gives you something that looks good in a browser, and
  http://feed.jobjitsu.com/jobs/sandiego/sw/python
will give you something that looks good in your feed reader. This allows you to easily syndicate your job searches. So if you always want to see the latest lisp jobs you can now embed:
  http://feed.jobjitsu.com/jobs/all/all/lisp
into your google/ig homepage.

Now for the fun stuff... how do we make it work?
We first built a url library to handle most of the low level url parsing which allows us to build a parse tree of operators and tokens. we then created the sql-generation code that allowed us to translate this parse tree into sql statements and finally we wrapped up the whole thing w/ some query caching. Disclaimer: this is all still first pass and is subject to bugs and change but I figured I would post it cause it is kind of neat.

The Sql-Generation is not included cause it is quite hackish (even more so then the rest of my code) and will need to be improved...

Job handler...


from ats.cache import memcache_mgr
from ats.db import dao
from ats.log import logger
from ats.util import converter
from ats.util import url as url_lib
from common.model import job as job_model
from common.model import sqlgen
from common.ui import navigation
from home.view.jobs import jobs as jobtmpl
from home.view.sidebar import sidebar
from mod_python import apache

# ------------------------------------------------------------------
# globals
# ------------------------------------------------------------------
cache = memcache_mgr.get_cache()
log = logger.get_logger()

# ------------------------------------------------------------------
# handler
# ------------------------------------------------------------------
def handler(req, sesh):
""" This function will return all the jobs based on the query string
"""
global log
jobs, start, pagesize, error = [], 0, 0, 0
try:
jobs, start, pagesize = get_jobs(req.uri)
jobs = sort_jobs(jobs)
if jobs and start > 0:
jobs = jobs[start-1:pagesize]
except Exception, e:
log.exception(e)
error = 1

if error:
# do some error handling here
pass

return jobtmpl(searchList=[{
"site_nav": navigation.get_site_nav('jobs'),
"sesh": sesh,
"jobs": jobs,
"start": start,
"pagesize": pagesize,
"sidebar":
sidebar(searchList=[{"sesh":sesh,"metros": job_model.get_metros()}])}])

# ------------------------------------------------------------------
# get jobs from cache or db
# ------------------------------------------------------------------
def get_jobs(url):
""" Check the cache to see if this query has been run recently
if so then return those results.
On Cache miss we construct the query and go to the database
"""
global cache
if not url: return [], 0, 0
urlkey = 'jj:jobs:%s' % url
segments = url_lib.unpack(
url,
['locations', 'categories', 'tags', 'start', 'pagesize'],
keep_blank_values=0)
start = converter.toint(segments.get('start', 1))
pagesize = converter.toint(segments.get('pagesize', 50))

# check cache first
jobs = cache.get(urlkey)
if jobs:
return jobs, start, pagesize

# on cache miss parse query string and run query
locations = filter( # split on ',' + filter empties and 'all'
lambda x: x and x != 'all',
segments.get('locations', '').lower().split(','))
categories = filter( # split on ',' + filter empties and 'all'
lambda x: x and x != 'all',
segments.get('categories', '').lower().split(','))
tags = url_lib.process(segments.get('tags', 'all').lower())

# run query
ct, jobs = dao.get_records(
'jobjitsu',
construct_sql(locations, categories, tags))

# cache for 5 min
if ct > 0:
cache.set(urlkey, jobs, 5)
else:
jobs, start, pagesize = [], 0, 0

return jobs, start, pagesize

# ------------------------------------------------------------------
# translate query string to job query
# ------------------------------------------------------------------
def construct_sql(locations, categories, tags):
""" build sql stmt from sections of query string of the form:
/jobs/locations/categories/tags/start/pagesize
returns (stmt, hash(url), start, pagesize)
"""
global log

# construct the query
sql = []
if locations:
sql.append(sqlgen.job_loc_or(locations))
if categories:
sql.append(sqlgen.job_cat_or(categories))
if tags and tags[0]:
sql.append(sqlgen.job_tags_and(tags[0]))
if tags and tags[1]:
sql.append(sqlgen.job_tags_or(tags[1]))
if tags and tags[2]:
sql.append(sqlgen.job_tags_not(tags[2]))
stmt = "%s%s;" % (" select j.* from jb_jobs j where "," \nand ".join(sql))
return stmt

# ------------------------------------------------------------------
# utility
# ------------------------------------------------------------------
def sort_jobs(jobs):
if not jobs: return []
sorted = jobs[:]
sorted.sort(lambda x,y: cmp(y.createdt, x.createdt))
return sorted




Low Level URI handling...


from ats.db import dao

escape_char = "'"
op_and = '+'
op_or = ','
op_not = '!'
operators = (op_and, op_or, op_not)

def unpack(uri, keys, keep_blank_values=0):
""" This function will take a req.uri and a list and return a dict
of key->val pairs where the keys are taken from the list and
values are taken from the req.uri.split('/')
"""
d = {}
if keep_blank_values:
for key in keys:
d[key] = None
if not uri or not keys:
return d
try:
num_keys = len(keys)
tokens = uri.split('/')[2:]
for i in xrange(len(tokens)):
if i < num_keys and tokens[i]:
d[keys[i]] = tokens[i]
return d
except:
return d

def process(str):
global log
if not str or str == 'all': return None
try:
return simplify_qry_expr(classify_qry_tokens(tokenize(str)))
except Exception, e:
log.exception(e)
return None

def tokenize(str):
""" takes a string like 'red,green+blue' and returns a list
in [OPER1, TOKEN1, OPER2, TOKEN2, ...] formant e.g.:
[',', 'red', ',', 'green', '+', 'blue']
this list can then be later operated on and transformed into
a sql query by examining the operators and tokens
"""
global operators, op_or, escape_char
tokens = []
word = []
i = 0
while i < len(str):
#########################################################
### save this section for when we impl group'd exprs ###
#########################################################
# if str[i] == '(':
# jmp = match_brace(str[i:])
# tokens.append(tokenize(str[i+1:i+jmp]))
# i += jmp+1
# continue
# if str[i] == ')':
# if word:
# tokens.append("".join(word))
# if tokens and tokens[0] not in operators:
# return [op_or] + tokens
# return tokens
#########################################################
if str[i] == escape_char:
i+=1
word.append(str[i])
i+=1
continue
elif str[i] in operators:
if word:
tokens.append(dao.escape("".join(word)))
word = []
tokens.append(str[i])
else:
word.append(str[i])
i += 1
if word:
tokens.append(dao.escape("".join(word)))
if tokens and tokens[0] not in operators:
return [op_or] + tokens
return tokens

def match_brace(str):
""" Given a string '(hello (not) cool) world' will return 17
useful for (grouped expressions) -- coming soon
"""
brc_count = 1
length = len(str)
for i in xrange(length-1):
if str[i+1] == '(':
brc_count += 1
elif str[i+1] == ')':
brc_count -= 1
if brc_count == 0:
return i + 1
if str[length-1] == ')':
return length-1
return 0

def classify_qry_tokens(tokens):
""" Takes a flat list of the form op, word, op, word, ...
returns a tuple of lists representing AND-tokens, OR-tokens, NOT-tokens
"""
global operators, op_and, op_or, op_not
_and, _or, _not = [], [], []
coppied = tokens[:]
while coppied:
token = coppied.pop()
oper = coppied.pop()
if isinstance(token, list):
token = classify(token)
if oper == op_and: _and.append(token)
elif oper == op_or: _or.append(token)
elif oper == op_not: _not.append(token)
return _and, _or, _not

def simplify_qry_expr(expr):
""" OR is lowest level operator
it only matters if AND and NOT aren't present.
* AND + OR reduces to just AND..
* NOT + OR reduces to just NOT...
"""
if expr[0] or expr[2]:
_or = []
else:
_or = expr[1]
return expr[0], _or, expr[2]

if __name__ == "__main__":
print "keep_blank_values=0"
print """/cool/666/42 - ['evilid', 'goodid']"""
d = unpack("/cool/666/42", ['evilid', 'goodid'])
for k,v in d.items(): print k, v

print """/cool/666/42 - ['evilid']"""
d = unpack("/cool/666/42", ['evilid'])
for k,v in d.items(): print k, v

print """/cool/666 - ['evilid', 'goodid']"""
d = unpack("/cool/666", ['evilid', 'goodid'])
for k,v in d.items(): print k, v

print "\nkeep_blank_values=1"
print """/cool/666/42 - ['evilid', 'goodid']"""
d = unpack("/cool/666/42", ['evilid', 'goodid'], 1)
for k,v in d.items(): print k, v

print """/cool/666/42 - ['evilid']"""
d = unpack("/cool/666/42", ['evilid'], 1)
for k,v in d.items(): print k, v

print """/cool/666 - ['evilid', 'goodid']"""
d = unpack("/cool/666", ['evilid', 'goodid'], 1)
for k,v in d.items(): print k, v

Friday, February 17, 2006

The Technology behind JobJitsu.com

JobJitsu Technology Overview:
So now a little bit about the technology behind the job board...

JobJitsu is more of a home grown web app. we have some general purpose modules that handle most of the common web application facilities:

  • cache -- a wrapper around the memcache client provided by tummy.com
  • conf -- where we keep common configuration settings
  • db -- a data access module that manages connections and uses Greg Stein's DTuple.py
  • dispatch -- this guy handles the url to resource mapping, sessions and a few other things.
  • log -- wrapper around python logging module
  • mail -- mail wrapper
  • re -- common precompiled regex's (email, url and etc.)
  • sesh -- our session module
  • util -- utility stuffs like (string manipulation, url parsing and etc.)

The presentation is handled by Cheetah. Cheetah has nice syntax (no-xml), flexibility, and performance (precompiled templates). One other feature that I find hard to live without is the template inheritance. Cheetah allows you to define a base page which other pages can extend (good for header + sidebar + footer).

We are using MySQL 5.0 (which supports stored procedures) on top of the innodb storage engine for persistence.

Finally what modern web app would be complete without some cool AJAX? Matt has some neat JSON stuffs that he is working on.

Next post will be an in depth look at how we manage our data including how we defined our data model, a quick discussion on stored procedures, indexing strategies and how we actually pass data back and forth between Python and the MySQL

Thursday, February 16, 2006

Ambition

It's all about Ambition

The real challenge of starting a job board is not the technology behind it, but getting employers to post jobs and job seekers to search it. There is a critical mass that needs to happen on both sides before the job board becomes relevant. In the past I believe that this was accomplished with massive advertising campaigns, Matt and I discussed the idea of running a 30 second commercial during the superbowl but decided to save that for next year and take a more grass roots style approach instead.

Getting employers to post shouldn't be too hard... make it free to post for the first 6 months, when other job boards are charging hundreds of dollars a posting free isn't so bad.

Getting job seekers to use it will be a bit harder (enter the shameless self promotion you are now reading). A good place to start will be by documenting the project in my blog. I figure the worst case scenario is that in the future someone may find a shred of something useful here, while the best case scenario would be that I might be able to generate some *interest*.

I am convinced that if you give 2 good hackers 2 months with nothing else to do and they can build just about anything, we on the other hand have day jobs so this is going to be a tight deadline for us.

That being said, my philosophy on making things happen is this...
1) If you have half a brain.
2) And you are completely honest with yourself and others.
3) And you are not afraid of a little hard work.

Then the only thing stopping you from achieving whatever it is that you set out to do is your own ambition.

The name of the site is going to be JobJitsu.com.
Why did we choose JobJitsu.com?
Simple cause JobNinja was already taken.

Anyway enough Tony Robins speak, tomorrow's entry is all bout tech.

Wednesday, February 15, 2006

2 hackers

2 hackers, 2 months, and an idea...

Build a better job board. I was talking to my friend Matt and about the idea of building a better job board. It didn't take much to convince each other that we could do much better then what is currently out there.

Monster, Dice, HotJobs, Jobing (spelling?) et al have all gotten fat dumb and happy - Internet Explorer style (oh yes I did), the one thing they all have in common is how hard they suck. It's like there has been a feature freeze since 2002, moreover these job boards have all been overrun by staffing companies.

I am going to assert that coming up with better technology then what the current job boards have to offer won't be a problem. So what is it that we are going to do better? Good question, the answer is a closely guarded secret, but you should have some idea by 4/1/06 (our first launch date). Over the next 6-8 weeks I will blog about our methods and progress.

What makes us so sure that we can pull this off?
We have a secret weapon.

More on that tomorrow.

Monday, January 09, 2006

Creating your own mod_python request dispatcher

Creating your own mod_python request dispatcher that maps url's to python request handlers.

Conceptually I have something that looks like this:

+-------------+
| Clients |
| Web Browser |
+-------------+
| [http://somecooldomain.com/coorequest]
| ^
| | [Cool Response]
V |
+--------+ +------------+ +---------------+ +---------+
| Apache | --> | Mod_Python | --> | dispatcher.py | --> | cool.py | --+
| (A) | <-- | | <-- | (B) | | (D) | |
+--------+ +------------+ +---------------+ +---------+ |
| ^ |
| | |
| +------------------------+
V
+-------------+
| urlmap.conf |
| (C) |
+-------------+


A) Apache Configuration:
On my server I have mod_python set up to handle any request that comes in without a file extension. That way I can have those cool REST style URI's. I use the apache "FilesMatch" directive in my httpd.conf file to look for any request that does not contain a '.' before the '?' in the query string:


<FilesMatch "(^[^\.]*$|^[^\?]*[\?]+[^$]+$)">
SetHandler python-program
PythonHandler common.dispatch.dispatcher
PythonDebug On
</FilesMatch>


B) dispatcher.py
The source for dispatcher.py can be found here.

C) Example urlmap.conf


[somecooldomain.com]
/=cooldomain.handlers.home.handler
/signup=cooldomain.handlers.member.signup
/login=cooldomain.handlers.member.login
/logout=cooldomain.handlers.member.logout
/cool=cooldomain.handlers.cool.handler



D) cool.py
This is the handler that generates your content. In my content handlers I usually do things like query the database, process business logic, select a cheetah template and return the result as html.

In summary I use this approch for a couple of reasons, first it allows me to decouple my url's from my python code that way I don't have python code + html sitting around in the same directory. Secondly I get one piece of code that handles every request. (good for sessions and things like that)

Friday, January 06, 2006

urlencoder/decoder for python

EDIT (I so stand corrected.):

I could have just used:
urllib.quote
urllib.unquote

---------------------------------------------------

Every once and a while I run into the need for a urlencoder/decoder function that just accepts a string and returns a string.

The one that comes w/ urllib(2) takes a dictionary and returns a "name=val" formatted string.
My only problem is that it does 2 things at once:
1) url encodes.
2) constructs a query or post string.

So I wrote a quick utility to do just #1:


_keys = [
"$", "&", "+", ",", "/", ":", ";", "=", "?", "@", " ", '"',
"<", ">", "#", "%", "{", "}", "|", "\\", "^", "~", "[", "]", "`"]

_vals = [
'%24', '%26', '%2B', '%2C', '%2F', '%3A', '%3B', '%3D', '%3F',
'%40', '%20', '%22', '%3C', '%3E', '%23', '%25', '%7B', '%7D',
'%7C', '%5C', '%5E', '%7E', '%5B', '%5D', '%60']

def encode(str=""):
""" URL Encodes a string with out side effects
"""
return "".join([_swap(x) for x in str])

def decode(str=""):
""" Takes a URL encoded string and decodes it with out side effects
"""
if not str: return None
for v in _vals:
if v in str: str = str.replace(v, _keys[_vals.index(v)])
return str

def _swap(x):
""" Helper function for encode.
"""
if x in _keys: return _vals[_keys.index(x)]
return x

### units ###
if __name__ == "__main__":
assert("".join(_keys) == decode(encode("".join(_keys))))
assert("".join(_vals) == encode(decode("".join(_vals))))
print "passed all unit tests."