Paolo Donadeo — LifeLOG

All about my life, job and thoughts
  • rss
  • Home
  • Blog
  • Contacts
    • GPG public key
    • Alessandro Donadeo — Curriculum Vitæ

PyCon Due

Paolo | 14/05/2008 | 13:32

PyCon2 Italia official logo

Si è appena conclusa la seconda convention italiana dedicata al linguaggio Python, il PyCon2, svoltosi a Firenze lo scorso fine settimana. Pur non essendo un vero pythonista sono comunque interessato a Python perché in questo momento gran parte della mia attività di consulenza richiede la conoscenza di questo linguaggio e, devo dire, perché molto attirato da alcuni key note, tra i quali spicca certamente quello di Alex Martelli su Google App Engine.

Un altro aspetto interessante è stato per me il constatare come molti talk fossero incentrati sulle tecnologie legate al web e siccome sto scrivendo Ex-nunc, un web framework in Objective Caml, ho voluto informarmi su quale fosse lo stato dell'arte di progetti simili scritti in Python.

L'organizzazione della manifestazione è stata a dir poco eccellente. Develer, l'azienda che ha promosso ed organizzato il PyCon2, non ha tralasciato nessun dettaglio, a partire dal sito della manifestazione, dettagliatissimo e ricco di informazioni, dalla scelta dell'albergo centralissimo e raggiungibile a piedi dalla stazione ferroviaria, fino al ricco buffet.

Anche l'organizzazione logistica dei talk è stata di livello eccezionale: ottima acustica e disponibilità di traduzione simultanea per gli ospiti di lingua Inglese.

Infine, magari meno importante, la cornice della città di Firenze è splendida e, seppure immagino sia possibile organizzare un PyCon3 a Sesto San Giovanni, spero davvero che scelgano ancora, l'anno prossimo, una città in cui ad ogni angolo si vedono panorami come questo.

Organizzazione perfetta, argomenti trattati interessanti ed ospiti internazionali hanno fatto del PyCon2 un appuntamento di assoluto rilievo, che non teme il confronto di nessuna manifestazione simile nel mondo. Per il panorama italiano si tratta poi di un fatto del tutto eccezionale anche dal punto di vista culturale, in un paese che di tecnologia parla sempre meno e di software libero non parla affatto.

Comments
2 Comments »
Categories
Article, Computer programming, Italian, Python, Spare time
Tags
Article, Computer programming, Italian, Python, Spare time
Comments rss Comments rss
Trackback Trackback

Sending emails via Gmail with Objective Caml

Paolo | 26/04/2008 | 23:00

Gmail logo with the caml

Motivation

Last week I was writing a Python script to make an automatic backup, and I decided to send me an email in case of scp failure. I decided to use Python to send the email, possibly via GMail and I found this interesting blog post: Sending emails via Gmail with Python. I like Python, it's a good programming language, but my heart (as a developer!) beats for the Objective Caml programming language.

So I decided to port the script presented in the post in OCaml. The result is this sendmail.ml.

Compiling the script

To compile the script you need four software components:

  1. the Objective Caml environment. You can download it from the INRIA site;
  2. Findlib, to make compiling very simple;
  3. Ocamlnet: here is the home page of the project;
  4. OCaml binding to the SSL library.

You can of course compile all this stuff, but every decent Linux distributions has all packaged. In Debian you have to run the following command:

# aptitude install ocaml libocamlnet-ocaml-dev \
  libssl-ocaml-dev ocaml-findlib

Now, to compile the script, issue the command:

$ ocamlfind ocamlopt -linkpkg -package \
  netstring,smtp,ssl,str sendmail.ml -o sendmail

Before using it, remember to customize your name, email address, GMail user and password.

Code comparison

The first difference that jumps out at everyone confronting the two scripts is the number of lines: 41 lines for Python against 163 of my OCaml version. The difference is justified by the fact that the Python standard library comes with an almost full featured SMTP client, with ESMTP and TLS capability. On the other side Objective Caml has a very concise standard library, which includes essential modules and data structures, but no "batteries" are provided out of the box. This is a precise design decision by INRIA and, in some ways, I agree with them. Luckily the OCaml community is a source of excellent libraries and bindings, like Ocamlnet by Gerd Stolpmann and the SSL library binding, written by Samuel Mimram. The first one is in particular the Swiss Army Knife for network oriented battles.

Since the SMTP client provided by Ocamlnet doesn't include TLS capability I decided to stole the source code and adapt it to my needs, to have a more comfortable and high level interface resembling the one offered by the Python standard library.

So the different length is easily explained: 109 lines of code are devoted to the smtp_client class, and the actual script is 54 lines long.

The forward pipe operator

All Turing complete computer languages are equivalent, but everyone knows this is only the theory and everyone have a programming language of choice. Here are two examples of what you can do in OCaml.

The first is the pipe operator:

let (|>) x f = f x

Here we define a (very common in FP) infix operator which simply inverts the order of its operands. What the frack is this? Very simple, we use it to invert the order of a function with its last parameter so, if we want to compute the 3rd Fibonacci number we can write:

let fib3 = fibonacci 3

but also:

let fib3 = 3 |> fibonacci

This is not a style issue, we can define a simple infix operator that feeds a function with a value; we can of course connect several functions together, like in a shell script with the Unix pipe operator, transforming an ugly and difficult to be read call:

let result = func1(func2 (func3(x)))

into:

let result = x |> func3 |> func2 |> func1

In the sendmail.ml script, line 127, we read:

email_string |>
  Str.global_replace new_line_regexp "\r\n" |>
    Str.split crlf_regexp |>
      List.iter (fun s ->
        self#output_string (if String.length s > 0 && s.[0] = '.' then
                              ("." ^ s ^ "\r\n")
                            else s^"\r\n"));

Here we take the string containing the email, we replace all new lines with the sequence "\r\n", split the stream into lines and in the end send each line to the SMTP server, taking care of quoting each line starting with a period. In 6 lines of code.

Algebraic data type

Algebraic data type are a very interesting aspect of functional programming. We can easily wrap two heterogeneous data types into a single one with two line of code:

type socket =
  | Unix_socket of Unix.file_descr
  | SSL_socket of Ssl.socket

The smtp_client class contains a reference to the connection handle used for communicating with the server which is a plain file descriptor or an SSL socket, which one depends on the state of the communication. I do not want to create a virtual class or an interface and two implementing class as I should do in horrible languages like Java, spending half an hour deciding which methods to put in the public interface, and so on; after all, it's only a file descriptor!

Now I have a new type which is a disjoint union of the two original types and I can write code like this (line 54):

let input = match channel with
  | Unix_socket s -> Unix.read s
  | SSL_socket s -> Ssl.read s i

Here we say: if channel is actually a Unix file descriptor, let's define a new function "input" which is the standard function "read", from Unix module, otherwise, if channel is an SSL socket, let's define "input" as the Ssl.read function, which works only in ciphered sockets. From now on I'll use input instead of one of the two original functions.

Ok, it's time to stop the waffle. Enjoy the script if you need, it's completely free, like in free beer, in free speech and even in free sex! :-)

Comments
No Comments »
Categories
Article, Computer programming, English, Objective Caml, Python, Spare time
Tags
Article, Computer programming, English, Objective Caml, Python, Spare time
Comments rss Comments rss
Trackback Trackback

Tagging is not the right way

Paolo | 20/10/2007 | 17:00

Bad tagging example

Yesterday I was listening to some music with the Last.fm client, when a song I particularly like started. As always, I decided to mark the song as "loved" and to tag it with something useful. If you use the Last.fm client, you know that it suggests the most common tags for the tune you want to tag. Ok, usually the list of tags includes a lot of stupid words, but this time I was surprised to see the word "gnocca" in the list.

For people not speaking Italian, "gnocca" is a coarse term referring to the vagina and, by extension, to a sexy girl: it can be translated with "pussy" in the first case and... I don't know a suitable translation for the second meaning.

Actually, this is the worst case of tag I've ever found, but it's not the first time I was disappointed with other user's tag choice.

As a matter of fact many internet sites that use tags to classify user contents, are showing their limits. The whole paradigm of user-defined tags, well known with the term folksonomy, is based on three ideas:

  1. it's nearly impossible to classify contents inside a tree of categories;
  2. associating words (tags) to contents is effective, because the user will remember the world and then, searching for the tag, he will easily be able to recover the piece of information she looks for;
  3. as a good side effect, if many users tag the same object, the most appropriate tags will emerge and a big number of users will automatically screen unused or not relevant tags, so that other people will easily retrieve information.

While I agree with the first statement, the last two are questionable, at least. Sure it's very difficult, if not impossible, to arrange a large and heterogeneous set of objects into a tree-shaped data structure, particularly if the set grows with time and you don't know in what "direction" the tree's growing. Everyone who owns a personal computer and has tried to sort out his or her "document" folder , now can understand what I mean: there isn't a hierarchy that fits to all needs, because many documents can be correctly folded in different places at the same time.

The proposed solution is to tag documents with a chaotic cloud of words freely chosen by people, where the only valid criterion seems to be common sense or a more hazy association of ideas.

My experience is that tagging without a criterion is only another way to lose information. Using a hierarchical tree of directories (or categories) leads the user to lose documents, because people tend to forget the aspect of the document they have chosen to catalogue it. The same situation still gets worst with tags: I usually choose as many tags as I think appropriate, in the secret hope that the large number will help me in the task of searching information later on. The net effect is the proliferation of synonyms, singular and plural tags (eg: tool and tools) and completely useless words, because too much generic (eg: hardware, programming or software), or too much specialistic (eg: xgl or xen), so that about the 48% of my tags actually label only one document.

Those statistics are based on my personal experience using del.icio.us, one of the services to which I pay great attention when I choose my tags, because Internet bookmarks are very important for my job. You can download here the file containing my del.icio.us tags, ordered by frequency. More than 48% of tags is used only once, and only a 20% is used twice, so I guess that the most of my tags is completely useless. Not very good, actually.

The lesson here is that cataloguing a large quantity of information is not for free: a simple and easy way to have tons of documents well ordered, always accessible at any time under your fingertips, is an utopistic dream.

I think now it's time to drop buzzwords like "web2.0" (the parent of all buzzwords) and to pass to some more serious and structured ideas about information architecture. Since I don't like to reinvent the wheel again and again and since I need something to index my documents, I decided to investigate how librarians organize the knowledge in a big library, following these two ideas:

  1. the librarian is a very old job, and librarians can boast a thousand-year-old experience;
  2. I have many documents on my hard disk, and these documents are very different about topics, media type and relevance, but nothing compared to the Library of Congress or other similar libraries in the world.

A quick search around led me to some readings and I discovered a whole universe of studies about information indexing. The most appealing theory I found is the faceted classification, in which multiple trees of "facets" are used to reach information. What are facets, actually?

A faceted classification system is composed by a number of categories (facets) what represent different aspects of the items we are going to classify. Each facet (aspect) is explained and developed in a tree of terms (now to be known as "foci", or individually, "a focus"). To classify an item, therefore, you apply one or more terms from one or more facets to the item. In this way you have a multidimensional approach to the items you are indexing.

There are two main criteria developed by librarians to compose faceted classifications:

  1. the list of facets should represent several aspects of the items to be classified, and should be "orthogonal", as much as possible;
  2. the tree of terms belonging to each facet should present at each node a unique criterion of division, i.e. the set of children of a node must be a partition of the whole parent node, so that the hierarchy has no overlapping terms.

Following these two principles the result will be a set of trees in which items are classified, and consequently several access points from which to start the search. Instead of a tree, the final data structure resulting from this kind of classification is a DAG (a directed acyclic graph), which provide a flexible way to organize knowledge, without being chaotic like a "tag cloud".

An excellent example of faceted classification is the Nobel price winners page, a demo of the Flamenco Project of the Berkeley University: you can navigate through the various criteria in which Nobels are classified in an intuitive way, with a simple and effective interface.

Another example of use of facets is the Amazon jewelry page. You can reach the page going to Amazon.com and looking for "Jewelry and Watches" in the Product Categories menu.

Since I find the idea of facets very interesting, I decided to start a little experiment in this blog: Wordpress handles trees of categories, so I decided to adapt them and use the whole category system as a facets system. Is will not be perfect, because there are no facilities to navigate into the DAG, like in the Flamenco Project, and the reader cannot choose more than one focus at time, but it's a starting point. If the result, in a few months, will be better than my experience with del.icio.us tags, I'll probably start a software project to handle files on my computer using facets.

Comments
1 Comment »
Categories
Article, English, Information retrieval, Librarianship, Writing
Comments rss Comments rss
Trackback Trackback

Administration

  • Register
  • Log in
  • Entries RSS
  • Comments RSS

Search with Google

Facets (like tags, but better)

Article Books Chillout Cinema Computer programming Digital life Editors Essay Information retrieval Job Karate Librarianship Life Linux Music News Nu-jazz Objective Caml Photography Politics Python Random thoughts Review South American literature Spare time Tips Visual arts Writing

Site map

  • Home
  • Blog
  • Contacts
    • GPG public key
    • Alessandro Donadeo — Curriculum Vitæ

Facets hierarchy

My photos on Flickr

www.flickr.com

RSS Recent links on del.icio.us

  • Nemiver : A GUI debugger for GNOME
  • Monads are Plug-ins
  • Goodbye, Passwords. You Aren’t a Good Defense.
  • Raphaël—JavaScript Library
  • The Official Website of the Beijing 2008 Olympic Games
  • A Fundraising Survival Guide
  • TestDisk - CGSecurity

RSS Interesting News

  • eigenclass - Some functional programming and OCaml koansreddit/programming
  • Uomini e mezz'uomini, ominicchi, piglianculo e quaquaraquàBlog di Beppe Grillo
  • Richard Jones: Just draw something on the f-ing screenOCamlcore Planet
  • La filosofia berlusconianaAntonio Di Pietro
  • Matías Giovannini: Monads are Plug-insOCamlcore Planet
  • Matías Giovannini: Monads are Plug-insOCamlcore Planet
  • Italy angered by 'fascist' labelBBC News | Europe | World Edition

My Last.Fm

Recent Posts

  • Ascoltare Radio Monte Carlo… con Linux
  • Holiday!
  • Vacanze!
  • Ritardo Cronico
  • PyCon Due
  • Sending emails via Gmail with Objective Caml
  • BBC article on Italy and mafia

Old posts

  • August 2008 (1)
  • July 2008 (2)
  • June 2008 (1)
  • May 2008 (1)
  • April 2008 (2)
  • March 2008 (1)
  • January 2008 (4)
  • November 2007 (2)
  • October 2007 (6)
  • August 2007 (2)

Blogroll

  • Alex
  • Andrea
  • Benji
  • Dome
  • Gigi
  • Ilaria
  • kOoLiNuS’s blog (English)
  • kOoLiNuS’s blog (italian)
  • Le ricette del secco
  • Roberto Gastaldi

Make me a present...

My Amazon.com Wish List
My Hoepli.it Wish List
rss Comments rss valid xhtml 1.1 Viewable With Any Browser design by jide