Pippilongstrings/20100121
From EU-wiki
- Beginning of log 2010 Jan 21 19:03:54 ****
2010 Jan 21 19:03:54 --> jaywalk (~strkmjlk@forthelulz.se) has joined #pippi 2010 Jan 21 19:03:55 -@- Nicks #pippi: [@peterlj jaywalk stf] 2010 Jan 21 19:03:55 -=- Channel #pippi: 3 nicks (1 op, 0 halfop, 0 voice, 2 normal) 2010 Jan 21 19:04:02 <peterlj> yo 2010 Jan 21 19:04:15 --> juice (~sushipum@h-223-98.A149.priv.bahnhof.se) has joined #pippi 2010 Jan 21 19:04:16 --> ehj (~erik@136.173.180.68) has joined #pippi 2010 Jan 21 19:05:10 <ehj> hey stf, can you tell us what you have done? 2010 Jan 21 19:05:11 <stf> so i am sitting in H.a.c.k. a budapest hackerspace. you? 2010 Jan 21 19:05:15 <stf> yep. 2010 Jan 21 19:05:21 <ehj> I'm in Strasbourg 2010 Jan 21 19:05:25 <ehj> in the EP 2010 Jan 21 19:05:29 <peterlj> im in gothenburg 2010 Jan 21 19:05:40 <jaywalk> Linköping 2010 Jan 21 19:06:04 <stf> ok, so i implemented pippi using python. 2010 Jan 21 19:06:17 <stf> funny, but there is a module in python for that: difflib 2010 Jan 21 19:06:36 <jaywalk> haha, nice :) 2010 Jan 21 19:06:56 <juice> and jussi in stockholm 2010 Jan 21 19:07:00 <stf> so i wrote a simple app, that fetches the docs from http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri= 2010 Jan 21 19:07:04 <peterlj> you basically look for strings of x words in multiple documents? 2010 Jan 21 19:07:13 <stf> where i simply append the CELEX or OJ id 2010 Jan 21 19:07:24 <jaywalk> juice: ah! didnt know your nick. :) hi 2010 Jan 21 19:07:40 <stf> then i take all docs and compare them pairwise and store them in a basic db 2010 Jan 21 19:07:52 <stf> this is done in xpippy 2010 Jan 21 19:07:56 <ehj> http://92.243.28.240:14148/xpippi/?doc=CARIFORUM 2010 Jan 21 19:08:04 <ehj> to check the results 2010 Jan 21 19:08:08 <stf> http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=CELEX:62005J0039:EN:HTML 2010 Jan 21 19:08:09 <ehj> so far 2010 Jan 21 19:08:11 <juice> stf - when you write "compare them pairwise" what comparison is that? 2010 Jan 21 19:08:17 <stf> oops. worngwone 2010 Jan 21 19:08:21 <stf> wrongwone. 2010 Jan 21 19:08:33 <ehj> oups, sorry 2010 Jan 21 19:08:33 <stf> juice: sequence matcher. 2010 Jan 21 19:08:47 <stf> no, my link is wrong, yours is perfect. 2010 Jan 21 19:08:48 <stf> sorry 2010 Jan 21 19:09:19 --> Teirdes (~user@136.173.180.70) has joined #pippi 2010 Jan 21 19:09:34 <stf> you can also go to a form, where you can specify any valid CELEX id, if it is not yet in the cache it will be downloaded, parsed and displayed to you. although this might take some time. 2010 Jan 21 19:10:01 <stf> you can go here: http://92.243.28.240:14148/xpippi/ 2010 Jan 21 19:10:02 <peterlj> so to find a sentence that is almost identical but with a middle word changed, you would have to find the matching half-sentences before and after? 2010 Jan 21 19:10:08 <stf> no. 2010 Jan 21 19:10:16 <ehj> me and Teirdes put the result for one of the international agreements (Cariforum) compared to IPRED1 here: http://euwiki.org/index.php?title=FTA%2FCARIFORUM%2FIPRED1&diff=3220&oldid=3219 2010 Jan 21 19:10:49 <ehj> (which you all maybe already saw from my mail) 2010 Jan 21 19:11:02 <stf> before comparison i tokenize and to a stemming. 2010 Jan 21 19:11:09 <peterlj> ok i see 2010 Jan 21 19:11:16 <stf> the stemming results in in removal of non words. 2010 Jan 21 19:11:47 <stf> thus punctiation, numbers and acronyms are discarded and match like wildcards. 2010 Jan 21 19:11:47 <Teirdes> stf: you seem to have solved the parallel comparison problem already? 2010 Jan 21 19:11:50 <peterlj> do you also use something like the porter stemming algorithm? 2010 Jan 21 19:11:58 <stf> Teirdes: indeed. ;) 2010 Jan 21 19:12:18 <stf> Teirdes: i beat my own performance by a couple of magnitudes. 2010 Jan 21 19:12:21 <stf> :) 2010 Jan 21 19:12:45 <stf> xpippy gives you all the matches from all the documents parsed so far. 2010 Jan 21 19:13:01 <ehj> here, I think this is a result from the latest version of xpippi, no? 2010 Jan 21 19:13:04 <ehj> http://92.243.28.240:14148/xpippi/?doc=Korea 2010 Jan 21 19:13:08 <stf> if you want to do a direct comparison between two docs, you can pipi 2010 Jan 21 19:13:15 <stf> ehj: it is. 2010 Jan 21 19:13:32 <ehj> the right column are CELEX docs, e.g. http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=CELEX:32000L0031:EN:HTML 2010 Jan 21 19:14:03 <ehj> the left is this text (but I guess cleaned from wikimarkup): http://euwiki.org/FTA/Korea 2010 Jan 21 19:14:30 <stf> it is also possible to add non CELEX documents. but this needs some manual processing, thus we have some documents that do not follow the CELEX naming scheme. 2010 Jan 21 19:15:03 <ehj> jaywalk, how's the callisto server set up? is it "hot"? 2010 Jan 21 19:15:21 <jaywalk> hottest so far, yes. :) 2010 Jan 21 19:15:23 <ehj> stf, what about OJ "numbers"? 2010 Jan 21 19:15:32 <stf> these are: Canada, CARIFORUM, Korea, st06221.en06.html st11245.en05.html and the funnily named getDoc-2.html 2010 Jan 21 19:15:37 <jaywalk> ehj: I wanna discuss that with you, we really should make some changes very soon :) 2010 Jan 21 19:15:50 <ehj> jaywalk, how "big" is it? what do you need? 2010 Jan 21 19:15:56 <stf> ehj: OJ numbers work the same as CELEX numbers! 2010 Jan 21 19:16:22 <ehj> so you can load a OJ url then? 2010 Jan 21 19:16:36 <jaywalk> ehj: it's 8 shares right now. And instead of more I suggest a dedicated machine next 2010 Jan 21 19:16:47 <ehj> juice, maybe you see that the result can be used directly in the political debate... 2010 Jan 21 19:16:58 <stf> so that's about what i did in the last couple of days. all code is here: http://github.com/stef/le-n-x 2010 Jan 21 19:17:10 <juice> yep, i can see how this might be interesting pretty much in the form it's there now! 2010 Jan 21 19:17:12 <stf> ehj: yes! 2010 Jan 21 19:17:32 <ehj> jaywalk, I can pay a couple of ? temporarily not to interupt the flow 2010 Jan 21 19:17:36 <peterlj> stf: nice work. 2010 Jan 21 19:17:41 <juice> question is, what needs can we see for future devs? 2010 Jan 21 19:18:11 <ehj> juice, techncal, or user interface? 2010 Jan 21 19:18:15 <jaywalk> ehj: sure, I already added some extra temporairly ;) 2010 Jan 21 19:18:45 -*- ehj all, are you all ok if we log this chat publicly for documentation? 2010 Jan 21 19:18:59 <juice> i was thinking technical needs. 2010 Jan 21 19:19:15 <juice> logging is fine with me. 2010 Jan 21 19:19:17 <jaywalk> ehj: fine by me 2010 Jan 21 19:19:21 <ehj> jaywalk, so what do you suggest for "after now"? 2010 Jan 21 19:19:44 <jaywalk> ehj: colocation and a dedicated server 2010 Jan 21 19:19:49 <jaywalk> it would pay itself in a few months 2010 Jan 21 19:20:06 <stf> my plan is to create a proper diff view, and i already crawled the complete eur-lex db. so there's lot's of things to process. ;) 2010 Jan 21 19:20:08 <ehj> ah... you mean to buy a real box? 2010 Jan 21 19:20:13 <jaywalk> yepp 2010 Jan 21 19:20:16 <stf> jaywalk: you sure? 2010 Jan 21 19:20:19 <ehj> how much? 2010 Jan 21 19:20:23 <peterlj> as i suggested to ehj before, we can work on several levels. like (1) collect, pre-parse and build up a large corpus of potentially interesting documents. (2) work on parsing, indexing, comparing, clustering etc algorithms. (3) make a nice user interface to provide a fluent way of working with the system 2010 Jan 21 19:20:36 <jaywalk> stf: yes 2010 Jan 21 19:20:50 <stf> peterlj: i am doing exactly all these things now. 2010 Jan 21 19:20:51 <stf> :) 2010 Jan 21 19:20:58 <ehj> juice, where to colocate? 2010 Jan 21 19:20:59 <stf> jaywalk: who's going to pay for this? 2010 Jan 21 19:21:05 <jaywalk> ehj: a 1U server costs anything from 6000-10000 + VAT, colocation 500-1000 SEK / month 2010 Jan 21 19:21:18 <jaywalk> current setup costs well over 200 euros per month 2010 Jan 21 19:21:27 <stf> wow! 2010 Jan 21 19:21:32 <ehj> jaywalk, don't remind me... 2010 Jan 21 19:21:40 <Teirdes> jaywalk: i will be getting office money. 2010 Jan 21 19:21:49 <Teirdes> ehj: ce is getting office money too. 2010 Jan 21 19:21:50 <jaywalk> Teirdes: oh, nice. when? 2010 Jan 21 19:21:57 <Teirdes> jaywalk: i have no idea, which is a problem. 2010 Jan 21 19:22:06 <Teirdes> but it's included in the parliamentarian package. 2010 Jan 21 19:22:08 <jaywalk> Teirdes: ah, that I already knew then :) 2010 Jan 21 19:22:30 <juice> i'm thinking we should cloud this instead of setting up a physical box. 2010 Jan 21 19:22:30 <ehj> peterlj, juice any "institutional" alternatives? 2010 Jan 21 19:22:41 <ehj> juice, ah... 2010 Jan 21 19:22:51 <stf> currently the whole app is running on django. we should migrate this to lighttpd or apache sometimes in the future if traffic increases. 2010 Jan 21 19:23:02 <jaywalk> juice: clouded atm, gets quite expensive quite fast 2010 Jan 21 19:23:09 <ehj> hmm.... how to log this (I suck at xchat) 2010 Jan 21 19:23:22 <jaywalk> juice: we could add extra resources that way for special occations, but we have quite a bit of continous needs too 2010 Jan 21 19:23:41 <jaywalk> ehj: I can send you the log when we're done if you want me to :) 2010 Jan 21 19:23:46 <Teirdes> ehj: there is no way to access green group resources on this? 2010 Jan 21 19:23:51 <Teirdes> we could ask mab to find out. 2010 Jan 21 19:23:58 <ehj> jaywalk, I want you to :-) 2010 Jan 21 19:24:14 <juice> sics would be happy to host whatever i suggest without too much supervision. but there might be a sudden panic to close it down if someone starts to worry. 2010 Jan 21 19:24:42 <ehj> Teirdes, not yet, I have to work my way into the group slowly (and deliver some impressive results first) 2010 Jan 21 19:24:47 <Teirdes> juice: worry...? 2010 Jan 21 19:25:55 <juice> if i ask our board if it would be ok to host a service with direct political implications they will say no. 2010 Jan 21 19:26:11 <stf> my short-term plans are the following, test and make the bulk processor more robust if necessary. 2010 Jan 21 19:26:11 <juice> if i ask our board if i may process large amounts of legal text for public reading, they'll say yes. 2010 Jan 21 19:26:12 <ehj> juice, how sudden is "sudden panic"? I mean, will it be enough if jaywalk takes backups on a weekly basis? 2010 Jan 21 19:26:21 <Teirdes> juice: actually, you'll wnat to run this through an NGO. 2010 Jan 21 19:26:36 <Teirdes> ehj: is there any way we can outsource it on a non-establishment organisation? 2010 Jan 21 19:26:42 <ehj> juice, well, the processing is not political, we?re just finding pippilpongstrings :-) 2010 Jan 21 19:26:47 <Teirdes> like, you're sort of running it now and you're clearly politically affiliated. 2010 Jan 21 19:27:00 <stf> and create a nice diffveiw on documents. so that teirdez and ehj do not have to wikify them all the time. 2010 Jan 21 19:27:22 <Teirdes> stf: perhaps you could set up an organisation of data-handling. 2010 Jan 21 19:27:23 <juice> i would not ask our board any question at all. i'm fine setting up a first host. 2010 Jan 21 19:27:46 <Teirdes> being an NGO is a lot easier than being a political party :/ 2010 Jan 21 19:28:18 <peterlj> but then again, if we want we could separate the development from the "production" site 2010 Jan 21 19:28:30 <stf> data.eu? 2010 Jan 21 19:28:31 <peterlj> the development part is perfectly fine from an academic perspective 2010 Jan 21 19:28:32 <ehj> hey, I can take 300? per month for a couple of months.... 2010 Jan 21 19:28:55 <ehj> I don't want to get bogged down with administration 2010 Jan 21 19:29:09 <ehj> and when Teirdes gets her seat, everything will be fine 2010 Jan 21 19:29:20 <Teirdes> ehj: it's not administration, it's just a... paper tiger. 2010 Jan 21 19:29:21 <ehj> should not be that long 2010 Jan 21 19:29:36 <ehj> I have enough paper tigers to handle... 2010 Jan 21 19:30:05 <ehj> for commercial co-location you'll need VAT registration etc 2010 Jan 21 19:30:26 <ehj> so the bureaucracy burden is not negliable 2010 Jan 21 19:30:45 <jaywalk> ehj: thats a part I could handle though 2010 Jan 21 19:30:54 <ehj> let's assume I can pay for what we need for another 3 months 2010 Jan 21 19:31:06 <stf> ok, how many documents / month? 2010 Jan 21 19:31:13 <jaywalk> stf: how satisfied with the performance are you atm? 2010 Jan 21 19:31:36 <stf> because we do not need a computing farm in the longterm. only in spikes when we process a lot of documents. 2010 Jan 21 19:31:44 <stf> jaywalk: it's ok. 2010 Jan 21 19:32:02 <stf> jaywalk: depends on how impatient the customers are. ;))) 2010 Jan 21 19:32:16 <jaywalk> stf: then it sounds like we need more ;) 2010 Jan 21 19:32:35 <ehj> specific target now is to download the whole of eur-lex, no? 2010 Jan 21 19:32:44 <stf> i mean, if we calculate all differences for the current corpus, then we need some horsepower for that, but then later we need to only add the new documents. 2010 Jan 21 19:32:45 <ehj> and index it? 2010 Jan 21 19:32:58 <stf> ehj: the complete eur-lex is alread downloaded. ;) 2010 Jan 21 19:33:04 <ehj> what?? 2010 Jan 21 19:33:09 <ehj> all chapters? 2010 Jan 21 19:33:11 <stf> it's just not parsed. only the Trade agreements + cankor. ;) 2010 Jan 21 19:33:19 <stf> ehj: yes and in all languages. ;) 2010 Jan 21 19:33:28 <ehj> my god! how big is that? 2010 Jan 21 19:33:31 <stf> but only for 201001 2010 Jan 21 19:33:40 <peterlj> stf: how does your algorithm scale with the number of documents? linear? exponential? 2010 Jan 21 19:34:20 <stf> we now use 11G on my dedicated storage. 2010 Jan 21 19:34:26 <peterlj> in terms of processing power etc 2010 Jan 21 19:34:47 <stf> peterlj: adding a new document means comparing it with all other docs in the corpus atm. 2010 Jan 21 19:35:01 <ehj> oh.. that does not sound too much 2010 Jan 21 19:35:22 <stf> we have about 15 thousand docs. 2010 Jan 21 19:35:44 <juice> so i guess to scale up an order of magnitude will be a challenge :-) 2010 Jan 21 19:35:59 <stf> ehj, but there is a corpus for each month for the last couple of years, and then a corpus for each year going back to the 1950-ies. 2010 Jan 21 19:36:01 <juice> my suggestion then would be to build a repository of stock phrases and compare against that. 2010 Jan 21 19:36:05 <peterlj> sounds like it will be kind of slow performance if you have many docs. like millions. 2010 Jan 21 19:36:20 <stf> peterlj: yes. 2010 Jan 21 19:36:44 <juice> so here's how: harvest all multi-word terms and phrases and paragraphs from each incoming document. 2010 Jan 21 19:36:58 <jaywalk> stf: is it helpful comparing everything against everything though? we should be able to take the likely hits first 2010 Jan 21 19:37:00 <stf> juice: yes, that's another interesting path to follow. i'd like to display the most long or most often reused fragments, and let people classify them. 2010 Jan 21 19:37:13 <juice> save them for some time. if they reoccur, note that fact. if they don't reoccur, scratch them. 2010 Jan 21 19:37:36 <stf> jaywalk, if anyone does tell me a clustering of docs that should be compared together i am more than happy to start the engine. 2010 Jan 21 19:37:47 <juice> then you'd have an index of multi-word-terms to documents. the longer mwt-s the more likeness between docs. 2010 Jan 21 19:38:28 <ehj> I think the directives themselves (like IPRED1, Telecoms package, Dat retention Directive etc) are the core docs 2010 Jan 21 19:38:29 <jaywalk> kinda like a cache decay 2010 Jan 21 19:38:44 <peterlj> jaywalk: yes 2010 Jan 21 19:38:45 <jaywalk> juice: that sounds like a good idea I think 2010 Jan 21 19:38:54 <ehj> maybe also the trade agrements as a whole is another "cluster"? 2010 Jan 21 19:39:10 <Teirdes> the xpippi adds stuff to the corpus right? 2010 Jan 21 19:39:22 <stf> ehj has such a clustering of interesting docs. 2010 Jan 21 19:39:29 <stf> Teirdes: yes it does. 2010 Jan 21 19:39:46 <ehj> stf, how do you add a non-celex doc? 2010 Jan 21 19:39:57 <stf> ehj: TAs as a cluster. excellent. then we can also use the 20 chapters of eur-lex. 2010 Jan 21 19:40:11 <stf> ehj: only OJ: links work automatically. 2010 Jan 21 19:40:19 <stf> otherwise manual adding is necessary. 2010 Jan 21 19:40:26 <peterlj> stf: do you strip off all metadata like http markups or do you include it in some way? 2010 Jan 21 19:40:43 <stf> i store the raw html. 2010 Jan 21 19:40:57 <ehj> stf, can we do a "upload doc for cleaning" page/parser?2010 Jan 21 19:41:06 <stf> eur-lex has this nice custom of wrapping all the important stuff in a
;)
2010 Jan 21 19:41:10 <peterlj> do you include the http tags when comparing strings? 2010 Jan 21 19:41:45 <stf> no before tokenization i run the html document through beautifulsoup, extracting the text. 2010 Jan 21 19:42:00 <peterlj> ok 2010 Jan 21 19:42:12 <stf> ehj: not yet. but i think i'll come up with some kind of automatic upload service. 2010 Jan 21 19:42:54 <stf> can we summarize what's our progress until now? 2010 Jan 21 19:43:03 <stf> i mean in this meeting. 2010 Jan 21 19:43:19 <ehj> juice, stf has the same idea as I mentioned to you regarding finding longstrings in national law, did you think more about that idea? 2010 Jan 21 19:43:47 <juice> i have been thinking a bit about how to align the various languages 2010 Jan 21 19:43:51 <stf> yes, that's why i already crawled the other languages. 2010 Jan 21 19:43:55 <juice> to be able to extract multiling terms 2010 Jan 21 19:44:14 <juice> there's been some work to do that. 2010 Jan 21 19:44:14 <stf> and i employ hunglish, the spelling engine used in openoffice 2010 Jan 21 19:44:21 <ehj> let's edit the pad: http://etherpad.com/pippilongstrings in a bit 2010 Jan 21 19:44:29 <stf> so doing all this in hungarian is also a snap. 2010 Jan 21 19:44:30 <peterlj> thats great! i want that multi-ling stuff as input for my lsi system 2010 Jan 21 19:44:34 <Teirdes> stf: i was trying to add some council... regulations, i suppose. 2010 Jan 21 19:44:47 <stf> wtf:lsi? 2010 Jan 21 19:44:50 <Teirdes> they're 4NNNNXNNNN documents. 2010 Jan 21 19:44:55 <Teirdes> well, technically 2010 Jan 21 19:45:01 <peterlj> lsi = latent semantic indexing 2010 Jan 21 19:45:03 <Teirdes> 4<year>X<number>(011) 2010 Jan 21 19:45:13 <juice> peterlj: we're against lsi. you are to use random indexing. 2010 Jan 21 19:45:14 <stf> Teirdes: what's wrong? 2010 Jan 21 19:45:20 <Teirdes> stf: it gives me an error message. 2010 Jan 21 19:45:24 <ehj> what I am interesting in is to get a quantiative "number" on hom much EU-direcvtive text is sutn'paste into member states law 2010 Jan 21 19:45:29 <Teirdes> i don't know if it's because of the (01) or not. 2010 Jan 21 19:45:31 <ehj> how* 2010 Jan 21 19:46:03 <stf> Teirdes: i see. is this OJ or CELEX? 2010 Jan 21 19:46:12 <Teirdes> ehj: i think we need to run comparisons on how many times the EU has declared mutual recognition of legislation as well. 2010 Jan 21 19:46:20 <Teirdes> stf: celex. 2010 Jan 21 19:46:28 <Teirdes> i've never worked with the OJ number because I find it messy. :/ 2010 Jan 21 19:46:37 <Teirdes> I could change that though. 2010 Jan 21 19:46:40 <peterlj> juice: wot? 2010 Jan 21 19:46:54 <Teirdes> ehj: the mutual recognition part also means that our legislation applies to them, and their legislation applies to us. 2010 Jan 21 19:47:07 <stf> Teirdes: please use the complete CELEX id: CELEX:42009X0325%2801%29:EN:HTML 2010 Jan 21 19:47:11 <juice> peterlj: http://www.sics.se/~mange/random_indexing.html 2010 Jan 21 19:47:25 <Teirdes> in the north african agreements from 2005, in particular, it says that legislation in certain areas should be "approximated" 2010 Jan 21 19:47:55 <juice> hee hee "approximated" legislation is just such a beautiful term! 2010 Jan 21 19:48:06 <peterlj> juice: ok. i'm not up to date with all comp.linguistics i guess 2010 Jan 21 19:48:10 <stf> juice: thx for the url. gotta have alook at it. 2010 Jan 21 19:48:18 <Teirdes> stf: okay. thanks :) 2010 Jan 21 19:48:53 <stf> Teirdes: i made up that id, you better check wether it's valid. 2010 Jan 21 19:49:09 <ehj> Teirdes, mutual recognition is a "political problem", particularly in the field of criminal law, thats why EuroPol and EuroJust is so controvercial (and complex) 2010 Jan 21 19:50:02 <stf> sorry, i gotta leave for 5 minutes. 2010 Jan 21 19:50:08 <juice> the thing we should define is the information flow thru the system. if we have terminology resource, e.g., (which e.g. peterlj and myself will be able to help out to design using some funky tools) how to link it up with the analysis and the interface in a useful way? 2010 Jan 21 19:50:26 <juice> i have dinner baking in the stove - need to skip out in a minute or so. 2010 Jan 21 19:50:29 <ehj> the old system of "conventions" (ECHR, EPC etc) are running in paralell to the new "acquis" and "treaty" solutions 2010 Jan 21 19:51:06 <ehj> ok, 10 minutes break 2010 Jan 21 19:54:21 <Teirdes> i'll have to leave in 10 minutes since I'm seeing the chair of italian fsf. :/ 2010 Jan 21 19:54:37 <Teirdes> but basically i'm tracking down customs regulations :) 2010 Jan 21 19:54:40 <Teirdes> i love customs. 2010 Jan 21 19:55:15 <Teirdes> i was in tilburg yesterday starting a co-operation between PP and Incubate independent art festival. 2010 Jan 21 19:55:22 <Teirdes> jaywalk: that's why i asked about lejsrar. 2010 Jan 21 19:56:06 <jaywalk> Teirdes: I'll take a look at that, there will be many lejsers in 2010 :) 2010 Jan 21 19:59:10 <Teirdes> jaywalk: \o/ 2010 Jan 21 19:59:25 <Teirdes> i already talked to chrisk about it too. but i suppose he'd have tlaked to you? 2010 Jan 21 19:59:42 <stf> back 2010 Jan 21 19:59:43 <stf> sorry 2010 Jan 21 19:59:44 <jaywalk> Teirdes: briefly, yes 2010 Jan 21 20:01:11 -*- ehj asks all: what about the wiki? drop it or develop it? 2010 Jan 21 20:01:51 <jaywalk> ehj: do you mean euwiki? 2010 Jan 21 20:03:06 <peterlj> you mean about hosting hjalmar or migrating to some other platform? 2010 Jan 21 20:03:46 <stf> use the wiki wisely. i love wikis. but somethings are not meant to be done in wikis ;) 2010 Jan 21 20:03:49 <ehj> yes euwiki 2010 Jan 21 20:03:55 <Teirdes> ehj: listen to stef. 2010 Jan 21 20:04:00 <Teirdes> :) 2010 Jan 21 20:04:09 <ehj> I am listenting :-) 2010 Jan 21 20:04:26 <ehj> I want users, that's all 2010 Jan 21 20:04:34 <Teirdes> ehj: iwill write to malte? 2010 Jan 21 20:04:38 <jaywalk> hosting-wise I'll migrate it to a new machine, but that's already slowly under way. that doesnt affect its use or anything else though :) 2010 Jan 21 20:04:43 <stf> i have a couple of wikis myself. but what you're doing should be done by programs. 2010 Jan 21 20:04:46 <ehj> wiki is a known platform 2010 Jan 21 20:04:56 <Teirdes> ehj: i'm... sort of using the wiki, but not so much for comparisons as for notes :) 2010 Jan 21 20:04:59 <Teirdes> i need to rush. 2010 Jan 21 20:05:01 <Teirdes> take care all! 2010 Jan 21 20:05:02 <jaywalk> well when I'm done all langs are there, that wil ofc have an effect on the use :) 2010 Jan 21 20:05:04 <-- Teirdes (~user@136.173.180.70) has quit (Client closed connection) 2010 Jan 21 20:05:30 <jaywalk> so the question is more how to use the tool, I guess 2010 Jan 21 20:05:47 <ehj> jaywalk, yes, keep repeating that there is no resource out there which provides links to articles like euwiki 2010 Jan 21 20:06:10 <stf> ehj, this is possible if someone codes this up. 2010 Jan 21 20:06:23 <stf> if i find the time this will be also a feature. 2010 Jan 21 20:06:31 <stf> this is actually my next item. 2010 Jan 21 20:06:34 <peterlj> keep the wiki for collaborative notes etc. develop a dedicated ui for the new tools. useful output can be sŽdirectly saved in the wiki if wanted. 2010 Jan 21 20:06:49 <ehj> peterlj, sounds good 2010 Jan 21 20:06:59 <stf> peterlj: indeed. 2010 Jan 21 20:07:14 <stf> there must be a way to include external content into the mediawiki platform. 2010 Jan 21 20:07:15 <juice> yep, the output doesn't need to be in a wiki framework. it should seem more ... constant, i guess. 2010 Jan 21 20:07:24 <stf> simple iframes would do it. 2010 Jan 21 20:07:25 <jaywalk> yeah, that sounds like a goodplan 2010 Jan 21 20:07:28 <ehj> so we can get a new tool which can provide the diff like the cariforum one? 2010 Jan 21 20:07:36 <ehj> automagically? 2010 Jan 21 20:07:55 <stf> something better. 2010 Jan 21 20:08:12 <ehj> what's wrong with the wiki diffs I have made? 2010 Jan 21 20:08:30 <ehj> I mean, how can t be "better"? :-) 2010 Jan 21 20:08:30 <stf> just let me finish up the parallel bulk processor. i will start on diplaying documents. 2010 Jan 21 20:08:50 <ehj> sure :-) 2010 Jan 21 20:09:03 <jaywalk> only risk with a dedicated view of the raw pippi output is that it can fast get too vast and hard to get an overview of 2010 Jan 21 20:09:14 <jaywalk> still needs to be interconnected with wikistyle texts and context 2010 Jan 21 20:09:31 <stf> imagine looking at a document and seeing all copypastes from all docuemnts. not only from one diffed pair of documents. 2010 Jan 21 20:09:32 <jaywalk> otherwise it'll just be another db that ppl will have a hard time getting an overview of 2010 Jan 21 20:09:38 <juice> i think we want an index, again, using frequent fragments. 2010 Jan 21 20:09:45 <ehj> yes, that's why I wonder how much "better" it can be (but my imagination is a bit limited...) 2010 Jan 21 20:10:04 <ehj> juice, what is a "frequent fragment"? 2010 Jan 21 20:10:10 <juice> also, going along the track peterlj mentioned, we can infer relations between the fragments in question, so that we can relate fragments. 2010 Jan 21 20:10:10 <jaywalk> ehj: I think it can get better, just want to avoid data overload at the same time :) 2010 Jan 21 20:10:17 <stf> juice. indeed. 2010 Jan 21 20:10:25 <juice> ehj: frequent sequences of words. 2010 Jan 21 20:10:31 <ehj> classical user problem? 2010 Jan 21 20:10:58 <ehj> juice, you mean like the ones displayes in the cariforum wiki diff? 2010 Jan 21 20:11:00 <juice> fragment example: "the judicial authorities may order the recovery of profits or the payment of damages which may be pre-established" 2010 Jan 21 20:11:11 <ehj> perfect! I understand :-) 2010 Jan 21 20:11:13 <juice> actually, i mean the stuff that DOESN'>T diff. 2010 Jan 21 20:11:35 <ehj> pippilongstring="frequent fragment" :-) 2010 Jan 21 20:11:59 <juice> that would be an entry point into the database: an index of freqfrags/pippilongstrings. 2010 Jan 21 20:11:59 <stf> i'd like to display the longest and the most frequent long fragments, so people can classify these fragments. using this classification there is an automatic relation between all documents containgn these frags. 2010 Jan 21 20:12:09 <peterlj> i think the stuff stf is working on sounds greast, but i would like to complement it with some vectorspace svd-based approach (singular value decomposition) 2010 Jan 21 20:12:11 <stf> ehj, indeed. 2010 Jan 21 20:12:32 <juice> there could be a temporal feature, taking up most recent tag phrases, longstrings. 2010 Jan 21 20:12:36 <ehj> peterlj, woha! lost me there!! 2010 Jan 21 20:12:50 <peterlj> where the vectors can represent words, fragments, documents, etc as points in a multidimensional space 2010 Jan 21 20:13:03 <stf> peterlj: i suppose you study or perhaps tutor this topic somewhere? 2010 Jan 21 20:13:05 <juice> peterlj: vectorspace yes, but no svd. random indexing whups its ass. 2010 Jan 21 20:14:08 <ehj> regarding Korea Canada etc, there are about 70 "trade agreements" in the pipeline... 2010 Jan 21 20:14:20 <stf> ? 2010 Jan 21 20:14:31 <ehj> so there will be a need to see exaclt which directive articels they are exporting 2010 Jan 21 20:14:32 <peterlj> juice: well,if you say so. my argument for using svd would be the scalability, to be able to use 100.000s of documents and more in reasonable time 2010 Jan 21 20:14:45 <ehj> DG TRADE is on speed 2010 Jan 21 20:14:58 <juice> peterlj: we need to take this off line, but that's precisely where RI beats SVD. 2010 Jan 21 20:15:11 <ehj> flamewar!!!! 2010 Jan 21 20:15:36 <peterlj> also, in a vectorspace model you could potentially also include other metadata like urls, references to other legal documents etc, in addition to the actual words and sentences. 2010 Jan 21 20:15:39 <ehj> stf, do you understand "in the pieline"? 2010 Jan 21 20:15:55 <juice> my vector space is more random than peterljs. 2010 Jan 21 20:16:12 <peterlj> juice: i guess you are right.. i just need to read up... 2010 Jan 21 20:16:47 <juice> ok - i need to go feed the progeny. i hear sounds of unrest in the living quarters. 2010 Jan 21 20:17:10 <peterlj> juice: :) 2010 Jan 21 20:17:28 <ehj> juice, thanks for coming! any chance you can come up with a date for a afk meeting in Stockholm? 2010 Jan 21 20:17:39 <peterlj> my kids are eating here beside me right now 2010 Jan 21 20:18:23 <peterlj> i will be in sthlm feb 1-4. can we talk then juice ? 2010 Jan 21 20:18:34 <ehj> juice, formal (institutional) or informal does not matter 2010 Jan 21 20:19:06 <stf> ehj: which pipeline? 2010 Jan 21 20:19:10 <ehj> that's too soon for me, I was thinking May-June 2010 Jan 21 20:19:18 <ehj> the legislative pipeline 2010 Jan 21 20:19:42 <ehj> trying to keep track here: http://euwiki.org/Tratten/oeil/Stage_reached_in_procedure 2010 Jan 21 20:19:52 <juice> peterlj: i will be in the gbg area feb 1-3 :-). back on feb 4. random indexing tutor session at sics then? 2010 Jan 21 20:20:07 <peterlj> juice: hehe 2010 Jan 21 20:20:15 <juice> ehj: i'm mostly in sthlm, sure, can chat whenever. 2010 Jan 21 20:20:40 <stf> ehj: ok. can you prepare a list of urls where i can download these docs with the least amount of non-relevant noise. 2010 Jan 21 20:20:42 <peterlj> juice: 4 feb afternoon? 2010 Jan 21 20:20:55 <juice> peterlj: bokat 2010 Jan 21 20:21:08 <juice> peterlj: for chat abt r.i. with you, i mean. 2010 Jan 21 20:21:26 <peterlj> juice: yes 2010 Jan 21 20:21:45 <jaywalk> for now, are we mostly done here then? 2010 Jan 21 20:22:02 <jaywalk> I'll be nearby most of the evening anyway 2010 Jan 21 20:22:37 <ehj> juice, is this a possible research project "measure verbatim implementation rate EU-laws / MS laws"? 2010 Jan 21 20:23:32 <peterlj> could be i guess. we'll need to find funding for it. 2010 Jan 21 20:24:05 <ehj> well, that sounds boring :-) 2010 Jan 21 20:25:27 <ehj> I heard Slovenia is not processing directives very much 2010 Jan 21 20:25:31 <peterlj> well, we'll do the work now anyway, and then see if we can get some money for it afterwards. then we use that money to do our next project. 2010 Jan 21 20:26:07 <ehj> you know how it works, I don't... 2010 Jan 21 20:26:10 <ehj> :-) 2010 Jan 21 20:26:17 <peterlj> :) 2010 Jan 21 20:26:42 <ehj> jaywalk, thanks for coming 2010 Jan 21 20:26:45 <peterlj> well this kind of work you usually do for the lulz 2010 Jan 21 20:26:48 <stf> ok. can anyone summarize? 2010 Jan 21 20:27:08 <stf> i am interested in the next steps, the costs, meetings, todos, etc. 2010 Jan 21 20:27:10 <ehj> I put some points on the pad http://etherpad.com/pippilongstrings 2010 Jan 21 20:27:22 <jaywalk> ehj: we'll talk more later, are you on irc for the next hour or so? 2010 Jan 21 20:27:29 <ehj> jaywalk, yes 2010 Jan 21 20:28:09 <stf> longstring acta drafts? 2010 Jan 21 20:28:09 <ehj> stf, I hope I can carry the server costs for a couple of months 2010 Jan 21 20:28:33 <stf> you don't need. how long is the current setup payed? 2010 Jan 21 20:28:34 <ehj> I heard yesterday there is a leaked draft in Brussels somewhere 2010 Jan 21 20:29:01 <ehj> I am paying gandi every month 2010 Jan 21 20:29:29 <ehj> and i can take 300? until April if needed 2010 Jan 21 20:29:39 <stf> but this setup is more expensive than usual? 2010 Jan 21 20:29:45 <ehj> i pay something like 150 now anyways 2010 Jan 21 20:29:50 <stf> that's fucking insane! 300eur. 2010 Jan 21 20:30:05 <ehj> jaywalk, says it's ok for what you get 2010 Jan 21 20:30:07 <peterlj> if you pay that out of your own pocket and we can find a better solution, i think we should 2010 Jan 21 20:30:52 <ehj> yes, but I *want* to pay to make sure things get done (as they are!!) 2010 Jan 21 20:31:05 <stf> i'll try to streamline all processor intensive stuff. afterwards it'll be only a db query. 2010 Jan 21 20:31:06 <ehj> I also have some other servers 2010 Jan 21 20:31:42 <ehj> jaywalk is optimising price/performance for me, so I'm more worried about him 2010 Jan 21 20:33:01 <stf> ok. 2010 Jan 21 20:33:28 <jaywalk> yeah, there are 6 servers atm, guess 5 of em live doing things 2010 Jan 21 20:34:18 <jaywalk> ehj: it is a good solution price wise for the smaller ones, but you're growing out of the gandi sized things quickly now :) 2010 Jan 21 20:34:28 <ehj> ok 2010 Jan 21 20:35:22 <ehj> can I pay e.g. dustin per month if we buy a real server? 2010 Jan 21 20:36:02 <stf> would you mind if i ask for help in our hackerspace? 2010 Jan 21 20:36:17 <stf> i mean dev & hw+hosting. 2010 Jan 21 20:36:25 <ehj> (btw, are you sure CPU/memory prizes are not going down faster than a new machine gets old?) 2010 Jan 21 20:36:43 <ehj> stf, that would be greeat!! 2010 Jan 21 20:37:07 <jaywalk> ehj: the machine is a small part of the price compared to hosting/bandwith in the long run 2010 Jan 21 20:37:27 <ehj> ah.. 2010 Jan 21 20:37:32 <jaywalk> and well, most of gandis savings with faster hw/price doesnt get passed on to their customers unless they really have to due to competition ;) 2010 Jan 21 20:37:40 <stf> i am renting VPS at 10eur/month in germany. 2010 Jan 21 20:37:56 <jaywalk> stf: thats the minimal cost on gandi too 2010 Jan 21 20:38:06 <jaywalk> but you don't get 4 cores and 2GB of ram on a VPS for that ;) 2010 Jan 21 20:38:06 <stf> ok. 2010 Jan 21 20:38:25 <ehj> isn't memory cheap on gandi? 2010 Jan 21 20:38:27 <stf> no, for 10 you get 300M + 10G + 1core. 2010 Jan 21 20:38:31 <ehj> 2GB is nothing... 2010 Jan 21 20:38:51 <ehj> ah... 2GB *ram* 2010 Jan 21 20:38:52 <jaywalk> ehj: 2GB is plenty for what we do, but no, memory is decently expensive 2010 Jan 21 20:40:00 <ehj> this is an old question, but the multilingual..? 2010 Jan 21 20:40:22 <jaywalk> yes, time time time :( 2010 Jan 21 20:40:27 <jaywalk> trying to get there though 2010 Jan 21 20:40:52 <ehj> I have made it into a bottle neck... 2010 Jan 21 20:41:02 <jaywalk> hmm.. ok 2010 Jan 21 20:41:35 <ehj> like "when we have mulitling, we can begin running kattla's stuff over wiki API" 2010 Jan 21 20:41:45 <stf> wiki api? wtf? 2010 Jan 21 20:42:27 <ehj> I think there is a way of making wiki pages directly on the server, that's an API, np? 2010 Jan 21 20:42:31 <ehj> no? 2010 Jan 21 20:43:09 <stf> yes. i do not know about mediawiki, i but i have a bad opinion on mediawiki. 2010 Jan 21 20:43:16 <ehj> php? 2010 Jan 21 20:43:52 <stf> not the php, but the development setup. 2010 Jan 21 20:44:02 <ehj> moinmoin is python... 2010 Jan 21 20:44:29 <ehj> well, again, it's the "how to get users" problem 2010 Jan 21 20:44:30 <peterlj> if we index lots of documens in a vectorspace model, we can query in one language and get results in another (given that some multilingual docs using similar terms already exist in the corpus) 2010 Jan 21 20:44:51 <ehj> peterlj, hey! brilliant! 2010 Jan 21 20:45:07 <ehj> that means we could "read " french and spanish policy docs... 2010 Jan 21 20:45:16 <ehj> or polish... 2010 Jan 21 20:45:27 <peterlj> well, yes, to some extent. 2010 Jan 21 20:45:43 <peterlj> me and juice will try to work on that part 2010 Jan 21 20:46:00 <ehj> people have come up again and again with the idea of an "early warning" system 2010 Jan 21 20:47:59 <ehj> but what i think will make people really interested in EU lawmaking is when we can compare the "copypasteness" betwenn different countries legislation 2010 Jan 21 20:48:04 <peterlj> something like this could atleast pinpoint if the new proposed document is clearly based on some previous one, albeit from a very different area of legislation 2010 Jan 21 20:48:21 <peterlj> ehj: exactly 2010 Jan 21 20:48:37 <ehj> "Here is why EU matters" 2010 Jan 21 20:49:22 <stf> what about an early warning system based on bayes like spam categorization? 2010 Jan 21 20:49:34 <ehj> let's say that Sweden has a "copypaste index" of 5.7 and Denmark only 3.2, what woudl that mean? 2010 Jan 21 20:50:13 <ehj> what's so different in the way Denmark and Sweden is processing EU directives ? 2010 Jan 21 20:50:20 <stf> btw do you guys now a good formula to show the relevance of fragments given these paramters: number of tokens in fragment and frequency of fragments. 2010 Jan 21 20:50:40 <ehj> nope :-) 2010 Jan 21 20:51:03 <ehj> I guess that's the pippi index I am talking about? 2010 Jan 21 20:52:20 <stf> i guess the early warning system is possible using some kind of bayes filtering. 2010 Jan 21 20:52:23 <peterlj> ehj i will pass by brussels on march 15-17 2010 Jan 21 20:52:46 <ehj> peterlj, if the pippi index could be connected to some social sience (statsvetenskap), then we have a multidiciplinary research project? 2010 Jan 21 20:53:00 <stf> there will be misses (false positives) but it will catch a lot of things. especially if we look for already identified textblocks as well. 2010 Jan 21 20:53:14 <stf> statsvetenskap? 2010 Jan 21 20:53:17 <ehj> peterlj, cool! is it i2010? or "internet of things"? 2010 Jan 21 20:53:38 <ehj> stf, I dont know the english word 2010 Jan 21 20:53:43 <peterlj> stf, take a look at the proceedings from http://trec.nist.gov/ 2010 Jan 21 20:53:44 <stf> i do this under the motto: law is code. ;) 2010 Jan 21 20:53:44 <ehj> maybe political sience 2010 Jan 21 20:54:13 <stf> peterlj: thanks. i will 2010 Jan 21 20:54:32 <stf> i think we could definitly set up an eu financed project for this. 2010 Jan 21 20:54:45 <peterlj> ehj, well yes i guess so. mathias klang would probably be interested (he is juris dr. ) 2010 Jan 21 20:55:01 <stf> the only problem, i have some experiences with fp[5-7] projects, and i think they reduce the chance of success. 2010 Jan 21 20:55:11 <ehj> yes, the law, and how it is processed, and the results... 2010 Jan 21 20:55:39 <peterlj> ehj, no, it is the next annual review of our current fp7 project 2010 Jan 21 20:56:03 <ehj> ok 2010 Jan 21 20:56:56 <peterlj> stf: perhaps it would be possible to apply for funding within fp7 for a project like this, but it would take a lot of effort and time with preparations. 2010 Jan 21 20:57:31 <stf> indeed. that's one the reasons i don't like fp* projects. 2010 Jan 21 20:57:37 <peterlj> i think it would be easier to try to start with a smaller research funding scheme. 2010 Jan 21 20:57:57 <stf> osi? soros foundation? he's located here in budapest. ;) 2010 Jan 21 20:58:42 <peterlj> i have no idea on what kind of funding you could get from there. here in sweden i would probably try vinnova 2010 Jan 21 20:59:00 <stf> ah, vinnova. i know those guys. ;) 2010 Jan 21 20:59:13 <peterlj> :) 2010 Jan 21 20:59:40 <stf> ok. so what's next? 2010 Jan 21 21:01:03 <ehj> here's another datamining field: http://www.erikjosefsson.eu/sites/default/files/Lei_Wright_Why_Weak_Patents.pdf 2010 Jan 21 21:01:12 <ehj> but we take that another time 2010 Jan 21 21:01:23 <stf> wah. pdfs. how i loathe them. ;) 2010 Jan 21 21:01:38 <ehj> patent datamining 2010 Jan 21 21:01:38 <peterlj> well. harvest and build corpus. also try to locate other potentially interesting public document repositories: the US, national with EU, UN, OECD, etc. 2010 Jan 21 21:02:05 <ehj> peterlj, can you please pad that? 2010 Jan 21 21:02:14 <stf> ok. let's crowdsource the parsing of documents. 2010 Jan 21 21:02:24 <ehj> http://etherpad.com/pippilongstrings 2010 Jan 21 21:02:24 <peterlj> work on indexing and clusering algorithms. probably good with several complementary approaches. 2010 Jan 21 21:02:31 <stf> i'll try to think of a solution how to run this on client hosts. 2010 Jan 21 21:02:50 <ehj> distributed pippi? 2010 Jan 21 21:03:00 <stf> distributed mass pippi. 2010 Jan 21 21:03:09 <stf> bulkpippi 2010 Jan 21 21:03:14 <ehj> woooaaahh 2010 Jan 21 21:03:34 <peterlj> yey: like SETI! 2010 Jan 21 21:03:42 <stf> that's something not done in the next week. it might be even quite expensive to develop. 2010 Jan 21 21:04:03 <stf> but nothing is impossible. maybe expensive, but never impossible. ;) 2010 Jan 21 21:04:14 <ehj> I think step by step development is necessary 2010 Jan 21 21:04:32 <ehj> so that each dev step meets a user need 2010 Jan 21 21:04:39 <stf> yep, but some long-term wild-ass-visions help yo. 2010 Jan 21 21:04:43 <stf> indeed. 2010 Jan 21 21:04:47 <stf> i agree. 2010 Jan 21 21:04:57 <ehj> me too :) 2010 Jan 21 21:05:48 <stf> i only have to wait for our club mate order, that'll arrive hopefully tonight. and then i'll start to setup the new db and to the document view of pippi. 2010 Jan 21 21:06:06 <stf> so you can hopefully tomorrow enjoy browsing lot's of TAs 2010 Jan 21 21:06:17 <ehj> I've said it before... 2010 Jan 21 21:06:19 <peterlj> user interfaces can be command line and/or web interfaces. will probably have to be developed to fit particular modeling/clustering/comparing algorithms. 2010 Jan 21 21:06:24 <ehj> you're AWESOME! 2010 Jan 21 21:06:31 <stf> and later on have a diff view on the doucuments. 2010 Jan 21 21:07:06 <stf> peterlj: currently the engine is bases on some simple python scripts glued together by shell pipes, and a django web interface. all using the same core. 2010 Jan 21 21:07:06 <ehj> I should research how to get access to EU doc databases directly... 2010 Jan 21 21:07:43 <stf> yes, it would be excellent to have digital access not having to scrape the info would make things much easier. 2010 Jan 21 21:08:17 <peterlj> stf: yes. i am just trying to look forward. but it doesn't have to be complex, it just has to work smoothly. 2010 Jan 21 21:08:21 <ehj> hmm.. the Official Journal on a couple of DVDs... 2010 Jan 21 21:08:23 <stf> peterlj: this is how i run some documents contained in a file called asdf 2010 Jan 21 21:08:26 <stf> rm -rf db ; mkdir -p db/docs; cat /tmp/asdf | ../brain/bulkproducer.py 0 1 | tee | ../brain/bulkprocessor.py | ../brain/bulkadd.py 2010 Jan 21 21:09:29 <stf> i hope this is somewhat ETL conform. ;) 2010 Jan 21 21:15:09 --> Paolo (~kvirc@188.135.132.233) has joined #pippi 2010 Jan 21 21:15:15 <ehj> hi paolo 2010 Jan 21 21:15:16 <Paolo> hello :) 2010 Jan 21 21:15:31 <ehj> we're more or less done 2010 Jan 21 21:15:43 <ehj> some points here: http://etherpad.com/pippilongstrings 2010 Jan 21 21:16:07 <ehj> it has been a technical discussion 2010 Jan 21 21:19:38 <stf> hi paolo. 2010 Jan 21 21:19:43 <ehj> the results so far are awesome (thanks to stef), see e.g. http://92.243.28.240:14148/xpippi/?doc=Korea 2010 Jan 21 21:20:14 <ehj> stf, I et an error frm CARIFORUM: http://92.243.28.240:14148/xpippi/?doc=CARIFORUM 2010 Jan 21 21:20:19 <ehj> get* 2010 Jan 21 21:20:19 <Paolo> hi stf 2010 Jan 21 21:20:50 <stf> yes, CARIFORUM has a CELEX id, so there is no CARIFORUM anymore. 2010 Jan 21 21:20:54 <ehj> ok 2010 Jan 21 21:21:03 <stf> but i do not know the CELEX id for CARIFORUM unfortunately. 2010 Jan 21 21:21:04 <ehj> so you managed the size ? 2010 Jan 21 21:21:14 <ehj> ok, wait 2010 Jan 21 21:21:20 <stf> yes. amelia threw out all the tariff tables. 2010 Jan 21 21:22:03 <ehj> here it is: http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=CELEX:22008A1030%2801%29:EN:HTML 2010 Jan 21 21:22:20 <Paolo> I see, awesome results for Korea, really 2010 Jan 21 21:22:29 <ehj> and here (with one big table removed): http://euwiki.org/FTA/CARIFORUM 2010 Jan 21 21:23:04 <stf> Paolo: that's the point. ;) 2010 Jan 21 21:23:11 <ehj> Paolo, I have put the IPRED1/CARIFORUM stuff on the wiki: http://euwiki.org/index.php?title=FTA%2FCARIFORUM%2FIPRED1&diff=3220&oldid=3219 2010 Jan 21 21:23:53 <ehj> but stf will make these diffs automagically even better by tomorrow :-) 2010 Jan 21 21:24:16 <Paolo> amazing 2010 Jan 21 21:24:36 <Paolo> exporting laws with copy&paste... 2010 Jan 21 21:24:36 <ehj> which I think will be great, since it takes a llot of time to find and paste the articles by hand 2010 Jan 21 21:25:06 <Paolo> by hand it becomes an overwhelming task 2010 Jan 21 21:25:18 <ehj> we should soon be able to prove if CARIFORUM is the first "law export" 2010 Jan 21 21:25:44 <ehj> not overwhelming, but a bit 2010 Jan 21 21:25:45 <stf> ehj: it will be a short night, i have a openstandardsalliance presidential meeting tomorrow at 8 in the morning with the biggest hungarian bank. 2010 Jan 21 21:25:48 <stf> :) 2010 Jan 21 21:26:05 <ehj> stf, don't kill yourself.. 2010 Jan 21 21:26:48 <stf> nono, that's the job of some blackops guys, but it will look like a suicide. don't worry. ;) 2010 Jan 21 21:27:05 <ehj> uh.. :-) 2010 Jan 21 21:27:46 <ehj> but maybe it's enough to have something to show on monday, we've already proven the "export" point with the hand made wikidiff 2010 Jan 21 21:28:14 <ehj> now we only need to show to what extent this export policy has been applied 2010 Jan 21 21:28:33 <ehj> and then the really hard part: to understand what it means in legal terms 2010 Jan 21 21:28:47 <ehj> like if we find some words missing or inserted... 2010 Jan 21 21:29:21 <ehj> and questions like "Why didn't you export Telecoms Package art 1.3a (old 138)?" 2010 Jan 21 21:29:56 <ehj> but with the diffs we can change strategy and be much more aggressive 2010 Jan 21 21:30:07 <Paolo> this is a very nice question, it can put someone in an awkward position 2010 Jan 21 21:30:17 <ehj> jaywalk, hey! can you save this log now? 2010 Jan 21 21:30:52 <ehj> the key question is whether or not the Commission has overstepped it's competence 2010 Jan 21 21:31:21 <ehj> you could argue it has if they have changed the way EU law is interpreted 2010 Jan 21 21:31:49 <ehj> so it can be important to see where they put the commas... 2010 Jan 21 21:32:09 <ehj> small changes can have big implications 2010 Jan 21 21:32:59 <jaywalk> ehj: in a bit 2010 Jan 21 21:33:10 <ehj> :-) 2010 Jan 21 21:35:11 --> _paoloj3 (~kvirc@188.135.132.233) has joined #pippi 2010 Jan 21 21:35:11 <-- Paolo (~kvirc@188.135.132.233) has quit (Client closed connection) 2010 Jan 21 21:35:38 <-- _paoloj3 (~kvirc@188.135.132.233) has quit ("Changing server...") 2010 Jan 21 21:36:11 --> paolo (~kvirc@188.135.132.233) has joined #pippi 2010 Jan 21 21:40:27 <stf> so what's on monday? what's the target? 2010 Jan 21 21:41:25 <ehj> target is to raise awareness in the EP so that the Korea agreeement discussion (in february?) will be on sunstance 2010 Jan 21 21:41:30 <ehj> substance* 2010 Jan 21 21:41:52 <stf> ok. so we need a nice view on korea? 2010 Jan 21 21:41:53 <ehj> if we can stall the adoption of Korea, then ACTA will also be slowed down 2010 Jan 21 21:42:00 <stf> \o/ 2010 Jan 21 21:42:14 <ehj> if possible on Korea Canada and Cariforum 2010 Jan 21 21:42:31 <stf> what other docs? do we need the TAs? 2010 Jan 21 21:43:30 <ehj> they are not debated right now, but if it was possible to prove that Cariforum is the first TA with extensive copypaste... 2010 Jan 21 21:43:53 <ehj> ... and to prove that we need to run all TAs 2010 Jan 21 21:44:11 <stf> ok. deal. 2010 Jan 21 21:44:31 <ehj> but last point is possibly overkill right now 2010 Jan 21 21:44:48 <ehj> (prove Cariforum is first copypaste) 2010 Jan 21 21:44:59 <stf> i'll work with the current + the TA corpus, and will start working on a nice presentation of pippies in documents. my hope is substantial that this is possible until monday. 2010 Jan 21 21:45:23 <ehj> just a question, are all chapters in the corpus now? 2010 Jan 21 21:45:29 <stf> ehj. we have the TAs already parsed. 2010 Jan 21 21:45:44 <stf> we have downloaded everything in all languages. 2010 Jan 21 21:45:56 <ehj> ye, but the directives? 2010 Jan 21 21:45:57 <stf> parsed only the TAs and what ever got into our cankor list. 2010 Jan 21 21:46:33 <stf> directives downloaded, but not parsed, except if their in the cankor list. 2010 Jan 21 21:46:33 <ehj> e.g IPRED1 was in Chapter 11, no? 2010 Jan 21 21:46:42 <ehj> ok, then I understand 2010 Jan 21 21:46:57 <stf> everything is downloaded. 2010 Jan 21 21:47:32 <ehj> this: http://eur-lex.europa.eu/en/legis/20100101/index.htm ? 2010 Jan 21 21:47:36 <ehj> all of it? 2010 Jan 21 21:47:54 <ehj> "Directory of Community legislation in force" 2010 Jan 21 21:47:54 <stf> but parsed only the all. 2010 Jan 21 21:47:59 <stf> all 2010 Jan 21 21:48:02 <stf> yes. 2010 Jan 21 21:48:07 <stf> everything under this. 2010 Jan 21 21:48:13 <stf> the aquis iirc. 2010 Jan 21 21:48:19 <ehj> yes :-) 2010 Jan 21 21:48:22 <stf> and we have parsed 3200[0-9]L* 2010 Jan 21 21:48:39 <ehj> and it only need to be "indexed" or "pippied" 2010 Jan 21 21:49:12 <stf> plus yes. 2010 Jan 21 21:49:17 <stf> yes. 2010 Jan 21 21:49:29 <stf> parse=pippied. 2010 Jan 21 21:49:34 <ehj> so when you run cariforum, you will find *all* longstrings in the ""Directory of Community legislation in force"? 2010 Jan 21 21:49:39 <stf> TAs = 3200[0-9]L* 2010 Jan 21 21:49:46 <ehj> ok! thanks! 2010 Jan 21 21:49:57 <ehj> I'll put it on the pad 2010 Jan 21 21:50:08 <stf> no. we do not have the directory of community legislation in force. 2010 Jan 21 21:50:25 <stf> parsed. only downloaded. 2010 Jan 21 21:50:58 <stf> i just sent you the cankor list. 2010 Jan 21 21:51:11 <stf> via mail 2010 Jan 21 21:52:20 <ehj> * 2010 Jan 21 21:52:22 <ehj> * corpus pippied: TAs = 3200[0-9]L* and euwiki selection: http://euwiki.org/Special:PrefixIndex 2010 Jan 21 21:52:23 <ehj> * corpus to be pippied: "Directory of Community legislation in force" http://eur-lex.europa.eu/en/legis/20100101/index.htm 2010 Jan 21 21:52:40 <ehj> maybe I use "corpus" in the wrong way? 2010 Jan 21 21:53:54 <stf> no, this is ok. it's a very ambigious term. 2010 Jan 21 21:55:13 <stf> euwiki selection i am not sure if this is yet pippied. 2010 Jan 21 21:55:23 <ehj> http://euwiki.org/Pippilongstrings/selection 2010 Jan 21 21:55:45 <stf> yes, that's what i sent you just now 2010 Jan 21 21:55:46 <stf> ;) 2010 Jan 21 21:56:10 <stf> the euwiki selection is not a big deal. the bigger deal is to collect all the CELEX ids for these docs. 2010 Jan 21 21:56:18 <ehj> EHELO 2010 Jan 21 21:57:13 <ehj> btu there are no celex ids for getDoc-2.html 2010 Jan 21 21:57:13 <ehj> st06221.en06.html 2010 Jan 21 21:57:13 <ehj> st11245.en05.html 2010 Jan 21 21:58:36 <ehj> I'm sure you'll work it out :) 2010 Jan 21 21:59:01 <ehj> I mean, to keep tracj of what has a celex and what has not 2010 Jan 21 21:59:22 <ehj> and how to handle those who has no celex 2010 Jan 21 22:02:06 <stf> but this means work diverted from a proper diff view of documents. :) 2010 Jan 21 22:04:23 <stf> 2000/31/EC is in SV 2010 Jan 21 22:05:07 <stf> others as well? 2010 Jan 21 22:05:22 <stf> i mean on your Special:PrefixIndex 2010 Jan 21 22:05:39 <stf> btw. wtf:getDoc-2.html? 2010 Jan 21 22:05:46 <stf> can't we give it a proper name? 2010 Jan 21 22:09:28 <ehj> sorry, the special pags is all that's on the wiki 2010 Jan 21 22:09:45 <ehj> it comes from Amelia, I don't now what it is 2010 Jan 21 22:12:48 <ehj> getDoc-2.html comes from Amelia, I don't now what it is 2010 Jan 21 22:15:12 <stf> ok. 2010 Jan 21 22:15:53 <stf> i just thought to point out, that some of the content on euwiki seems to be in SV... 2010 Jan 21 22:16:18 <ehj> yes, that was before the multiling project started 2010 Jan 21 22:16:56 <ehj> when jaywalk finds the time, sv.euwiki.org, hu.euwiki.org etc will be created 2010 Jan 21 22:18:03 <stf> i'm not a fan of those language subdomains. 2010 Jan 21 22:18:25 <stf> i have a browser, where i have set up which language i prefer. the site should automatically honor that. 2010 Jan 21 22:18:50 <stf> there is no need for the user to select the language via the domain or otherwise. 2010 Jan 21 22:19:01 <ehj> I thought wikipedia was a good enough solution 2010 Jan 21 22:19:17 <jaywalk> stf: prolly doable, but everything should be accessible by anyone too 2010 Jan 21 22:19:24 <ehj> but how do you do if you want to work in both english and swedish? 2010 Jan 21 22:19:30 <stf> wikipedia is from 1990-ies, there has been some innovation since then. 2010 Jan 21 22:20:30 <stf> ehj: true. 2010 Jan 21 22:20:47 <ehj> phew.. 2010 Jan 21 22:20:49 <ehj> :-) 2010 Jan 21 22:20:50 <jaywalk> stf: I think the solution is fair enough, but euwiki needs languagework ofc since its gone from one to many 2010 Jan 21 22:21:14 <jaywalk> and if you work with the site you'll prolly login, and then it will go to your prefered language 2010 Jan 21 22:21:29 <ehj> "root wiki" is in english: "euwiki.org" (no prefix) 2010 Jan 21 22:21:30 <jaywalk> if you just visit for a short stop you can click an extra link to the left for the lang you want ;) 2010 Jan 21 22:22:29 <stf> with a login you have your language preferences anyway. 2010 Jan 21 22:22:37 <ehj> en.euwiki.org should resolve to euwiki.org 2010 Jan 21 22:23:13 <ehj> most work (by me) will be done in English 2010 Jan 21 22:23:55 <ehj> all this stuff: http://euwiki.org/Tratten/oeil/Stage_reached_in_procedure 2010 Jan 21 22:32:55 <ehj> stf, I think it is most important to pippi all of the "Directory of Community legislation in force" so that we can find *all* directives copypasted into Cariforum etc 2010 Jan 21 22:33:30 <ehj> etc= Cariforum, Korea and Canada 2010 Jan 21 22:34:23 <ehj> and present the diff in an awesome way :-) 2010 Jan 21 22:34:29 <ehj> and present the diffs in an awesome way :-) 2010 Jan 21 22:34:49 <ehj> can that be done by Monday? 2010 Jan 21 22:36:41 <ehj> Monday on there is a "Briefing on the negotiation of the Second Review of Cotonou Partnership Agreement " and a "EU-Brazil trade relations" hearing 2010 Jan 21 22:37:55 <ehj> both cotonou and brazil agreements are influenced by Canada, Korea and Cariforum 2010 Jan 21 22:40:03 <ehj> jaywalk, host host... 2010 Jan 21 22:40:15 <ehj> jaywalk, spark :-) 2010 Jan 21 22:40:26 <ehj> maila loggen s? limmar jag upp den 2010 Jan 21 22:41:10 <jaywalk> ehj: när folk slutar ringa mig så... 2010 Jan 21 22:42:22 <ehj> hahaha 2010 Jan 21 22:42:23 <paolo> I wonder how they can push Korea to sign the agreement. I mean, Korea just signed a couple of years ago an agreement with the USA which included secondary liability for ISPs. Now this treaty is bad but at least has a lot for mere conduit. 2010 Jan 21 22:43:13 <paolo> (I see articles 12, 13 and 14 of 2000/31/EC pasted into it) 2010 Jan 21 22:43:47 <ehj> yes, but that's debatabe since recital 46/10. is not complete http://euwiki.org/FTA/Korea#ARTICLE_10.62 2010 Jan 21 22:44:39 <paolo> I see 2010 Jan 21 22:44:40 <ehj> sorry rec 43 2010 Jan 21 22:44:43 <ehj> http://euwiki.org/2000/31/EC#Recital_43 2010 Jan 21 22:44:50 <ehj> wait, I'll make a diff 2010 Jan 21 22:45:22 <paolo> don't bother I got the point 2010 Jan 21 22:46:04 <paolo> yet, I still think that articles 12-15 are deeply incompatible with the treaty they signed with the USA (and incompatible with ACTA too) 2010 Jan 21 22:46:44 <paolo> (12-15 ---> 10.63-->10.66) 2010 Jan 21 22:47:14 <paolo> just thinking 2010 Jan 21 22:47:30 <ehj> what comes after: "in no way involved with the information transmitted" 2010 Jan 21 22:47:43 <ehj> is not copyasted: 2010 Jan 21 22:47:46 <ehj> http://euwiki.org/index.php?title=Sandbox%2Fehj&diff=3265&oldid=3264 2010 Jan 21 22:48:21 <paolo> true 2010 Jan 21 22:48:41 <ehj> I hope there is a question on this on it's way to the Commission 2010 Jan 21 22:49:55 <paolo> I guess a lot of MEPs are not yet aware of the issue, or am I wrong? 2010 Jan 21 22:50:39 <ehj> no, I don't think they are aware 2010 Jan 21 22:50:53 <paolo> so, your Monday target is strategic, very important 2010 Jan 21 22:51:45 <ehj> the consequences of cut'npaste is that the Aquis can be distorted 2010 Jan 21 22:54:04 <paolo> absolutely. As you said, any little difference potentially can bring a distortion to the Acquis. Even apparently safe ones. 2010 Jan 21 22:55:05 <paolo> I didn't see how massive the work on pippilongstrings has become. I'm impressed. 2010 Jan 21 23:01:02 <stf> sorry our shipment of club mate has arrived. im now back. 2010 Jan 21 23:02:12 <ehj> club mate is great! 2010 Jan 21 23:02:38 <stf> ok you guys. i need to leave for home. there i will be available in about half an hour, but then i need to seriously get to bed maybe even today. tomorrow i have a meeting in the morning, when i usually go to bed. :( 2010 Jan 21 23:02:57 <stf> ehj: club mate already also in hungary now. we just got our first shipment. 2010 Jan 21 23:03:05 <stf> running a hackerspace is a lot of fun ;) 2010 Jan 21 23:03:33 <paolo> stf: good night 2010 Jan 21 23:03:38 <peterlj> stf: way to go! 2010 Jan 21 23:03:54 <stf> i'll be back soon. ;) 2010 Jan 21 23:04:01 <stf> 30min. 2010 Jan 21 23:04:01 <ehj> paolo, good night, nice to talk! 2010 Jan 21 23:04:15 <ehj> I'm also leaving now 2010 Jan 21 23:04:21 <peterlj> thx for today btw. talk to you guys again soon. 2010 Jan 21 23:04:24 <ehj> stf, thanks a lot!! 2010 Jan 21 23:04:27 <paolo> ehj ok, talk to you asap 2010 Jan 21 23:04:30 <stf> i thank you.2010 Jan 21 23:04:35 <stf> all
